AMD Typewriter x86 User Manual

TM  
AMD Athlon Processor  
x86 Code Optimization  
Guide  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0—November 1999  
AMD Athlon™ Processor x86 Code Optimization  
Contents  
Contents  
iii  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
iv  
Contents  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
AMD Athlon Processor Blended Code . . . . . . . . . . . . . . . . . . . 41  
Contents  
v
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
vi  
Contents  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Contents  
vii  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
viii  
Contents  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Contents  
ix  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
List of Figures  
List of Figures  
xi  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
xii  
List of Figures  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
List of Tables  
Table 21. MMX Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211  
Table 23. 3DNow!Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . 217  
Table 24. 3DNow! Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218  
List of Tables  
xiii  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Revision History  
Date Rev  
Nov.  
1999  
E
122.  
Rearranged the appendices.  
Added Index.  
Revision History  
xv  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
xvi  
Revision History  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
1
Introduction  
The AMD Athlonprocessor is the newest microprocessor in  
the AMD K86family of microprocessors. The advances in the  
AMD Athlon processor take superscalar operation and  
out-of-order execution to a new level. The AMD Athlon  
processor has been designed to efficiently execute code written  
for previous-generation x86 processors. However, to enable the  
fastest code execution with the AMD Athlon processor,  
programmers should write software that includes specific code  
optimization techniques.  
About this Document  
This document contains information to assist programmers in  
creating optimized code for the AMD Athlon processor. In  
addition to compiler and assembler designers, this document  
has been targeted to C and assembly language programmers  
writing execution-sensitive code sequences.  
This document assumes that the reader possesses in-depth  
knowledge of the x86 instruction set, the x86 architecture  
(registers, programming modes, etc.), and the IBM PC-AT  
platform.  
This guide has been written specifically for the AMD Athlon  
processor, but it includes considerations for  
About this Document  
1
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
previous-generation processors and describes how those  
optimizations are applicable to the AMD Athlon processor. This  
guide contains the following chapters:  
Chapter 1: Introduction. Outlines the material covered in this  
document. Summarizes the AMD Athlon microarchitecture.  
Chapter 2: Top Optimizations. Provides convenient descriptions of  
the most important optimizations a programmer should take  
into consideration.  
Chapter 3: C Source Level Optimizations. Describes optimizations that  
C/C++ programmers can implement.  
Chapter 4: Instruction Decoding Optimizations. Describes methods that  
will make the most efficient use of the three sophisticated  
instruction decoders in the AMD Athlon processor.  
Chapter 5: Cache and Memory Optimizations. Describes optimizations  
that makes efficient use of the large L1 caches and high-  
bandwidth buses of the AMD Athlon processor.  
Chapter 6: Branch Optimizations. Describes optimizations that  
improves branch prediction and minimizes branch penalties.  
Chapter 7: Scheduling Optimizations. Describes optimizations that  
improves code scheduling for efficient execution resource  
utilization.  
Chapter 8: Integer Optimizations. Describes optimizations that  
improves integer arithmetic and makes efficient use of the  
integer execution units in the AMD Athlon processor.  
Chapter 9: Floating-Point Optimizations. Describes optimizations that  
makes maximum use of the superscalar and pipelined floating-  
point unit (FPU) of the AMD Athlon processor.  
Chapter 10: 3DNow!™ and MMX™ Optimizations. Describes guidelines  
for Enhanced 3DNow! and MMX code optimization techniques.  
Chapter 11: General x86 Optimizations Guidelines. Lists  
generic  
optimizations techniques applicable to x86 processors.  
Appendix A: AMD Athlon Processor Microarchitecture. Describes in  
detail the microarchitecture of the AMD Athlon processor.  
2
About this Document  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix B: Pipeline and Execution Unit Resources Overview. Describes  
in detail the execution units and its relation to the instruction  
pipeline.  
Appendix C: Implementation of Write Combining. Describes  
the  
algorithm used by the AMD Athlon processor to write combine.  
Appendix D: Performance Monitoring Counters. Describes the usage of  
the performance counters available in the AMD Athlon  
processor.  
Appendix E: Programming the MTRR and PAT. Describes the steps  
needed to program the Memory Type Range Registers and the  
Page Attribute Table.  
Appendix F: Instruction Dispatch and Execution Resources. Lists  
instructions execution resource usage.  
the  
Appendix G: DirectPath versus VectorPath Instructions. Lists the x86  
instructions that are DirectPath and VectorPath instructions.  
AMD AthlonProcessor Family  
The AMD Athlon processor family uses state-of-the-art  
decoupled decode/execution design techniques to deliver  
next-generation performance with x86 binary software  
compatibility. This next-generation processor family advances  
x86 code execution by using flexible instruction predecoding,  
wide and balanced decoders, aggressive out-of-order execution,  
parallel integer execution pipelines, parallel floating-point  
execution pipelines, deep pipelined execution for higher  
delivered operating frequency, dedicated backside cache  
memory, and a new high-performance double-rate 64-bit local  
bus. As an x86 binary-compatible processor, the AMD Athlon  
processor implements the industry-standard x86 instruction set  
by decoding and executing the x86 instructions using a  
proprietary microarchitecture. This microarchitecture allows  
the delivery of maximum performance when running x86-based  
PC software.  
AMD AthlonProcessor Family  
3
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
AMD AthlonProcessor Microarchitecture Summary  
The AMD Athlon processor brings superscalar performance  
and high operating frequency to PC systems running  
industry-standard x86 software. A brief summary of the  
next-generation design features implemented in the  
AMD Athlon processor is as follows:  
High-speed double-rate local bus interface  
Large, split 128-Kbyte level-one (L1) cache  
Dedicated backside level-two (L2) cache  
Instruction predecode and branch detection during cache  
line fills  
Decoupled decode/execution core  
Three-way x86 instruction decoding  
Dynamic scheduling and speculative execution  
Three-way integer execution  
Three-way address generation  
Three-way floating-point execution  
3DNow!technology and MMXsingle-instruction  
multiple-data (SIMD) instruction extensions  
Super data forwarding  
Deep out-of-order integer and floating-point execution  
Register renaming  
Dynamic branch prediction  
The AMD Athlon processor communicates through a  
next-generation high-speed local bus that is beyond the current  
Socket 7 or Super7bus standard. The local bus can transfer  
data at twice the rate of the bus operating frequency by using  
both the rising and falling edges of the clock (see  
information).  
To reduce on-chip cache miss penalties and to avoid subsequent  
data load or instruction fetch stalls, the AMD Athlon processor  
has a dedicated high-speed backside L2 cache. The large  
128-Kbyte L1 on-chip cache and the backside L2 cache allow the  
4
AMD AthlonProcessor Microarchitecture Summary  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
AMD Athlon execution core to achieve and sustain maximum  
performance.  
As a decoupled decode/execution processor, the AMD Athlon  
processor makes use of a proprietary microarchitecture, which  
defines the heart of the AMD Athlon processor. With the  
inclusion of all these features, the AMD Athlon processor is  
capable of decoding, issuing, executing, and retiring multiple  
x86 instructions per cycle, resulting in superior scaleable  
performance.  
The AMD Athlon processor includes both the industry-standard  
MMX SIMD integer instructions and the 3DNow! SIMD  
floating-point instructions that were first introduced in the  
AMD-K6®-2 processor. The design of 3DNow! technology was  
based on suggestions from leading graphics and independent  
software vendors (ISVs). Using SIMD format, the AMD Athlon  
processor can generate up to four 32-bit, single-precision  
floating-point results per clock cycle.  
The 3DNow! execution units allow for high-performance  
floating-point vector operations, which can replace x87  
instructions and enhance the performance of 3D graphics and  
other floating-point-intensive applications. Because the  
3DNow! architecture uses the same registers as the MMX  
instructions, switching between MMX and 3DNow! has no  
penalty.  
The AMD Athlon processor designers took another innovative  
step by carefully integrating the traditional x87 floating-point,  
MMX, and 3DNow! execution units into one operational engine.  
With the introduction of the AMD Athlon processor, the  
switching overhead between x87, MMX, and 3DNow!  
technology is virtually eliminated. The AMD Athlon processor  
combined with 3DNow! technology brings a better multimedia  
experience to mainstream PC users while maintaining  
backwards compatibility with all existing x86 software.  
Although the AMD Athlon processor can extract code  
parallelism on-the-fly from off-the-shelf, commercially available  
x86 software, specific code optimization for the AMD Athlon  
processor can result in even higher delivered performance. This  
document describes the proprietary microarchitecture in the  
AMD Athlon processor and makes recommendations for  
optimizing execution of x86 software on the processor.  
AMD AthlonProcessor Microarchitecture Summary  
5
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
The coding techniques for achieving peak performance on the  
AMD Athlon processor include, but are not limited to, those for  
the AMD-K6, AMD-K6-2, Pentium®, Pentium Pro, and Pentium  
II processors. However, many of these optimizations are not  
necessary for the AMD Athlon processor to achieve maximum  
performance. Due to the more flexible pipeline control and  
aggressive out-of-order execution, the AMD Athlon processor is  
not as sensitive to instruction selection and code scheduling.  
This flexibility is one of the distinct advantages of the  
AMD Athlon processor.  
The AMD Athlon processor uses the latest in processor  
microarchitecture design techniques to provide the highest x86  
performance for todays PC. In short, the AMD Athlon  
processor offers true next-generation performance with x86  
binary software compatibility.  
6
AMD AthlonProcessor Microarchitecture Summary  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
2
Top Optimizations  
This chapter contains concise descriptions of the best  
optimizations for improving the performance of the  
AMD Athlonprocessor. Subsequent chapters contain more  
detailed descriptions of these and other optimizations. The  
optimizations in this chapter are divided into two groups and  
listed in order of importance.  
Group I Essential  
Optimizations  
Group I contains essential optimizations. Users should follow  
these critical guidelines closely. The optimizations in Group I  
are as follows:  
Memory Size and Alignment IssuesAvoid memory size  
mismatchesAlign data where possible  
Use the 3DNow!™ PREFETCH and PREFETCHW  
Instructions  
Select DirectPath Over VectorPath Instructions  
Group II Secondary  
Optimizations  
Group II contains secondary optimizations that can  
significantly improve the performance of the AMD Athlon  
processor. The optimizations in Group II are as follows:  
Load-Execute Instruction UsageUse Load-Execute  
instructionsAvoid load-execute floating-point instructions  
with integer operands  
Take Advantage of Write Combining  
Use 3DNow! Instructions  
Avoid Branches Dependent on Random Data  
Top Optimizations  
7
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Avoid Placing Code and Data in the Same 64-Byte Cache  
Line  
Optimization Star  
The top optimizations described in this chapter are flagged  
with a star. In addition, the star appears beside the more  
detailed descriptions found in subsequent chapters.  
TOP  
Group I Optimizations Essential Optimizations  
Memory Size and Alignment Issues  
details.  
Avoid Memory Size Mismatches  
Avoid memory size mismatches when instructions operate on  
the same data. For instructions that store and reload the same  
data, keep operands aligned and keep the loads/stores of each  
TOP  
operand the same size.  
Align Data Where Possible  
Avoid misaligned data references. A misaligned store or load  
operation suffers a minimum one-cycle penalty in the  
AMD Athlon processor load/store pipeline.  
TOP  
Use the 3DNow!PREFETCH and PREFETCHW Instructions  
For code that can take advantage of prefetching, use the  
3DNow! PREFETCH and PREFETCHW instructions to increase  
the effective bandwidth to the AMD Athlon processor, which  
TOP  
significantly improves performance. All the prefetch  
instructions are essentially integer instructions and can be used  
8
Optimization Star  
Download from Www.Somanuals.com. All Manuals Search And Download.  
               
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
anywhere, in any type of code (integer, x87, 3DNow!, MMX,  
etc.). Use the following formula to determine prefetch distance:  
Prefetch Length = 200 (DS/C)  
Round up to the nearest cache line.  
DS is the data stride per loop iteration.  
C is the number of cycles per loop iteration when hitting in  
the L1 cache.  
Instructionson page 46 for more details.  
Select DirectPath Over VectorPath Instructions  
Use DirectPath instructions rather than VectorPath  
instructions. DirectPath instructions are optimized for decode  
and execute efficiently by minimizing the number of operations  
per x86 instruction. Three DirectPath instructions can be  
decoded in parallel. Using VectorPath instructions will block  
DirectPath instructions from decoding simultaneously.  
TOP  
on page 219 for a list of DirectPath and VectorPath instructions.  
Group II OptimizationsSecondary Optimizations  
Load-Execute Instruction Usage  
details.  
Use Load-Execute Instructions  
Wherever possible, use load-execute instructions to increase  
code density with the one exception described below. The  
split-instruction form of load-execute instructions can be used  
TOP  
to avoid scheduler stalls for longer executing instructions and  
to explicitly schedule the load and execute operations.  
Group II OptimizationsSecondary Optimizations  
9
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Avoid Load-Execute Floating-Point Instructions with Integer Operands  
Do not use load-execute floating-point instructions with integer  
operands. The floating-point load-execute instructions with  
integer operands are VectorPath and generate two OPs in a  
cycle, while the discrete equivalent enables a third DirectPath  
instruction to be decoded in the same cycle.  
TOP  
Take Advantage of Write Combining  
This guideline applies only to operating system, device driver,  
and BIOS programmers. In order to improve system  
performance, the AMD Athlon processor aggressively combines  
multiple memory-write cycles of any data size that address  
locations within a 64-byte cache line aligned write buffer.  
TOP  
page 155 for more details.  
Use 3DNow!Instructions  
Unless accuracy requirements dictate otherwise, perform  
floating-point computations using the 3DNow! instructions  
instead of x87 instructions. The SIMD nature of 3DNow!  
instructions achieves twice the number of FLOPs that are  
achieved through x87 instructions. 3DNow! instructions also  
provide for a flat register file instead of the stack-based  
approach of x87 instructions.  
TOP  
See Table 23 on page 217 for a list of 3DNow! instructions. For  
information about instruction usage, see the 3DNow!™  
Technology Manual, order# 21928.  
Avoid Branches Dependent on Random Data  
Avoid data-dependent branches around a single instruction.  
Data-dependent branches acting upon basically random data  
can cause the branch prediction logic to mispredict the branch  
about 50% of the time. Design branch-free alternative code  
sequences, which results in shorter average execution time.  
TOP  
for more details.  
10  
Group II OptimizationsSecondary Optimizations  
Download from Www.Somanuals.com. All Manuals Search And Download.  
             
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Avoid Placing Code and Data in the Same 64-Byte Cache Line  
Consider that the AMD Athlon processor cache line is twice the  
size of previous processors. Code and data should not be shared  
in the same 64-byte cache line, especially if the data ever  
TOP  
becomes modified. In order to maintain cache coherency, the  
AMD Athlon processor may thrash its caches, resulting in lower  
performance.  
In general the following should be avoided:  
Self-modifying code  
Storing data in code segments  
Lineon page 50 for more details.  
Group II OptimizationsSecondary Optimizations  
11  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
12  
Group II OptimizationsSecondary Optimizations  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
3
C Source Level Optimizations  
This chapter details C programming practices for optimizing  
code for the AMD Athlonprocessor. Guidelines are listed in  
order of importance.  
Ensure Floating-Point Variables and Expressions are of  
Type Float  
For compilers that generate 3DNow!instructions, make sure  
that all floating-point variables and expressions are of type  
float. Pay special attention to floating-point constants. These  
require a suffix of For f(for example, 3.14f) in order to be  
of type float, otherwise they default to type double. To avoid  
automatic promotion of float arguments to double, always use  
function prototypes for all functions that accept float  
arguments.  
Use 32-Bit Data Types for Integer Code  
Use 32-bit data types for integer code. Compiler  
implementations vary, but typically the following data types are  
includedint, signed, signed int, unsigned, unsigned int, long,  
signed long, long int, signed long int, unsigned long, and unsigned  
long int.  
Ensure Floating-Point Variables and Expressions are of Type Float  
13  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Consider the Sign of Integer Operands  
In many cases, the data stored in integer variables determines  
whether a signed or an unsigned integer type is appropriate.  
For example, to record the weight of a person in pounds, no  
negative numbers are required so an unsigned type is  
appropriate. However, recording temperatures in degrees  
Celsius may require both positive and negative numbers so a  
signed type is needed.  
Where there is a choice of using either a signed or an unsigned  
type, it should be considered that certain operations are faster  
with unsigned types while others are faster for signed types.  
Integer-to-floating-point conversion using integers larger than  
16-bit is faster with signed types, as the x86 FPU provides  
instructions for converting signed integers to floating-point, but  
has no instructions for converting unsigned integers. In a  
typical case, a 32-bit integer is converted as follows:  
Example 1 (Avoid):  
double x;  
====>  
MOV [temp+4], 0  
MOV EAX, i  
unsigned int i;  
MOV [temp], eax  
FILD QWORD PTR [temp]  
FSTP QWORD PTR [x]  
x = i;  
This code is slow not only because of the number of instructions  
but also because a size mismatch prevents store-to-load-  
forwarding to the FILD instruction.  
Example (Preferred):  
double x;  
int i;  
====>  
FILD DWORD PTR [i]  
FSTP QWORD PTR [x]  
x = i;  
Computing quotients and remainders in integer division by  
constants are faster when performed on unsigned types. In a  
typical case, a 32-bit integer is divided by four as follows:  
14  
Consider the Sign of Integer Operands  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example (Avoid):  
int i;  
====>  
MOV EAX, i  
CDQ  
i = i / 4;  
AND EDX, 3  
ADD EAX, EDX  
SAR EAX, 2  
MOV i, EAX  
Example (Preferred):  
unsigned int i; ====>  
SHR i, 2  
i = i / 4;  
In summary:  
Use unsigned types for:  
Division and remainders  
Loop counters  
Array indexing  
Use signed types for:  
Integer-to-float conversion  
Use Array Style Instead of Pointer Style Code  
The use of pointers in C makes work difficult for the optimizers  
in C compilers. Without detailed and aggressive pointer  
analysis, the compiler has to assume that writes through a  
pointer can write to any place in memory. This includes storage  
allocated to other variables, creating the issue of aliasing, i.e.,  
the same block of memory is accessible in more than one way.  
In order to help the optimizer of the C compiler in its analysis,  
avoid the use of pointers where possible. One example where  
this is trivially possible is in the access of data organized as  
arrays. C allows the use of either the array operator [] or  
pointers to access the array. Using array-style code makes the  
task of the optimizer easier by reducing possible aliasing.  
For example, x[0] and x[2] can not possibly refer to the same  
memory location, while *p and *q could. It is highly  
recommended to use the array style, as significant performance  
advantages can be achieved with most compilers.  
Use Array Style Instead of Pointer Style Code  
15  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Note that source code transformations will interact with a  
compilers code generator and that it is difficult to control the  
generated machine code from the source level. It is even  
possible that source code transformations for improving  
performance and compiler optimizations "fight" each other.  
Depending on the compiler and the specific source code it is  
therefore possible that pointer style code will be compiled into  
machine code that is faster than that generated from equivalent  
array style code. It is advisable to check the performance after  
any source code transformation to see whether performance  
indeed increased.  
Example 1 (Avoid):  
typedef struct {  
float x,y,z,w;  
} VERTEX;  
typedef struct {  
float m[4][4];  
} MATRIX;  
void XForm (float *res, const float *v, const float *m, int  
numverts)  
{
float dp;  
int i;  
const VERTEX* vv = (VERTEX *)v;  
for (i = 0; i < numverts; i++) {  
dp = vv->x * *m++;  
dp += vv->y * *m++;  
dp += vv->z * *m++;  
dp += vv->w * *m++;  
*res++ = dp; /* write transformed x */  
dp = vv->x * *m++;  
dp += vv->y * *m++;  
dp += vv->z * *m++;  
dp += vv->w * *m++;  
*res++ = dp; /* write transformed y */  
dp = vv->x * *m++;  
dp += vv->y * *m++;  
dp += vv->z * *m++;  
dp += vv->w * *m++;  
16  
Use Array Style Instead of Pointer Style Code  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
*res++ = dp; /* write transformed z */  
dp = vv->x * *m++;  
dp += vv->y * *m++;  
dp += vv->z * *m++;  
dp += vv->w * *m++;  
*res++ = dp; /* write transformed w */  
++vv;  
/* next input vertex */  
m -= 16; /* reset to start of transform matrix */  
}
}
Example 2 (Preferred):  
typedef struct {  
float x,y,z,w;  
} VERTEX;  
typedef struct {  
float m[4][4];  
} MATRIX;  
void XForm (float *res, const float *v, const float *m, int  
numverts)  
{
int i;  
const VERTEX* vv = (VERTEX *)v;  
const MATRIX* mm = (MATRIX *)m;  
VERTEX* rr = (VERTEX *)res;  
for (i = 0; i < numverts; i++) {  
rr->x = vv->x*mm->m[0][0] + vv->y*mm->m[0][1] +  
vv->z*mm->m[0][2] + vv->w*mm->m[0][3];  
rr->y = vv->x*mm->m[1][0] + vv->y*mm->m[1][1] +  
vv->z*mm->m[1][2] + vv->w*mm->m[1][3];  
rr->z = vv->x*mm->m[2][0] + vv->y*mm->m[2][1] +  
vv->z*mm->m[2][2] + vv->w*mm->m[2][3];  
rr->w = vv->x*mm->m[3][0] + vv->y*mm->m[3][1] +  
vv->z*mm->m[3][2] + vv->w*mm->m[3][3];  
}
}
Use Array Style Instead of Pointer Style Code  
17  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Completely Unroll Small Loops  
Take advantage of the AMD Athlon processors large, 64-Kbyte  
instruction cache and completely unroll small loops. Unrolling  
loops can be beneficial to performance, especially if the loop  
body is small which makes the loop overhead significant. Many  
compilers are not aggressive at unrolling loops. For loops that  
have a small fixed loop count and a small loop body, completely  
unrolling the loops at the source level is recommended.  
Example 1 (Avoid):  
// 3D-transform: multiply vector V by 4x4 transform matrix M  
for (i=0; i<4; i++) {  
r[i] = 0;  
for (j=0; j<4; j++) {  
r[i] += M[j][i]*V[j];  
}
}
Example 2 (Preferred):  
// 3D-transform: multiply vector V by 4x4 transform matrix M  
r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +  
M[3][0]*V[3];  
r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] +  
M[3][1]*V[3];  
r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] +  
M[3][2]*V[3];  
r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] +  
M[3][3]*v[3];  
Avoid Unnecessary Store-to-Load Dependencies  
A store-to-load dependency exists when data is stored to  
memory, only to be read back shortly thereafter. See  
details. The AMD Athlon processor contains hardware to  
accelerate such store-to-load dependencies, allowing the load to  
obtain the store data before it has been written to memory.  
However, it is still faster to avoid such dependencies altogether  
and keep the data in an internal register.  
Avoiding store-to-load dependencies is especially important if  
they are part of a long dependency chains, as might occur in a  
recurrence computation. If the dependency occurs while  
operating on arrays, many compilers are unable to optimize the  
18  
Completely Unroll Small Loops  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
code in a way that avoids the store-to-load dependency. In some  
instances the language definition may prohibit the compiler  
from using code transformations that would remove the store-  
to-load dependency. It is therefore recommended that the  
programmer remove the dependency manually, e.g., by  
introducing a temporary variable that can be kept in a register.  
This can result in a significant performance increase. The  
following is an example of this.  
Example 1 (Avoid):  
double x[VECLEN], y[VECLEN], z[VECLEN];  
unsigned int k;  
for (k = 1; k < VECLEN; k++) {  
x[k] = x[k-1] + y[k];  
}
for (k = 1; k < VECLEN; k++) {  
x[k] = z[k] * (y[k] - x[k-1]);  
}
Example 2 (Preferred):  
double x[VECLEN], y[VECLEN], z[VECLEN];  
unsigned int k;  
double t;  
t = x[0];  
for (k = 1; k < VECLEN; k++) {  
t = t + y[k];  
x[k] = t;  
}
t = x[0];  
for (k = 1; k < VECLEN; k++) {  
t = z[k] * (y[k] - t);  
x[k] = t;  
}
Avoid Unnecessary Store-to-Load Dependencies  
19  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Consider Expression Order in Compound Branch  
Conditions  
Branch conditions in C programs are often compound  
conditions consisting of multiple boolean expressions joined by  
the boolean operators && and ||. C guarantees a short-circuit  
evaluation of these operators. This means that in the case of ||,  
the first operand to evaluate to TRUE terminates the  
evaluation, i.e., following operands are not evaluated at all.  
Similarly for &&, the first operand to evaluate to FALSE  
terminates the evaluation. Because of this short-circuit  
evaluation, it is not always possible to swap the operands of ||  
and &&. This is especially the case when the evaluation of one  
of the operands causes a side effect. However, in most cases the  
exchange of operands is possible.  
When used to control conditional branches, expressions  
involving || and && are translated into a series of conditional  
branches. The ordering of the conditional branches is a function  
of the ordering of the expressions in the compound condition,  
and can have a significant impact on performance. It is  
unfortunately not possible to give an easy, closed-form formula  
on how to order the conditions. Overall performance is a  
function of a variety of the following factors:  
probability of a branch mispredict for each of the branches  
generated  
additional latency incurred due to a branch mispredict  
cost of evaluating the conditions controlling each of the  
branches generated  
amount of parallelism that can be extracted in evaluating  
the branch conditions  
data stream consumed by an application (mostly due to the  
dependence of mispredict probabilities on the nature of the  
incoming data in data dependent branches)  
It is therefore recommended to experiment with the ordering of  
expressions in compound branch conditions in the most active  
areas of a program (so called hot spots) where most of the  
execution time is spent. Such hot spots can be found through  
the use of profiling. A "typical" data stream should be fed to  
the program while doing the experiments.  
20  
Consider Expression Order in Compound Branch Conditions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Switch Statement Usage  
Optimize Switch Statements  
Switch statements are translated using a variety of algorithms.  
The most common of these are jump tables and comparison  
chains/trees. It is recommended to sort the cases of a switch  
statement according to the probability of occurrences, with the  
most probable first. This will improve performance when the  
switch is translated as a comparison chain. It is further  
recommended to make the case labels small, contiguous  
integers, as this will allow the switch to be translated as a jump  
table.  
Example 1 (Avoid):  
int days_in_month, short_months, normal_months, long_months;  
switch (days_in_month) {  
case 28:  
case 29: short_months++; break;  
case 30: normal_months++; break;  
case 31: long_months++; break;  
default: printf ("month has fewer than 28 or more than 31  
days\n");  
}
Example 2 (Preferred):  
int days_in_month, short_months, normal_months, long_months;  
switch (days_in_month) {  
case 31: long_months++; break;  
case 30: normal_months++; break;  
case 28:  
case 29: short_months++; break;  
default: printf ("month has fewer than 28 or more than 31  
days\n");  
}
Use Prototypes for All Functions  
In general, use prototypes for all functions. Prototypes can  
convey additional information to the compiler that might  
enable more aggressive optimizations.  
Switch Statement Usage  
21  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Use Const Type Qualifier  
Use the consttype qualifier as much as possible. This  
optimization makes code more robust and may enable higher  
performance code to be generated due to the additional  
information available to the compiler. For example, the C  
standard allows compilers to not allocate storage for objects  
that are declared const, if their address is never taken.  
Generic Loop Hoisting  
To improve the performance of inner loops, it is beneficial to  
reduce redundant constant calculations (i.e., loop invariant  
calculations). However, this idea can be extended to invariant  
control structures.  
The first case is that of a constant if()statement in a for()”  
loop.  
Example 1:  
for( i ... ) {  
if( CONSTANT0 ) {  
DoWork0( i );  
} else {  
DoWork1( i );  
// does not affect CONSTANT0  
// does not affect CONSTANT0  
}
}
The above loop should be transformed into:  
if( CONSTANT0 ) {  
for( i ... ) {  
DoWork0( i );  
}
} else {  
for( i ... ) {  
DoWork1( i );  
}
}
This will make your inner loops tighter by avoiding repetitious  
evaluation of a known if()control structure. Although the  
branch would be easily predicted, the extra instructions and  
decode limitations imposed by branching are saved, which are  
usually well worth it.  
22  
Use Const Type Qualifier  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Generalization for Multiple Constant Control Code  
To generalize this further for multiple constant control code  
some more work may have to be done to create the proper outer  
loop. Enumeration of the constant cases will reduce this to a  
simple switch statement.  
Example 2:  
for(i ... ) {  
if( CONSTANT0 ) {  
DoWork0( i );  
//does not affect CONSTANT0  
// or CONSTANT1  
} else {  
DoWork1( i );  
//does not affect CONSTANT0  
// or CONSTANT1  
}
if( CONSTANT1 ) {  
DoWork2( i );  
//does not affect CONSTANT0  
// or CONSTANT1  
} else {  
DoWork3( i );  
//does not affect CONSTANT0  
// or CONSTANT1  
}
}
The above loop should be transformed into:  
#define combine( c1, c2 ) (((c1) << 1) + (c2))  
switch( combine( CONSTANT0!=0, CONSTANT1!=0 ) ) {  
case combine( 0, 0 ):  
for( i ... ) {  
DoWork0( i );  
DoWork2( i );  
}
break;  
case combine( 1, 0 ):  
for( i ... ) {  
DoWork1( i );  
DoWork2( i );  
}
break;  
case combine( 0, 1 ):  
for( i ... ) {  
DoWork0( i );  
DoWork3( i );  
}
break;  
Generic Loop Hoisting  
23  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
case combine( 1, 1 ):  
for( i ... ) {  
DoWork1( i );  
DoWork3( i );  
}
break;  
default:  
break;  
}
The trick here is that there is some up-front work involved in  
generating all the combinations for the switch constant and the  
total amount of code has doubled. However, it is also clear that  
the inner loops are "if()-free". In ideal cases where the  
DoWork*()functions are inlined, the successive functions  
will have greater overlap leading to greater parallelism than  
would be possible in the presence of intervening if()”  
statements.  
The same idea can be applied to constant switch()”  
statements, or combinations of switch()statements and if()”  
statements inside of for()loops. The method for combining  
the input constants gets more complicated but will be worth it  
for the performance benefit.  
However, the number of inner loops can also substantially  
increase. If the number of inner loops is prohibitively high, then  
only the most common cases need to be dealt with directly, and  
the remaining cases can fall back to the old code in a "default:"  
clause for the switch()statement.  
This typically comes up when the programmer is considering  
runtime generated code. While runtime generated code can  
lead to similar levels of performance improvement, it is much  
harder to maintain, and the developer must do their own  
optimizations for their code generation without the help of an  
available compiler.  
Declare Local Functions as Static  
Functions that are not used outside the file in which they are  
defined should always be declared static, which forces internal  
linkage. Otherwise, such functions default to external linkage,  
24  
Declare Local Functions as Static  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
which might inhibit certain optimizations with some  
compilersfor example, aggressive inlining.  
Dynamic Memory Allocation Consideration  
Dynamic memory allocation (mallocin C language) should  
always return a pointer that is suitably aligned for the largest  
base type (quadword alignment). Where this aligned pointer  
cannot be guaranteed, use the technique shown in the following  
code to make the pointer quadword aligned, if needed. This  
code assumes the pointer can be cast to a long.  
Example:  
double* p;  
double* np;  
p = (double *)malloc(sizeof(double)*number_of_doubles+7L);  
np = (double *)((((long)(p))+7L) & (–8L));  
Then use npinstead of pto access the data. pis still needed  
in order to deallocate the storage.  
Introduce Explicit Parallelism into Code  
Where possible, long dependency chains should be broken into  
several independent dependency chains which can then be  
executed in parallel exploiting the pipeline execution units.  
This is especially important for floating-point code, whether it  
is mapped to x87 or 3DNow! instructions because of the longer  
latency of floating-point operations. Since most languages,  
including ANSI C, guarantee that floating-point expressions are  
not re-ordered, compilers can not usually perform such  
optimizations unless they offer a switch to allow ANSI non-  
compliant reordering of floating-point expressions according to  
algebraic rules.  
Note that re-ordered code that is algebraically identical to the  
original code does not necessarily deliver identical  
computational results due to the lack of associativity of floating  
point operations. There are well-known numerical  
considerations in applying these optimizations (consult a book  
on numerical analysis). In some cases, these optimizations may  
Dynamic Memory Allocation Consideration  
25  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
lead to unexpected results. Fortunately, in the vast majority of  
cases, the final result will differ only in the least significant  
bits.  
Example 1 (Avoid):  
double a[100],sum;  
int i;  
sum = 0.0f;  
for (i=0; i<100; i++) {  
sum += a[i];  
}
Example 2 (Preferred):  
double a[100],sum1,sum2,sum3,sum4,sum;  
int i;  
sum1 = 0.0;  
sum2 = 0.0;  
sum3 = 0.0;  
sum4 = 0.0;  
for (i=0; i<100; i+4) {  
sum1 += a[i];  
sum2 += a[i+1];  
sum3 += a[i+2];  
sum4 += a[i+3];  
}
sum = (sum4+sum3)+(sum1+sum2);  
Notice that the 4-way unrolling was chosen to exploit the 4-stage  
fully pipelined floating-point adder. Each stage of the floating-  
point adder is occupied on every clock cycle, ensuring maximal  
sustained utilization.  
Explicitly Extract Common Subexpressions  
In certain situations, C compilers are unable to extract common  
subexpressions from floating-point expressions due to the  
guarantee against reordering of such expressions in the ANSI  
standard. Specifically, the compiler can not re-arrange the  
computation according to algebraic equivalencies before  
extracting common subexpressions. In such cases, the  
programmer should manually extract the common  
subexpression. It should be noted that re-arranging the  
expression may result in different computational results due to  
the lack of associativity of floating-point operations, but the  
results usually differ in only the least significant bits.  
26  
Explicitly Extract Common Subexpressions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 1  
Avoid:  
double a,b,c,d,e,f;  
e = b*c/d;  
f = b/d*a;  
Preferred:  
double a,b,c,d,e,f,t;  
t = b/d;  
e = c*t;  
f = a*t;  
Example 2  
Avoid:  
double a,b,c,e,f;  
e = a/c;  
f = b/c;  
Preferred:  
double a,b,c,e,f,t;  
t = 1/c;  
e = a*t  
f = b*t;  
C Language Structure Component Considerations  
Many compilers have options that allow padding of structures  
to make their size multiples of words, doublewords, or  
quadwords, in order to achieve better alignment for structures.  
In addition, to improve the alignment of structure members,  
some compilers might allocate structure elements in an order  
that differs from the order in which they are declared. However,  
some compilers might not offer any of these features, or their  
implementation might not work properly in all situations.  
Therefore, to achieve the best alignment of structures and  
structure members while minimizing the amount of padding  
regardless of compiler optimizations, the following methods are  
suggested.  
Sort by Base Type  
Size  
Sort structure members according to their base type size,  
declaring members with a larger base type size ahead of  
members with a smaller base type size.  
C Language Structure Component Considerations  
27  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Pad by Multiple of  
Largest Base Type  
Size  
Pad the structure to a multiple of the largest base type size of  
any member. In this fashion, if the first member of a structure is  
naturally aligned, all other members are naturally aligned as  
well. The padding of the structure to a multiple of the largest  
based type size allows, for example, arrays of structures to be  
perfectly aligned.  
The following example demonstrates the reordering of  
structure member declarations:  
Original ordering (Avoid):  
struct {  
char  
long  
a[5];  
k;  
double x;  
} baz;  
New ordering, with padding (Preferred):  
struct {  
double x;  
long  
char  
char  
k;  
a[5];  
pad[7];  
} baz;  
page 55 for a different perspective.  
Sort Local Variables According to Base Type Size  
When a compiler allocates local variables in the same order in  
which they are declared in the source code, it can be helpful to  
declare local variables in such a manner that variables with a  
larger base type size are declared ahead of the variables with  
smaller base type size. Then, if the first variable is allocated so  
that it is naturally aligned, all other variables are allocated  
contiguously in the order they are declared, and are naturally  
aligned without any padding.  
Some compilers do not allocate variables in the order they are  
declared. In these cases, the compiler should automatically  
allocate variables in such a manner as to make them naturally  
aligned with the minimum amount of padding. In addition,  
some compilers do not guarantee that the stack is aligned  
suitably for the largest base type (that is, they do not guarantee  
28  
Sort Local Variables According to Base Type Size  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
quadword alignment), so that quadword operands might be  
misaligned, even if this technique is used and the compiler does  
allocate variables in the order they are declared.  
The following example demonstrates the reordering of local  
variable declarations:  
Original ordering (Avoid):  
short ga, gu, gi;  
long  
double x, y, z[3];  
char a, b;  
foo, bar;  
float baz;  
Improved ordering (Preferred):  
double z[3];  
double x, y;  
long  
foo, bar;  
float baz;  
short ga, gu, gi;  
more information from a different perspective.  
Accelerating Floating-Point Divides and Square Roots  
Divides and square roots have a much longer latency than other  
floating-point operations, even though the AMD Athlon  
processor provides significant acceleration of these two  
operations. In some codes, these operations occur so often as to  
seriously impact performance. In these cases, it is  
recommended to port the code to 3DNow! inline assembly or to  
use a compiler that can generate 3DNow! code. If code has hot  
spots that use single-precision arithmetic only (i.e., all  
computation involves data of type float) and for some reason  
cannot be ported to 3DNow!, the following technique may be  
used to improve performance.  
The x87 FPU has a precision-control field as part of the FPU  
control word. The precision-control setting determines what  
precision results get rounded to. It affects the basic arithmetic  
operations, including divides and square roots. AMD Athlon  
and AMD-K6® family processors implement divide and square  
root in such fashion as to only compute the number of bits  
Accelerating Floating-Point Divides and Square Roots  
29  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
necessary for the currently selected precision. This means that  
setting precision control to single precision (versus Win32  
default of double precision) lowers the latency of those  
operations.  
The Microsoft® Visual C environment provides functions to  
manipulate the FPU control word and thus the precision  
control. Note that these functions are not very fast, so changes  
of precision control should be inserted where it creates little  
overhead, such as outside a computation-intensive loop.  
Otherwise the overhead created by the function calls outweighs  
the benefit from reducing the latencies of divide and square  
root operations.  
The following example shows how to set the precision control to  
single precision and later restore the original settings in the  
Microsoft Visual C environment.  
Example:  
/* prototype for _controlfp() function */  
#include <float.h>  
unsigned int orig_cw;  
/* Get current FPU control word and save it */  
orig_cw = _controlfp (0,0);  
/* Set precision control in FPU control word to single  
precision. This reduces the latency of divide and square  
root operations.  
*/  
_controlfp (_PC_24, MCW_PC);  
/* restore original FPU control word */  
_controlfp (orig_cw, 0xfffff);  
30  
Accelerating Floating-Point Divides and Square Roots  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Avoid Unnecessary Integer Division  
Integer division is the slowest of all integer arithmetic  
operations and should be avoided wherever possible. One  
possibility for reducing the number of integer divisions is  
multiple divisions, in which division can be replaced with  
multiplication as shown in the following examples. This  
replacement is possible only if no overflow occurs during the  
computation of the product. This can be determined by  
considering the possible ranges of the divisors.  
Example 1 (Avoid):  
int i,j,k,m;  
m = i / j / k;  
Example 2 (Preferred):  
int i,j,k,l;  
m = i / (j * k);  
Copy Frequently De-referenced Pointer Arguments to Local  
Variables  
Avoid frequently de-referencing pointer arguments inside a  
function. Since the compiler has no knowledge of whether  
aliasing exists between the pointers, such de-referencing can  
not be optimized away by the compiler. This prevents data from  
being kept in registers and significantly increases memory  
traffic.  
Note that many compilers have an assume no aliasing”  
optimization switch. This allows the compiler to assume that  
two different pointers always have disjoint contents and does  
not require copying of pointer arguments to local variables.  
Otherwise, copy the data pointed to by the pointer arguments  
to local variables at the start of the function and if necessary  
copy them back at the end of the function.  
Avoid Unnecessary Integer Division  
31  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 1 (Avoid):  
//assumes pointers are different and q!=r  
void isqrt (unsigned long a,  
unsigned long *q,  
unsigned long *r)  
{
*q = a;  
if (a > 0)  
{
while (*q > (*r = a / *q))  
{
*q = (*q + *r) >> 1;  
}
}
*r = a - *q * *q;  
}
Example 2 (Preferred):  
//assumes pointers are different and q!=r  
void isqrt (unsigned long a,  
unsigned long *q,  
unsigned long *r)  
{
unsigned long qq, rr;  
qq = a;  
if (a > 0)  
{
while (qq > (rr = a / qq))  
{
qq = (qq + rr) >> 1;  
}
}
rr = a - qq * qq;  
*q = qq;  
*r = rr;  
}
32  
Copy Frequently De-referenced Pointer Arguments to Local Variables  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
4
Instruction Decoding  
Optimizations  
This chapter discusses ways to maximize the number of  
instructions decoded by the instruction decoders in the  
AMD Athlonprocessor. Guidelines are listed in order of  
importance.  
Overview  
The AMD Athlon processor instruction fetcher reads 16-byte  
aligned code windows from the instruction cache. The  
instruction bytes are then merged into a 24-byte instruction  
queue. On each cycle, the in-order front-end engine selects for  
decode up to three x86 instructions from the instruction-byte  
queue.  
All instructions (x86, x87, 3DNow!, and MMX) are  
classified into two types of decodes DirectPath and  
Decoderon page 133 for more information). DirectPath  
instructions are common instructions that are decoded directly  
in hardware. VectorPath instructions are more complex  
instructions that require the use of a sequence of multiple  
operations issued from an on-chip ROM.  
Up to three DirectPath instructions can be selected for decode  
per cycle. Only one VectorPath instruction can be selected for  
decode per cycle. DirectPath instructions and VectorPath  
instructions cannot be simultaneously decoded.  
Overview  
33  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Select DirectPath Over VectorPath Instructions  
Use DirectPath instructions rather than VectorPath  
instructions. DirectPath instructions are optimized for decode  
and execute efficiently by minimizing the number of operations  
per x86 instruction, which includes registerregister op  
memoryas well as registerregister op registerforms of  
instructions. Up to three DirectPath instructions can be  
decoded per cycle. VectorPath instructions will block the  
decoding of DirectPath instructions.  
TOP  
The very high majority of instructions used be a compiler has  
been implemented as DirectPath instructions in the  
AMD Athlon processor. Assembly writers must still take into  
consideration the usage of DirectPath versus VectorPath  
instructions.  
and VectorPath instructions.  
Load-Execute Instruction Usage  
Use Load-Execute Integer Instructions  
Most load-execute integer instructions are DirectPath  
decodable and can be decoded at the rate of three per cycle.  
Splitting a load-execute integer instruction into two separate  
instructionsa load instruction and a reg, reginstruction—  
reduces decoding bandwidth and increases register pressure,  
which results in lower performance. The split-instruction form  
can be used to avoid scheduler stalls for longer executing  
instructions and to explicitly schedule the load and execute  
operations.  
TOP  
34  
Select DirectPath Over VectorPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Use Load-Execute Floating-Point Instructions with Floating-Point  
Operands  
When operating on single-precision or double-precision  
floating-point data, wherever possible use floating-point  
load-execute instructions to increase code density.  
TOP  
Note: This optimization applies only to floating-point instructions  
with floating-point operands and not with integer operands,  
as described in the next optimization.  
This coding style helps in two ways. First, denser code allows  
more work to be held in the instruction cache. Second, the  
denser code generates fewer internal OPs and, therefore, the  
FPU scheduler holds more work, which increases the chances of  
extracting parallelism from the code.  
Example 1 (Avoid):  
FLD  
FLD  
FMUL  
QWORD PTR [TEST1]  
QWORD PTR [TEST2]  
ST, ST(1)  
Example 2 (Preferred):  
FLD  
FMUL  
QWORD PTR [TEST1]  
QWORD PTR [TEST2]  
Avoid Load-Execute Floating-Point Instructions with Integer Operands  
Do not use load-execute floating-point instructions with integer  
operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR,  
FICOM, and FICOMP. Remember that floating-point  
TOP  
instructions can have integer operands while integer  
instruction cannot have floating-point operands.  
Floating-point computations involving integer-memory  
operands should use separate FILD and arithmetic instructions.  
This optimization has the potential to increase decode  
bandwidth and OP density in the FPU scheduler. The floating-  
point load-execute instructions with integer operands are  
VectorPath and generate two OPs in a cycle, while the discrete  
equivalent enables a third DirectPath instruction to be decoded  
in the same cycle. In some situations this optimizations can also  
reduce execution time if the FILD can be scheduled several  
instructions ahead of the arithmetic instruction in order to  
cover the FILD latency.  
Load-Execute Instruction Usage  
35  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 1 (Avoid):  
FLD  
FIMUL  
FIADD  
QWORD PTR [foo]  
DWORD PTR [bar]  
DWORD PTR [baz]  
Example 2 (Preferred):  
FILD  
FILD  
FLD  
FMULP  
FADDP  
DWORD PTR [bar]  
DWORD PTR [baz]  
QWORD PTR [foo]  
ST(2), ST  
ST(1),ST  
Align Branch Targets in Program Hot Spots  
In program hot spots (i.e., innermost loops in the absence of  
profiling data), place branch targets at or near the beginning of  
16-byte aligned code windows. This technique helps to  
maximize the number of instructions that are filled into the  
instruction-byte queue while preventing I-cache space in  
branch intensive code.  
Use Short Instruction Lengths  
Assemblers and compilers should generate the tightest code  
possible to optimize use of the I-cache and increase average  
decode rate. Wherever possible, use instructions with shorter  
lengths. Using shorter instructions increases the number of  
instructions that can fit into the instruction-byte queue. For  
example, use 8-bit displacements as opposed to 32-bit  
displacements. In addition, use the single-byte format of simple  
integer instructions whenever possible, as opposed to the  
2-byte opcode ModR/M format.  
Example 1 (Avoid):  
81 C0 78 56 34 12 add eax, 12345678h ;uses 2-byte opcode  
; form (with ModR/M)  
81 C3 FB FF FF FF add ebx, -5  
;uses 32-bit  
; immediate  
0F 84 05 00 00 00 jz $label1  
;uses 2-byte opcode,  
; 32-bit immediate  
36  
Align Branch Targets in Program Hot Spots  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 2 (Preferred):  
05 78 56 34 12 add eax, 12345678h ;uses single byte  
; opcode form  
83 C3 FB  
add ebx, -5  
;uses 8-bit sign  
; extended immediate  
;uses 1-byte opcode,  
; 8-bit immediate  
74 05  
jz $label1  
Avoid Partial Register Reads and Writes  
In order to handle partial register writes, the AMD Athlon  
processor execution core implements a data-merging scheme.  
In the execution unit, an instruction writing a partial register  
merges the modified portion with the current state of the  
remainder of the register. Therefore, the dependency hardware  
can potentially force a false dependency on the most recent  
instruction that writes to any part of the register.  
Example 1 (Avoid):  
MOV  
MOV  
AL, 10  
AH, 12  
;inst 1  
;inst 2 has a false dependency on  
; inst 1  
;inst 2 merges new AH with current  
; EAX register value forwarded  
; by inst 1  
In addition, an instruction that has a read dependency on any  
part of a given architectural register has a read dependency on  
the most recent instruction that modifies any part of the same  
architectural register.  
Example 2 (Avoid):  
MOV  
MOV  
BX, 12h  
BL, DL  
;inst 1  
;inst 2, false dependency on  
; completion of inst 1  
;inst 3, false dependency on  
; completion of inst 2  
;inst 4, depends on completion of  
; inst 2  
MOV  
MOV  
BH, CL  
AL, BL  
Avoid Partial Register Reads and Writes  
37  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Replace Certain SHLD Instructions with Alternative Code  
Certain instances of the SHLD instruction can be replaced by  
alternative code using SHR and LEA. The alternative code has  
lower latency and requires less execution resources. SHR and  
LEA (32-bit version) are DirectPath instructions, while SHLD is  
a VectorPath instruction. SHR and LEA preserves decode  
bandwidth as it potentially enables the decoding of a third  
DirectPath instruction.  
Example 1  
Example 2  
Example 3  
(Avoid):  
SHLD REG1, REG2, 1  
(Preferred):  
SHR REG2, 31  
LEA REG1, [REG1*2 + REG2]  
(Avoid):  
SHLD REG1, REG2, 2  
(Preferred):  
SHR REG2, 30  
LEA REG1, [REG1*4 + REG2]  
(Avoid):  
SHLD REG1, REG2, 3  
(Preferred):  
SHR REG2, 29  
LEA REG1, [REG1*8 + REG2]  
Use 8-Bit Sign-Extended Immediates  
Using 8-bit sign-extended immediates improves code density  
with no negative effects on the AMD Athlon processor. For  
example, ADD BX, 5 should be encoded 83 C3 FBand not  
81 C3 FF FB.  
38  
Replace Certain SHLD Instructions with Alternative  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Use 8-Bit Sign-Extended Displacements  
Use 8-bit sign-extended displacements for conditional  
branches. Using short, 8-bit sign-extended displacements for  
conditional branches improves code density with no negative  
effects on the AMD Athlon processor.  
Code Padding Using Neutral Code Fillers  
Occasionally a need arises to insert neutral code fillers into the  
code stream, e.g., for code alignment purposes or to space out  
branches. Since this filler code can be executed, it should take  
up as few execution resources as possible, not diminish decode  
density, and not modify any processor state other than  
advancing EIP. A one byte padding can easily be achieved using  
the NOP instructions (XCHG EAX, EAX; opcode 0x90). In the  
x86 architecture, there are several multi-byte "NOP"  
instructions available that do not change processor state other  
than EIP:  
MOV REG, REG  
XCHG REG, REG  
CMOVcc REG, REG  
SHR REG, 0  
SAR REG, 0  
SHL REG, 0  
SHRD REG, REG, 0  
SHLD REG, REG, 0  
LEA REG, [REG]  
LEA REG, [REG+00]  
LEA REG, [REG*1+00]  
LEA REG, [REG+00000000]  
LEA REG, [REG*1+00000000]  
Not all of these instructions are equally suitable for purposes of  
code padding. For example, SHLD/SHRD are microcoded which  
reduces decode bandwidth and takes up execution resources.  
Use 8-Bit Sign-Extended Displacements  
39  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Recommendations for the AMD AthlonProcessor  
For code that is optimized specifically for the AMD Athlon  
processor, the optimal code fillers are NOP instructions (opcode  
0x90) with up to two REP prefixes (0xF3). In the AMD Athlon  
processor, a NOP with up to two REP prefixes can be handled  
by a single decoder with no overhead. As the REP prefixes are  
redundant and meaningless, they get discarded, and NOPs are  
handled without using any execution resources. The three  
decoders of AMD Athlon processor can handle up to three  
NOPs, each with up to two REP prefixes each, in a single cycle,  
for a neutral code filler of up to nine bytes.  
Note: When used as a filler instruction, REP/REPNE prefixes can  
be used in conjunction only with NOPs. REP/REPNE has  
undefined behavior when used with instructions other than  
a NOP.  
If a larger amount of code padding is required, it is  
recommended to use a JMP instruction to jump across the  
padding region. The following assembly language macros show  
this:  
NOP1_ATHLON TEXTEQU <DB 090h>  
NOP2_ATHLON TEXTEQU <DB 0F3h, 090h>  
NOP3_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h>  
NOP4_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 090h>  
NOP5_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 090h>  
NOP6_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h>  
NOP7_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,  
090h>  
NOP8_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,  
0F3h, 090h>  
NOP9_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,  
0F3h, 0F3h, 090h>  
NOP10_ATHLONTEXTEQU <DB 0EBh, 008h, 90h, 90h, 90h, 90h,  
90h, 90h, 90h, 90h>  
40  
Code Padding Using Neutral Code Fillers  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Recommendations for AMD-K6® Family and AMD AthlonProcessor  
Blended Code  
On x86 processors other than the AMD Athlon processor  
(including the AMD-K6 family of processors), the REP prefix  
and especially multiple prefixes cause decoding overhead, so  
the above technique is not recommended for code that has to  
run well both on AMD Athlon processor and other x86  
processors (blended code). In such cases the instructions and  
instruction sequences below are recommended. For neutral  
code fillers longer than eight bytes in length, the JMP  
instruction can be used to jump across the padding region.  
Note that each of the instructions and instruction sequences  
below utilizes an x86 register. To avoid performance  
degradation, the register used in the padding should be  
selected so as to not lengthen existing dependency chains, i.e.,  
one should select a register that is not used by instructions in  
the vicinity of the neutral code filler. Note that certain  
instructions use registers implicitly. For example, PUSH, POP,  
CALL, and RET all make implicit use of the ESP register. The  
5-byte filler sequence below consists of two instructions. If flag  
changes across the code padding are acceptable, the following  
instructions may be used as single instruction, 5-byte code  
fillers:  
TEST EAX, 0FFFF0000h  
CMP EAX, 0FFFF0000h  
The following assembly language macros show the  
recommended neutral code fillers for code optimized for the  
AMD Athlon processor that also has to run well on other x86  
processors. Note for some padding lengths, versions using ESP  
or EBP are missing due to the lack of fully generalized  
addressing modes.  
NOP2_EAX TEXTEQU <DB 08Bh,0C0h> ;mov eax, eax  
NOP2_EBX TEXTEQU <DB 08Bh,0DBh> ;mov ebx, ebx  
NOP2_ECX TEXTEQU <DB 08Bh,0C9h> ;mov ecx, ecx  
NOP2_EDX TEXTEQU <DB 08Bh,0D2h> ;mov edx, edx  
NOP2_ESI TEXTEQU <DB 08Bh,0F6h> ;mov esi, esi  
NOP2_EDI TEXTEQU <DB 08Bh,0FFh> ;mov edi, edi  
NOP2_ESP TEXTEQU <DB 08Bh,0E4h> ;mov esp, esp  
NOP2_EBP TEXTEQU <DB 08Bh,0EDh> ;mov ebp, ebp  
NOP3_EAX TEXTEQU <DB 08Dh,004h,020h> ;lea eax, [eax]  
NOP3_EBX TEXTEQU <DB 08Dh,01Ch,023h> ;lea ebx, [ebx]  
Code Padding Using Neutral Code Fillers  
41  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx]  
NOP3_EDX TEXTEQU <DB 08Dh,014h,022h> ;lea edx, [edx]  
NOP3_ESI TEXTEQU <DB 08Dh,024h,024h> ;lea esi, [esi]  
NOP3_EDI TEXTEQU <DB 08Dh,034h,026h> ;lea edi, [edi]  
NOP3_ESP TEXTEQU <DB 08Dh,03Ch,027h> ;lea esp, [esp]  
NOP3_EBP TEXTEQU <DB 08Dh,06Dh,000h> ;lea ebp, [ebp]  
NOP4_EAX TEXTEQU <DB 08Dh,044h,020h,000h> ;lea eax, [eax+00]  
NOP4_EBX TEXTEQU <DB 08Dh,05Ch,023h,000h> ;lea ebx, [ebx+00]  
NOP4_ECX TEXTEQU <DB 08Dh,04Ch,021h,000h> ;lea ecx, [ecx+00]  
NOP4_EDX TEXTEQU <DB 08Dh,054h,022h,000h> ;lea edx, [edx+00]  
NOP4_ESI TEXTEQU <DB 08Dh,064h,024h,000h> ;lea esi, [esi+00]  
NOP4_EDI TEXTEQU <DB 08Dh,074h,026h,000h> ;lea edi, [edi+00]  
NOP4_ESP TEXTEQU <DB 08Dh,07Ch,027h,000h> ;lea esp, [esp+00]  
;lea eax, [eax+00];nop  
NOP5_EAX TEXTEQU <DB 08Dh,044h,020h,000h,090h>  
;lea ebx, [ebx+00];nop  
NOP5_EBX TEXTEQU <DB 08Dh,05Ch,023h,000h,090h>  
;lea ecx, [ecx+00];nop  
NOP5_ECX TEXTEQU <DB 08Dh,04Ch,021h,000h,090h>  
;lea edx, [edx+00];nop  
NOP5_EDX TEXTEQU <DB 08Dh,054h,022h,000h,090h>  
;lea esi, [esi+00];nop  
NOP5_ESI TEXTEQU <DB 08Dh,064h,024h,000h,090h>  
;lea edi, [edi+00];nop  
NOP5_EDI TEXTEQU <DB 08Dh,074h,026h,000h,090h>  
;lea esp, [esp+00];nop  
NOP5_ESP TEXTEQU <DB 08Dh,07Ch,027h,000h,090h>  
;lea eax, [eax+00000000]  
NOP6_EAX TEXTEQU <DB 08Dh,080h,0,0,0,0>  
;lea ebx, [ebx+00000000]  
NOP6_EBX TEXTEQU <DB 08Dh,09Bh,0,0,0,0>  
;lea ecx, [ecx+00000000]  
NOP6_ECX TEXTEQU <DB 08Dh,089h,0,0,0,0>  
;lea edx, [edx+00000000]  
NOP6_EDX TEXTEQU <DB 08Dh,092h,0,0,0,0>  
;lea esi, [esi+00000000]  
NOP6_ESI TEXTEQU <DB 08Dh,0B6h,0,0,0,0>  
42  
Code Padding Using Neutral Code Fillers  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
;lea edi ,[edi+00000000]  
NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0>  
;lea ebp ,[ebp+00000000]  
NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0>  
;lea eax,[eax*1+00000000]  
NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0>  
;lea ebx,[ebx*1+00000000]  
NOP7_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0>  
;lea ecx,[ecx*1+00000000]  
NOP7_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0>  
;lea edx,[edx*1+00000000]  
NOP7_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0>  
;lea esi,[esi*1+00000000]  
NOP7_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0>  
;lea edi,[edi*1+00000000]  
NOP7_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0>  
;lea ebp,[ebp*1+00000000]  
NOP7_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0>  
;lea eax,[eax*1+00000000] ;nop  
NOP8_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0,90h>  
;lea ebx,[ebx*1+00000000] ;nop  
NOP8_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0,90h>  
;lea ecx,[ecx*1+00000000] ;nop  
NOP8_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0,90h>  
;lea edx,[edx*1+00000000] ;nop  
NOP8_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0,90h>  
;lea esi,[esi*1+00000000] ;nop  
NOP8_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0,90h>  
;lea edi,[edi*1+00000000] ;nop  
NOP8_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0,90h>  
;lea ebp,[ebp*1+00000000] ;nop  
NOP8_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0,90h>  
;JMP  
NOP9 TEXTEQU <DB 0EBh,007h,90h,90h,90h,90h,90h,90h,90h>  
Code Padding Using Neutral Code Fillers  
43  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
44  
Code Padding Using Neutral Code Fillers  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
5
Cache and Memory  
Optimizations  
This chapter describes code optimization techniques that take  
advantage of the large L1 caches and high-bandwidth buses of  
the AMD Athlonprocessor. Guidelines are listed in order of  
importance.  
Memory Size and Alignment Issues  
Avoid Memory Size Mismatches  
Avoid memory size mismatches when instructions operate on  
the same data. For instructions that store and reload the same  
data, keep operands aligned and keep the loads/stores of each  
operand the same size. The following code examples result in a  
store-to-load-forwarding (STLF) stall:  
TOP  
Example 1 (Avoid):  
MOV DWORD PTR [FOO], EAX  
MOV DWORD PTR [FOO+4], EDX  
FLD QWORD PTR [FOO]  
Avoid large-to-small mismatches, as shown in the following  
code:  
Example 2 (Avoid):  
FST QWORD PTR [FOO]  
MOV EAX, DWORD PTR [FOO]  
MOV EDX, DWORD PTR [FOO+4]  
Memory Size and Alignment Issues  
45  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Align Data Where Possible  
In general, avoid misaligned data references. All data whose  
size is a power of 2 is considered aligned if it is naturally  
aligned. For example:  
TOP  
QWORD accesses are aligned if they access an address  
divisible by 8.  
DWORD accesses are aligned if they access an address  
divisible by 4.  
WORD accesses are aligned if they access an address  
divisible by 2.  
TBYTE accesses are aligned if they access an address  
divisible by 8.  
A misaligned store or load operation suffers a minimum  
one-cycle penalty in the AMD Athlon processor load/store  
pipeline. In addition, using misaligned loads and stores  
increases the likelihood of encountering a store-to-load  
forwarding pitfall. For a more detailed discussion of store-to-  
load forwarding issues, see Store-to-Load Forwarding  
Use the 3DNow!PREFETCH and PREFETCHW Instructions  
For code that can take advantage of prefetching, use the  
3DNow! PREFETCH and PREFETCHW instructions to  
increase the effective bandwidth to the AMD Athlon processor.  
TOP  
The PREFETCH and PREFETCHW instructions take  
advantage of the AMD Athlon processors high bus bandwidth  
to hide long latencies when fetching data from system memory.  
The prefetch instructions are essentially integer instructions  
and can be used anywhere, in any type of code (integer, x87,  
3DNow!, MMX, etc.).  
Large data sets typically require unit-stride access to ensure  
that all data pulled in by PREFETCH or PREFETCHW is  
actually used. If necessary, algorithms or data structures should  
be reorganized to allow unit-stride access.  
46  
Use the 3DNow!PREFETCH and PREFETCHW  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
PREFETCH/W versus  
PREFETCHNTA/T0/T1  
/T2  
The PREFETCHNTA/T0/T1/T2 instructions in the MMX  
extensions are processor implementation dependent. To  
maintain compatibility with the 25 million AMD-K6®-2 and  
AMD-K6-III processors already sold, use the 3DNow!  
PREFETCH/W instructions instead of the various prefetch  
flavors in the new MMX extensions.  
PREFETCHW Usage  
Code that intends to modify the cache line brought in through  
prefetching should use the PREFETCHW instruction. While  
PREFETCHW works the same as a PREFETCH on the  
AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a  
hint to the AMD Athlon processor of an intent to modify the  
cache line. The AMD Athlon processor will mark the cache line  
being brought in by PREFETCHW as Modified. Using  
PREFETCHW can save an additional 15-25 cycles compared to  
a PREFETCH and the subsequent cache state change caused by  
a write to the prefetched cache line.  
Multiple Prefetches  
Programmers can initiate multiple outstanding prefetches on  
the AMD Athlon processor. While the AMD-K6-2 and  
AMD-K6-III processors can have only one outstanding prefetch,  
the AMD Athlon processor can have up to six outstanding  
prefetches. When all six buffers are filled by various memory  
read requests, the processor will simply ignore any new  
prefetch requests until a buffer frees up. Multiple prefetch  
requests are essentially handled in-order. If data is needed first,  
then that data should be prefetched first.  
The example below shows how to initiate multiple prefetches  
when traversing more than one array.  
Example (Multiple Prefetches):  
.CODE  
.K3D  
; original C code  
;
; #define LARGE_NUM 65536  
;
; double array_a[LARGE_NUM];  
; double array b[LARGE_NUM];  
; double array c[LARGE_NUM];  
; int i;  
;
; for (i = 0; i < LARGE_NUM; i++) {  
;
a[i] = b[i] * c[i]  
; }  
Use the 3DNow!PREFETCH and PREFETCHW Instructions  
47  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
MOV  
MOV  
MOV  
MOV  
ECX, (-LARGE_NUM)  
;used biased index  
EAX, OFFSET array_a  
EDX, OFFSET array_b  
ECX, OFFSET array_c  
;get address of array_a  
;get address of array_b  
;get address of array_c  
$loop:  
PREFETCHW [EAX+196]  
;two cachelines ahead  
;two cachelines ahead  
;two cachelines ahead  
PREFETCH  
PREFETCH  
[EDX+196]  
[ECX+196]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE] ;b[i]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE] ;b[i]*c[i]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE] ;a[i] = b[i]*c[i]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+8] ;b[i+1]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+8] ;b[i+1]*c[i+1]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+8] ;a[i+1] =  
; b[i+1]*c[i+1]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+16];b[i+2]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+16];b[i+2]*c[i+2]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+16];a[i+2] =  
; [i+2]*c[i+2]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+24];b[i+3]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+24];b[i+3]*c[i+3]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+24];a[i+3] =  
; b[i+3]*c[i+3]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+32];b[i+4]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+32];b[i+4]*c[i+4]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+32];a[i+4] =  
; b[i+4]*c[i+4]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+40];b[i+5]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+40];b[i+5]*c[i+5]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+40];a[i+5] =  
; b[i+5]*c[i+5]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+48];b[i+6]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+48];b[i+6]*c[i+6]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+48];a[i+6] =  
; b[i+6]*c[i+6]  
FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+56];b[i+7]  
FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+56];b[i+7]*c[i+7]  
FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+56];a[i+7] =  
; b[i+7]*c[i+7]  
ADD ECX, 8  
JNZ $loop  
;next 8 products  
;until none left  
END  
48  
Use the 3DNow!PREFETCH and PREFETCHW  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
The following optimization rules were applied to this example.  
Loops should be unrolled to make sure that the data stride  
per loop iteration is equal to the length of a cache line. This  
avoids overlapping PREFETCH instructions and thus  
optimal use of the available number of outstanding  
PREFETCHes.  
Since the array "array_a" is written rather than read,  
PREFETCHW is used instead of PREFETCH to avoid  
overhead for switching cache lines to the correct MESI  
state. The PREFETCH lookahead has been optimized such  
that each loop iteration is working on three cache lines  
while six active PREFETCHes bring in the next six cache  
lines.  
Index arithmetic has been reduced to a minimum by use of  
complex addressing modes and biasing of the array base  
addresses in order to cut down on loop overhead.  
Determining Prefetch  
Distance  
Given the latency of a typical AMD Athlon processor system  
and expected processor speeds, the following formula should be  
used to determine the prefetch distance in bytes for a single  
array:  
Prefetch Distance = 200 (DS/C) bytes  
Round up to the nearest 64-byte cache line.  
The number 200 is a constant based upon expected  
AMD Athlon processor clock frequencies and typical system  
memory latencies.  
DS is the data stride in bytes per loop iteration.  
C is the number of cycles for one loop to execute entirely  
from the L1 cache.  
The prefetch distance for multiple arrays are typically even  
longer.  
Prefetch at Least 64  
Bytes Away from  
Surrounding Stores  
The PREFETCH and PREFETCHW instructions can be  
affected by false dependencies on stores. If there is a store to an  
address that matches a request, that request (the PREFETCH  
or PREFETCHW instruction) may be blocked until the store is  
written to the cache. Therefore, code should prefetch data that  
is located at least 64 bytes away from any surrounding stores  
data address.  
Use the 3DNow!PREFETCH and PREFETCHW Instructions  
49  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Take Advantage of Write Combining  
Operating system and device driver programmers should take  
advantage of the write-combining capabilities of the  
AMD Athlon processor. The AMD Athlon processor has a very  
aggressive write-combining algorithm, which improves  
performance significantly.  
TOP  
page 155 for more details.  
Avoid Placing Code and Data in the Same 64-Byte Cache  
Line  
Sharing code and data in the same 64-byte cache line may cause  
the L1 caches to thrash (unnecessary castout of code/data) in  
order to maintain coherency between the separate instruction  
TOP  
and data caches. The AMD Athlon processor has a cache-line  
size of 64-bytes, which is twice the size of previous processors.  
Programmers must be aware that code and data should not be  
shared within this larger cache line, especially if the data  
becomes modified.  
For example, programmers should consider that a memory  
indirect JMP instruction may have the data for the jump table  
residing in the same 64-byte cache line as the JMP instruction,  
which would result in lower performance.  
Although rare, do not place critical code at the border between  
32-byte aligned code segments and a data segments. The code  
at the start or end of your data segment should be as rarely  
executed as possible or simply padded with garbage.  
In general, the following should be avoided:  
self-modifying code  
storing data in code segments  
50  
Take Advantage of Write Combining  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Store-to-Load Forwarding Restrictions  
Store-to-load forwarding refers to the process of a load reading  
(forwarding) data from the store buffer (LS2). There are  
instances in the AMD Athlon processor load/store architecture  
when either a load operation is not allowed to read needed data  
from a store in the store buffer, or a load OP detects a false data  
dependency on a store in the store buffer.  
In either case, the load cannot complete (load the needed data  
into a register) until the store has retired out of the store buffer  
and written to the data cache. A store-buffer entry cannot retire  
and write to the data cache until every instruction before the  
store has completed and retired from the reorder buffer.  
The implication of this restriction is that all instructions in the  
reorder buffer, up to and including the store, must complete  
and retire out of the reorder buffer before the load can  
complete. Effectively, the load has a false dependency on every  
instruction up to the store.  
The following sections describe store-to-load forwarding  
examples that are acceptable and those that should be avoided.  
Store-to-Load Forwarding PitfallsTrue Dependencies  
A load is allowed to read data from the store-buffer entry only if  
all of the following conditions are satisfied:  
The start address of the load matches the start address of  
the store.  
The load operand size is equal to or smaller than the store  
operand size.  
Neither the load or store is misaligned.  
The store data is not from a high-byte register (AH, BH, CH,  
or DH).  
The following sections describe common-case scenarios to avoid  
whereby a load has a true dependency on a LS2-buffered store  
but cannot read (forward) data from a store-buffer entry.  
Store-to-Load Forwarding Restrictions  
51  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Narrow-to-Wide  
Store-Buffer Data  
Forwarding  
If the following conditions are present, there is a  
narrow-to-wide store-buffer data forwarding restriction:  
The operand size of the store data is smaller than the  
operand size of the load data.  
Restriction  
The range of addresses spanned by the store data covers  
some sub-region of range of addresses spanned by the load  
data.  
Avoid the type of code shown in the following two examples.  
Example 1 (Avoid):  
MOV EAX, 10h  
MOV WORD PTR [EAX], BX  
;word store  
...  
MOV ECX, DWORD PTR [EAX] ;doubleword load  
;cannot forward upper  
; byte from store buffer  
Example 2 (Avoid):  
MOV EAX, 10h  
MOV BYTE PTR [EAX + 3], BL ;byte store  
...  
MOV ECX, DWORD PTR [EAX] ;doubleword load  
;cannot forward upper byte  
; from store buffer  
Wide-to-Narrow  
Store-Buffer Data  
Forwarding  
If the following conditions are present, there is a  
wide-to-narrow store-buffer data forwarding restriction:  
The operand size of the store data is greater than the  
operand size of the load data.  
Restriction  
The start address of the store data does not match the start  
address of the load.  
Example 3 (Avoid):  
MOV EAX, 10h  
ADD DWORD PTR [EAX], EBX ;doubleword store  
MOV CX, WORD PTR [EAX + 2] ;word load-cannot forward high  
; word from store buffer  
Use example 5 instead of example 4.  
Example 4 (Avoid):  
MOVQ  
...  
[foo], MM1  
;store upper and lower half  
;fine  
ADD  
ADD  
EAX, [foo]  
EDX, [foo+4] ;uh-oh!  
52  
Store-to-Load Forwarding Restrictions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 5 (Preferred):  
MOVD  
[foo], MM1  
;store lower half  
PUNPCKHDQ MM1, MM1  
;get upper half into lower half  
MOVD  
...  
ADD  
ADD  
[foo+4], MM1 ;store lower half  
EAX, [foo]  
;fine  
EDX, [foo+4] ;fine  
Misaligned  
Store-Buffer Data  
Forwarding  
If the following condition is present, there is a misaligned  
store-buffer data forwarding restriction:  
The store or load address is misaligned. For example, a  
quadword store is not aligned to a quadword boundary, a  
doubleword store is not aligned to doubleword boundary,  
etc.  
Restriction  
A common case of misaligned store-data forwarding involves  
the passing of misaligned quadword floating-point data on the  
doubleword-aligned integer stack. Avoid the type of code shown  
in the following example.  
Example 6 (Avoid):  
MOV ESP, 24h  
FSTP QWORD PTR [ESP] ;esp=24  
.
.
.
;store occurs to quadword  
; misaligned address  
FLD QWORD PTR[ESP] ;quadword load cannot forward  
; from quadword misaligned  
; ‘fstp[esp]’ store OP  
High-Byte  
If the following condition is present, there is a high-byte  
store-data buffer forwarding restriction:  
Store-Buffer Data  
Forwarding  
Restriction  
The store data is from a high-byte register (AH, BH, CH,  
DH).  
Avoid the type of code shown in the following example.  
Example 7 (Avoid):  
MOV EAX, 10h  
MOV [EAX], BH  
.
;high-byte store  
MOV DL, [EAX]  
;load cannot forward from  
; high-byte store  
Store-to-Load Forwarding Restrictions  
53  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
One Supported Store-  
to-Load Forwarding  
Case  
There is one case of a mismatched store-to-load forwarding that  
is supported by the by AMD Athlon processor. The lower 32 bits  
from an aligned QWORD write feeding into a DWORD read is  
allowed.  
Example 8 (Allowed):  
MOVQ [AlignedQword], mm0  
...  
MOV  
EAX, [AlignedQword]  
Summary of Store-to-Load Forwarding Pitfalls to Avoid  
To avoid store-to-load forwarding pitfalls, code should conform  
to the following guidelines:  
Maintain consistent use of operand size across all loads and  
stores. Preferably, use doubleword or quadword operand  
sizes.  
Avoid misaligned data references.  
Avoid narrow-to-wide and wide-to-narrow forwarding cases.  
When using word or byte stores, avoid loading data from  
anywhere in the same doubleword of memory other than the  
identical start addresses of the stores.  
Stack Alignment Considerations  
Make sure the stack is suitably aligned for the local variable  
with the largest base type. Then, using the technique described  
55, all variables can be properly aligned with no padding.  
Extend to 32 Bits  
Before Pushing onto  
Stack  
Function arguments smaller than 32 bits should be extended to  
32 bits before being pushed onto the stack, which ensures that  
the stack is always doubleword aligned on entry to a function.  
If a function has no local variables with a base type larger than  
doubleword, no further work is necessary. If the function does  
have local variables whose base type is larger than a  
doubleword, additional code should be inserted to ensure  
proper alignment of the stack. For example, the following code  
achieves quadword alignment:  
54  
Stack Alignment Considerations  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example (Preferred):  
Prolog:  
PUSH  
MOV  
SUB  
AND  
EBP  
EBP, ESP  
ESP, SIZE_OF_LOCALS  
ESP, –8  
;size of local variables  
;push registers that need to be preserved  
Epilog:  
MOV  
POP  
;pop register that needed to be preserved  
ESP, EBP  
EBP  
RET  
With this technique, function arguments can be accessed via  
EBP, and local variables can be accessed via ESP. In order to  
free EBP for general use, it needs to be saved and restored  
between the prolog and the epilog.  
Align TBYTE Variables on Quadword Aligned Addresses  
Align variables of type TBYTE on quadword aligned addresses.  
In order to make an array of TBYTE variables that are aligned,  
array elements are 16-bytes apart. In general, TBYTE variables  
should be avoided. Use double-precision variables instead.  
C Language Structure Component Considerations  
Structures (structin C language) should be made the size of a  
multiple of the largest base type of any of their components. To  
meet this requirement, padding should be used where  
necessary.  
Language definitions permitting, to minimize padding,  
structure components should be sorted and allocated such that  
the components with a larger base type are allocated ahead of  
those with a smaller base type. For example, consider the  
following code:  
Align TBYTE Variables on Quadword Aligned Addresses  
55  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example:  
struct {  
char a[5];  
long k;  
doublex;  
} baz;  
The structure components should be allocated (lowest to  
highest address) as follows:  
x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0  
page 27 for more information from a C source code perspective.  
Sort Variables According to Base Type Size  
Sort local variables according to their base type size and  
allocate variables with larger base type size ahead of those with  
smaller base type size. Assuming the first variable allocated is  
naturally aligned, all other variables are naturally aligned  
without any padding. The following example is a declaration of  
local variables in a C function:  
Example:  
short  
long  
ga, gu, gi;  
foo, bar;  
double x, y, z[3];  
char  
float  
a, b;  
baz;  
Allocate in the following order from left to right (from higher to  
lower addresses):  
x, y, z[2], z[1], z[0], foo, bar, baz, ga, gu, gi, a, b;  
28 for more information from a C source code perspective.  
56  
Sort Variables According to Base Type Size  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
6
Branch Optimizations  
While the AMD Athlonprocessor contains a very  
sophisticated branch unit, certain optimizations increase the  
effectiveness of the branch prediction unit. This chapter  
discusses rules that improve branch prediction and minimize  
branch penalties. Guidelines are listed in order of importance.  
Avoid Branches Dependent on Random Data  
Avoid conditional branches depending on random data, as these  
are difficult to predict. For example, a piece of code receives a  
random stream of characters Athrough Zand branches if  
the character is before Min the collating sequence.  
Data-dependent branches acting upon basically random data  
causes the branch prediction logic to mispredict the branch  
about 50% of the time.  
TOP  
If possible, design branch-free alternative code sequences,  
which results in shorter average execution time. This technique  
is especially important if the branch body is small. Examples 1  
and 2 illustrate this concept using the CMOV instruction. Note  
that the AMD-K6® processor does not support the CMOV  
instruction. Therefore, blended AMD-K6 and AMD Athlon  
processor code should use examples 3 and 4.  
Avoid Branches Dependent on Random Data  
57  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
AMD AthlonProcessor Specific Code  
Example 1 Signed integer ABS function (X = labs(X)):  
MOV  
MOV  
NEG  
ECX, [X]  
EBX, ECX  
ECX  
;load value  
;save value  
;–value  
CMOVS  
MOV  
ECX, EBX  
[X], ECX  
;if –value is negative, select value  
;save labs result  
Example 2 Unsigned integer min function (z = x < y ? x : y):  
MOV  
MOV  
CMP  
EAX, [X]  
EBX, [Y]  
EAX, EBX  
;load X value  
;load Y value  
;EBX<=EAX ? CF=0 : CF=1  
;EAX=(EBX<=EAX) ? EBX:EAX  
;save min (X,Y)  
CMOVNC EAX, EBX  
MOV [Z], EAX  
Blended AMD-K6® and AMD AthlonProcessor Code  
Example 3 Signed integer ABS function (X = labs(X)):  
MOV  
MOV  
SAR  
XOR  
SUB  
MOV  
ECX, [X]  
EBX, ECX  
ECX, 31  
EBX, ECX  
EBX, ECX  
[X], EBX  
;load value  
;save value  
;x < 0 ? 0xffffffff : 0  
;x < 0 ? ~x : x  
;x < 0 ? (~x)+1 : x  
;x < 0 ? -x : x  
Example 4 Unsigned integer min function (z = x < y ? x : y):  
MOV  
MOV  
SUB  
SBB  
AND  
ADD  
MOV  
EAX, [x]  
EBX, [y]  
EAX, EBX  
ECX, ECX  
ECX, EAX  
ECX, EBX  
[z], ECX  
;load x  
;load y  
;x < y ? CF : NC ; x - y  
;x < y ? 0xffffffff : 0  
;x < y ? x - y : 0  
;x < y ? x - y + y : y  
;x < y ? x : y  
Example 5 Hexadecimal to ASCII conversion  
(y=x < 10 ? x + 0x30: x + 0x41):  
MOV  
CMP  
SBB  
DAS  
MOV  
AL, [X]  
AL, 10  
AL, 69h  
;load X value  
;if x is less than 10, set carry flag  
;0..9 –> 96h, Ah..Fh –> A1h...A6h  
;0..9: subtract 66h, Ah..Fh: Sub. 60h  
;save conversion in y  
[Y],AL  
58  
Avoid Branches Dependent on Random Data  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 6 Increment Ring Buffer Offset:  
//C Code  
char buf[BUFSIZE];  
int a;  
if (a < (BUFSIZE-1)) {  
a++;  
} else {  
a = 0;  
}
;-------------  
;Assembly Code  
MOV  
CMP  
INC  
SBB  
AND  
MOV  
EAX, [a]  
; old offset  
EAX, (BUFSIZE-1) ; a < (BUFSIZE-1) ? CF : NC  
EAX  
EDX, EDX  
EAX, EDX  
[a], EAX  
; a++  
; a < (BUFSIZE-1) ? 0xffffffff :0  
; a < (BUFSIZE-1) ? a++ : 0  
; store new offset  
Example 7 Integer Signum Function:  
//C Code  
int a, s;  
if (!a) {  
s = 0;  
} else if (a < 0) {  
s = -1;  
} else {  
s = 1;  
}
;-------------  
;Assembly Code  
MOV  
CDQ  
CMP  
ADC  
MOV  
EAX, [a]  
;load a  
;t = a < 0 ? 0xffffffff : 0  
;a > 0 ? CF : NC  
;a > 0 ? t+1 : t  
;signum(x)  
EDX, EAX  
EDX, 0  
[s], EDX  
Always Pair CALL and RETURN  
When the 12 entry return address stack gets out of  
synchronization, the latency of returns increase. The return  
address stack becomes out of sync when:  
calls and returns do not match  
the depth of the return stack is exceeded because of too  
many levels of nested functions calls  
Always Pair CALL and RETURN  
59  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Replace Branches with Computation in 3DNow!Code  
Branches negatively impact the performance of 3DNow! code.  
Branches can operate only on one data item at a time, i.e., they  
are inherently scalar and inhibit the SIMD processing that  
makes 3DNow! code superior. Also, branches based on 3DNow!  
comparisons require data to be passed to the integer units,  
which requires either transport through memory, or the use of  
MOVD reg, MMreginstructions. If the body of the branch is  
small, one can achieve higher performance by replacing the  
branch with computation. The computation simulates  
predicated execution or conditional moves. The principal tools  
for this are the following instructions: PCMPGT, PFCMPGT,  
PFCMPGE, PFMIN, PFMAX, PAND, PANDN, POR, PXOR.  
Muxing Constructs  
The most important construct to avoiding branches in  
3DNow!and MMXcode is a 2-way muxing construct that is  
equivalent to the ternary operator ?:in C and C++. It is  
implemented using the PCMP/PFCMP, PAND, PANDN, and  
POR instructions. To maximize performance, it is important to  
apply the PAND and PANDN instructions in the proper order.  
Example 1 (Avoid):  
; r = (x < y) ? a : b  
;
; in: mm0 a  
;
;
;
mm1 b  
mm2 x  
mm3 y  
; out: mm1 r  
PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0  
MOVQ  
PANDN  
PAND  
POR  
MM4, MM3 ; duplicate mask  
MM3, MM0 ; y > x ? 0 : a  
MM1, MM4 ; y > x ? b : 0  
MM1, MM3 ; r = y > x ? b : a  
Because the use of PANDN destroys the mask created by PCMP,  
the mask needs to be saved, which requires an additional  
register. This adds an instruction, lengthens the dependency  
chain, and increases register pressure. Therefore 2-way muxing  
constructs should be written as follows.  
60  
Replace Branches with Computation in 3DNow!Code  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 2 (Preferred):  
; r = (x < y) ? a : b  
;
; in: mm0 a  
;
;
;
mm1 b  
mm2 x  
mm3 y  
; out: mm1 r  
PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0  
PAND  
PANDN  
POR  
MM1, MM3 ; y > x ? b : 0  
MM3, MM0 ; y > x > 0 : a  
MM1, MM3 ; r = y > x ? b : a "  
Sample Code Translated into 3DNow!Code  
The following examples use scalar code translated into 3DNow!  
code. Note that it is not recommended to use 3DNow! SIMD  
instructions for scalar code, because the advantage of 3DNow!  
instructions lies in their SIMDness. These examples are  
meant to demonstrate general techniques for translating source  
code with branches into branchless 3DNow! code. Scalar source  
code was chosen to keep the examples simple. These techniques  
work in an identical fashion for vector code.  
Each example shows the C code and the resulting 3DNow! code.  
Example 1:  
C code:  
float x,y,z;  
if (x < y) {  
z += 1.0;  
}
else {  
z -= 1.0;  
}
3DNow! code:  
;in: MM0 = x  
;
;
MM1 = y  
MM2 = z  
;out: MM0 = z  
MOVQ  
MOVQ  
MM3, MM0  
MM4, one  
;save x  
;1.0  
PFCMPGE MM0, MM1  
;x < y ? 0 : 0xffffffff  
;x < y ? 0 : 0x80000000  
;x < y ? 1.0 : -1.0  
;x < y ? z+1.0 : z-1.0  
PSLLD  
PXOR  
MM0, 31  
MM0, MM4  
MM0, MM2  
PFADD  
Replace Branches with Computation in 3DNow!Code  
61  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 2:  
C code:  
float x,z;  
z = abs(x);  
if (z >= 1) {  
z = 1/z;  
}
3DNow! code:  
;in: MM0 = x  
;out: MM0 = z  
MOVQ  
PAND  
PFRCP  
MOVQ  
MM5, mabs ;0x7fffffff  
MM0, MM5 ;z=abs(x)  
MM2, MM0 ;1/z approx  
MM1, MM0 ;save z  
PFRCPIT1 MM0, MM2 ;1/z step  
PFRCPIT2 MM0, MM2 ;1/z final  
PFMIN  
MM0, MM1 ;z = z < 1 ? z : 1/z  
Example 3:  
C code:  
float x,z,r,res;  
z = fabs(x)  
if (z < 0.575) {  
res = r;  
}
else {  
res = PI/2 - 2*r;  
}
3DNow! code:  
;in: MM0 = x  
;
MM1 = r  
;out: MM0 = res  
MOVQ  
PAND  
MOVQ  
MM7, mabs ;mask for absolute value  
MM0, MM7 ;z = abs(x)  
MM2, bnd ;0.575  
PCMPGTD MM2, MM0 ;z < 0.575 ? 0xffffffff : 0  
MOVQ  
MOVQ  
MM3, pio2 ;pi/2  
MM0, MM1 ;save r  
PFADD MM1, MM1 ;2*r  
PFSUBR MM1, MM3 ;pi/2 - 2*r  
PAND  
MM0, MM2 ;z < 0.575 ? r : 0  
PANDN MM2, MM1 ;z < 0.575 ? 0 : pi/2 - 2*r  
POR  
MM0, MM2 ;z < 0.575 ? r : pi/2 - 2 * r  
62  
Replace Branches with Computation in 3DNow!Code  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 4:  
C code:  
#define PI 3.14159265358979323  
float x,z,r,res;  
/* 0 <= r <= PI/4 */  
z = abs(x)  
if (z < 1) {  
res = r;  
}
else {  
res = PI/2-r;  
}
3DNow! code:  
;in: MM0 = x  
;
MM1 = r  
;out: MM1 = res  
MOVQ  
MOVQ  
PAND  
MM5, mabs ; mask to clear sign bit  
MM6, one ; 1.0  
MM0, MM5 ; z=abs(x)  
PCMPGTD MM6, MM0 ; z < 1 ? 0xffffffff : 0  
MOVQ  
MM4, pio2 ; pi/2  
MM4, MM1 ; pi/2-r  
MM6, MM4 ; z < 1 ? 0 : pi/2-r  
MM1, MM6 ; res = z < 1 ? r : pi/2-r  
PFSUB  
PANDN  
PFMAX  
Replace Branches with Computation in 3DNow!Code  
63  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 5:  
C code:  
#define PI 3.14159265358979323  
float x,y,xa,ya,r,res;  
int xs,df;  
xs = x < 0 ? 1 : 0;  
xa = fabs(x);  
ya = fabs(y);  
df = (xa < ya);  
if (xs && df) {  
res = PI/2 + r;  
}
else if (xs) {  
res = PI - r;  
}
else if (df) {  
res = PI/2 - r;  
}
else {  
res = r;  
}
3DNow! code:  
;in: MM0 = r  
;
;
MM1 = y  
MM2 = x  
;out: MM0 = res  
MOVQ  
MOVQ  
MOVQ  
PAND  
PAND  
PAND  
MOVQ  
MM7, sgn  
MM6, sgn  
MM5, mabs ;mask to clear sign bit  
MM7, MM2  
MM1, MM5  
MM2, MM5  
MM6, MM1  
;mask to extract sign bit  
;mask to extract sign bit  
;xs = sign(x)  
;ya = abs(y)  
;xa = abs(x)  
;y  
PCMPGTD MM6, MM2  
;df = (xa < ya) ? 0xffffffff : 0  
;df = bit<31>  
;xs  
PSLLD  
MOVQ  
MM6, 31  
MM5, MM7  
MM7, MM6  
PXOR  
;xs^df ? 0x80000000 : 0  
MOVQ  
PXOR  
PSRAD  
PANDN  
PFSUB  
MM3, npio2 ;-pi/2  
MM5, MM3  
MM6, 31  
;xs ? pi/2 : -pi/2  
;df ? 0xffffffff : 0  
;xs ? (df ? 0 : pi/2) : (df ? 0 : -pi/2)  
;pr = pi/2 + (xs ? (df ? 0 : pi/2) :  
; (df ? 0 : -pi/2))  
MM6, MM5  
MM6, MM3  
POR  
PFADD  
MM0, MM7  
MM0, MM6  
;ar = xs^df ? -r : r  
;res = ar + pr  
64  
Replace Branches with Computation in 3DNow!Code  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Avoid the Loop Instruction  
The LOOP instruction in the AMD Athlon processor requires  
eight cycles to execute. Use the preferred code shown below:  
Example 1 (Avoid):  
LOOP  
LABEL  
Example 2 (Preferred):  
DEC  
JNZ  
ECX  
LABEL  
Avoid Far Control Transfer Instructions  
Avoid using far control transfer instructions. Far control  
transfer branches can not be predicted by the branch target  
buffer (BTB).  
Avoid the Loop Instruction  
65  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Avoid Recursive Functions  
Avoid recursive functions due to the danger of overflowing the  
return address stack. Convert end-recursive functions to  
iterative code. An end-recursive function is when the function  
call to itself is at the end of the code.  
Example 1 (Avoid):  
long fac(long a)  
{
if (a==0) {  
return (1);  
} else {  
return (a*fac(a–1));  
}
return (t);  
}
Example 2 (Preferred):  
long fac(long a)  
{
long t=1;  
while (a > 0) {  
t *= a;  
a--;  
}
return (t);  
}
66  
Avoid Recursive Functions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
7
Scheduling Optimizations  
This chapter describes how to code instructions for efficient  
scheduling. Guidelines are listed in order of importance.  
Schedule Instructions According to their Latency  
The AMD Athlonprocessor can execute up to three x86  
instructions per cycle, with each x86 instruction possibly having  
a different latency. The AMD Athlon processor has flexible  
scheduling, but for absolute maximum performance, schedule  
instructions, especially FPU and 3DNow!instructions,  
according to their latency. Dependent instructions will then not  
have to wait on instructions with longer latencies.  
Resourceson page 187 for a list of latency numbers.  
Unrolling Loops  
Complete Loop Unrolling  
Make use of the large AMD Athlon processor 64-Kbyte  
instruction cache and unroll loops to get more parallelism and  
reduce loop overhead, even with branch prediction. Complete  
Schedule Instructions According to their Latency  
67  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
unrolling reduces register pressure by removing the loop  
counter. To completely unroll a loop, remove the loop control  
and replicate the loop body N times. In addition, completely  
unrolling a loop increases scheduling opportunities.  
Only unrolling very large code loops can result in the inefficient  
use of the L1 instruction cache. Loops can be unrolled  
completely, if all of the following conditions are true:  
The loop is in a frequently executed piece of code.  
The loop count is known at compile time.  
The loop body, once unrolled, is less than 100 instructions,  
which is approximately 400 bytes of code.  
Partial Loop Unrolling  
Partial loop unrolling can increase register pressure, which can  
make it inefficient due to the small number of registers in the  
x86 architecture. However, in certain situations, partial  
unrolling can be efficient due to the performance gains  
possible. Partial loop unrolling should be considered if the  
following conditions are met:  
Spare registers are available  
Loop body is small, so that loop overhead is significant  
Number of loop iterations is likely > 10  
Consider the following piece of C code:  
double a[MAX_LENGTH], b[MAX_LENGTH];  
for (i=0; i< MAX_LENGTH; i++) {  
a[i] = a[i] + b[i];  
}
Without loop unrolling, the code looks like the following:  
68  
Unrolling Loops  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Without Loop Unrolling:  
MOV ECX, MAX_LENGTH  
MOV EAX, OFFSET A  
MOV EBX, OFFSET B  
$add_loop:  
FLD  
FADD  
FSTP  
ADD  
ADD  
DEC  
JNZ  
QWORD PTR [EAX]  
QWORD PTR [EBX]  
QWORD PTR [EAX]  
EAX, 8  
EBX, 8  
ECX  
$add_loop  
The loop consists of seven instructions. The AMD Athlon  
processor can decode/retire three instructions per cycle, so it  
cannot execute faster than three iterations in seven cycles, or  
3/7 floating-point adds per cycle. However, the pipelined  
floating-point adder allows one add every cycle. In the following  
code, the loop is partially unrolled by a factor of two, which  
creates potential endcases that must be handled outside the  
loop:  
With Partial Loop Unrolling:  
MOV  
MOV  
MOV  
SHR  
JNC  
FLD  
FADD  
FSTP  
ADD  
ADD  
ECX, MAX_LENGTH  
EAX, offset A  
EBX, offset B  
ECX, 1  
$add_loop  
QWORD PTR [EAX]  
QWORD PTR [EBX]  
QWORD PTR [EAX]  
EAX, 8  
EBX, 8  
$add_loop:  
FLD  
QWORD PTR[EAX]  
FADD  
FSTP  
FLD  
FADD  
FSTP  
ADD  
QWORD PTR[EBX]  
QWORD PTR[EAX]  
QWORD PTR[EAX+8]  
QWORD PTR[EBX+8]  
QWORD PTR[EAX+8]  
EAX, 16  
ADD  
DEC  
EBX, 16  
ECX  
JNZ  
$add_loop  
Now the loop consists of 10 instructions. Based on the  
decode/retire bandwidth of three OPs per cycle, this loop goes  
Unrolling Loops  
69  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
no faster than three iterations in 10 cycles, or 6/10  
floating-point adds per cycle, or 1.4 times as fast as the original  
loop.  
Deriving Loop  
Control For Partially  
Unrolled Loops  
A frequently used loop construct is a counting loop. In a typical  
case, the loop count starts at some lower bound lo, increases by  
some fixed, positive increment inc for each iteration of the  
loop, and may not exceed some upper bound hi. The following  
example shows how to partially unroll such a loop by an  
unrolling factor of fac, and how to derive the loop control for  
the partially unrolled version of the loop.  
Example 1 (rolled loop):  
for (k = lo; k <= hi; k += inc) {  
x[k] =  
...  
}
Example 2 (partially unrolled loop):  
for (k = lo; k <= (hi - (fac-1)*inc); k += fac*inc) {  
x[k] =  
...  
x[k+inc] =  
...  
...  
x[k+(fac-1)*inc] =  
...  
}
/* handle end cases */  
for (k = k; k <= hi; k += inc) {  
x[k] =  
...  
}
70  
Unrolling Loops  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Use Function Inlining  
Overview  
Make use of the AMD Athlon processors large 64-Kbyte  
instruction cache by inlining small routines to avoid  
procedure-call overhead. Consider the cost of possible  
increased register usage, which can increase load/store  
instructions for register spilling.  
Function inlining has the advantage of eliminating function call  
overhead and allowing better register allocation and  
instruction scheduling at the site of the function call. The  
disadvantage is decreasing code locality, which can increase  
execution time due to instruction cache misses. Therefore,  
function inlining is an optimization that has to be used  
judiciously.  
In general, due to its very large instruction cache, the  
AMD Athlon processor is less susceptible than other processors  
to the negative side effect of function inlining. Function call  
overhead on the AMD Athlon processor can be low because  
calls and returns are executed at high speed due to the use of  
prediction mechanisms. However, there is still overhead due to  
passing function arguments through memory, which creates  
STLF (store-to-load-forwarding) dependencies. Some compilers  
allow for a reduction of this overhead by allowing arguments to  
be passed in registers in one of their calling conventions, which  
has the drawback of constraining register allocation in the  
function and at the site of the function call.  
In general, function inlining works best if the compiler can  
utilize feedback from a profiler to identify the function call  
sites most frequently executed. If such data is not available, a  
reasonable heuristic is to concentrate on function calls inside  
loops. Functions that are directly recursive should not be  
considered candidates for inlining. However, if they are  
end-recursive, the compiler should convert them to an iterative  
equivalent to avoid potential overflow of the AMD Athlon  
processor return prediction mechanism (return stack) during  
deep recursion. For best results, a compiler should support  
function inlining across multiple source files. In addition, a  
compiler should provide inline templates for commonly used  
library functions, such as sin(), strcmp(), or memcpy().  
Use Function Inlining  
71  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Always Inline Functions if Called from One Site  
A function should always be inlined if it can be established that  
it is called from just one site in the code. For the C language,  
determination of this characteristic is made easier if functions  
are explicitly declared static unless they require external  
linkage. This case occurs quite frequently, as functionality that  
could be concentrated in a single large function is split across  
multiple small functions for improved maintainability and  
readability.  
Always Inline Functions with Fewer than 25 Machine Instructions  
In addition, functions that create fewer than 25 machine  
instructions once inlined should always be inlined because it is  
likely that the function call overhead is close to or more than  
the time spent executing the function body. For large functions,  
the benefits of reduced function call overhead gives  
diminishing returns. Therefore, a function that results in the  
insertion of more than 500 machine instructions at the call site  
should probably not be inlined. Some larger functions might  
consist of multiple, relatively short paths that are negatively  
affected by function overhead. In such a case, it can be  
advantageous to inline larger functions. Profiling information is  
the best guide in determining whether to inline such large  
functions.  
Avoid Address Generation Interlocks  
Loads and stores are scheduled by the AMD Athlon processor to  
access the data cache in program order. Newer loads and stores  
with their addresses calculated can be blocked by older loads  
and stores whose addresses are not yet calculated this is  
known as an address generation interlock. Therefore, it is  
advantageous to schedule loads and stores that can calculate  
their addresses quickly, ahead of loads and stores that require  
the resolution of a long dependency chain in order to generate  
their addresses. Consider the following code examples.  
72  
Avoid Address Generation Interlocks  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 1 (Avoid):  
ADD EBX, ECX  
;inst 1  
MOV EAX, DWORD PTR [10h]  
;inst 2 (fast address calc.)  
MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.)  
MOV EDX, DWORD PTR [24h]  
;this load is stalled from  
; accessing data cache due  
; to long latency for  
; generating address for  
; inst 3  
Example 2 (Preferred):  
ADD EBX, ECX  
MOV EAX, DWORD PTR [10h]  
MOV EDX, DWORD PTR [24h]  
;inst 1  
;inst 2  
;place load above inst 3  
; to avoid address  
; generation interlock stall  
MOV ECX, DWORD PTR [EAX+EBX] ;inst 3  
Use MOVZX and MOVSX  
Use the MOVZX and MOVSX instructions to zero-extend and  
sign-extend byte-size and word-size operands to doubleword  
length. For example, typical code for zero extension creates a  
superset dependency when the zero-extended value is used, as  
in the following code:  
Example 1 (Avoid):  
XOR  
MOV  
EAX, EAX  
AL, [MEM]  
Example 2 (Preferred):  
MOVZX  
EAX, BYTE PTR [MEM]  
Minimize Pointer Arithmetic in Loops  
Minimize pointer arithmetic in loops, especially if the loop  
body is small. In this case, the pointer arithmetic would cause  
significant overhead. Instead, take advantage of the complex  
addressing modes to utilize the loop counter to index into  
memory arrays. Using complex addressing modes does not have  
any negative impact on execution speed, but the reduced  
number of instructions preserves decode bandwidth.  
Use MOVZX and MOVSX  
73  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 1 (Avoid):  
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;  
for (i=0; i < MAXSIZE; i++) {  
c [i] = a[i] + b[i];  
}
MOV  
XOR  
XOR  
XOR  
ECX, MAXSIZE ;initialize loop counter  
ESI, ESI  
EDI, EDI  
EBX, EBX  
;initialize offset into array a  
;initialize offset into array b  
;initialize offset into array c  
$add_loop:  
MOV  
MOV  
ADD  
MOV  
ADD  
ADD  
ADD  
DEC  
JNZ  
EAX, [ESI + a] ;get element a  
EDX, [EDI + b] ;get element b  
EAX, EDX ;a[i] + b[i]  
[EBX + c], EAX ;write result to c  
ESI, 4  
EDI, 4  
EBX, 4  
ECX  
;increment offset into a  
;increment offset into b  
;increment offset into c  
;decrement loop count  
;until loop count 0  
$add_loop  
Example 2 (Preferred):  
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;  
for (i=0; i < MAXSIZE; i++) {  
c [i] = a[i] + b[i];  
}
MOV ECX, MAXSIZE-1  
;initialize loop counter  
$add_loop:  
MOV EAX, [ECX*4 + a] ;get element a  
MOV EDX, [ECX*4 + b] ;get element b  
ADD EAX, EDX  
;a[i] + b[i]  
MOV [ECX*4 + c], EAX ;write result to c  
DEC ECX  
JNS $add_loop  
;decrement index  
;until index negative  
Note that the code in example 2 traverses the arrays in a  
downward direction (i.e., from higher addresses to lower  
addresses), whereas the original code in example 1 traverses  
the arrays in an upward direction. Such a change in the  
direction of the traversal is possible if each loop iteration is  
completely independent of all other loop iterations, as is the  
case here.  
In code where the direction of the array traversal cant be  
switched, it is still possible to minimize pointer arithmetic by  
appropriately biasing base addresses and using an index  
74  
Minimize Pointer Arithmetic in Loops  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
variable that starts with a negative value and reaches zero when  
the loop expires. Note that if the base addresses are held in  
registers (e.g., when the base addresses are passed as  
arguments of a function) biasing the base addresses requires  
additional instructions to perform the biasing at run time and a  
small amount of additional overhead is incurred. In the  
examples shown here the base addresses are used in the  
displacement portion of the address and biasing is  
accomplished at compile time by simply modifying the  
displacement.  
Example 3 (Preferred):  
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;  
for (i=0; i < MAXSIZE; i++) {  
c [i] = a[i] + b[i];  
}
MOV ECX, (-MAXSIZE)  
;initialize index  
$add_loop:  
MOV EAX, [ECX*4 + a + MAXSIZE*4];get a element  
MOV EDX, [ECX*4 + b + MAXSIZE*4];get b element  
ADD EAX, EDX  
;a[i] + b[i]  
MOV [ECX*4 + c + MAXSIZE*4], EAX;write result to c  
INC ECX  
JNZ $add_loop  
;increment index  
;until index==0  
Push Memory Data Carefully  
Carefully choose the best method for pushing memory data. To  
reduce register pressure and code dependencies, follow  
example 2 below.  
Example 1 (Avoid):  
MOV EAX, [MEM]  
PUSH EAX  
Example 2 (Preferred):  
PUSH [MEM]  
Push Memory Data Carefully  
75  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
76  
Push Memory Data Carefully  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
8
Integer Optimizations  
This chapter describes ways to improve integer performance  
through optimized programming techniques. The guidelines are  
listed in order of importance.  
Replace Divides with Multiplies  
Replace integer division by constants with multiplication by  
the reciprocal. Because the AMD Athlonprocessor has a very  
fast integer multiply (59 cycles signed, 48 cycles unsigned)  
and the integer division delivers only one bit of quotient per  
cycle (2247 cycles signed, 1741 cycles unsigned), the  
equivalent code is much faster. The user can follow the  
examples in this chapter that illustrate the use of integer  
division by constants, or access the executables in the  
opt_utilities directory in the AMD documentation CD-ROM  
(order# 21860) to find alternative code for dividing by a  
constant.  
Multiplication by Reciprocal (Division) Utility  
The code for the utilities can be found at Derivation of  
All utilities were compiled for the Microsoft Windows® 95,  
Windows 98, and Windows NT® environments. All utilities are  
provided as isand are not supported by AMD.  
Replace Divides with Multiplies  
77  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Signed Division  
Utility  
In the opt_utilities directory of the AMD documentation  
CDROM, run sdiv.exe in a DOS window to find the fastest code  
for signed division by a constant. The utility displays the code  
after the user enters a signed constant divisor. Type sdiv >  
example.outto output the code to a file.  
Unsigned Division  
Utility  
In the opt_utilities directory of the AMD documentation  
CDROM, run udiv.exe in a DOS window to find the fastest code  
for unsigned division by a constant. The utility displays the code  
after the user enters an unsigned constant divisor. Type udiv >  
example.outto output the code to a file.  
Unsigned Division by Multiplication of Constant  
Algorithm: Divisors  
The following code shows an unsigned division using a constant  
value multiplier.  
1 <= d < 231, Odd d  
;In: d = divisor, 1 <= d < 2^31, odd d  
;Out: a = algorithm  
;
;
m = multiplier  
s = shift factor  
;algorithm 0  
MOV EDX, dividend  
MOV EAX, m  
MUL EDX  
SHR EDX, s ;EDX=quotient  
;algorithm 1  
MOV EDX, dividend  
MOV EAX, m  
MUL EDX  
ADD EAX, m  
ADC EDX, 0  
SHR EDX, s ;EDX=quotient  
Derivation of a, m, s  
The derivation for the algorithm (a), multiplier (m), and shift  
count (s), is found in the section Unsigned Derivation for  
Algorithm: Divisors  
231 <= d < 232  
For divisors 231 <= d < 232, the possible quotient values are  
either 0 or 1. This makes it easy to establish the quotient by  
simple comparison of the dividend and divisor. In cases where  
the dividend needs to be preserved, example 1 below is  
recommended.  
78  
Replace Divides with Multiplies  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 1:  
;In:  
;Out:  
EDX = dividend  
EDX = quotient  
XOR EDX, EDX;0  
CMP EAX, d ;CF = (dividend < divisor) ? 1 : 0  
SBB EDX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1  
In cases where the dividend does not need to be preserved, the  
division can be accomplished without the use of an additional  
register, thus reducing register pressure. This is shown in  
example 2 below:  
Example 2:  
;In: EDX = dividend  
;Out: EAX = quotient  
CMP EDX, d ;CF = (dividend < divisor) ? 1 : 0  
MOV EAX, 0 ;0  
SBB EAX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1  
Simpler Code for  
Restricted Dividend  
Integer division by a constant can be made faster if the range of  
the dividend is limited, which removes a shift associated with  
most divisors. For example, for a divide by 10 operation, use the  
following code if the dividend is less than 40000005h:  
MOV EAX, dividend  
MOV EDX, 01999999Ah  
MUL EDX  
MOV quotient, EDX  
Signed Division by Multiplication of Constant  
Algorithm: Divisors  
2 <= d < 231  
These algorithms work if the divisor is positive. If the divisor is  
negative, use abs(d) instead of d, and append a NEG EDXto  
the code. The code makes use of the fact that n/d = (n/d).  
;IN: d = divisor, 2 <= d < 2^31  
;OUT: a = algorithm  
;
;
m = multiplier  
s = shift count  
;algorithm 0  
MOV EAX, m  
MOV EDX, dividend  
MOV ECX, EDX  
IMUL EDX  
SHR ECX, 31  
SAR EDX, s  
ADD EDX, ECX  
;quotient in EDX  
Replace Divides with Multiplies  
79  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
;algorithm 1  
MOV EAX, m  
MOV EDX, dividend  
MOV ECX, EDX  
IMUL EDX  
ADD EDX, ECX  
SHR ECX, 31  
SAR EDX, s  
ADD EDX, ECX  
;quotient in EDX  
Derivation for a, m, s  
Signed Division By 2  
The derivation for the algorithm (a), multiplier (m), and shift  
count (s), is found in the section Signed Derivation for  
;IN: EAX = dividend  
;OUT:EAX = quotient  
CMP EAX, 800000000h  
SBB EAX, –1  
SAR EAX, 1  
;CY = 1, if dividend >=0  
;Increment dividend if it is < 0  
;Perform a right shift  
Signed Division By 2n  
;IN:EAX = dividend  
;OUT:EAX = quotient  
CDQ  
;Sign extend into EDX  
AND EDX, (2^n–1)  
ADD EAX, EDX  
SAR EAX, (n)  
;Mask correction (use divisor –1)  
;Apply correction if necessary  
;Perform right shift by  
; log2 (divisor)  
Signed Division By 2 ;IN:EAX = dividend  
;OUT:EAX = quotient  
CMP EAX, 800000000h  
SBB EAX, –1  
;CY = 1, if dividend >= 0  
;Increment dividend if it is < 0  
;Perform right shift  
SAR EAX, 1  
NEG EAX  
;Use (x/–2) == –(x/2)  
Signed Division By  
;IN:EAX = dividend  
;OUT:EAX = quotient  
CDQ  
(2n)  
;Sign extend into EDX  
AND EDX, (2^n–1)  
ADD EAX, EDX  
SAR EAX, (n)  
NEG EAX  
;Mask correction (–divisor –1)  
;Apply correction if necessary  
;Right shift by log2(–divisor)  
;Use (x/–(2^n)) == (–(x/2^n))  
Remainder of Signed  
Integer 2 or 2  
;IN:EAX = dividend  
;OUT:EAX = remainder  
CDQ  
;Sign extend into EDX  
;Compute remainder  
;Negate remainder if  
;Dividend was < 0  
AND EDX, 1  
XOR EAX, EDX  
SUB EAX, EDX  
MOV [remainder], EAX  
80  
Replace Divides with Multiplies  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Remainder of Signed  
;IN:EAX = dividend  
;OUT:EAX = remainder  
CDQ  
Integer 2n or (2n)  
;Sign extend into EDX  
;Mask correction (abs(divison)–1)  
;Apply pre-correction  
AND EDX, (2^n–1)  
ADD EAX, EDX  
AND EAX, (2^n–1)  
SUB EAX, EDX  
;Mask out remainder (abs(divison)–1)  
;Apply pre-correction, if necessary  
MOV [remainder], EAX  
Use Alternative Code When Multiplying by a Constant  
A 32-bit integer multiply by a constant has a latency of five  
cycles. Therefore, use alternative code when multiplying by  
certain constants. In addition, because there is just one  
multiply unit, the replacement code may provide better  
throughput.  
The following code samples are designed such that the original  
source also receives the final result. Other sequences are  
possible if the result is in a different register. Adds have been  
favored over shifts to keep code size small. Generally, there is a  
fast replacement if the constant has very few 1 bits in binary.  
More constants are found in the file multiply_by_constants.txt  
located in the same directory where this document is located in  
the SDK.  
by 2:  
by 3:  
by 4:  
by 5:  
by 6:  
ADD REG1, REG1  
;1 cycle  
;2 cycles  
;1 cycle  
;2 cycles  
;3 cycles  
LEA REG1, [REG1*2+REG1]  
SHL REG1, 2  
LEA REG1, [REG1*4+REG1]  
LEA REG2, [REG1*4+REG1]  
ADD REG1, REG2  
by 7:  
MOV REG2, REG1  
SHL REG1, 3  
;2 cycles  
SUB REG1, REG2  
by 8:  
by 9:  
SHL REG1, 3  
;1 cycle  
;2 cycles  
;3 cycles  
LEA REG1, [REG1*8+REG1]  
by 10: LEA REG2, [REG1*8+REG1]  
ADD REG1, REG2  
Use Alternative Code When Multiplying by a Constant  
81  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
by 11: LEA REG2, [REG1*8+REG1]  
;3 cycles  
ADD REG1, REG1  
ADD REG1, REG2  
by 12: SHL REG1, 2  
LEA REG1, [REG1*2+REG1]  
;3 cycles  
;3 cycles  
by 13: LEA REG2, [REG1*2+REG1]  
SHL REG1, 4  
SUB REG1, REG2  
by 14: LEA REG2, [REG1*4+REG1]  
LEA REG1, [REG1*8+REG1]  
ADD REG1, REG2  
;3 cycles  
;2 cycles  
by 15: MOV REG2, REG1  
SHL REG1, 4  
SUB REG1, REG2  
by 16: SHL REG1, 4  
;1 cycle  
by 17: MOV REG2, REG1  
SHL REG1, 4  
;2 cycles  
ADD REG1, REG2  
by 18: ADD REG1, REG1  
LEA REG1, [REG1*8+REG1]  
;3 cycles  
;3 cycles  
by 19: LEA REG2, [REG1*2+REG1]  
SHL REG1, 4  
ADD REG1, REG2  
by 20: SHL REG1, 2  
;3 cycles  
;3 cycles  
LEA REG1, [REG1*4+REG1]  
by 21: LEA REG2, [REG1*4+REG1]  
SHL REG1, 4  
ADD REG1, REG2  
by 22: use IMUL  
by 23: LEA REG2, [REG1*8+REG1]  
SHL REG1, 5  
;3 cycles  
SUB REG1, REG2  
by 24: SHL REG1, 3  
;3 cycles  
;3 cycles  
LEA REG1, [REG1*2+REG1]  
by 25: LEA REG2, [REG1*8+REG1]  
SHL REG1, 4  
ADD REG1, REG2  
82  
Use Alternative Code When Multiplying by a Constant  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
by 26: use IMUL  
by 27: LEA REG2, [REG1*4+REG1]  
SHL REG1, 5  
;3 cycles  
;3 cycles  
SUB REG1, REG2  
by 28: MOV REG2, REG1  
SHL REG1, 3  
SUB REG1, REG2  
SHL REG1, 2  
by 29: LEA REG2, [REG1*2+REG1]  
SHL REG1, 5  
;3 cycles  
;3 cycles  
SUB REG1, REG2  
by 30: MOV REG2, REG1  
SHL REG1, 4  
SUB REG1, REG2  
ADD REG1, REG1  
by 31: MOV REG2, REG1  
SHL REG1, 5  
;2 cycles  
;1 cycle  
SUB REG1, REG2  
by 32: SHL REG1, 5  
Use MMXInstructions for Integer-Only Work  
In many programs it can be advantageous to use MMX  
instructions to do integer-only work, especially if the function  
already uses 3DNow!or MMX code. Using MMX instructions  
relieves register pressure on the integer registers. As long as  
data is simply loaded/stored, added, shifted, etc., MMX  
instructions are good substitutes for integer instructions.  
Integer registers are freed up with the following results:  
May be able to reduce the number of integer registers to  
saved/restored on function entry/edit.  
Free up integer registers for pointers, loop counters, etc., so  
that they do not have to be spilled to memory, which  
reduces memory traffic and latency in dependency chains.  
Be careful with regards to passing data between MMX and  
integer registers and of creating mismatched store-to-load  
Use MMXInstructions for Integer-Only Work  
83  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
In addition, using MMX instructions increases the available  
parallelism. The AMD Athlon processor can issue three integer  
OPs and two MMX OPs per cycle.  
Repeated String Instruction Usage  
Latency of Repeated String Instructions  
Table 1 shows the latency for repeated string instructions on the  
AMD Athlon processor.  
Table 1. Latency of Repeated String Instructions  
Instruction  
REP MOVS  
ECX=0 (cycles)  
DF = 0 (cycles)  
15 + (4/3*c)  
14 + (1*c)  
DF = 1 (cycles)  
25 + (4/3*c)  
24 + (1*c)  
11  
11  
11  
11  
11  
REP STOS  
REP LODS  
REP SCAS  
REP CMPS  
Note:  
15 + (2*c)  
15 + (2*c)  
15 + (5/2*c)  
16 + (10/3*c)  
15 + (5/2*c)  
16 + (10/3*c)  
c = value of ECX, (ECX > 0)  
Table 1 lists the latencies with the direction flag (DF) = 0  
(increment) and DF = 1. In addition, these latencies are  
assumed for aligned memory operands. Note that for  
MOVS/STOS, when DF = 1 (DOWN), the overhead portion of the  
latency increases significantly. However, these types are less  
commonly found. The user should use the formula and round up  
to the nearest integer value to determine the latency.  
Guidelines for Repeated String Instructions  
To help achieve good performance, this section contains  
guidelines for the careful scheduling of VectorPath repeated  
string instructions.  
Use the Largest  
Possible Operand  
Size  
Always move data using the largest operand size possible. For  
example, use REP MOVSD rather than REP MOVSW and REP  
MOVSW rather than REP MOVSB. Use REP STOSD rather than  
REP STOSW and REP STOSW rather than REP MOVSB.  
84  
Repeated String Instruction Usage  
Download from Www.Somanuals.com. All Manuals Search And Download.  
             
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Ensure DF=0 (UP)  
Always make sure that DF = 0 (UP) (after execution of CLD) for  
REP MOVS and REP STOS. DF = 1 (DOWN) is only needed for  
certain cases of overlapping REP MOVS (for example, source  
and destination overlap).  
While string instructions with DF = 1 (DOWN) are slower, only  
the overhead part of the cycle equation is larger and not the  
Instructions,on page 84 for additional latency numbers.  
Align Source and  
Destination with  
Operand Size  
For REP MOVS, make sure that both source and destination are  
aligned with regard to the operand size. Handle the end case  
separately, if necessary. If either source or destination cannot  
be aligned, make the destination aligned and the source  
misaligned. For REP STOS, make the destination aligned.  
Inline REP String  
with Low Counts  
Expand REP string instructions into equivalent sequences of  
simple x86 instructions, if the repeat count is constant and less  
than eight. Use an inline sequence of loads and stores to  
accomplish the move. Use a sequence of stores to emulate REP  
STOS. This technique eliminates the setup overhead of REP  
instructions and increases instruction throughput.  
Use Loop for REP  
String with Low  
Variable Counts  
If the repeated count is variable, but is likely less than eight,  
use a simple loop to move/store the data. This technique avoids  
the overhead of REP MOVS and REP STOS.  
Using MOVQ and  
MOVNTQ for Block  
Copy/Fill  
To fill or copy blocks of data that are larger than 512 bytes, or  
where the destination is in uncacheable memory, it is  
recommended to use the MMX instructions MOVQ/MOVNTQ  
instead of REP STOS and REP MOVS in order to achieve  
maximum performance. (See the guideline, Use MMX™  
Repeated String Instruction Usage  
85  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Use XOR Instruction to Clear Integer Registers  
To clear an integer register to all 0s, use XOR reg, reg. The  
AMD Athlon processor is able to avoid the false read  
dependency on the XOR instruction.  
Example 1 (Acceptable):  
MOV  
REG, 0  
Example 2 (Preferred):  
XOR  
REG, REG  
Efficient 64-Bit Integer Arithmetic  
This section contains a collection of code snippets and  
subroutines showing the efficient implementation of 64-bit  
arithmetic. Addition, subtraction, negation, and shifts are best  
handled by inline code. Multiplies, divides, and remainders are  
less common operations and should usually be implemented as  
subroutines. If these subroutines are used often, the  
programmer should consider inlining them. Except for division  
and remainder, the code presented works for both signed and  
unsigned integers. The division and remainder code shown  
works for unsigned integers, but can easily be extended to  
handle signed integers.  
Example 1 (Addition):  
;add operand in ECX:EBX to operand EDX:EAX, result in  
; EDX:EAX  
ADD  
ADC  
EAX, EBX  
EDX, ECX  
Example 2 (Subtraction):  
;subtract operand in ECX:EBX from operand EDX:EAX, result in  
; EDX:EAX  
SUB  
SBB  
EAX, EBX  
EDX, ECX  
Example 3 (Negation):  
;negate operand in EDX:EAX  
NOT  
NEG  
SBB  
EDX  
EAX  
EDX, –1 ;fixup: increment hi-word if low-word was 0  
86  
Use XOR Instruction to Clear Integer Registers  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Example 4 (Left shift):  
;shift operand in EDX:EAX left, shift count in ECX (count  
; applied modulo 64)  
SHLD  
SHL  
TEST  
JZ  
MOV  
XOR  
EDX, EAX, CL  
EAX, CL  
ECX, 32  
$lshift_done  
EDX, EAX  
EAX, EAX  
;first apply shift count  
; mod 32 to EDX:EAX  
;need to shift by another 32?  
;no, done  
;left shift EDX:EAX  
; by 32 bits  
$lshift_done:  
Example 5 (Right shift):  
SHRD  
SHR  
TEST  
JZ  
EAX, EDX, CL  
EDX, CL  
;first apply shift count  
; mod 32 to EDX:EAX  
;need to shift by another 32?  
;no, done  
;left shift EDX:EAX  
; by 32 bits  
ECX, 32  
$rshift_done  
EAX, EDX  
EDX, EDX  
MOV  
XOR  
$rshift_done:  
Example 6 (Multiplication):  
;_llmul computes the low-order half of the product of its  
; arguments, two 64-bit integers  
;
;INPUT: [ESP+8]:[ESP+4] multiplicand  
;
;
[ESP+16]:[ESP+12] multiplier  
;OUTPUT: EDX:EAX  
;
(multiplicand * multiplier) % 2^64  
;DESTROYS: EAX,ECX,EDX,EFlags  
_llmul PROC  
MOV  
MOV  
OR  
MOV  
MOV  
JNZ  
MUL  
RET  
EDX, [ESP+8]  
ECX, [ESP+16]  
EDX, ECX  
EDX, [ESP+12]  
EAX, [ESP+4]  
$twomul  
;multiplicand_hi  
;multiplier_hi  
;one operand >= 2^32?  
;multiplier_lo  
;multiplicand_lo  
;yes, need two multiplies  
;multiplicand_lo * multiplier_lo  
;done, return to caller  
EDX  
$twomul:  
IMUL  
IMUL  
ADD  
EDX, [ESP+8] ;p3_lo = multiplicand_hi*multiplier_lo  
ECX, EAX  
ECX, EDX  
;p2_lo = multiplier_hi*multiplicand_lo  
; p2_lo + p3_lo  
MUL  
DWORD PTR [ESP+12] ;p1=multiplicand_lo*multiplier_lo  
ADD  
RET  
EDX, ECX  
;p1+p2lo+p3_lo = result in EDX:EAX  
;done, return to caller  
_llmul ENDP  
Efficient 64-Bit Integer Arithmetic  
87  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 7 (Division):  
;_ulldiv divides two unsigned 64-bit integers, and returns  
; the quotient.  
;
;INPUT:  
;
;
[ESP+8]:[ESP+4] dividend  
[ESP+16]:[ESP+12] divisor  
;OUTPUT: EDX:EAX  
;
quotient of division  
;DESTROYS:EAX,ECX,EDX,EFlags  
_ulldiv PROC  
PUSH  
MOV  
MOV  
MOV  
MOV  
TEST  
JNZ  
CMP  
JAE  
DIV  
MOV  
EBX  
;save EBX as per calling convention  
ECX, [ESP+20] ;divisor_hi  
EBX, [ESP+16] ;divisor_lo  
EDX, [ESP+12] ;dividend_hi  
EAX, [ESP+8]  
ECX, ECX  
$big_divisor  
EDX, EBX  
$two_divs  
EBX  
;dividend_lo  
;divisor > 2^32–1?  
;yes, divisor > 32^32–1  
;only one division needed? (ECX = 0)  
;need two divisions  
;EAX = quotient_lo  
;EDX = quotient_hi = 0 (quotient in  
; EDX:EAX)  
EDX, ECX  
POP  
RET  
EBX  
;restore EBX as per calling convention  
;done, return to caller  
$two_divs:  
MOV  
MOV  
XOR  
DIV  
XCHG  
DIV  
MOV  
POP  
RET  
ECX, EAX  
;save dividend_lo in ECX  
;get dividend_hi  
;zero extend it into EDX:EAX  
;quotient_hi in EAX  
;ECX = quotient_hi, EAX = dividend_lo  
;EAX = quotient_lo  
;EDX = quotient_hi (quotient in EDX:EAX)  
;restore EBX as per calling convention  
;done, return to caller  
EAX, EDX  
EDX, EDX  
EBX  
EAX, ECX  
EBX  
EDX, ECX  
EBX  
$big_divisor:  
PUSH  
MOV  
SHR  
RCR  
ROR  
RCR  
BSR  
SHRD  
SHRD  
SHR  
ROL  
DIV  
MOV  
EDI  
;save EDI as per calling convention  
;save divisor_hi  
;shift both divisor and dividend right  
; by 1 bit  
EDI, ECX  
EDX, 1  
EAX, 1  
EDI, 1  
EBX, 1  
ECX, ECX  
;ECX = number of remaining shifts  
EBX, EDI, CL ;scale down divisor and dividend  
EAX, EDX, CL ; such that divisor is  
EDX, CL  
EDI, 1  
EBX  
; less than 2^32 (i.e. fits in EBX)  
;restore original divisor_hi  
;compute quotient  
EBX, [ESP+12] ;dividend_lo  
88  
Efficient 64-Bit Integer Arithmetic  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
MOV  
IMUL  
ECX, EAX  
EDI, EAX  
;save quotient  
;quotient * divisor hi-word  
; (low only)  
MUL  
ADD  
SUB  
MOV  
MOV  
SBB  
SBB  
XOR  
POP  
POP  
RET  
DWORD PTR [ESP+20];quotient * divisor lo-word  
EDX, EDI  
EBX, EAX  
EAX, ECX  
;EDX:EAX = quotient * divisor  
;dividend_lo – (quot.*divisor)_lo  
;get quotient  
ECX, [ESP+16]  
;dividend_hi  
ECX, EDX  
EAX, 0  
EDX, EDX  
EDI  
;subtract divisor * quot. from dividend  
;adjust quotient if remainder negative  
;clear hi-word of quot(EAX<=FFFFFFFFh)  
;restore EDI as per calling convention  
;restore EBX as per calling convention  
;done, return to caller  
EBX  
_ulldiv ENDP  
Example 8 (Remainder):  
;_ullrem divides two unsigned 64-bit integers, and returns  
; the remainder.  
;
;INPUT:  
;
[ESP+8]:[ESP+4] dividend  
[ESP+16]:[ESP+12] divisor  
;
;OUTPUT:  
;
EDX:EAX  
remainder of division  
;DESTROYS: EAX,ECX,EDX,EFlags  
_ullrem PROC  
PUSH  
MOV  
MOV  
MOV  
MOV  
TEST  
JNZ  
CMP  
JAE  
DIV  
MOV  
MOV  
POP  
RET  
EBX  
;save EBX as per calling convention  
ECX, [ESP+20] ;divisor_hi  
EBX, [ESP+16] ;divisor_lo  
EDX, [ESP+12] ;dividend_hi  
EAX, [ESP+8] ;dividend_lo  
ECX, ECX  
;divisor > 2^32–1?  
$r_big_divisor;yes, divisor > 32^32–1  
EDX, EBX  
$r_two_divs  
EBX  
EAX, EDX  
EDX, ECX  
EBX  
;only one division needed? (ECX = 0)  
;need two divisions  
;EAX = quotient_lo  
;EAX = remainder_lo  
;EDX = remainder_hi = 0  
;restore EBX as per calling convention  
;done, return to caller  
Efficient 64-Bit Integer Arithmetic  
89  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
$r_two_divs:  
MOV  
MOV  
XOR  
DIV  
ECX, EAX  
EAX, EDX  
EDX, EDX  
EBX  
;save dividend_lo in ECX  
;get dividend_hi  
;zero extend it into EDX:EAX  
;EAX = quotient_hi, EDX = intermediate  
; remainder  
MOV  
DIV  
MOV  
XOR  
POP  
RET  
EAX, ECX  
EBX  
EAX, EDX  
EDX, EDX  
EBX  
;EAX = dividend_lo  
;EAX = quotient_lo  
;EAX = remainder_lo  
;EDX = remainder_hi = 0  
;restore EBX as per calling convention  
;done, return to caller  
$r_big_divisor:  
PUSH  
MOV  
SHR  
RCR  
ROR  
RCR  
BSR  
SHRD  
SHRD  
SHR  
ROL  
DIV  
MOV  
MOV  
IMUL  
MUL  
ADD  
SUB  
MOV  
MOV  
SBB  
EDI  
;save EDI as per calling convention  
;save divisor_hi  
;shift both divisor and dividend right  
; by 1 bit  
EDI, ECX  
EDX, 1  
EAX, 1  
EDI, 1  
EBX, 1  
ECX, ECX  
;ECX = number of remaining shifts  
EBX, EDI, CL ;scale down divisor and dividend such  
EAX, EDX, CL ; that divisor is less than 2^32  
EDX, CL  
EDI, 1  
EBX  
EBX, [ESP+12] ;dividend lo-word  
ECX, EAX  
EDI, EAX  
DWORD PTR [ESP+20] ;quotient * divisor lo-word  
EDX, EDI  
EBX, EAX  
ECX, [ESP+16] ;dividend_hi  
EAX, [ESP+20] ;divisor_lo  
ECX, EDX  
; (i.e. fits in EBX)  
;restore original divisor (EDI:ESI)  
;compute quotient  
;save quotient  
;quotient * divisor hi-word (low only)  
;EDX:EAX = quotient * divisor  
;dividend_lo – (quot.*divisor)–lo  
;subtract divisor * quot. from  
; dividend  
;(remainder < 0)? 0xFFFFFFFF : 0  
;(remainder < 0)? divisor_lo : 0  
SBB  
AND  
AND  
ADD  
ADC  
POP  
POP  
RET  
EDX, EDX  
EAX, EDX  
EDX, [ESP+24] ;(remainder < 0)? divisor_hi : 0  
EAX, EBX  
EDX, ECX  
EDI  
;remainder += (remainder < 0)?  
; divisor : 0  
;restore EDI as per calling convention  
;restore EBX as per calling convention  
;done, return to caller  
EBX  
_ullrem ENDP  
90  
Efficient 64-Bit Integer Arithmetic  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Efficient Implementation of Population Count Function  
Population count is an operation that determines the number of  
set bits in a bit string. For example, this can be used to  
determine the cardinality of a set. The following example code  
shows how to efficiently implement a population count  
operation for 32-bit operands. The example is written for the  
inline assembler of Microsoft Visual C.  
Function popcount() implements a branchless computation of  
the population count. It is based on a O(log(n)) algorithm that  
successively groups the bits into groups of 2, 4, 8, 16, and 32,  
while maintaining a count of the set bits in each group. The  
algorithms consist of the following steps:  
Step 1  
Partition the integer into groups of two bits. Compute the  
population count for each 2-bit group and store the result in the  
2-bit group. This calls for the following transformation to be  
performed for each 2-bit group:  
00b -> 00b  
01b -> 01b  
10b -> 01b  
11b -> 10b  
If the original value of a 2-bit group is v, then the new value will  
be v - (v >> 1). In order to handle all 2-bit groups simultaneously,  
it is necessary to mask appropriately to prevent spilling from  
one bit group to the next lower bit group. Thus:  
w = v - ((v >> 1) & 0x55555555)  
Step 2  
Add the population count of adjacent 2-bit group and store the  
sum to the 4-bit group resulting from merging these adjacent  
2-bit groups. To do this simultaneously to all groups, mask out  
the odd numbered groups, mask out the even numbered groups,  
and then add the odd numbered groups to the even numbered  
groups:  
x = (w & 0x33333333) + ((w >> 2) & 0x33333333)  
Each 4-bit field now has value 0000b, 0001b, 0010b, 0011b, or  
0100b.  
Efficient Implementation of Population Count Function  
91  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Step 3  
For the first time, the value in each k-bit field is small enough  
that adding two k-bit fields results in a value that still fits in the  
k-bit field. Thus the following computation is performed:  
y = (x + (x >> 4)) & 0x0F0F0F0F  
The result is four 8-bit fields whose lower half has the desired  
sum and whose upper half contains "junk" that has to be  
masked out. In a symbolic form:  
x
= 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh  
x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg  
sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ  
The WWWW, XXXX, YYYY, and ZZZZ values are the  
interesting sums with each at most 1000b, or 8 decimal.  
Step 4  
The four 4-bit sums can now be rapidly accumulated by means  
of a multiply with a "magic" multiplier. This can be derived  
from looking at the following chart of partial products:  
0p0q0r0s * 01010101 =  
:0p0q0r0s  
0p:0q0r0s  
0p0q:0r0s  
0p0q0r:0s  
000pxxww:vvuutt0s  
Here p, q, r, and s are the 4-bit sums from the previous step, and  
vv is the final result in which we are interested. Thus, the final  
result:  
z = (y * 0x01010101) >> 24  
Example:  
unsigned int popcount(unsigned int v)  
{
unsigned int retVal;  
__asm {  
MOV EAX, [v]  
MOV EDX, EAX  
SHR EAX, 1  
;v  
;v  
;v >> 1  
AND EAX, 055555555h ;(v >> 1) & 0x55555555  
SUB EDX, EAX  
MOV EAX, EDX  
SHR EDX, 2  
;w = v - ((v >> 1) & 0x55555555)  
;w  
;w >> 2  
AND EAX, 033333333h ;w & 0x33333333  
AND EDX, 033333333h ;(w >> 2) & 0x33333333  
92  
Efficient Implementation of Population Count Function  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
ADD EAX, EDX  
;x = (w & 0x33333333) + ((w >> 2) &  
; 0x33333333)  
;x  
MOV EDX, EDX  
SHR EAX, 4  
;x >> 4  
ADD EAX, EDX  
;x + (x >> 4)  
AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F)  
IMUL EAX, 001010101h ;y * 0x01010101  
SHR EAX, 24  
;population count = (y *  
; 0x01010101) >> 24  
;store result  
MOV retVal, EAX  
}
return (retVal);  
}
Derivation of Multiplier Used for Integer Division by  
Constants  
Unsigned Derivation for Algorithm, Multiplier, and Shift Factor  
The utility udiv.exe was compiled using the code shown in this  
section.  
The following code derives the multiplier value used when  
performing integer division by constants. The code works for  
unsigned integer division and for odd divisors between 1 and  
2311, inclusive. For divisors of the form d = d*2n, the multiplier  
is the same as for dand the shift factor is s + n.  
/* Code snippet to determine algorithm (a), multiplier (m),  
and shift factor (s) to perform division on unsigned 32-bit  
integers by constant divisor. Code is written for the  
Microsoft Visual C compiler. */  
/*  
In: d = divisor, 1 <= d < 2^31, d odd  
Out: a = algorithm  
m = multiplier  
s = shift factor  
;algorithm 0  
MOV EDX, dividend  
MOV EAX, m  
MUL EDX  
SHR EDX, s ;EDX=quotient  
Derivation of Multiplier Used for Integer Division by Constants  
93  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
;algorithm 1  
MOV EDX, dividend  
MOV EAX, m  
MUL EDX  
ADD EAX, m  
ADC EDX, 0  
SHR EDX, s ;EDX=quotient  
*/  
typedef unsigned __int64 U64;  
typedef unsigned long  
U32;  
U32 d, l, s, m, a, r;  
U64 m_low, m_high, j, k;  
U32 log2 (U32 i)  
{
U32 t = 0;  
i = i >> 1;  
while (i) {  
i = i >> 1;  
t++;  
}
return (t);  
}
/* Generate m, s for algorithm 0. Based on: Granlund, T.;  
Montgomery, P.L.:"Division by Invariant Integers using  
Multiplication”. SIGPLAN Notices, Vol. 29, June 1994, page  
61. */  
l
j
k
= log2(d) + 1;  
= (((U64)(0xffffffff)) % ((U64)(d)));  
= (((U64)(1)) << (32+l)) / ((U64)(0xffffffff–j));  
m_low = (((U64)(1)) << (32+l)) / d;  
m_high = ((((U64)(1)) << (32+l)) + k) / d;  
while (((m_low >> 1) < (m_high >> 1)) && (l > 0)) {  
m_low = m_low >> 1;  
m_high = m_high >> 1;  
l
= l – 1;  
}
if ((m_high >> 32) == 0) {  
m = ((U32)(m_high));  
s = l;  
a = 0;  
}
94  
Derivation of Multiplier Used for Integer Division by  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
/* Generate m, s for algorithm 1. Based on: Magenheimer,  
D.J.; et al: “Integer Multiplication and Division on the HP  
Precision Architecture”. IEEE Transactions on Computers, Vol  
37, No. 8, August 1988, page 980. */  
else {  
s = log2(d);  
m_low = (((U64)(1)) << (32+s)) / ((U64)(d));  
r
= ((U32)((((U64)(1)) << (32+s)) % ((U64)(d))));  
m = (r < ((d>>1)+1)) ? ((U32)(m_low)) : ((U32)(m_low))+1;  
a = 1;  
}
/* Reduce multiplier/shift factor for either algorithm to  
smallest possible */  
while (!(m&1)) {  
m = m >> 1;  
s––;  
}
Signed Derivation for Algorithm, Multiplier, and Shift Factor  
The utility sdiv.exe was compiled using the following code.  
/* Code snippet to determine algorithm (a), multiplier (m),  
and shift count (s) for 32-bit signed integer division,  
given divisor d. Written for Microsoft Visual C compiler. */  
/*  
IN: d = divisor, 2 <= d < 2^31  
OUT: a = algorithm  
m = multiplier  
s = shift count  
;algorithm 0  
MOV EAX, m  
MOV EDX, dividend  
MOV ECX, EDX  
IMUL EDX  
SHR ECX, 31  
SAR EDX, s  
ADD EDX, ECX  
; quotient in EDX  
Derivation of Multiplier Used for Integer Division by Constants  
95  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
;algorithm 1  
MOV EAX, m  
MOV EDX, dividend  
MOV ECX, EDX  
IMUL EDX  
ADD EDX, ECX  
SHR ECX, 31  
SAR EDX, s  
ADD EDX, ECX  
*/  
; quotient in EDX  
U32;  
typedef unsigned __int64 U64;  
typedef unsigned long  
U32 log2 (U32 i)  
{
U32 t = 0;  
i = i >> 1;  
while (i) {  
i = i >> 1;  
t++;  
}
return (t);  
}
U32 d, l, s, m, a;  
U64 m_low, m_high, j, k;  
/* Determine algorithm (a), multiplier (m), and shift count  
(s) for 32-bit signed integer division. Based on: Granlund,  
T.; Montgomery, P.L.: “Division by Invariant Integers using  
Multiplication”. SIGPLAN Notices, Vol. 29, June 1994, page  
61. */  
l
j
k
= log2(d);  
= (((U64)(0x80000000)) % ((U64)(d)));  
= (((U64)(1)) << (32+l)) / ((U64)(0x80000000–j));  
m_low = (((U64)(1)) << (32+l)) / d;  
m_high = ((((U64)(1)) << (32+l)) + k) / d;  
while (((m_low >> 1) < (m_high >> 1)) && (l > 0)) {  
m_low = m_low >> 1;  
m_high = m_high >> 1;  
l
= l – 1;  
}
m = ((U32)(m_high));  
s = l;  
a = (m_high >> 31) ? 1 : 0;  
96  
Derivation of Multiplier Used for Integer Division by  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
9
Floating-Point Optimizations  
This chapter details the methods used to optimize  
floating-point code to the pipelined floating-point unit (FPU).  
Guidelines are listed in order of importance.  
Ensure All FPU Data is Aligned  
45, floating-point data should be naturally aligned. That is,  
words should be aligned on word boundaries, doublewords on  
doubleword boundaries, and quadwords on quadword  
boundaries. Misaligned memory accesses reduce the available  
memory bandwidth.  
Use Multiplies Rather than Divides  
If accuracy requirements allow, floating-point division by a  
constant should be converted to a multiply by the reciprocal.  
Divisors that are powers of two and their reciprocal are exactly  
representable, except in the rare case that the reciprocal  
overflows or underflows, and therefore does not cause an  
accuracy issue. Unless such an overflow or underflow occurs, a  
division by a power of two should always be converted to a  
multiply. Although the AMD Athlonprocessor has  
high-performance division, multiplies are significantly faster  
than divides.  
Ensure All FPU Data is Aligned  
97  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Use FFREEP Macro to Pop One Register from the FPU Stack  
In FPU intensive code, frequently accessed data is often  
pre-loaded at the bottom of the FPU stack before processing  
floating-point data. After completion of processing, it is  
desirable to remove the pre-loaded data from the FPU stack as  
quickly as possible. The classical way to clean up the FPU stack  
is to use either of the following instructions:  
FSTP  
ST(0)  
;removes one register from stack  
FCOMPP  
;removes two registers from stack  
On the AMD Athlon processor, a faster alternative is to use the  
FFREEP instruction below. Note that the FFREEP instruction,  
although insufficiently documented in the past, is supported by  
all 32-bit x86 processors. The opcode bytes for FFREEP ST(i)  
FFREEP ST(0)  
;removes one register from stack  
FFREEP ST(i) works like FFREE ST(i) except that it  
increments the FPU top-of-stack after doing the FFREE work.  
In other words, FFREEP ST(i) marks ST(i) as empty, then  
increments the x87 stack pointer. On the AMD Athlon  
processor, the FFREEP instruction converts to an internal NOP,  
which can go down any pipe with no dependencies.  
Many assemblers do not support the FFREEP instruction. In  
these cases, a simple text macro can be created to facilitate use  
of the FFREEP ST(0).  
FFREEP_ST0  
TEXTEQU  
<DB 0DFh, 0C0h>  
Floating-Point Compare Instructions  
For branches that are dependent on floating-point comparisons,  
use the following instructions:  
FCOMI  
FCOMIP  
FUCOMI  
FUCOMIP  
98  
Use FFREEP Macro to Pop One Register from the FPU  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
These instructions are much faster than the classical approach  
using FSTSW, because FSTSW is essentially a serializing  
instruction on the AMD Athlon processor. When FSTSW cannot  
be avoided (for example, backward compatibility of code with  
older processors), no FPU instruction should occur between an  
FCOM[P], FICOM[P], FUCOM[P], or FTST and a dependent  
FSTSW. This optimization allows the use of a fast forwarding  
mechanism for the FPU condition codes internal to the  
AMD Athlon processor FPU and increases performance.  
Use the FXCH Instruction Rather than FST/FLD Pairs  
Increase parallelism by breaking up dependency chains or by  
evaluating multiple dependency chains simultaneously by  
explicitly switching execution between them. Although the  
AMD Athlon processor FPU has a deep scheduler, which in  
most cases can extract sufficient parallelism from existing code,  
long dependency chains can stall the scheduler while issue slots  
are still available. The maximum dependency chain length that  
the scheduler can absorb is about six 4-cycle instructions.  
To switch execution between dependency chains, use of the  
FXCH instruction is recommended because it has an apparent  
latency of zero cycles and generates only one OP. The  
AMD Athlon processor FPU contains special hardware to  
handle up to three FXCH instructions per cycle. Using FXCH is  
preferred over the use of FST/FLD pairs, even if the FST/FLD  
pair works on a register. An FST/FLD pair adds two cycles of  
latency and consists of two OPs.  
Avoid Using Extended-Precision Data  
Store data as either single-precision or double-precision  
quantities. Loading and storing extended-precision data is  
comparatively slower.  
Use the FXCH Instruction Rather than FST/FLD Pairs  
99  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Minimize Floating-Point-to-Integer Conversions  
C++, C, and Fortran define floating-point-to-integer conversions  
as truncating. This creates a problem because the active  
rounding mode in an application is typically round-to-nearest-  
even. The classical way to do a double-to-int conversion  
therefore works as follows:  
Example 1 (Fast):  
SUB  
FLD  
[I], EDX  
;trunc(X)=rndint(X)-correction  
;load double to be converted  
;save current FPU control word  
QWORD PTR [X]  
FSTCW [SAVE_CW]  
MOVZX EAX, WORD PTR[SAVE_CW];retrieve control word  
OR  
MOV  
EAX, 0C00h  
;rounding control field = truncate  
WORD PTR [NEW_CW], AX ;new FPU control word  
FLDCW [NEW_CW]  
FISTP DWORD PTR [I]  
FLDCW [SAVE_CW]  
;load new FPU control word  
;do double->int conversion  
;restore original control word  
The AMD Athlon processor contains special acceleration  
hardware to execute such code as quickly as possible. In most  
situations, the above code is therefore the fastest way to  
perform floating-point-to-integer conversion and the conversion  
is compliant both with programming language standards and  
the IEEE-754 standard.  
According to the recommendations for inlining (see Always  
page 72), the above code should not be put into a separate  
subroutine (e.g., ftol). It should rather be inlined into the main  
code.  
In some codes, floating-point numbers are converted to an  
integer and the result is immediately converted back to  
floating-point. In such cases, the FRNDINT instruction should  
be used for maximum performance instead of FISTP in the code  
above. FRNDINT delivers the integral result directly to an FPU  
register in floating-point form, which is faster than first using  
FISTP to store the integer result and then converting it back to  
floating-point with FILD.  
If there are multiple, consecutive floating-point-to-integer  
conversions, the cost of FLDCW operations should be  
minimized by saving the current FPU control word, forcing the  
100  
Minimize Floating-Point-to-Integer Conversions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
FPU into truncating mode, and performing all of the  
conversions before restoring the original control word.  
The speed of the above code is somewhat dependent on the  
nature of the code surrounding it. For applications in which the  
speed of floating-point-to-integer conversions is extremely  
critical for application performance, experiment with either of  
the following substitutions, which may or may not be faster than  
the code above.  
The first substitution simulates a truncating floating-point to  
integer conversion provided that there are no NaNs, infinities,  
and overflows. This conversion is therefore not IEEE-754  
compliant. This code works properly only if the current FPU  
rounding mode is round-to-nearest-even, which is usually the  
case.  
Example 2 (Potentially faster).  
FLD  
FST  
QWORD PTR [X]  
DWORD PTR [TX]  
;load double to be converted  
;store X because sign(X) is needed  
;store rndint(x) as default result  
;compute DIFF = X - rndint(X)  
FIST DWORD PTR [I]  
FISUB DWORD PTR [I]  
FSTP DWORD PTR [DIFF] ;store DIFF as we need sign(DIFF)  
MOV  
MOV  
EAX, [TX]  
EDX, [DIFF]  
;X  
;DIFF  
;DIFF == 0 ?  
TEST EDX, EDX  
JZ  
$DONE  
;default result is OK, done  
XOR  
SAR  
SAR  
LEA  
AND  
SUB  
$DONE:  
EDX, EAX ; need correction if sign(X) != sign(DIFF)  
EAX, 31  
EDX, 31  
EAX, [EAX+EAX+1] ;(X<0) ? 0xFFFFFFFF : 1  
EDX, EAX  
[I], EDX  
;(X<0) ? 0xFFFFFFFF : 0  
; sign(X)!=sign(DIFF)?0xFFFFFFFF:0  
;correction: -1, 0, 1  
;trunc(X)=rndint(X)-correction  
The second substitution simulates a truncating floating-point to  
integer conversion using only integer instructions and therefore  
works correctly independent of the FPUs current rounding  
mode. It does not handle NaNs, infinities, and overflows  
according to the IEEE-754 standard. Note that the first  
instruction of this code may cause an STLF size mismatch  
resulting in performance degradation if the variable to be  
converted has been stored recently.  
Minimize Floating-Point-to-Integer Conversions  
101  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 3 (Potentially faster):  
MOV  
XOR  
MOV  
AND  
CMP  
JB  
ECX, DWORD PTR[X+4] ;get upper 32 bits of double  
EDX, EDX  
EAX, ECX  
;i = 0  
;save sign bit  
ECX, 07FF00000h  
ECX, 03FF00000h  
$DONE2  
EDX, DWORD PTR[X]  
ECX, 20  
;isolate exponent field  
;if abs(x) < 1.0  
; then i = 0  
;get lower 32 bits of double  
;extract exponent  
MOV  
SHR  
SHRD EDX, EAX, 21  
;extract mantissa  
NEG  
ADD  
OR  
SAR  
ECX  
;compute shift factor for extracting  
;non-fractional mantissa bits  
;set integer bit of mantissa  
;x < 0 ? 0xffffffff : 0  
;i = trunc(abs(x))  
ECX, 1054  
EDX, 080000000h  
EAX, 31  
SHR  
EDX, CL  
XOR  
SUB  
EDX, EAX  
EDX, EAX  
;i = x < 0 ? ~i : i  
;i = x < 0 ? -i : i  
$DONE2:  
MOV  
[I], EDX  
;store result  
For applications which can tolerate a floating-point-to-integer  
conversion that is not compliant with existing programming  
language standards (but is IEEE-754 compliant), perform the  
conversion using the rounding mode that is currently in effect  
(usually round-to-nearest-even).  
Example 4 (Fastest):  
FLD  
QWORD PTR [X]  
; get double to be converted  
; store integer result  
FISTP DWORD PTR [I]  
Some compilers offer an option to use the code from example 4  
for floating-point-to-integer conversion, using the default  
rounding mode.  
Lastly, consider setting the rounding mode throughout an  
application to truncate and using the code from example 4 to  
perform extremely fast conversions that are compliant with  
language standards and IEEE-754. This mode is also provided  
as an option by some compilers. Note that use of this technique  
also changes the rounding mode for all other FPU operations  
inside the application, which can lead to significant changes in  
numerical results and even program failure (for example, due to  
lack of convergence in iterative algorithms).  
102  
Minimize Floating-Point-to-Integer Conversions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Floating-Point Subexpression Elimination  
There are cases which do not require an FXCH instruction after  
every instruction to allow access to two new stack entries. In the  
cases where two instructions share a source operand, an FXCH  
is not required between the two instructions. When there is an  
opportunity for subexpression elimination, reduce the number  
of superfluous FXCH instructions by putting the shared source  
operand at the top of the stack. For example, using the function:  
func( (x*y), (x+z) )  
Example 1 (Avoid):  
FLD  
FLD  
Z
Y
FLD  
X
FADD  
FXCH  
FMUL  
CALL  
FSTP  
ST, ST(2)  
ST(1)  
ST, ST(2)  
FUNC  
ST(0)  
Example 2 (Preferred):  
FLD  
FLD  
FLD  
FMUL  
Z
Y
X
ST(1), ST  
FADDP ST(2), ST  
CALL FUNC  
Check Argument Range of Trigonometric Instructions  
Efficiently  
The transcendental instructions FSIN, FCOS, FPTAN, and  
FSINCOS are architecturally restricted in their argument  
range. Only arguments with a magnitude of <= 2^63 can be  
evaluated. If the argument is out of range, the C2 bit in the FPU  
status word is set, and the argument is returned as the result.  
Software needs to guard against such (extremely infrequent)  
cases.  
Floating-Point Subexpression Elimination  
103  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
If an argument out of rangeis detected, a range reduction  
subroutine is invoked which reduces the argument to less than  
2^63 before the instruction is attempted again. While an  
argument > 2^63 is unusual, it often indicates a problem  
elsewhere in the code and the code may completely fail in the  
absence of a properly guarded trigonometric instruction. For  
example, in the case of FSIN or FCOS generated from a sin() or  
cos() function invocation in the HLL, the downstream code  
might reasonably expect that the returned result is in the range  
[-1,1].  
A naive solution for guarding a trigonometric instruction may  
check the C2 bit in the FPU status word after each FSIN, FCOS,  
FPTAN, and FSINCOS instruction, and take appropriate action  
if it is set (indicating an argument out of range).  
Example 1 (Avoid):  
FLD  
FSIN  
QWORD PTR [x];argument  
;compute sine  
FSTSW AX  
;store FPU status word to AX  
;is the C2 bit set?  
;nope, argument was in range, all OK  
TEST  
JZ  
AX, 0400h  
$in_range  
CALL  
FSIN  
$reduce_range;reduce argument in ST(0) to < 2^63  
;compute sine (in-range argument  
; guaranteed)  
$in_range:  
Such a solution is inefficient since the FSTSW instruction is  
serializing with respect to all x87/3DNow!/MMX instructions  
and should thus be avoided (see the section Floating-Point  
Compare Instructionson page 98). Use of FSTSW in the above  
fashion slows down the common path through the code.  
Instead, it is advisable to check the argument before one of the  
trigonometric instructions is invoked.  
Example 2 (Preferred):  
FLD  
FLD  
QWORD PTR [x] ;argument  
DWORD PTR [two_to_the_63]  
;2^63  
FCOMIP ST,ST(1)  
JBE  
CALL  
;argument <= 2^63 ?  
;Yes, It is in range.  
$reduce_range ;reduce argument in ST(0) to < 2^63  
$in_range  
$in_range:  
FSIN  
;compute sine (in-range argument  
; guaranteed)  
104  
Check Argument Range of Trigonometric Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Since out-of-range arguments are extremely uncommon, the  
conditional branch will be perfectly predicted, and the other  
instructions used to guard the trigonometric instruction can  
execute in parallel to it.  
Take Advantage of the FSINCOS Instruction  
Frequently, a piece of code that needs to compute the sine of an  
argument also needs to compute the cosine of that same  
argument. In such cases, the FSINCOS instruction should be  
used to compute both trigonometric functions concurrently,  
which is faster than using separate FSIN and FCOS instructions  
to accomplish the same task.  
Example 1 (Avoid):  
FLD  
FLD  
QWORD PTR [x]  
DWORD PTR [two_to_the_63]  
FCOMIP ST,ST(1)  
JBE  
CALL  
$in_range  
$reduce_range  
$in_range:  
FLD  
ST(0)  
FCOS  
FSTP  
FSIN  
FSTP  
QWORD PTR [cosine_x]  
QWORD PTR [sine_x]  
Example 2 (Preferred):  
FLD  
FLD  
QWORD PTR [x]  
DWORD PTR [two_to_the_63]  
FCOMIP ST,ST(1)  
JBE  
CALL  
$in_range  
$reduce_range  
$in_range:  
FSINCOS  
FSTP  
FSTP  
QWORD PTR [cosine_x]  
QWORD PTR [sine_x]  
Take Advantage of the FSINCOS Instruction  
105  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
106  
Take Advantage of the FSINCOS Instruction  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
10  
3DNow!and MMX™  
Optimizations  
This chapter describes 3DNow! and MMX code optimization  
techniques for the AMD Athlonprocessor. Guidelines are  
listed in order of importance. 3DNow! porting guidelines can be  
found in the 3DNow!Instruction Porting Guide, order# 22621.  
Use 3DNow!Instructions  
Unless accuracy requirements dictate otherwise, perform  
floating-point computations using the 3DNow! instructions  
instead of x87 instructions. The SIMD nature of 3DNow!  
achieves twice the number of FLOPs that are achieved through  
x87 instructions. 3DNow! instructions provide for a flat register  
file instead of the stack-based approach of x87 instructions.  
TOP  
See the 3DNow!Technology Manual, order# 21928 for  
information on instruction usage.  
Use FEMMS Instruction  
Though there is no penalty for switching between x87 FPU and  
3DNow!/MMX instructions in the AMD Athlon processor, the  
FEMMS instruction should be used to ensure the same code  
also runs optimally on AMD-K6® family processors. The  
Use 3DNow!Instructions  
107  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
FEMMS instruction is supported for backward compatibility  
with AMD-K6 family processors, and is aliased to the EMMS  
instruction.  
3DNow! and MMX instructions are designed to be used  
concurrently with no switching issues. Likewise, enhanced  
3DNow! instructions can be used simultaneously with MMX  
instructions. However, x87 and 3DNow! instructions share the  
same architectural registers so there is no easy way to use them  
concurrently without cleaning up the register file in between  
using FEMMS/EMMS.  
Use 3DNow!Instructions for Fast Division  
3DNow! instructions can be used to compute a very fast, highly  
accurate reciprocal or quotient.  
Optimized 14-Bit Precision Divide  
This divide operation executes with a total latency of seven  
cycles, assuming that the program hides the latency of the first  
MOVD/MOVQ instructions within preceding code.  
Example:  
MOVD  
MM0, [MEM] ;  
MM0, MM0  
MM2, [MEM] ;  
MM2, MM0  
0 | W  
PFRCP  
MOVQ  
PFMUL  
;
1/W | 1/W (approximate)  
Y | X  
Y/W | X/W  
;
Optimized Full 24-Bit Precision Divide  
This divide operation executes with a total latency of 15 cycles,  
assuming that the program hides the latency of the first  
MOVD/MOVQ instructions within preceding code.  
Example:  
MOVD  
MM0, [W]  
MM1, MM0  
;
;
;
;
;
;
:
0 | W  
PFRCP  
PUNPCKLDQ MM0, MM0  
PFRCPIT1 MM0, MM1  
MOVQ  
PFRCPIT2 MM0, MM1  
PFMUL MM2, MM0  
1/W | 1/W (approximate)  
W | W  
1/W | 1/W (refine)  
Y | X  
1/W | 1/W (final)  
Y/W | X/W  
(MMX instr.)  
MM2, [X_Y]  
108  
Use 3DNow!Instructions for Fast Division  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Pipelined Pair of 24-Bit Precision Divides  
This divide operation executes with a total latency of 21 cycles,  
assuming that the program hides the latency of the first  
MOVD/MOVQ instructions within preceding code.  
Example:  
MOVQ  
PFRCP  
MOVQ  
MM0, [DIVISORS] ; y | x  
MM1, MM0  
MM2, MM0  
; 1/x | 1/x (approximate)  
; y | x  
PUNPCKHDQ MM0, MM0  
PFRCP MM0, MM0  
PUNPCKLDQ MM1, MM0  
; y | y  
; 1/y | 1/y (approximate)  
; 1/y | 1/x (approximate)  
MOVQ  
MM0, [DIVIDENDS] ; z | w  
PFRCPIT1 MM2, MM1  
PFRCPIT2 MM2, MM1  
PFMUL  
; 1/y | 1/x (intermediate)  
; 1/y | 1/x (final)  
; z/y | w/x  
MM0, MM2  
Newton-Raphson Reciprocal  
Consider the quotient q = a/b. An (on-chip) ROM-based table  
lookup can be used to quickly produce a 14-to-15-bit precision  
approximation of 1/b using just one PFRCP instruction. A full  
24-bit precision reciprocal can then be quickly computed from  
this approximation using a Newton Raphson algorithm.  
The general Newton-Raphson recurrence for the reciprocal is as  
follows:  
Zi+1 = Zi (2 – b Zi)  
Given that the initial approximation is accurate to at least 14  
bits, and that a full IEEE single-precision mantissa contains 24  
bits, just one Newton-Raphson iteration is required. The  
following sequence shows the 3DNow! instructions that produce  
the initial reciprocal approximation, compute the full precision  
reciprocal from the approximation, and finally, complete the  
desired divide of a/b.  
X0 = PFRCP(b)  
X1 = PFRCPIT1(b,X0)  
X2 = PFRCPIT2(X1,X0)  
q = PFMUL(a,X2)  
The 24-bit final reciprocal value is X2. In the AMD Athlon  
processor 3DNow! technology implementation the operand X2  
contains the correct round-to-nearest single precision  
reciprocal for approximately 99% of all arguments.  
Use 3DNow!Instructions for Fast Division  
109  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Use 3DNow!Instructions for Fast Square Root and  
Reciprocal Square Root  
3DNow! instructions can be used to compute a very fast, highly  
accurate square root and reciprocal square root.  
Optimized 15-Bit Precision Square Root  
This square root operation can be executed in only 7 cycles,  
assuming a program hides the latency of the first MOVD  
instruction within previous code. The reciprocal square root  
operation requires four less cycles than the square root  
operation.  
Example:  
MOVD  
MM0, [MEM]  
MM1, MM0  
;
0 | a  
;1/sqrt(a) | 1/sqrt(a) (approximate)  
a | a (MMX instr.)  
; sqrt(a) | sqrt(a)  
PFRSQRT  
PUNPCKLDQ MM0, MM0  
PFMUL MM0, MM1  
;
Optimized 24-Bit Precision Square Root  
This square root operation can be executed in only 19 cycles,  
assuming a program hides the latency of the first MOVD  
instruction within previous code. The reciprocal square root  
operation requires four less cycles than the square root  
operation.  
Example:  
MOVD  
MM0, [MEM]  
MM1, MM0  
MM2, MM1  
MM1, MM1  
;
0 | a  
; 1/sqrt(a) | 1/sqrt(a) (approx.)  
; X_0 = 1/(sqrt a) (approx.)  
; X_0 * X_0 | X_0 * X_0 (step 1)  
PFRSQRT  
MOVQ  
PFMUL  
PUNPCKLDQ MM0, MM0  
PFRSQIT1 MM1, MM0  
PFRCPIT2 MM1, MM2  
;
;
a | a  
(intermediate)  
(MMX instr)  
(step 2)  
; 1/sqrt(a) | 1/sqrt(a) (step 3)  
; sqrt(a) | sqrt(a)  
PFMUL  
MM0, MM1  
110  
Use 3DNow!Instructions for Fast Square Root and  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Newton-Raphson Reciprocal Square Root  
The general Newton-Raphson reciprocal square root recurrence  
is:  
Zi+1 = 1/2 Zi (3 – b Zi2)  
To reduce the number of iterations, the initial approximation  
read from a table. The 3DNow! reciprocal square root  
approximation is accurate to at least 15 bits. Accordingly, to  
obtain a single-precision 24-bit reciprocal square root of an  
input operand b, one Newton-Raphson iteration is required,  
using the following sequence of 3DNow! instructions:  
X0 = PFRSQRT(b)  
X1 = PFMUL(X0,X0)  
X2 = PFRSQIT1(b,X1)  
X3 = PFRCPIT2(X2,X0)  
X4 = PFMUL(b,X3)  
The 24-bit final reciprocal square root value is X3. In the  
AMD Athlon processor 3DNow! implementation, the estimate  
contains the correct round-to-nearest value for approximately  
87% of all arguments. The remaining arguments differ from the  
correct round-to-nearest value by one unit-in-the-last-place. The  
square root (X4) is formed in the last step by multiplying by the  
input operand b.  
Use MMXPMADDWD Instruction to Perform Two 32-Bit  
Multiplies in Parallel  
The MMX PMADDWD instruction can be used to perform two  
signed 16x1632 bit multiplies in parallel, with much higher  
performance than can be achieved using the IMUL instruction.  
The PMADDWD instruction is designed to perform four  
16x1632 bit signed multiplies and accumulate the results  
pairwise. By making one of the results in a pair a zero, there are  
now just two multiplies. The following example shows how to  
multiply 16-bit signed numbers a,b,c,d into signed 32-bit  
products a×c and b×d:  
Use MMXPMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel  
111  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example:  
PXOR  
MOVD  
MOVD  
MM2, MM2  
MM0, [ab]  
MM1, [cd]  
; 0 | 0  
; 0 0 | b a  
; 0 0 | d c  
; 0 b | 0 a  
; 0 d | 0 c  
; b*d | a*c  
PUNPCKLWD MM0, MM2  
PUNCPKLWD MM1, MM2  
PMADDWD MM0, MM1  
3DNow!and MMXIntra-Operand Swapping  
AMD Athlon™  
Specific Code  
If the swapping of MMX register halves is necessary, use the  
PSWAPD instruction, which is a new AMD Athlon 3DNow! DSP  
extension. Use of this instruction should only be for  
AMD Athlon specific code. PSWAPD MMreg1, MMreg2”  
performs the following operation:  
mmreg1[63:32] = mmreg2[31:0])  
mmreg1[31:0] = mmreg2[63:32])  
See the AMD Extensions to the 3DNow! and MMX Instruction Set  
Manual, order #22466 for more usage information.  
Blended Code  
Otherwise, for blended code, which needs to run well on  
AMD-K6 and AMD Athlon family processors, the following code  
is recommended:  
Example 1 (Preferred, faster):  
;MM1 = SWAP (MM0), MM0 destroyed  
MOVQ  
MM1, MM0  
;make a copy  
;duplicate lower half  
;combine lower halves  
PUNPCKLDQ MM0, MM0  
PUNPCKHDQ MM1, MM0  
Example 2 (Preferred, fast):  
;MM1 = SWAP (MM0), MM0 preserved  
MOVQ  
MM1, MM0  
;make a copy  
;duplicate upper half  
;combine upper halves  
PUNPCKHDQ MM1, MM1  
PUNPCKLDQ MM1, MM0  
Both examples accomplish the swapping, but the first example  
should be used if the original contents of the register do not  
need to be preserved. The first example is faster due to the fact  
that the MOVQ and PUNPCKLDQ instructions can execute in  
parallel. The instructions in the second example are dependent  
on one another and take longer to execute.  
112  
3DNow!and MMXIntra-Operand Swapping  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Fast Conversion of Signed Words to Floating-Point  
In many applications there is a need to quickly convert data  
consisting of packed 16-bit signed integers into floating-point  
numbers. The following two examples show how this can be  
accomplished efficiently on AMD processors.  
The first example shows how to do the conversion on a processor  
that supports AMDs 3DNow! extensions, such as the  
AMD Athlon processor. It demonstrates the increased  
efficiency from using the PI2FW instruction. Use of this  
instruction should only be for AMD Athlon processor specific  
code. See the AMD Extensions to the 3DNow!and MMX™  
Instruction Set Manual, order #22466 for more information on  
this instruction.  
The second example demonstrates how to accomplish the same  
task in blended code that achieves good performance on the  
AMD Athlon processor as well as on the AMD-K6 family  
processors that support 3DNow! technology.  
Example 1 (AMD Athlon specific code using 3DNow! DSP extension):  
MOVD  
MM0, [packed_sword]  
;0 0 | b a  
;b b | a a  
PUNPCKLWD MM0, MM0  
PI2FW  
MOVQ  
MM0, MM0  
[packed_float], MM0  
;xb=float(b)|xa=float(a)  
;store xb | xa  
Example 2 (AMD-K6 Family and AMD Athlon processor blended code):  
MOVD  
PXOR  
MM1, [packed_sword] ;0 0 | b a  
MM0, MM0  
;0 0 | 0 0  
;b 0 | a 0  
PUNPCKLWD MM0, MM1  
PSRAD  
PI2FD  
MOVQ  
MM0, 16  
MM0, MM0  
;sign extend: b | a  
;xb=float(b) | xa=float(a)  
[packed_float], MM0 ;store xb | xa  
Use MMXPXOR to Negate 3DNow!Data  
For both the AMD Athlon and AMD-K6 processors, it is  
recommended that code use the MMX PXOR instruction to  
change the sign bit of 3DNow! operations instead of the 3DNow!  
PFMUL instruction. On the AMD Athlon processor, using  
PXOR allows for more parallelism, as it can execute in either  
the FADD or FMUL pipes. PXOR has an execution latency of  
two, but because it is a MMX instruction, there is an initial one  
Fast Conversion of Signed Words to Floating-Point  
113  
Download from Www.Somanuals.com. All Manuals Search And Download.  
             
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
cycle bypassing penalty, and another one cycle penalty if the  
result goes to a 3DNow! operation. The PFMUL execution  
latency is four, therefore, in the worst case, the PXOR and  
PMUL instructions are the same in terms of latency. On the  
AMD-K6 processor, there is only a one cycle latency for PXOR,  
versus a two cycle latency for the 3DNow! PFMUL instruction.  
Use the following code to negate 3DNow! data:  
msgn  
PXOR  
DQ 8000000080000000h  
MM0, [msgn]  
;toggle sign bit  
Use MMXPCMP Instead of 3DNow!PFCMP  
Use the MMX PCMP instruction instead of the 3DNow! PFCMP  
instruction. On the AMD Athlon processor, the PCMP has a  
latency of two cycles while the PFCMP has a latency of four  
cycles. In addition to the shorter latency, PCMP can be issued to  
either the FADD or the FMUL pipe, while PFCMP is restricted  
to the FADD pipe.  
Note: The PFCMP instruction has a GE(greater or equal)  
version (PFCMPGE) that is missing from PCMP.  
Both Numbers  
Positive  
If both arguments are positive, PCMP always works.  
One Negative, One  
Positive  
If one number is negative and the other is positive, PCMP still  
works, except when one number is a positive zero and the other  
is a negative zero.  
Both Numbers  
Negative  
Be careful when performing integer comparison using PCMPGT  
on two negative 3DNow! numbers. The result is the inverse of  
the PFCMPGT floating-point comparison. For example:  
–2 = 84000000  
–4 = 84800000  
PCMPGT gives 84800000 > 84000000, but 4 < 2. To address  
this issue, simply reverse the comparison by swapping the  
source operands.  
114  
Use MMXPCMP Instead of 3DNow!PFCMP  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Use MMXInstructions for Block Copies and Block Fills  
For moving or filling small blocks of data (e.g., less than 512  
bytes) between cacheable memory areas, the REP MOVS and  
REP STOS families of instructions deliver good performance  
and are straightforward to use. For moving and filling larger  
blocks of data, or to move/fill blocks of data where the  
destination is in non-cacheable space, it is recommended to  
make use of MMX instructions and MMX extensions. The  
following examples all use quadword-aligned blocks of data. In  
cases where memory blocks are not quadword aligned,  
additional code is required to handle end cases as needed.  
AMD-K6® and  
AMD Athlon™  
Processor Blended  
Code  
The following example code, written for the inline assembler of  
Microsoft Visual C, is suitable for moving/filling a large quad-  
word aligned block of data in the following situations:  
Blended code, i.e., code that needs to perform well on both  
AMD Athlon and AMD-K6 family processors  
AMD Athlon processor specific code where the destination  
is in cacheable memory and immediate data re-use of the  
data at the destination is expected  
AMD-K6 family specific code where the destination is in  
non-cacheable memory  
Example 1:  
/* block copy (source and destination QWORD aligned) */  
__asm {  
mov  
mov  
mov  
shr  
eax, [src_ptr]  
edx, [dst_ptr]  
ecx, [blk_size]  
ecx, 6  
align 16  
Use MMXInstructions for Block Copies and Block Fills  
115  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
$xfer:  
movq  
add  
mm0, [eax]  
edx, 64  
mm1, [eax+8]  
eax, 64  
movq  
add  
movq  
movq  
movq  
movq  
movq  
movq  
movq  
movq  
movq  
movq  
movq  
movq  
movq  
dec  
mm2, [eax-48]  
[edx-64], mm0  
mm0, [eax-40]  
[edx-56], mm1  
mm1, [eax-32]  
[edx-48], mm2  
mm2, [eax-24]  
[edx-40], mm0  
mm0, [eax-16]  
[edx-32], mm1  
mm1, [eax-8]  
[edx-24], mm2  
[edx-16], mm0  
ecx  
movq  
jnz  
[edx-8], mm1  
$xfer  
femms  
}
/* block fill (destination QWORD aligned) */  
__asm {  
mov  
mov  
shr  
movq  
edx, [dst_ptr]  
ecx, [blk_size]  
ecx, 6  
mm0, [fill_data]  
align 16  
$fill:  
movq  
movq  
movq  
movq  
movq  
movq  
add  
[edx], mm0  
[edx+8], mm0  
[edx+16], mm0  
[edx+24], mm0  
[edx+32], mm0  
[edx+40], mm0  
edx, 64  
movq  
decq  
mov  
[edx-16], mm0  
ecx  
[edx-8], mm0  
$fill  
jnz  
femms  
}
116  
Use MMXInstructions for Block Copies and Block Fills  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
AMD Athlon™  
Processor Specific  
Code  
The following example code, written for the inline assembler of  
Microsoft Visual C, is suitable for moving/filling a quadword  
aligned block of data in the following situations:  
AMD Athlon processor specific code where the destination  
of the block copy is in non-cacheable memory space  
AMD Athlon processor specific code where the destination  
of the block copy is in cacheable space, but no immediate  
data re-use of the data at the destination is expected.  
Example 2:  
/* block copy (source and destination QWORD aligned) */  
__asm {  
mov  
mov  
mov  
shr  
eax, [src_ptr]  
edx, [dst_ptr]  
ecx, [blk_size]  
ecx, 6  
align 16  
$xfer_nc:  
prefetchnta [eax+256]  
movq  
add  
mm0, [eax]  
edx, 64  
movq  
add  
mm1, [eax+8]  
eax, 64  
movq  
movntq  
movq  
movntq  
movq  
movntq  
movq  
movntq  
movq  
movntq  
movq  
movntq  
movntq  
dec  
mm2, [eax-48]  
[edx-64], mm0  
mm0, [eax-40]  
[edx-56], mm1  
mm1, [eax-32]  
[edx-48], mm2  
mm2, [eax-24]  
[edx-40], mm0  
mm0, [eax-16]  
[edx-32], mm1  
mm1, [eax-8]  
[edx-24], mm2  
[edx-16], mm0  
ecx  
movntq  
jnz  
[edx-8], mm1  
$xfer_nc  
femms  
sfence  
}
Use MMXInstructions for Block Copies and Block Fills  
117  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
/* block fill (destination QWORD aligned) */  
__asm {  
mov  
mov  
shr  
movq  
edx, [dst_ptr]  
ecx, [blk_size]  
ecx, 6  
mm0, [fill_data]  
align 16  
$fill_nc:  
movntq  
movntq  
movntq  
movntq  
movntq  
movntq  
movntq  
movntq  
add  
[edx], mm0  
[edx+8], mm0  
[edx+16], mm0  
[edx+24], mm0  
[edx+32], mm0  
[edx+40], mm0  
[edx+48], mm0  
[edx+56], mm0  
edx, 64  
dec  
ecx  
jnz  
$fill_nc  
femms  
sfence  
}
Use MMXPXOR to Clear All Bits in an MMXRegister  
To clear all the bits in an MMX register to zero, use:  
PXOR MMreg, MMreg  
Note that PXOR MMreg, MMreg is dependent on previous  
writes to MMreg. Therefore, using PXOR in the manner  
described can lengthen dependency chains, which in return  
may lead to reduced performance. An alternative in such cases  
is to use:  
zero DD 0  
MOVD MMreg, DWORD PTR [zero]  
i.e., to load a zero from a statically initialized and properly  
aligned memory location. However, loading the data from  
memory runs the risk of cache misses. Cases where MOVD is  
superior to PXOR are therefore rare and PXOR should be used  
in general.  
118  
Use MMXPXOR to Clear All Bits in an MMXRegister  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Use MMXPCMPEQD to Set All Bits in an MMXRegister  
To set all the bits in an MMX register to one, use:  
PCMPEQD MMreg, MMreg  
Note that PCMPEQD MMreg, MMreg is dependent on previous  
writes to MMreg. Therefore, using PCMPEQD in the manner  
described can lengthen dependency chains, which in return  
may lead to reduced performance. An alternative in such cases  
is to use:  
ones DQ 0FFFFFFFFFFFFFFFFh  
MOVQ MMreg, QWORD PTR [ones]  
i.e., to load a quadword of 0xFFFFFFFFFFFFFFFF from a  
statically initialized and properly aligned memory location.  
However, loading the data from memory runs the risk of cache  
misses. Cases where MOVQ is superior to PCMPEQD are  
therefore rare and PCMPEQD should be used in general.  
Use MMXPAND to Find Absolute Value in 3DNow!Code  
Use the following to compute the absolute value of 3DNow!  
floating-point operands:  
mabs  
PAND  
DQ 7FFFFFFF7FFFFFFFh  
MM0, [mabs]  
;mask out sign bit  
Optimized Matrix Multiplication  
The multiplication of a 4x4 matrix with a 4x1 vector is  
commonly used in 3D graphics for geometry transformation.  
This routine serves to translate, scale, rotate, and apply  
perspective to 3D coordinates represented in homogeneous  
coordinates. The following code sample is a 3DNow! optimized,  
general 3D vertex transformation routine that completes in 16  
cycles on the AMD Athlon processor:  
Use MMXPCMPEQD to Set All Bits in an MMXRegister  
119  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
/* Function XForm performs a fully generalized 3D transform on an array  
of vertices pointed to by "v" and stores the transformed vertices in  
the location pointed to by "res". Each vertex consists of four floats.  
The 4x4 transform matrix is pointed to by "m". The matrix elements are  
also floats. The argument "numverts" indicates how many vertices have  
to be transformed. The computation performed for each vertex is:  
res->x = v->x*m[0][0] + v->y*m[1][0] + v->z*m[2][0] + v->w*m[3][0]  
res->y = v->x*m[0][1] + v->y*m[1][1] + v->z*m[2][1] + v->w*m[3][1]  
res->z = v->x*m[0][2] + v->y*m[1][2] + v->z*m[2][2] + v->w*m[3][2]  
res->w = v->x*m[0][3] + v->y*m[1][3] + v->z*m[2][3] + v->w*m[3][3]  
*/  
#define M00 0  
#define M01 4  
#define M02 8  
#define M03 12  
#define M10 16  
#define M11 20  
#define M12 24  
#define M13 28  
#define M20 32  
#define M21 36  
#define M22 40  
#define M23 44  
#define M30 48  
#define M31 52  
#define M32 56  
#define M33 60  
void XForm (float *res, const float *v, const float *m, int numverts)  
{
_asm {  
MOV  
MOV  
MOV  
MOV  
EDX, [V]  
EAX, [M]  
EBX, [RES]  
ECX, [NUMVERTS]  
;EDX = source vector ptr  
;EAX = matrix ptr  
;EBX = destination vector ptr  
;ECX = number of vertices to transform  
;3DNow! version of fully general 3D vertex tranformation.  
;Optimal for AMD Athlon (completes in 16 cycles)  
FEMMS  
ALIGN  
;clear MMX state  
16  
;for optimal branch alignment  
120  
Optimized Matrix Multiplication  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
$$xform:  
ADD  
EBX, 16  
;res++  
MOVQ  
MOVQ  
ADD  
MM0, QWORD PTR [EDX]  
MM1, QWORD PTR [EDX+8]  
EDX, 16  
;v->y | v->x  
;v->w | v->z  
;v++  
MOVQ  
MOVQ  
MM2, MM0  
;v->y | v->x  
MM3, QWORD PTR [EAX+M00] ;m[0][1] | m[0][0]  
PUNPCKLDQ MM0, MM0  
;v->x | v->x  
MOVQ  
PFMUL  
MM4, QWORD PTR [EAX+M10] ;m[1][1] | m[1][0]  
MM3, MM0  
;v->x*m[0][1] | v->x*m[0][0]  
PUNPCKHDQ MM2, MM2  
;v->y | v->y  
PFMUL  
MOVQ  
MOVQ  
MOVQ  
PFMUL  
MOVQ  
MM4, MM2  
;v->y*m[1][1] | v->y*m[1][0]  
MM5, QWORD PTR [EAX+M02] ;m[0][3] | m[0][2]  
MM7, QWORD PTR [EAX+M12] ;m[1][3] | m[1][2]  
MM6, MM1  
MM5, MM0  
;v->w | v->z  
;v->x*m[0][3] | v0>x*m[0][2]  
MM0, QWORD PTR [EAX+M20] ;m[2][1] | m[2][0]  
PUNPCKLDQ MM1, MM1  
;v->z | v->z  
PFMUL  
MOVQ  
PFMUL  
PFADD  
MM7, MM2  
;v->y*m[1][3] | v->y*m[1][2]  
MM2, QWORD PTR [EAX+M22] ;m[2][3] | m[2][2]  
MM0, MM1  
MM3, MM4  
;v->z*m[2][1] | v->z*m[2][0]  
;v->x*m[0][1]+v->y*m[1][1] |  
; v->x*m[0][0]+v->y*m[1][0]  
MOVQ  
PFMUL  
PFADD  
MM4, QWORD PTR [EAX+M30] ;m[3][1] | m[3][0]  
MM2, MM1  
MM5, MM7  
;v->z*m[2][3] | v->z*m[2][2]  
;v->x*m[0][3]+v->y*m[1][3] |  
; v->x*m[0][2]+v->y*m[1][2]  
MOVQ  
MM1, QWORD PTR [EAX+M32] ;m[3][3] | m[3][2]  
PUNPCKHDQ MM6, MM6  
;v->w | v->w  
PFADD  
MM3, MM0  
;v->x*m[0][1]+v->y*m[1][1]+v->z*m[2][1] |  
; v->x*m[0][0]+v->y*m[1][0]+v->z*m[2][0]  
;v->w*m[3][1] | v->w*m[3][0]  
PFMUL  
PFMUL  
PFADD  
MM4, MM6  
MM1, MM6  
MM5, MM2  
;v->w*m[3][3] | v->w*m[3][2]  
;v->x*m[0][3]+v->y*m[1][3]+v->z*m[2][3] |  
; v->x*m[0][2]+v->y*m[1][2]+v->z*m[2][2]  
;v->x*m[0][1]+v->y*m[1][1]+v->z*m[2][1]+  
; v->w*m[3][1] | v->x*m[0][0]+v->y*m[1][0]+  
; v->z*m[2][0]+v->w*m[3][0]  
PFADD  
MM3, MM4  
MOVQ  
PFADD  
[EBX-16], MM3  
MM5, MM1  
;store res->y | res->x  
;v->x*m[0][3]+v->y*m[1][3]+v->z*m[2][3]+  
; v->w*m[3][3] | v->x*m[0][2]+v->y*m[1][2]+  
; v->z*m[2][2]+v->w*m[3][2]  
MOVQ  
DEC  
JNZ  
[EBX-8], MM5  
ECX  
$$XFORM  
;store res->w | res->z  
;numverts--  
;until numverts == 0  
FEMMS  
;clear MMX state  
}
}
Optimized Matrix Multiplication  
121  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Efficient 3D-Clipping Code Computation Using 3DNow!™  
Instructions  
Clipping is one of the major activities occurring in a 3D  
graphics pipeline. In many instances, this activity is split into  
two parts which do not necessarily have to occur consecutively:  
Computation of the clip code for each vertex, where each  
bit of the clip code indicates whether the vertex is outside  
the frustum with regard to a specific clip plane.  
Examination of the clip code for a vertex and clipping if the  
clip code is non-zero.  
The following example shows how to use 3DNow! instructions to  
efficiently implement a clip code computation for a frustum  
that is defined by:  
-w <= x <= w  
-w <= y <= w  
-w <= z <= w  
.DATA  
RIGHT EQU 01h  
LEFT  
EQU 02h  
ABOVE EQU 04h  
BELOW EQU 08h  
BEHIND EQU 10h  
BEFORE EQU 20h  
ALIGN 8  
ABOVE_RIGHT  
BELOW_LEFT  
DD RIGHT  
DD ABOVE  
DD LEFT  
DD BELOW  
BEHIND_BEFORE DD BEFORE  
DD BEHIND  
.CODE  
;; Generalized computation of 3D clip code (out code)  
;;  
;; Register usage: IN  
;;  
;;  
MM5 y | x  
MM6 w | z  
;;  
OUT  
MM2 clip code (out code)  
122  
Efficient 3D-Clipping Code Computation Using  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
;;  
;;  
DESTROYS MM0,MM1,MM2,MM3,MM4  
PXOR  
MOVQ  
MOVQ  
MM0, MM0  
MM1, MM6  
MM4, MM5  
; 0 | 0  
; w | z  
; y | x  
; w | w  
PUNPCKHDQ MM1, MM1  
MOVQ  
MOVQ  
PFSUBR  
PFSUBR  
MM3, MM6  
MM2, MM5  
MM3, MM0  
MM2, MM0  
; w | z  
; y | x  
; -w | -z  
; -y | -x  
PUNPCKLDQ MM3, MM6  
PFCMPGT MM4, MM1  
; z | -z  
; y>w?FFFFFFFF:0 | x>w?FFFFFFFF:0  
MOVQ  
MM0, QWORD PTR [ABOVE_RIGHT] ; ABOVE | RIGHT  
PFCMPGT MM3, MM1  
PFCMPGT MM2, MM1  
; z>w?FFFFFFFF:0 | -z>w>FFFFFFFF:0  
; -y>w?FFFFFFFF:0 | -x>w?FFFFFFFF:0  
MOVQ  
PAND  
MOVQ  
PAND  
PAND  
POR  
MM1, QWORD PTR [BEHIND_BEFORE] ; BEHIND | BEFORE  
MM4, MM0  
MM0, QWORD PTR [BELOW_LEFT]  
MM3, MM1  
MM2, MM0  
MM2, MM4  
; y > w ? ABOVE:0 | x > w ? RIGHT:0  
; BELOW | LEFT  
; z > w ? BEHIND:0 | -z > w ? BEFORE:0  
; -y > w ? BELOW:0 | -x > w ? LEFT:0  
; BELOW,ABOVE | LEFT,RIGHT  
POR  
MOVQ  
MM2, MM3  
MM1, MM2  
; BELOW,ABOVE,BEHIND | LEFT,RIGHT,BEFORE  
; BELOW,ABOVE,BEHIND | LEFT,RIGHT,BEFORE  
; BELOW,ABOVE,BEHIND | BELOW,ABOVE,BEHIND  
; zclip, yclip, xclip = clip code  
PUNPCKHDQ MM2, MM2  
POR MM2, MM1  
Use 3DNow!PAVGUSB for MPEG-2 Motion Compensation  
Use the 3DNow! PAVGUSB instruction for MPEG-2 motion  
compensation. The PAVGUSB instruction produces the rounded  
averages of the eight unsigned 8-bit integer values in the source  
operand (a MMX register or a 64-bit memory location) and the  
eight corresponding unsigned 8-bit integer values in the  
destination operand (a MMX register). The PAVGUSB  
instruction is extremely useful in DVD (MPEG-2) decoding  
where motion compensation performs a lot of byte averaging  
between and within macroblocks. The PAVGUSB instruction  
helps speed up these operations. In addition, PAVGUSB can  
free up some registers and make unrolling the averaging loops  
possible.  
The following code fragment uses original MMX code to  
perform averaging between the source macroblock and  
destination macroblock:  
Use 3DNow!PAVGUSB for MPEG-2 Motion Compensation  
123  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Example 1 (Avoid):  
MOV  
MOV  
MOV  
MOV  
MOVQ  
MOVQ  
MOV  
ESI, DWORD PTR Src_MB  
EDI, DWORD PTR Dst_MB  
EDX, DWORD PTR SrcStride  
EBX, DWORD PTR DstStride  
MM7, QWORD PTR [ConstFEFE]  
MM6, QWORD PTR [Const0101]  
ECX, 16  
L1:  
MOVQ  
MOVQ  
MOVQ  
MOVQ  
PAND  
PAND  
PAND  
PAND  
POR  
PSRLQ  
PSRLQ  
PAND  
PADDB  
MM0, [ESI]  
MM1, [EDI]  
MM2, MM0  
MM3, MM1  
MM2, MM6  
MM3, MM6  
MM0, MM7  
MM1, MM7  
MM2, MM3  
MM0, 1  
;MM0=QWORD1  
;MM1=QWORD3  
;MM0 = QWORD1 & 0xfefefefe  
;MM1 = QWORD3 & 0xfefefefe  
;calculate adjustment  
;MM0 = (QWORD1 & 0xfefefefe)/2  
;MM1 = (QWORD3 & 0xfefefefe)/2  
MM1, 1  
MM2, MM6  
MM0, MM1  
;MM0 = QWORD1/2 + QWORD3/2 w/o  
; adjustment  
PADDB  
MOVQ  
MOVQ  
MOVQ  
MOVQ  
MOVQ  
PAND  
PAND  
PAND  
PAND  
POR  
MM0, MM2  
[EDI], MM0  
MM4, [ESI+8]  
MM5, [EDI+8]  
MM2, MM4  
MM3, MM5  
MM2, MM6  
MM3, MM6  
MM4, MM7  
MM5, MM7  
MM2, MM3  
MM4, 1  
;add lsb adjustment  
;MM4=QWORD2  
;MM5=QWORD4  
;MM0 = QWORD2 & 0xfefefefe  
;MM1 = QWORD4 & 0xfefefefe  
;calculate adjustment  
;MM0 = (QWORD2 & 0xfefefefe)/2  
;MM1 = (QWORD4 & 0xfefefefe)/2  
PSRLQ  
PSRLQ  
PAND  
PADDB  
MM5, 1  
MM2, MM6  
MM4, MM5  
;MM0 = QWORD2/2 + QWORD4/2 w/o  
; adjustment  
PADDB  
MOVQ  
MM4, MM2  
[EDI+8], MM4  
;add lsb adjustment  
ADD  
ADD  
LOOP  
ESI, EDX  
EDI, EBX  
L1  
124  
Use 3DNow!PAVGUSB for MPEG-2 Motion  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
The following code fragment uses the 3DNow! PAVGUSB  
instruction to perform averaging between the source  
macroblock and destination macroblock:  
Example 2 (Preferred):  
MOV  
MOV  
MOV  
MOV  
MOV  
EAX, DWORD PTR Src_MB  
EDI, DWORD PTR Dst_MB  
EDX, DWORD PTR SrcStride  
EBX, DWORD PTR DstStride  
ECX, 16  
L1:  
MOVQ  
MOVQ  
MM0, [EAX]  
MM1, [EAX+8]  
;MM0=QWORD1  
;MM1=QWORD2  
PAVGUSB MM0, [EDI]  
;(QWORD1 + QWORD3)/2 with  
; adjustment  
PAVGUSB MM1, [EDI+8]  
;(QWORD2 + QWORD4)/2 with  
; adjustment  
ADD  
EAX, EDX  
[EDI], MM0  
[EDI+8], MM1  
EDI, EBX  
L1  
MOVQ  
MOVQ  
ADD  
LOOP  
Stream of Packed Unsigned Bytes  
The following code is an example of how to process a stream of  
packed unsigned bytes (like RGBA information) with faster  
3DNow! instructions.  
Example:  
outside loop:  
PXOR  
MM0, MM0  
inside loop:  
MOVD  
PUNPCKLBW MM1, MM0  
MOVQ MM2, MM1  
PUNPCKLWD MM1, MM0  
PUNPCKHWD MM2, MM0  
PI2FD  
PI2FD  
MM1, [VAR] ;  
0 | v[3],v[2],v[1],v[0]  
;0,v[3],0,v[2] | 0,v[1],0,v[0]  
;0,v[3],0,v[2] | 0,v[1],0,v[0]  
; 0,0,0,v[1] | 0,0,0,v[0]  
; 0,0,0,v[3] | 0,0,0,v[2]  
; float(v[1]) | float(v[0])  
; float(v[3]) | float(v[2])  
MM1, MM1  
MM2, MM2  
Stream of Packed Unsigned Bytes  
125  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Complex Number Arithmetic  
Complex numbers have a realpart and an imaginarypart.  
Multiplying complex numbers (ex. 3 + 4i) is an integral part of  
many algorithms such as Discrete Fourier Transform (DFT) and  
complex FIR filters. Complex number multiplication is shown  
below:  
(src0.real + src0.imag) * (src1.real + src1.imag) = result  
result = (result.real + result.imag)  
result.real <= src0.real*src1.real - src0.imag*src1.imag  
result.imag <= src0.real*src1.imag + src0.imag*src1.real  
Example:  
(1+2i) * (3+4i) => result.real + result.imag  
result.real <= 1*3 - 2*4 = -5  
result.imag <= 1*4i + 2i*3 = 10i  
result = -5 +10i  
Assuming that complex numbers are represented as two  
element vectors [v.real, v.imag], one can see the need for  
swapping the elements of src1 to perform the multiplies for  
result.imag, and the need for a mixed positive/negative  
accumulation to complete the parallel computation of  
result.real and result.imag.  
PSWAPD performs the swapping of elements for src1 and  
PFPNACC performs the mixed positive/negative accumulation  
to complete the computation. The code example below  
summarizes the computation of a complex number multiply.  
Example:  
;MM0 = s0.imag | s0.real ;reg_hi | reg_lo  
;MM1 = s1.imag | s1.real  
PSWAPD MM2, MM0  
;M2 =  
s0.real | s0.imag  
PFMUL  
PFMUL  
MM0, MM1  
MM1, MM2  
;M0 = s0.imag*s1.imag |s0.real*s1.real  
;M1 = s0.real*s1.imag | s0.imag*s1.real  
PFPNACC MM0, MM1  
;M0 =  
res.imag | res.real  
PSWAPD supports independent source and result operands and  
enables PSWAPD to also perform a copy function. In the above  
example, this eliminates the need for a separate MOVQ MM2,  
MM0instruction.  
126  
Complex Number Arithmetic  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
11  
General x86 Optimization  
Guidelines  
This chapter describes general code optimization techniques  
specific to superscalar processors (that is, techniques common  
to the AMD-K6® processor, AMD Athlonprocessor, and  
Pentium® family processors). In general, all optimization  
techniques used for the AMD-K6 processor, Pentium, and  
Pentium Pro processors either improve the performance of the  
AMD Athlon processor or are not required and have a neutral  
effect (usually due to fewer coding restrictions with the  
AMD Athlon processor).  
Short Forms  
Use shorter forms of instructions to increase the effective  
number of instructions that can be examined for decoding at  
any one time. Use 8-bit displacements and jump offsets where  
possible.  
Example 1 (Avoid):  
CMP  
REG, 0  
Example 2 (Preferred):  
TEST  
REG, REG  
Although both of these instructions have an execute latency of  
one, fewer opcode bytes need to be examined by the decoders  
for the TEST instruction.  
Short Forms  
127  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Dependencies  
Spread out true dependencies to increase the opportunities for  
parallel execution. Anti-dependencies and output  
dependencies do not impact performance.  
Register Operands  
Maintain frequently used values in registers rather than in  
memory. This technique avoids the comparatively long latencies  
for accessing memory.  
Stack Allocation  
When allocating space for local variables and/or outgoing  
parameters within a procedure, adjust the stack pointer and  
use moves rather than pushes. This method of allocation allows  
random access to the outgoing parameters so that they can be  
set up when they are calculated instead of being held  
somewhere else until the procedure call. In addition, this  
method reduces ESP dependencies and uses fewer execution  
resources.  
128  
Dependencies  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix A  
AMD AthlonProcessor  
Microarchitecture  
Introduction  
When discussing processor design, it is important to understand  
the following termsarchitecture, microarchitecture, and design  
implementation. The term architecture refers to the instruction  
set and features of a processor that are visible to software  
programs running on the processor. The architecture  
determines what software the processor can run. The  
architecture of the AMD Athlon processor is the  
industry-standard x86 instruction set.  
The term microarchitecture refers to the design techniques used  
in the processor to reach the target cost, performance, and  
functionality goals. The AMD Athlon processor  
microarchitecture is a decoupled decode/execution design  
approach. In other words, the decoders essentially operate  
independent of the execution units, and the execution core uses  
a small number of instructions and simplified circuit design for  
fast single-cycle execution and fast operating frequencies.  
The term design implementation refers to the actual logic and  
circuit designs from which the processor is created according to  
the microarchitecture specifications.  
Introduction  
129  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
AMD AthlonProcessor Microarchitecture  
The innovative AMD Athlon processor microarchitecture  
approach implements the x86 instruction set by processing  
simpler operations (OPs) instead of complex x86 instructions.  
These OPs are specially designed to include direct support for  
the x86 instructions while observing the high-performance  
principles of fixed-length encoding, regularized instruction  
fields, and a large register set. Instead of executing complex  
x86 instructions, which have lengths from 1 to 15 bytes, the  
AMD Athlon processor executes the simpler fixed-length OPs,  
while maintaining the instruction coding efficiencies found in  
x86 programs. The enhanced microarchitecture used in the  
AMD Athlon processor enables higher processor core  
performance and promotes straightforward extendibility for  
future designs.  
Superscalar Processor  
The AMD Athlon processor is an aggressive, out-of-order,  
three-way superscalar x86 processor. It can fetch, decode, and  
issue up to three x86 instructions per cycle with a centralized  
instruction control unit (ICU) and two independent instruction  
schedulersan integer scheduler and a floating-point  
scheduler. These two schedulers can simultaneously issue up to  
nine OPs to the three general-purpose integer execution units  
(IEUs), three address-generation units (AGUs), and three  
floating-point/3DNow!/MMXexecution units. The  
AMD Athlon moves integer instructions down the integer  
execution pipeline, which consists of the integer scheduler and  
the IEUs, as shown in Figure 1 on page 131. Floating-point  
instructions are handled by the floating-point execution  
pipeline, which consists of the floating-point scheduler and the  
x87/3DNow!/MMX execution units.  
130  
AMD AthlonProcessor Microarchitecture  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Predecode  
Cache  
Branch  
Prediction Table  
2-Way, 64-Kbyte Instruction Cache  
24-Entry L1 TLB/256-Entry L2 TLB  
Fetch/Decode  
Control  
3-Way x86 Instruction Decoders  
Instruction Control Unit (72-Entry)  
FPU Stack Map / Rename  
Integer Scheduler (18-Entry)  
FPU Scheduler (36-Entry)  
FPU Register File (88-Entry)  
IEU0 AGU0 IEU1 AGU1 IEU2 AGU2  
Bus  
Interface  
Unit  
L2 Cache  
Controller  
FMUL  
MMX  
3DNow!  
FADD  
MMX™  
3DNow!™  
FSTORE  
Load / Store Queue Unit  
2-Way, 64-Kbyte Data Cache  
32-Entry L1 TLB/256-Entry L2 TLB  
System Interface  
L2 SRAMs  
Figure 1. AMD AthlonProcessor Block Diagram  
Instruction Cache  
The out-of-order execute engine of the AMD Athlon processor  
contains a very large 64-Kbyte L1 instruction cache. The L1  
instruction cache is organized as a 64-Kbyte, two-way,  
set-associative array. Each line in the instruction array is 64  
bytes long. Functions associated with the L1 instruction cache  
are instruction loads, instruction prefetching, instruction  
predecoding, and branch prediction. Requests that miss in the  
L1 instruction cache are fetched from the backside L2 cache or,  
subsequently, from the local memory using the bus interface  
unit (BIU).  
The instruction cache generates fetches on the naturally  
aligned 64 bytes containing the instructions and the next  
sequential line of 64 bytes (a prefetch). The principal of  
program spatial locality makes data prefetching very effective  
and avoids or reduces execution stalls due to the amount of  
time wasted reading the necessary data. Cache line  
AMD AthlonProcessor Microarchitecture  
131  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
replacement is based on a least-recently used (LRU)  
replacement algorithm.  
The L1 instruction cache has an associated two-level translation  
look-aside buffer (TLB) structure. The first-level TLB is fully  
associative and contains 24 entries (16 that map 4-Kbyte pages  
and eight that map 2-Mbyte or 4-Mbyte pages). The second-level  
TLB is four-way set associative and contains 256 entries, which  
can map 4-Kbyte pages.  
Predecode  
Predecoding begins as the L1 instruction cache is filled.  
Predecode information is generated and stored alongside the  
instruction cache. This information is used to help efficiently  
identify the boundaries between variable length x86  
instructions, to distinguish DirectPath from VectorPath  
early-decode instructions, and to locate the opcode byte in each  
instruction. In addition, the predecode logic detects code  
branches such as CALLs, RETURNs and short unconditional  
JMPs. When a branch is detected, predecoding begins at the  
target of the branch.  
Branch Prediction  
The fetch logic accesses the branch prediction table in parallel  
with the instruction cache and uses the information stored in  
the branch prediction table to predict the direction of branch  
instructions.  
The AMD Athlon processor employs combinations of a branch  
target address buffer (BTB), a global history bimodal counter  
(GHBC) table, and a return address stack (RAS) hardware in  
order to predict and accelerate branches. Predicted-taken  
branches incur only a single-cycle delay to redirect the  
instruction fetcher to the target instruction. In the event of a  
mispredict, the minimum penalty is ten cycles.  
The BTB is a 2048-entry table that caches in each entry the  
predicted target address of a branch.  
In addition, the AMD Athlon processor implements a 12-entry  
return address stack to predict return addresses from a near or  
far call. As CALLs are fetched, the next EIP is pushed onto the  
132  
AMD AthlonProcessor Microarchitecture  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
return stack. Subsequent RETs pop a predicted return address  
off the top of the stack.  
Early Decoding  
The DirectPath and VectorPath decoders perform  
early-decoding of instructions into MacroOPs. A MacroOP is a  
fixed length instruction which contains one or more OPs. The  
outputs of the early decoders keep all (DirectPath or  
VectorPath) instructions in program order. Early decoding  
produces three MacroOPs per cycle from either path. The  
outputs of both decoders are multiplexed together and passed  
to the next stage in the pipeline, the instruction control unit.  
When the target 16-byte instruction window is obtained from  
the instruction cache, the predecode data is examined to  
determine which type of basic decode should occur —  
DirectPath or VectorPath.  
DirectPath Decoder  
DirectPath instructions can be decoded directly into a  
MacroOP, and subsequently into one or two OPs in the final  
issue stage. A DirectPath instruction is limited to those x86  
instructions that can be further decoded into one or two OPs.  
The length of the x86 instruction does not determine DirectPath  
instructions. A maximum of three DirectPath x86 instructions  
can occupy a given aligned 8-byte block. 16-bytes are fetched at  
a time. Therefore, up to six DirectPath x86 instructions can be  
passed into the DirectPath decode pipeline.  
VectorPath Decoder  
Uncommon x86 instructions requiring two or more MacroOPs  
proceed down the VectorPath pipeline. The sequence of  
MacroOPs is produced by an on-chip ROM known as the MROM.  
The VectorPath decoder can produce up to three MacroOPs per  
cycle. Decoding a VectorPath instruction may prevent the  
simultaneous decode of a DirectPath instruction.  
AMD AthlonProcessor Microarchitecture  
133  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Instruction Control Unit  
The instruction control unit (ICU) is the control center for the  
AMD Athlon processor. The ICU controls the following  
resourcesthe centralized in-flight reorder buffer, the integer  
scheduler, and the floating-point scheduler. In turn, the ICU is  
responsible for the following functionsMacroOP dispatch,  
MacroOP retirement, register and flag dependency resolution  
and renaming, execution resource management, interrupts,  
exceptions, and branch mispredictions.  
The ICU takes the three MacroOPs per cycle from the early  
decoders and places them in a centralized, fixed-issue reorder  
buffer. This buffer is organized into 24 lines of three MacroOPs  
each. The reorder buffer allows the ICU to track and monitor up  
to 72 in-flight MacroOPs (whether integer or floating-point) for  
maximum instruction throughput. The ICU can simultaneously  
dispatch multiple MacroOPs from the reorder buffer to both the  
integer and floating-point schedulers for final decode, issue,  
and execution as OPs. In addition, the ICU handles exceptions  
and manages the retirement of MacroOPs.  
Data Cache  
The L1 data cache contains two 64-bit ports. It is a  
write-allocate and writeback cache that uses an LRU  
replacement policy. The data cache and instruction cache are  
both two-way set-associative and 64-Kbytes in size. It is divided  
into 8 banks where each bank is 8 bytes wide. In addition, this  
cache supports the MOESI (Modified, Owner, Exclusive,  
Shared, and Invalid) cache coherency protocol and data parity.  
The L1 data cache has an associated two-level TLB structure.  
The first-level TLB is fully associative and contains 32 entries  
(24 that map 4-Kbyte pages and eight that map 2-Mbyte or  
4-Mbyte pages). The second-level TLB is four-way set  
associative and contains 256 entries, which can map 4-Kbyte  
pages.  
134  
AMD AthlonProcessor Microarchitecture  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Integer Scheduler  
The integer scheduler is based on a three-wide queuing system  
(also known as a reservation station) that feeds three integer  
execution positions or pipes. The reservation stations are six  
entries deep, for a total queuing system of 18 integer  
MacroOPs.Each reservation station divides the MacroOPs into  
integer and address generation OPs, as required.  
Integer Execution Unit  
The integer execution pipeline consists of three identical  
pipes0, 1, and 2. Each integer pipe consists of an integer  
execution unit (IEU) and an address generation unit (AGU).  
The integer execution pipeline is organized to match the three  
MacroOP dispatch pipes in the ICU as shown in Figure 2 on  
page 135. MacroOPs are broken down into OPs in the  
schedulers. OPs issue when their operands are available either  
from the register file or result buses.  
OPs are executed when their operands are available. OPs from  
a single MacroOP can execute out-of-order. In addition, a  
particular integer pipe can be executing two OPs from different  
MacroOPs (one in the IEU and one in the AGU) at the same  
time.  
P ip e lin e  
S ta g e  
In s tru c tio n C o n tro l U n it a n d R e g is te r F ile s  
M a c ro O P s  
M a c ro O P s  
In te g e r S c h e d u le r  
(1 8 -e n try )  
7
8
IE U 0  
A G U 0  
IE U 1  
A G U 1  
IE U 2  
A G U 2  
In te g e r M u ltip ly (IM U L )  
Figure 2. Integer Execution Pipeline  
AMD AthlonProcessor Microarchitecture  
135  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Each of the three IEUs are general purpose in that each  
performs logic functions, arithmetic functions, conditional  
functions, divide step functions, status flag multiplexing, and  
branch resolutions. The AGUs calculate the logical addresses  
for loads, stores, and LEAs. A load and store unit reads and  
writes data to and from the L1 data cache. The integer  
scheduler sends a completion status to the ICU when the  
outstanding OPs for a given MacroOP are executed.  
All integer operations can be handled within any of the three  
IEUs with the exception of multiplies. Multiplies are handled  
by a pipelined multiplier that is attached to the pipeline at pipe  
0. See Figure 2 on page 135. Multiplies always issue to integer  
pipe 0, and the issue logic creates results bus bubbles for the  
multiplier in integer pipes 0 and 1 by preventing non-multiply  
OPs from issuing at the appropriate time.  
Floating-Point Scheduler  
The AMD Athlon processor floating-point logic is a  
high-performance, fully-pipelined, superscalar, out-of-order  
execution unit. It is capable of accepting three MacroOPs of any  
mixture of x87 floating-point, 3DNow! or MMX operations per  
cycle.  
The floating-point scheduler handles register renaming and has  
a dedicated 36-entry scheduler buffer organized as 12 lines of  
three MacroOPs each. It also performs OP issue, and  
out-of-order execution. The floating-point scheduler  
communicates with the ICU to retire a MacroOP, to manage  
comparison results from the FCOMI instruction, and to back  
out results from a branch misprediction.  
136  
AMD AthlonProcessor Microarchitecture  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Floating-Point Execution Unit  
The floating-point execution unit (FPU) is implemented as a  
coprocessor that has its own out-of-order control in addition to  
the data path. The FPU handles all register operations for x87  
instructions, all 3DNow! operations, and all MMX operations.  
The FPU consists of a stack renaming unit, a register renaming  
unit, a scheduler, a register file, and three parallel execution  
units. Figure 3 shows a block diagram of the dataflow through  
the FPU.  
Pipeline  
Stage  
Instruction Control Unit  
Stack Map  
7
8
Register Rename  
9
10  
Scheduler (36-entry)  
FPU Register File (88-entry)  
11  
FMUL  
12  
to  
15  
FADD  
MMX ALU  
MMXALU  
FSTORE  
MMX Mul  
3DNow!™  
3DNow!  
Figure 3. Floating-Point Unit Block Diagram  
As shown in Figure 3 on page 137, the floating-point logic uses  
three separate execution positions or pipes for superscalar x87,  
3DNow! and MMX operations. The first of the three pipes is  
generally known as the adder pipe (FADD), and it contains  
3DNow! add, MMX ALU/shifter, and floating-point add  
execution units. The second pipe is known as the multiplier  
(FMUL). It contains a 3DNow!/MMX multiplier/reciprocal unit,  
an MMX ALU and a floating-point multiplier/divider/square  
root unit. The third pipe is known as the floating-point  
load/store (FSTORE), which handles floating-point constant  
loads (FLDZ, FLDPI, etc.), stores, FILDs, as well as many OP  
primitives used in VectorPath sequences.  
AMD AthlonProcessor Microarchitecture  
137  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Load-Store Unit (LSU)  
The load-store unit (LSU) manages data load and store accesses  
to the L1 data cache and, if required, to the backside L2 cache  
or system memory. The 44-entry LSU provides a data interface  
for both the integer scheduler and the floating-point scheduler.  
It consists of two queuesa 12-entry queue for L1 cache load  
and store accesses and a 32-entry queue for L2 cache or system  
memory load and store accesses. The 12-entry queue can  
request a maximum of two L1 cache loads and two L1 cache  
(32-bits) stores per cycle. The 32-entry queue effectively holds  
requests that missed in the L1 cache probe by the 12-entry  
queue. Finally, the LSU ensures that the architectural load and  
store ordering rules are preserved (a requirement for x86  
architecture compatibility).  
Operand  
Buses  
Result Buses  
from  
Data Cache  
LSU  
Core  
2-way,  
44-Entry  
64Kbytes  
Store Data  
to BIU  
Figure 4. Load/Store Unit  
138  
AMD AthlonProcessor Microarchitecture  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
L2 Cache Controller  
The AMD Athlon processor contains a very flexible onboard L2  
controller. It uses an independent backside bus to access up to  
8-Mbytes of industry-standard SRAMs. There are full on-chip  
tags for a 512-Kbyte cache, while larger sizes use a partial tag  
system. In addition, there is a two-level data TLB structure. The  
first-level TLB is fully associative and contains 32 entries (24  
that map 4-Kbyte pages and eight that map 2-Mbyte or 4-Mbyte  
pages). The second-level TLB is four-way set associative and  
contains 256 entries, which can map 4-Kbyte pages.  
Write Combining  
page 155 for detailed information about write combining.  
AMD AthlonSystem Bus  
The AMD Athlon system bus is a high-speed bus that consists of  
a pair of unidirectional 13-bit address and control channels and  
a bidirectional 64-bit data bus. The AMD Athlon system bus  
supports low-voltage swing, multiprocessing, clock forwarding,  
and fast data transfers. The clock forwarding technique is used  
to deliver data on both edges of the reference clock, therefore  
doubling the transfer speed. A four-entry 64-byte write buffer is  
integrated into the BIU. The write buffer improves bus  
utilization by combining multiple writes into a single large  
write cycle. By using the AMD Athlon system bus, the  
AMD Athlon processor can transfer data on the 64-bit data bus  
at 200 MHz, which yields an effective throughput of 1.6-Gbyte  
per second.  
AMD AthlonProcessor Microarchitecture  
139  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
140  
AMD AthlonProcessor Microarchitecture  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix B  
Pipeline and Execution Unit  
Resources Overview  
The AMD Athlonprocessor contains two independent  
execution pipelinesone for integer operations and one for  
floating-point operations. The integer pipeline manages x86  
integer operations and the floating-point pipeline manages all  
x87, 3DNow!and MMXinstructions. This appendix  
describes the operation and functionality of these pipelines.  
Fetch and Decode Pipeline Stages  
AMD Athlon processor instruction fetch and decoding pipeline  
stages. The pipeline consists of one cycle for instruction fetches  
and four cycles of instruction alignment and decoding. The  
three ports in stage 5 provide a maximum bandwidth of three  
MacroOPs per cycle for dispatching to the instruction control  
unit (ICU).  
Fetch and Decode Pipeline Stages  
141  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Entry  
Point  
VectorPath  
Decode  
Decode  
Decode  
Decode  
M ROM  
16 bytes  
I-CACHE  
Decode  
Decode  
Decode  
Decode  
3
Decode  
Decode  
MacroOps  
DirectPath  
Quadword  
Queue  
FETCH SCAN ALIGN1/  
MECTL  
ALIGN2/  
MEROM  
EDEC/  
MEDEC  
IDEC  
1
2
3
4
5
6
Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware  
The most common x86 instructions flow through the DirectPath  
pipeline stages and are decoded by hardware. The less common  
instructions, which require microcode assistance, flow through  
the VectorPath. Although the DirectPath decodes the common  
x86 instructions, it also contains VectorPath instruction data,  
which allows it to maintain dispatch order at the end of cycle 5.  
1
2
3
4
5
6
D irectPath  
A LIGN 1  
A LIG N2  
M EROM  
ED EC  
FETCH  
SCA N  
ID EC  
M ECTL  
M ESEQ  
VectorPath  
Figure 6. Fetch/Scan/Align/Decode Pipeline Stages  
142  
Fetch and Decode Pipeline Stages  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Cycle 1FETCH  
The FETCH pipeline stage calculates the address of the next  
x86 instruction window to fetch from the processor caches or  
system memory.  
Cycle 2SCAN  
SCAN determines the start and end pointers of instructions.  
SCAN can send up to six aligned instructions (DirectPath and  
VectorPath) to ALIGN1 and only one VectorPath instruction to  
the microcode engine (MENG) per cycle.  
Cycle 3 (DirectPath)–  
ALIGN1  
Because each 8-byte buffer (quadword queue) can contain up to  
three instructions, ALIGN1 can buffer up to a maximum of nine  
instructions, or 24 instruction bytes. ALIGN1 tries to send three  
instructions from an 8-byte buffer to ALIGN2 per cycle.  
Cycle 3 (VectorPath)–  
MECTL  
For VectorPath instructions, the microcode engine control  
(MECTL) stage of the pipeline generates the microcode entry  
points.  
Cycle 4 (DirectPath)–  
ALIGN2  
ALIGN2 prioritizes prefix bytes, determines the opcode,  
ModR/M, and SIB bytes for each instruction and sends the  
accumulated prefix information to EDEC.  
Cycle 4 (VectorPath)–  
MEROM  
In the microcode engine ROM (MEROM) pipeline stage, the  
entry-point generated in the previous cycle, MECTL, is used to  
index into the MROM to obtain the microcode lines necessary  
to decode the instruction sent by SCAN.  
Cycle 5 (DirectPath)–  
EDEC  
The early decode (EDEC) stage decodes information from the  
DirectPath stage (ALIGN2) and VectorPath stage (MEROM)  
into MacroOPs. In addition, EDEC determines register  
pointers, flag updates, immediate values, displacements, and  
other information. EDEC then selects either MacroOPs from  
the DirectPath or MacroOPs from the VectorPath to send to the  
instruction decoder (IDEC) stage.  
Cycle 5 (VectorPath)–  
MEDEC/MESEQ  
The microcode engine decode (MEDEC) stage converts x86  
instructions into MacroOPs. The microcode engine sequencer  
(MESEQ) performs the sequence controls (redirects and  
exceptions) for the MENG.  
Cycle 6–  
IDEC/Rename  
At the instruction decoder (IDEC)/rename stage, integer and  
floating-point MacroOPs diverge in the pipeline. Integer  
MacroOPs are scheduled for execution in the next cycle.  
Floating-point MacroOPs have their floating-point stack  
Fetch and Decode Pipeline Stages  
143  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
operands mapped to registers. Both integer and floating-point  
MacroOPs are placed into the ICU.  
Integer Pipeline Stages  
The integer execution pipeline consists of four or more stages  
for scheduling and execution and, if necessary, accessing data  
in the processor caches or system memory. There are three  
integer pipes associated with the three IEUs.  
Pipeline  
Stage  
Instruction Control Unit and Register Files  
MacroOPs  
MacroOPs  
Integer Scheduler  
(18-entry)  
7
8
IEU0  
AGU0  
IEU1  
AGU1  
IEU2  
AGU2  
Integer Multiply (IMUL)  
Figure 7. Integer Execution Pipeline  
Figure 7 and Figure 8 show the integer execution resources and  
the pipeline stages, which are described in the following  
sections.  
7
8
9
11  
10  
RESP  
SCHED  
EXEC  
ADDGEN  
DC ACC  
Figure 8. Integer Pipeline Stages  
144  
Integer Pipeline Stages  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Cycle 7SCHED  
In the scheduler (SCHED) pipeline stage, the scheduler buffers  
can contain MacroOPs that are waiting for integer operands  
from the ICU or the IEU result bus. When all operands are  
received, SCHED schedules the MacroOP for execution and  
issues the OPs to the next stage, EXEC.  
Cycle 8EXEC  
In the execution (EXEC) pipeline stage, the OP and its  
associated operands are processed by an integer pipe (either  
the IEU or the AGU). If addresses must be calculated to access  
data necessary to complete the operation, the OP proceeds to  
the next stages, ADDGEN and DCACC.  
Cycle 9ADDGEN  
Cycle 10DCACC  
In the address generation (ADDGEN) pipeline stage, the load  
or store OP calculates a linear address, which is sent to the data  
cache TLBs and caches.  
In the data cache access (DCACC) pipeline stage, the address  
generated in the previous pipeline stage is used to access the  
data cache arrays and TLBs. Any OP waiting in the scheduler  
for this data snarfs this data and proceeds to the EXEC stage  
(assuming all other operands were available).  
Cycle 11RESP  
In the response (RESP) pipeline stage, the data cache returns  
hit/miss status and data for the request from DCACC.  
Integer Pipeline Stages  
145  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Floating-Point Pipeline Stages  
The floating-point unit (FPU) is implemented as a coprocessor  
that has its own out-of-order control in addition to the data  
path. The FPU handles all register operations for x87  
instructions, all 3DNow! operations, and all MMX operations.  
The FPU consists of a stack renaming unit, a register renaming  
unit, a scheduler, a register file, and three parallel execution  
units. Figure 9 shows a block diagram of the dataflow through  
the FPU.  
Pipeline  
Stage  
Instruction Control Unit  
Stack Map  
7
8
Register Rename  
9
10  
Scheduler (36-entry)  
FPU Register File (88-entry)  
11  
FMUL  
12  
to  
15  
FADD  
MMX ALU  
MMXALU  
FSTORE  
MMX Mul  
3DNow!  
3DNow!™  
Figure 9. Floating-Point Unit Block Diagram  
The floating-point pipeline stages 715 are shown in Figure 10  
and described in the following sections. Note that the  
floating-point pipe and integer pipe separates at cycle 7.  
7
8
15  
11  
12  
9
10  
STKREN  
REGREN  
SCHEDW  
SCHED  
FREG  
FEXE1  
FEXE4  
Figure 10. Floating-Point Pipeline Stages  
146  
Floating-Point Pipeline Stages  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Cycle 7STKREN  
The stack rename (STKREN) pipeline stage in cycle 7 receives  
up to three MacroOPs from IDEC and maps stack-relative  
register tags to virtual register tags.  
Cycle 8REGREN  
The register renaming (REGREN) pipeline stage in cycle 8 is  
responsible for register renaming. In this stage, virtual register  
tags are mapped into physical register tags. Likewise, each  
destination is assigned a new physical register. The MacroOPs  
are then sent to the 36-entry FPU scheduler.  
Cycle 9SCHEDW  
Cycle 10SCHED  
The scheduler write (SCHEDW) pipeline stage in cycle 9 can  
receive up to three MacroOPs per cycle.  
The schedule (SCHED) pipeline stage in cycle 10 schedules up  
to three MacroOPs per cycle from the 36-entry FPU scheduler  
to the FREG pipeline stage to read register operands.  
MacroOPs are sent when their operands and/or tags are  
obtained.  
Cycle 11FREG  
The register file read (FREG) pipeline stage reads the  
floating-point register file for any register source operands of  
MacroOPs. The register file read is done before the MacroOPs  
are sent to the floating-point execution pipelines.  
Cycle 1215 –  
Floating-Point  
Execution (FEXEC14)  
The FPU has three logical pipesFADD, FMUL, and FSTORE.  
Each pipe may have several associated execution units. MMX  
execution is in both the FADD and FMUL pipes, with the  
exception of MMX instructions involving multiplies, which are  
limited to the FMUL pipe. The FMUL pipe has special support  
for long latency operations.  
DirectPath/VectorPath operations are dispatched to the FPU  
during cycle 6, but are not acted upon until they receive  
validation from the ICU in cycle 7.  
Floating-Point Pipeline Stages  
147  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Execution Unit Resources  
Terminology  
The execution units operate with two types of register values—  
operands and results. There are three operand types and two  
result types, which are described in this section.  
Operands  
The three types of operands are as follows:  
Address register operandsUsed for address calculations of  
load and store instructions  
Data register operandsUsed for register instructions  
Store data register operandsUsed for memory stores  
Results  
The two types of results are as follows:  
Data register resultsProduced by load or register  
instructions  
Address register resultsProduced by LEA or PUSH  
instructions  
Examples  
The following examples illustrate the operand and result  
definitions:  
ADD EAX, EBX  
The ADD instruction has two data register operands (EAX  
and EBX) and one data register result (EAX).  
MOV EBX, [ESP+4*ECX+8]  
;Load  
The Load instruction has two address register operands  
(ESP and ECX as base and index registers, respectively)  
and a data register result (EBX).  
MOV [ESP+4*ECX+8], EAX  
;Store  
The Store instruction has a data register operand (EAX)  
and two address register operands (ESP and ECX as base  
and index registers, respectively).  
LEA ESI, [ESP+4*ECX+8]  
The LEA instruction has address register operands (ESP  
and ECX as base and index registers, respectively), and an  
address register result (ESI).  
148  
Execution Unit Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Integer Pipeline Operations  
Table 2 shows the category or type of operations handled by the  
integer pipeline. Table 3 shows examples of the decode type.  
Table 2. Integer Pipeline Operation Types  
Category  
Execution Unit  
Integer Memory Load or Store Operations  
Address Generation Operations  
Integer Execution Unit Operations  
Integer Multiply Operations  
L/S  
AGU  
IEU  
IMUL  
Table 3. Integer Decode Types  
x86 Instruction  
MOV CX, [SP+4]  
ADD AX, BX  
Decode Type  
DirectPath  
DirectPath  
VectorPath  
DirectPath  
OPs  
AGU, L/S  
IEU  
CMP CX, [AX]  
AGU, L/S, IEU  
IEU  
JZ  
Addr  
As shown in Table 2, the MOV instruction early decodes in the  
DirectPath decoder and requires two OPsan address  
generation operation for the indirect address and a data load  
from memory into a register. The ADD instruction early  
decodes in the DirectPath decoder and requires a single OP  
that can be executed in one of the three IEUs. The CMP  
instruction early decodes in the VectorPath and requires three  
OPsan address generation operation for the indirect address,  
a data load from memory, and a compare to CX using an IEU.  
The final JZ instruction is a simple operation that early decodes  
in the DirectPath decoder and requires a single OP. Not shown  
is a load-op-store instruction, which translates into only one  
MacroOP (one AGU OP, one IEU OP, and one L/S OP).  
Execution Unit Resources  
149  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Floating-Point Pipeline Operations  
Table 4 shows the category or type of operations handled by the  
floating-point execution units. Table 5 shows examples of the  
decode types.  
Table 4. Floating-Point Pipeline Operation Types  
Category  
Execution Unit  
FPU/3DNow!/MMX Load/store or  
Miscellaneous Operations  
FSTORE  
FPU/3DNow!/MMX Multiply Operation  
FPU/3DNow!/MMX Arithmetic Operation  
FMUL  
FADD  
Table 5. Floating-Point Decode Types  
x86 Instruction  
FADD ST, ST(i)  
FSIN  
Decode Type  
DirectPath  
VectorPath  
DirectPath  
DirectPath  
OPs  
FADD  
various  
FADD  
FMUL  
PFACC  
PFRSQRT  
As shown in Table 4, the FADD register-to-register instruction  
generates a single MacroOP targeted for the floating-point  
scheduler. FSIN is considered a VectorPath instruction because  
it is a complex instruction with long execution times, as  
compared to the more common floating-point instructions. The  
MMX PFACC instruction is DirectPath decodeable and  
generates a single MacroOP targeted for the arithmetic  
operation execution pipeline in the floating-point logic. Just  
like PFACC, a single MacroOP is early decoded for the 3DNow!  
PFRSQRT instruction, but it is targeted for the multiply  
operation execution pipeline.  
150  
Execution Unit Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Load/Store Pipeline Operations  
The AMD Athlon processor decodes any instruction that  
references memory into primitive load/store operations. For  
example, consider the following code sample:  
MOV  
PUSH  
POP  
AX, [EBX]  
EAX  
EAX  
;1 load MacroOP  
;1 store MacroOP  
;1 load MacroOP  
ADD  
FSTP  
MOVQ  
[EAX], EBX  
[EAX]  
[EAX], MM0  
;1 load/store and 1 IEU MacroOPs  
;1 store MacroOP  
;1 store MacroOP  
As shown in Table 6, the load/store unit (LSU) consists of a  
three-stage data cache lookup.  
Table 6. Load/Store Unit Stages  
Stage 1 (Cycle 8)  
Stage 2 (Cycle 9)  
Stage 3 (Cycle 10)  
Address Calculation / LS1  
Scan  
Transport Address to Data  
Cache  
Data Cache Access / LS2  
Data Forward  
Loads and stores are first dispatched in order into a 12-entry  
deep reservation queue called LS1. LS1 holds loads and stores  
that are waiting to enter the cache subsystem. Loads and stores  
are allocated into LS1 entries at dispatch time in program  
order, and are required by LS1 to probe the data cache in  
program order. The AGUs can calculate addresses out of  
program order, therefore, LS1 acts as an address reorder buffer.  
When a load or store is scanned out of the LS1 queue (Stage 1),  
it is deallocated from the LS1 queue and inserted into the data  
cache probe pipeline (Stage 2 and Stage 3). Up to two memory  
operations can be scheduled (scanned out of LS1) to access the  
data cache per cycle. The LSU can handle the following:  
Two 64-bit loads per cycle or  
One 64-bit load and one 64-bit store per cycle or  
Two 32-bit stores per cycle  
Execution Unit Resources  
151  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Code Sample Analysis  
show the execution behavior of several series of instructions as  
a function of decode constraints, dependencies, and execution  
resource constraints.  
The sample tables show the x86 instructions, the decode pipe in  
the integer execution pipeline, the decode type, the clock  
counts, and a description of the events occurring within the  
processor. The decode pipe gives the specific IEU used (see  
Figure 7 on page 144). The decode type specifies either  
VectorPath (VP) or DirectPath (DP).  
The following nomenclature is used to describe the current  
location of a particular operation:  
DDispatch stage (Allocate in ICU, reservation stations,  
load-store (LS1) queue)  
IIssue stage (Schedule operation for AGU or FU  
execution)  
EInteger Execution Unit (IEU number corresponds to  
decode pipe)  
&Address Generation Unit (AGU number corresponds to  
decode pipe)  
MMultiplier Execution  
SLoad/Store pipe stage 1 (Schedule operation for  
load/store pipe)  
ALoad/Store pipe stage 2 (1st stage of data cache/LS2  
buffer access)  
$Load/Store pipe stage 3 (2nd stage of data cache/LS2  
buffer access)  
Note: Instructions execute more efficiently (that is, without  
delays) when scheduled apart by suitable distances based on  
dependencies. In general, the samples in this section show  
poorly scheduled code in order to illustrate the resultant  
effects.  
152  
Execution Unit Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 7. Sample 1 Integer Register Operations  
Clocks  
Instruction  
Number  
Decode Decode  
Pipe  
Type  
Instruction  
1
2
I
3
M
I
4
M
E
5
6
7
8
1
2
3
4
5
6
7
8
IMUL EAX, ECX  
0
0
1
2
0
1
2
0
VP  
DP  
DP  
DP  
DP  
DP  
DP  
DP  
D
M
M
INC  
ESI  
D
D
D
MOV EDI, 0x07F4  
ADD EDI, EBX  
SHL EAX, 8  
I
E
I
E
D
D
D
I
E
I
OR  
EAX, 0x0F  
EBX  
E
INC  
I
I
E
E
ADD ESI, EDX  
D
Comments for Each Instruction Number  
1. The IMUL is a VectorPath instruction. It cannot be decode or paired with other operations and, therefore,  
dispatches alone in pipe 0. The multiply latency is four cycles.  
2. The simple INC operation is paired with instructions 3 and 4. The INC executes in IEU0 in cycle 4.  
3. The MOV executes in IEU1 in cycle 4.  
4. The ADD operation depends on instruction 3. It executes in IEU2 in cycle 5.  
5. The SHL operation depends on the multiply result (instruction 1). The MacroOP waits in a reservation  
station and is eventually scheduled to execute in cycle 7 after the multiply result is available.  
6. This operation executes in cycle 8 in IEU1.  
7. This simple operation has a resource contention for execution in IEU2 in cycle 5. Therefore, the operation  
does not execute until cycle 6.  
8. The ADD operation executes immediately in IEU0 after dispatching.  
Execution Unit Resources  
153  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 8. Sample 2 Integer Register and Memory Load Operations  
Clocks  
Instruc  
Num  
Decode Decode  
Instruction  
DEC EDX  
Pipe  
Type  
DP  
DP  
DP  
DP  
DP  
DP  
DP  
DP  
1
D
D
D
2
I
3
E
4
5
6
7
8
9
10 11 12  
1
2
3
4
5
6
7
8
0
1
2
0
1
2
0
1
MOV EDI, [ECX]  
SUB EAX, [EDX+20]  
SAR EAX, 5  
I
&/S  
&/S  
A
A
$
I
$/I  
E
I
D
D
D
E
ADD ECX, [EDI+4]  
AND EBX, 0x1F  
MOV ESI, [0x0F100]  
I
&/S  
A
$
I
E
I
D
D
&
S
A
$
I
OR  
ECX, [ESI+EAX*4+8]  
&/S  
A
$
E
Comments for Each Instruction Number  
1. The ALU operation executes in IEU0.  
2. The load operation generates the address in AGU1 and is simultaneously scheduled for the load/store pipe in cycle 3. In  
cycles 4 and 5, the load completes the data cache access.  
3. The load-execute instruction accesses the data cache in tandem with instruction 2. After the load portion completes, the  
subtraction is executed in cycle 6 in IEU2.  
4. The shift operation executes in IEU0 (cycle 7) after instruction 3 completes.  
5. This operation is stalled on its address calculation waiting for instruction 2 to update EDI. The address is calculated in  
cycle 6. In cycle 7/8, the cache access completes.  
6. This simple operation executes quickly in IEU2  
7. The address for the load is calculated in cycle 5 in AGU0. However, the load is not scheduled to access the data cache  
until cycle 6. The load is blocked for scheduling to access the data cache for one cycle by instruction 5. In cycles 7 and 8,  
instruction 7 accesses the data cache concurrently with instruction 5.  
8. The load execute instruction accesses the data cache in cycles 10/11 and executes the ORoperation in IEU1 in cycle 12.  
154  
Execution Unit Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix C  
Implementation of  
Write Combining  
Introduction  
This appendix describes the memory write-combining feature  
as implemented in the AMD Athlonprocessor family. The  
AMD Athlon processor supports the memory type and range  
register (MTRR) and the page attribute table (PAT) extensions,  
which allow software to define ranges of memory as either  
writeback (WB), write-protected (WP), writethrough (WT),  
uncacheable (UC), or write-combining (WC).  
Defining the memory type for a range of memory as WC or WT  
allows the processor to conditionally combine data from  
multiple write cycles that are addressed within this range into a  
merge buffer. Merging multiple write cycles into a single write  
cycle reduces processor bus utilization and processor stalls,  
thereby increasing the overall system performance.  
To understand the information presented in this appendix, the  
reader should possess a knowledge of K86processors, the x86  
architecture, and programming requirements.  
Introduction  
155  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Write-Combining Definitions and Abbreviations  
This appendix uses the following definitions and abbreviations:  
UCUncacheable memory type  
WCWrite-combining memory type  
WTWritethrough memory type  
WPWrite-protected memory type  
WBWriteback memory type  
One Byte8 bits  
One Word16 bits  
Longword32 bits (same as a x86 doubleword)  
Quadword64 bits or 2 longwords  
Octaword128 bits or 2 quadwords  
Cache Block64 bytes or 4 octawords or 8 quadwords  
What is Write Combining?  
Write combining is the merging of multiple memory write  
cycles that target locations within the address range of a write  
buffer. The AMD Athlon processor combines multiple  
memory-write cycles to a 64-byte buffer whenever the memory  
address is within a WC or WT memory type region. The  
processor continues to combine writes to this buffer without  
writing the data to the system, as long as certain rules apply  
(see Table 9 on page 158 for more information).  
Programming Details  
The steps required for programming write combining on the  
AMD Athlon processor are as follows:  
1. Verify the presence of an AMD Athlon processor by using  
the CPUID instruction to check for the instruction family  
code and vendor identification of the processor. Standard  
function 0 on AMD processors returns a vendor  
identification string of AuthenticAMDin registers EBX,  
EDX, and ECX. Standard function 1 returns the processor  
156  
Write-Combining Definitions and Abbreviations  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
signature in register EAX, where EAX[118] contains the  
instruction family code. For the AMD Athlon processor, the  
instruction family code is six.  
2. In addition, the presence of the MTRRs is indicated by bit  
12 and the presence of the PAT extension is indicated by bit  
16 of the extended features bits returned in the EDX  
register by CPUID function 8000_0001h. See the AMD  
Processor Recognition Application Note, order# 20734 for  
more details on the CPUID instruction.  
3. Write combining is controlled by the MTRRs and PAT.  
Write combining should be enabled for the appropriate  
memory ranges. The AMD Athlon processor MTRRs and  
PAT are compatible with the Pentium® II.  
Write-Combining Operations  
In order to improve system performance, the AMD Athlon  
processor aggressively combines multiple memory-write cycles  
of any data size that address locations within a 64-byte write  
buffer that is aligned to a cache-line boundary. The data sizes  
can be bytes, words, longwords, or quadwords.  
WC memory type writes can be combined in any order up to a  
full 64-byte sized write buffer.  
WT memory type writes can only be combined up to a fully  
aligned quadword in the 64-byte buffer, and must be combined  
contiguously in ascending order. Combining may be opened at  
any byte boundary in a quadword, but is closed by a write that is  
either not contiguous and ascendingor fills byte 7.  
All other memory types for stores that go through the write  
buffer (UC and WP) cannot be combined.  
Combining is able to continue until interrupted by one of the  
conditions listed in Table 9 on page 158. When combining is  
interrupted, one or more bus commands are issued to the  
system for that write buffer, as described by Table 10 on  
Write-Combining Operations  
157  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 9. Write Combining Completion Events  
Event  
Comment  
The first non-WB write to a different cache block address  
closes combining for previous writes. WB writes do not affect  
write combining. Only one line-sized buffer can be open for  
write combining at a time. Once a buffer is closed for write  
combining, it cannot be reopened for write combining.  
Non-WB write outside of  
current buffer  
Any IN/INS or OUT/OUTS instruction closes combining. The  
implied memory type for all IN/OUT instructions is UC,  
which cannot be combined.  
I/O Read or Write  
Any serializing instruction closes combining. These  
instructions include: MOVCRx, MOVDRx, WRMSR, INVD,  
INVLPG, WBINVD, LGDT, LLDT, LIDT, LTR, CPUID, IRET, RSM,  
INIT, HALT.  
Serializing instructions  
Flushing instructions  
Locks  
Any flush instruction causes the WC to complete.  
Any instruction or processor operation that requires a cache  
or bus lock closes write combining before starting the lock.  
Writes within a lock can be combined.  
A UC read closes write combining. A WC read closes  
combining only if a cache block address match occurs  
between the WC read and a write in the write buffer.  
Uncacheable Read  
Any WT write while write-combining for WC memory or any  
Different memory type WC write while write combining for WT memory closes write  
combining.  
Write combining is closed if all 64 bytes of the write buffer  
Buffer full  
are valid.  
If 16 processor clocks have passed since the most recent  
write for WT write combining, write combining is closed.  
There is no time-out for WC write combining.  
WT time-out  
Write combining is closed if a write fills the most significant  
byte of a quadword, which includes writes that are  
misaligned across a quadword boundary. In the misaligned  
case, combining is closed by the LS part of the misaligned  
write and combining is opened by the MS part of the  
misaligned store.  
WT write fills byte 7  
If a subsequent WT write is not in ascending sequential  
order, the write combining completes. WC writes have no  
addressing constraints within the 64-byte line being  
combined.  
WT Nonsequential  
TLB AD bit set  
Write combining is closed whenever a TLB reload sets the  
accessed (A) or dirty (D) bits of a Pde or Pte.  
158  
Write-Combining Operations  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Sending Write-Buffer Data to the System  
Once write combining is closed for a 64-byte write buffer, the  
contents of the write buffer are eligible to be sent to the system  
as one or more AMD Athlon system bus commands. Table 10  
lists the rules for determining what system commands are  
issued for a write buffer, as a function of the alignment of the  
valid buffer data.  
Table 10. AMD AthlonSystem Bus Commands Generation Rules  
1. If all eight quadwords are either full (8 bytes valid) or empty (0 bytes valid), a  
Write-Quadword system command is issued, with an 8-byte mask representing  
which of the eight quadwords are valid. If this case is true, do not proceed to the  
next rule.  
2. If all longwords are either full (4 bytes valid) or empty (0 bytes valid), a  
Write-Longword system command is issued for each 32-byte buffer half that  
contains at least one valid longword. The mask for each Write-Longword system  
command indicates which longwords are valid in that 32-byte write buffer half. If  
this case is true, do not proceed to the next rule.  
3. Sequence through all eight quadwords of the write buffer, from quadword 0 to  
quadword 7. Skip over a quadword if no bytes are valid. Issue a Write-Quad system  
command if all bytes are valid, asserting one mask bit. Issue a Write-Longword  
system command if the quadword contains one aligned longword, asserting one  
mask bit. Otherwise, issue a Write-Byte system command if there is at least one  
valid byte, asserting a mask bit for each valid byte.  
Write-Combining Operations  
159  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
160  
Write-Combining Operations  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix D  
Performance-Monitoring  
Counters  
This chapter describes how to use the AMD Athlonprocessor  
performance monitoring counters.  
Overview  
The AMD Athlon processor provides four 48-bit performance  
counters, which allows four types of events to be monitored  
simultaneously. These counters can either count events or  
measure duration. When counting events, a counter is  
incremented each time a specified event takes place or a  
specified number of events takes place. When measuring  
duration, a counter counts the number of processor clocks that  
occur while a specified condition is true. The counters can  
count events or measure durations that occur at any privilege  
level. Table 11 on page 164 lists the events that can be counted  
with the performance monitoring counters.  
Performance Counter Usage  
The performance monitoring counters are supported by eight  
MSRsPerfEvtSel[3:0] are the performance event select  
MSRs, and PerfCtr[3:0] are the performance counter MSRs.  
Overview  
161  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
These registers can be read from and written to using the  
RDMSR and WRMSR instructions, respectively.  
The PerfEvtSel[3:0] registers are located at MSR locations  
C001_0000h to C001_0003h. The PerfCtr[3:0] registers are  
located at MSR locations C001_0004h to C0001_0007h and are  
64-byte registers.  
The PerfEvtSel[3:0] registers can be accessed using the  
RDMSR/WRMSR instructions only when operating at privilege  
level 0. The PerfCtr[3:0] MSRs can be read from any privilege  
level using the RDPMC (read performance-monitoring  
counters) instruction, if the PCE flag in CR4 is set.  
PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000hC001_0003h)  
The PerfEvtSel[3:0] MSRs, shown in Figure 11, control the  
operation of the performance-monitoring counters, with one  
register used to set up each counter. These MSRs specify the  
events to be counted, how they should be counted, and the  
privilege levels at which counting should take place. The  
functions of the flags and fields within these MSRs are as are  
described in the following sections.  
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  
9
8
7
6
5
4
3
2
1
0
I
N
V
E
N
I
N
T
P
C
E
O
S
U
S
R
Counter Mask  
Event Mask  
Unit Mask  
Reserved  
Symbol  
Description  
Bit  
USR  
OS  
E
User Mode  
Operating System Mode  
Edge Detect  
16  
17  
18  
19  
20  
22  
23  
PC  
Pin Control  
INT  
EN  
INV  
APIC Interrupt Enable  
Enable Counter  
Invert Mask  
Figure 11. PerfEvtSel[3:0] Registers  
Event Select Field  
(Bits 07)  
These bits are used to select the event to be monitored. See  
Table 11 on page 164 for a list of event masks and their 8-bit  
codes.  
162  
Performance Counter Usage  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Unit Mask Field (Bits  
815)  
These bits are used to further qualify the event selected in the  
event select field. For example, for some cache events, the mask  
is used as a MESI-protocol qualifier of cache states. See  
Table 11 on page 164 for a list of unit masks and their 8-bit  
codes.  
USR (User Mode) Flag  
(Bit 16)  
Events are counted only when the processor is operating at  
privilege levels 1, 2 or 3. This flag can be used in conjunction  
with the OS flag.  
OS (Operating System  
Mode) Flag (Bit 17)  
Events are counted only when the processor is operating at  
privilege level 0. This flag can be used in conjunction with the  
USR flag.  
E (Edge Detect) Flag  
(Bit 18)  
When this flag is set, edge detection of events is enabled. The  
processor counts the number of negated-to-asserted transitions  
of any condition that can be expressed by the other fields. The  
mechanism is limited in that it does not permit back-to-back  
assertions to be distinguished. This mechanism allows software  
to measure not only the fraction of time spent in a particular  
state, but also the average length of time spent in such a state  
(for example, the time spent waiting for an interrupt to be  
serviced).  
PC (Pin Control) Flag  
(Bit 19)  
When this flag is set, the processor toggles the PMi pins when  
the counter overflows. When this flag is clear, the processor  
toggles the PMi pins and increments the counter when  
performance monitoring events occur. The toggling of a pin is  
defined as assertion of the pin for one bus clock followed by  
negation.  
INT (APIC Interrupt  
Enable) Flag (Bit 20)  
When this flag is set, the processor generates an interrupt  
through its local APIC on counter overflow.  
EN (Enable Counter)  
Flag (Bit 22)  
This flag enables/disables the PerfEvtSeln MSR. When set,  
performance counting is enabled for this counter. When clear,  
this counter is disabled.  
INV (Invert) Flag (Bit  
23)  
By inverting the Counter Mask Field, this flag inverts the result  
of the counter comparison, allowing both greater than and less  
than comparisons.  
Counter Mask Field  
(Bits 3124)  
For events which can have multiple occurrences within one  
clock, this field is used to set a threshold. If the field is non-zero,  
the counter increments each time the number of events is  
Performance Counter Usage  
163  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
greater than or equal to the counter mask. Otherwise if this  
field is zero, then the counter increments by the total number of  
events.  
Table 11. Performance-Monitoring Counters  
Event  
Number  
Source  
Unit  
Notes / Unit Mask (bits 158)  
1xxx_xxxxb = reserved  
Event Description  
x1xx_xxxxb = HS  
xx1x_xxxxb = GS  
xxx1_xxxxb = FS  
xxxx_1xxxb = DS  
xxxx_x1xxb = SS  
xxxx_xx1xb = CS  
xxxx_xxx1b = ES  
20h  
LS  
Segment register loads  
21h  
40h  
41h  
LS  
DC  
DC  
Stores to active instruction stream  
Data cache accesses  
Data cache misses  
xxx1_xxxxb = Modified (M)  
xxxx_1xxxb = Owner (O)  
xxxx_x1xxb = Exclusive (E)  
xxxx_xx1xb = Shared (S)  
xxxx_xxx1b = Invalid (I)  
xxx1_xxxxb = Modified (M)  
xxxx_1xxxb = Owner (O)  
42h  
43h  
44h  
DC  
Data cache refills  
DC xxxx_x1xxb = Exclusive (E)  
xxxx_xx1xb = Shared (S)  
xxxx_xxx1b = Invalid (I)  
xxx1_xxxxb = Modified (M)  
xxxx_1xxxb = Owner (O)  
DC xxxx_x1xxb = Exclusive (E)  
xxxx_xx1xb = Shared (S)  
xxxx_xxx1b = Invalid (I)  
DC  
Data cache refills from system  
Data cache writebacks  
45h  
46h  
47h  
64h  
L1 DTLB misses and L2 DTLB hits  
L1 and L2 DTLB misses  
DC  
DC  
Misaligned data references  
DRAM system requests  
BU  
164  
Performance Counter Usage  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 11. Performance-Monitoring Counters (Continued)  
Event  
Number  
Source  
Unit  
Notes / Unit Mask (bits 158)  
1xxx_xxxxb = reserved  
Event Description  
x1xx_xxxxb = WB  
xx1x_xxxxb = WP  
xxx1_xxxxb = WT  
65h  
BU  
BU  
System requests with the selected type  
bits 11–10 = reserved  
xxxx_xx1xb = WC  
xxxx_xxx1b = UC  
bits 15–11 = reserved  
xxxx_x1xxb = L2 (L2 hit and no DC  
hit)  
73h  
74h  
Snoop hits  
xxxx_xx1xb = Data cache  
xxxx_xxx1b = Instruction cache  
bits 15–10 = reserved  
BU xxxx_xx1xb = L2 single bit error  
xxxx_xxx1b = System single bit error  
bits 15–12 = reserved  
Single-bit ECC errors detected/corrected  
xxxx_1xxxb = I invalidates D  
BU xxxx_x1xxb = I invalidates I  
xxxx_xx1xb = D invalidates D  
xxxx_xxx1b = D invalidates I  
75h  
76h  
Internal cache line invalidates  
Cycles processor is running (not in HLT  
or STPCLK)  
BU  
1xxx_xxxxb = Data block write from  
the L2 (TLB RMW)  
x1xx_xxxxb = Data block write from  
the DC  
xx1x_xxxxb = Data block write from  
the system  
xxx1_xxxxb = Data block read data  
store  
79h  
BU  
L2 requests  
xxxx_1xxxb = Data block read data  
load  
xxxx_x1xxb = Data block read  
instruction  
xxxx_xx1xb = Tag write  
xxxx_xxx1b = Tag read  
Performance Counter Usage  
165  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 11. Performance-Monitoring Counters (Continued)  
Event  
Number  
Source  
Unit  
Notes / Unit Mask (bits 158)  
Event Description  
Cycles that at least one fill request  
waited to use the L2  
7Ah  
BU  
80h  
81h  
82h  
83h  
84h  
85h  
86h  
87h  
88h  
89h  
PC  
PC  
PC  
PC  
PC  
PC  
PC  
PC  
PC  
PC  
Instruction cache fetches  
Instruction cache misses  
Instruction cache refills from L2  
Instruction cache refills from system  
L1 ITLB misses (and L2 ITLB hits)  
(L1 and) L2 ITLB misses  
Snoop resyncs  
Instruction fetch stall cycles  
Return stack hits  
Return stack overflow  
Retired instructions (includes  
exceptions, interrupts, resyncs)  
C0h  
C1h  
C2h  
FR  
FR  
FR  
Retired Ops  
Retired branches (conditional,  
unconditional, exceptions, interrupts)  
C3h  
C4h  
C5h  
C6h  
C8h  
C9h  
FR  
FR  
FR  
FR  
FR  
FR  
Retired branches mispredicted  
Retired taken branches  
Retired taken branches mispredicted  
Retired far control transfers  
Retired near returns  
Retired near returns mispredicted  
Retired indirect branches with target  
mispredicted  
CAh  
CDh  
CEh  
FR  
FR  
FR  
Interrupts masked cycles (IF=0)  
Interrupts masked while pending cycles  
(INTR while IF=0)  
CFh  
D0h  
FR  
FR  
Number of taken hardware interrupts  
Instruction decoder empty  
Dispatch stalls (event masks D2h  
through DAh below combined)  
D1h  
FR  
D2h  
D3h  
D4h  
FR  
FR  
FR  
Branch abort to retire  
Serialize  
Segment load stall  
166  
Performance Counter Usage  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 11. Performance-Monitoring Counters (Continued)  
Event  
Number  
Source  
Unit  
Notes / Unit Mask (bits 158)  
Event Description  
D5h  
D6h  
D7h  
D8h  
D9h  
DAh  
DCh  
DDh  
DEh  
DFh  
FR  
FR  
FR  
FR  
FR  
FR  
FR  
FR  
FR  
FR  
ICU full  
Reservation stations full  
FPU full  
LS full  
All quiet stall  
Far transfer or resync branch pending  
Breakpoint matches for DR0  
Breakpoint matches for DR1  
Breakpoint matches for DR2  
Breakpoint matches for DR3  
PerfCtr[3:0] MSRs (MSR Addresses C001_0004hC001_0007h)  
The performance-counter MSRs contain the event or duration  
counts for the selected events being counted. The RDPMC  
instruction can be used by programs or procedures running at  
any privilege level and in virtual-8086 mode to read these  
counters. The PCE flag in control register CR4 (bit 8) allows the  
use of this instruction to be restricted to only programs and  
procedures running at privilege level 0.  
The RDPMC instruction is not serializing or ordered with other  
instructions. Therefore, it does not necessarily wait until all  
previous instructions have been executed before reading the  
counter. Similarly, subsequent instructions can begin execution  
before the RDPMC instruction operation is performed.  
Only the operating system, executing at privilege level 0, can  
directly manipulate the performance counters, using the  
RDMSR and WRMSR instructions. A secure operating system  
would clear the PCE flag during system initialization, which  
disables direct user access to the performance-monitoring  
counters but provides a user-accessible programming interface  
that emulates the RDPMC instruction.  
The WRMSR instruction cannot arbitrarily write to the  
performance-monitoring counter MSRs (PerfCtr[3:0]). Instead,  
the value should be treated as 64-bit sign extended, which  
Performance Counter Usage  
167  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
allows writing both positive and negative values to the  
performance counters. The performance counters may be  
initialized using a 64-bit signed integer in the range -247 and  
+247. Negative values are useful for generating an interrupt  
after a specific number of events.  
Starting and Stopping the Performance-Monitoring Counters  
The performance-monitoring counters are started by writing  
valid setup information in one or more of the PerfEvtSel[3:0]  
MSRs and setting the enable counters flag in the PerfEvtSel0  
MSR. If the setup is valid, the counters begin counting  
following the execution of a WRMSR instruction, which sets the  
enable counter flag. The counters can be stopped by clearing  
the enable counters flag or by clearing all the bits in the  
PerfEvtSel[3:0] MSRs.  
Event and Time-Stamp Monitoring Software  
For applications to use the performance-monitoring counters  
and time-stamp counter, the operating system needs to provide  
an event-monitoring device driver. This driver should include  
procedures for handling the following operations:  
Feature checking  
Initialize and start counters  
Stop counters  
Read the event counters  
Reading of the time stamp counter  
The event monitor feature determination procedure must  
determine whether the current processor supports the  
performance-monitoring counters and time-stamp counter. This  
procedure compares the family and model of the processor  
returned by the CPUID instruction with those of processors  
known to support performance monitoring. In addition, the  
procedure checks the MSR and TSC flags returned to register  
EDX by the CPUID instruction to determine if the MSRs and  
the RDTSC instruction are supported.  
168  
Event and Time-Stamp Monitoring Software  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
The initialization and start counters procedure sets the  
PerfEvtSel0 and/or PerfEvtSel1 MSRs for the events to be  
counted and the method used to count them and initializes the  
counter MSRs (PerfCtr[3:0]) to starting counts. The stop  
counters procedure stops the performance counters. (See  
on page 168 for more information about starting and stopping  
the counters.)  
The read counters procedure reads the values in the  
PerfCtr[3:0] MSRs, and a read time-stamp counter procedure  
reads the time-stamp counter. These procedures can be used  
instead of enabling the RDTSC and RDPMC instructions, which  
allow application code to read the counters directly.  
Monitoring Counter Overflow  
The AMD Athlon processor provides the option of generating a  
debug interrupt when a performance-monitoring counter  
overflows. This mechanism is enabled by setting the interrupt  
enable flag in one of the PerfEvtSel[3:0] MSRs. The primary  
use of this option is for statistical performance sampling.  
To use this option, the operating system should do the  
following:  
Provide an interrupt routine for handling the counter  
overflow as an APIC interrupt  
Provide an entry in the IDT that points to a stub exception  
handler that returns without executing any instructions  
Provide an event monitor driver that provides the actual  
interrupt handler and modifies the reserved IDT entry to  
point to its interrupt routine  
When interrupted by a counter overflow, the interrupt handler  
needs to perform the following actions:  
Save the instruction pointer (EIP register), code segment  
selector, TSS segment selector, counter values and other  
relevant information at the time of the interrupt  
Reset the counter to its initial setting and return from the  
interrupt  
Monitoring Counter Overflow  
169  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
An event monitor application utility or another application  
program can read the collected performance information of the  
profiled application.  
170  
Monitoring Counter Overflow  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix E  
Programming the MTRR and  
PAT  
Introduction  
The AMD Athlonprocessor includes a set of memory type  
and range registers (MTRRs) to control cacheability and access  
to specified memory regions. The processor also includes the  
Page Address Table for defining attributes of pages. This  
chapter documents the use and capabilities of this feature.  
The purpose of the MTRRs is to provide system software with  
the ability to manage the memory mapping of the hardware.  
Both the BIOS software and operating systems utilize this  
capability. The AMD Athlon processors implementation is  
compatible to the Pentium® II. Prior to the MTRR mechanism,  
chipsets usually provided this capability.  
Memory Type Range Register (MTRR) Mechanism  
The memory type and range registers allow the processor to  
determine cacheability of various memory locations prior to  
bus access and to optimize access to the memory system. The  
AMD Athlon processor implements the MTRR programming  
model in a manner compatible with Pentium II.  
Introduction  
171  
Download from Www.Somanuals.com. All Manuals Search And Download.  
           
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
There are two types of address ranges: fixed and variable. (See  
Figure 12.) For each address range, there is a memory type. For  
each 4K, 16K or 64K segment within the first 1 Mbyte of  
memory, there is one fixed address MTRR. The fixed address  
ranges all exist in the first 1 Mbyte. There are eight variable  
address ranges above 1 Mbytes. Each is programmed to a  
specific memory starting address, size and alignment. If a  
variable range overlaps the lower 1 MByte and the fixed  
MTRRs are enabled, then the fixed-memory type dominates.  
The address regions have the following priority with respect to  
each other:  
1. Fixed address ranges  
2. Variable address ranges  
3. Default memory type (UC at reset)  
172  
Memory Type Range Register (MTRR) Mechanism  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
FFFFFFFFh  
SMM TSeg  
0-8 Variable Ranges  
(212 to 232)  
64 Fixed Ranges  
(4 Kbytes each)  
100000h  
256 Kbytes  
C0000h  
80000h  
16 Fixed Ranges  
(16 Kbytes each)  
8 Fixed Ranges  
256 Kbytes  
512 Kbytes  
(64 Kbytes each)  
0
Figure 12. MTRR Mapping of Physical Memory  
Memory Type Range Register (MTRR) Mechanism  
173  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Memory Types  
Five standard memory types are defined by the AMD Athlon  
processor: writethrough (WT), writeback (WB), write-protect  
(WP), write-combining (WC), and uncacheable (UC). These are  
Table 12. Memory Type Encodings  
Type Number  
Type Name  
Type Description  
Uncacheable for reads or writes. Cannot be combined. Must be  
non-speculative for reads or writes.  
00h  
UCUncacheable  
Uncacheable for reads or writes. Can be combined. Can be speculative for  
reads. Writes can never be speculative.  
01h  
04h  
05h  
WC Write-Combining  
WTWritethrough  
WPWrite-Protect  
Reads allocate on a miss, but only to the S-state. Writes do not allocate on  
a miss and, for a hit, writes update the cached entry and main memory.  
WP is functionally the same as the WT memory type, except stores do not  
actually modify cached data and do not cause an exception.  
Reads will allocate on a miss, and will allocate to:  
S
state if returned with a ReadDataShared command.  
06h  
WBWriteback  
M state if returned with a ReadDataDirty command.  
Writes allocate to the M state, if the read allows the line to be marked E.  
MTRR Capability  
Register Format  
The MTRR capability register is a read-only register that  
defines the specific MTRR capability of the processor and is  
defined as follows.  
63  
11 10  
9
8
7
0
F
I
X
W
C
VCNT  
Reserved  
Symbol  
WC  
FIX  
Description  
Write Combining Memory Type 10  
Fixed Range Registers  
No. of Variable Range Registers 70  
Bits  
8
VCNT  
Figure 13. MTRR Capability Register Format  
For the AMD Athlon processor, the MTRR capability register  
should contain 0508h (write-combining, fixed MTRRs  
supported, and eight variable MTRRs defined).  
174  
Memory Type Range Register (MTRR) Mechanism  
Download from Www.Somanuals.com. All Manuals Search And Download.  
         
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
MTRR Default Type Register Format. The MTRR default type register  
is defined as follows.  
63  
11 10  
F
9
8
7
3
2
1
0
E
Type  
E
Reserved  
Symbol  
E
Description  
MTRRs Enabled  
Bits  
11  
FE  
Type  
Fixed Range Enabled  
Default Memory Type  
10  
70  
Figure 14. MTRR Default Type Register Format  
E
MTRRs are enabled when set. All MTRRs (both fixed and  
variable range) are disabled when clear, and all of  
physical memory is mapped as uncacheable memory  
(reset state = 0).  
FE  
Fixed-range MTRRs are enabled when set. All MTRRs  
are disabled when clear. When the fixed-range MTRRs  
are enabled and an overlap occurs with a variable-range  
MTRR, the fixed-range MTRR takes priority (reset state  
= 0).  
Type Defines the default memory type (reset state = 0). See  
Table 13 for more details.  
Memory Type Range Register (MTRR) Mechanism  
175  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 13. Standard MTRR Types and Properties  
Allows  
Encoding in  
MTRR  
Internally  
Cacheable  
Writeback  
Cacheable  
Memory Type  
Speculative Memory Ordering Model  
Reads  
Uncacheable (UC)  
Write Combining (WC)  
Reserved  
0
1
2
3
4
No  
No  
No  
-
No  
Yes  
-
Strong ordering  
No  
Weak ordering  
-
-
Reserved  
-
Yes  
-
-
-
Writethrough (WT)  
No  
Yes  
Speculative ordering  
Yes, reads  
No, Writes  
Yes  
Write Protected (WP)  
5
No  
Yes  
Speculative ordering  
Writeback (WB)  
Reserved  
6
Yes  
-
Yes  
-
Speculative ordering  
-
7-255  
-
Note that if two or more variable memory ranges match then  
the interactions are defined as follows:  
1. If the memory types are identical, then that memory type is  
used.  
2. If one or more of the memory types is UC, the UC memory  
type is used.  
3. If one or more of the memory types is WT and the only other  
matching memory type is WB then the WT memory type is  
used.  
4. Otherwise, if the combination of memory types is not listed  
above then the behavior of the processor is undefined.  
MTRR Overlapping  
The Intel documentation (P6/PII) states that the mapping of  
large pages into regions that are mapped with differing memory  
types can result in undefined behavior. However, testing shows  
that these processors decompose these large pages into 4-Kbyte  
pages.  
When a large page (2 Mbytes/4 Mbytes) mapping covers a  
region that contains more than one memory type (as mapped by  
the MTRRs), the AMD Athlon processor does not suppress the  
caching of that large page mapping and only caches the  
mapping for just that 4-Kbyte piece in the 4-Kbyte TLB.  
Therefore, the AMD Athlon processor does not decompose  
large pages under these conditions. The fixed range MTRRs are  
176  
Memory Type Range Register (MTRR) Mechanism  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
not affected by this issue, only the variable range (and MTRR  
DefType) registers are affected.  
Page Attribute Table (PAT)  
The Page Attribute Table (PAT) is an extension of the page  
table entry format, which allows the specification of memory  
types to regions of physical memory based on the linear  
address. The PAT provides the same functionality as MTRRs  
with the flexibility of the page tables. It provides the operating  
systems and applications to determine the desired memory  
type for optimal performance. PAT support is detected in the  
feature flags (bit 16) of the CPUID instruction.  
MSR Access  
The PAT is located in a 64-bit MSR at location 277h. It is  
illustrated in Figure 15. Each of the eight PAn fields can contain  
the memory type encodings as described in Table 12 on  
page 174. An attempt to write an undefined memory type  
encoding into the PAT will generate a GP fault.  
2
0
31  
63  
26  
24  
18  
16  
10  
8
PA1  
PA5  
PA0  
PA4  
PA3  
PA7  
PA2  
PA6  
34  
32  
58  
56  
50  
48  
42  
40  
Reserved  
Figure 15. Page Attribute Table (MSR 277h)  
Page Attribute Table (PAT)  
177  
Download from Www.Somanuals.com. All Manuals Search And Download.  
       
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Accessing the PAT  
A 3-bit index consisting of the PATi, PCD, and PWT bits of the  
page table entry, is used to select one of the seven PAT register  
fields to acquire the memory type for the desired page (PATi is  
defined as bit 7 for 4-Kbyte PTEs and bit 12 for PDEs which  
map to 2-Mbyte or 4-Mbyte pages). The memory type from the  
PAT is used instead of the PCD and PWT for the effective  
memory type.  
A 2-bit index consisting of PCD and PWT bits of the page table  
entry, is used to select one of four PAT register fields when PAE  
(page address extensions) is enabled, or when the PDE doesnt  
describe a large page. In the latter case, the PATi bit for a PTE  
(bit 7) corresponds to the page size bit in a PDE. Therefore, the  
OS should only use PA0-3 when setting the memory type for a  
page table that is also used as a page directory. See Table 14 on  
Table 14. PATi 3-Bit Encodings  
PATi  
PCD  
0
PWT  
PAT Entry  
Reset Value  
0
0
0
0
1
1
1
1
0
1
0
1
0
1
0
1
0
1
2
3
4
5
6
7
0
1
1
0
0
1
1
MTRRs and PAT  
The processor contains MTRRs as described earlier which  
provide a limited way of assigning memory types to specific  
regions. However, the page tables allow memory types to be  
assigned to the pages used for linear to physical translation.  
The memory type as defined by PAT and MTRRs are combined  
to determine the effective memory type as listed in Table 15  
and Table 16. Shaded areas indicated reserved settings.  
178  
Page Attribute Table (PAT)  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 15. Effective Memory Type Based on PAT and MTRRs  
PAT Memory Type  
MTRR Memory Type  
Effective Memory Type  
UC-  
WB, WT, WP, WC  
UC-Page  
UC-MTRR  
WC  
UC  
x
WC  
WT  
WB, WT  
UC  
WT  
UC  
WC  
CD  
WP  
CD  
WP  
WB  
WB, WP  
UC  
WP  
UC-MTRR  
CD  
WC, WT  
WB  
WB  
UC  
UC  
WC  
WC  
WT  
WT  
WP  
WP  
Notes:  
1. UC-MTRR indicates that the UC attribute came from the MTRRs and that the processor caches  
should not be probed for performance reasons.  
2. UC-Page indicates that the UC attribute came from the page tables and that the processor  
caches must be probed due to page aliasing.  
3. All reserved combinations default to CD.  
Page Attribute Table (PAT)  
179  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 16. Final Output Memory Types  
Input Memory Type  
Output Memory Type  
AMD-751  
Note  
UC  
-
-
UC  
CD  
WC  
WT  
WP  
WB  
CD  
UC  
CD  
WC  
CD  
WP  
CD  
CD  
UC  
CD  
WC  
CD  
CD  
CD  
CD  
UC  
1
1
1
1
1
CD  
WC  
-
WT  
-
WP  
-
WB  
-
-
-
1, 2  
UC  
CD  
WC  
WT  
WP  
WB  
-
-
-
-
3
1
3
2
-
-
-
UC  
CD  
-
WC  
-
WT  
-
6
6
6
2
WP  
-
WB  
-
-
UC  
-
180  
Page Attribute Table (PAT)  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 16. Final Output Memory Types (Continued)  
Input Memory Type  
Output Memory Type  
AMD-751  
Note  
CD  
-
-
CD  
WC  
WT  
WP  
WC  
WT  
-
WP  
-
WB  
-
WT  
CD  
4
2
-
Notes:  
1. WP is not functional for RdMem/WrMem.  
2. ForceCD must cause the MTRR memory type to be ignored in order to avoid x’s.  
3. D-I should always be WP because the BIOS will only program RdMem-WrIO for WP. CD  
is forced to preserve the write-protect intent.  
4. Since cached IO lines cannot be copied back to IO, the processor forces WB to WT to  
prevent cached IO from going dirty.  
5. ForceCD. The memory type is forced CD due to (1) CR0[CD]=1, (2) memory type is for  
the ITLB and the I-Cache is disabled or for the DTLB and the D-Cache is disabled, (3)  
when clean victims must be written back and RdIO and WrIO and WT, WB, or WP, or  
(4) access to Local APIC space.  
6. The processor does not support this memory type.  
Page Attribute Table (PAT)  
181  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
MTRR Fixed-Range  
Register Format  
The memory types defined for memory segments defined in  
each of the MTRR fixed-range registers are defined in Table 17  
Table 17. MTRR Fixed Range Register Format  
Address Range (in hexadecimal)  
Register  
63:56  
55:48  
47:40  
39:32  
31:24  
23:16  
15:8  
7:0  
70000- 60000- 50000- 40000- 30000- 20000- 10000-  
00000-  
0FFFF  
MTRR_fix64K_00000  
MTRR_fix16K_80000  
7FFFF  
9C000  
9FFFF  
6FFFF  
98000  
9BFFF  
5FFFF  
94000  
97FFF  
4FFFF  
90000  
93FFF  
3FFFF  
8C000  
8FFFF  
2FFFF  
88000  
8BFFF  
1FFFF  
84000  
87FFF  
80000  
83FFF  
BC000- B8000- B4000- B0000- AC000- A8000- A4000- A0000-  
BFFFF BBFFF B7FFF B3FFF AFFFF ABFFF A7FFF A3FFF  
C7000- C6000- C5000- C4000- C3000- C2000- C1000- C0000-  
C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF  
CF000C- CE000- CD000- CC000- CB000- CA000- C9000- C8000-  
FFFF CEFFF CDFFF CCFFF CBFFF CAFFF C9FFF C8FFF  
D7000- D6000- D5000- D4000- D3000- D2000- D1000- D0000-  
D7FFF D6FFF D5FFF D4FFF D3FFF D2FFF D1FFF D0FFF  
DF000- DE000- DD000- DC000- DB000- DA000- D9000- D8000-  
MTRR_fix16K_A0000  
MTRR_fix4K_C0000  
MTRR_fix4K_C8000  
MTRR_fix4K_D0000  
MTRR_fix4K_D8000  
MTRR_fix4K_E0000  
MTRR_fix4K_E8000  
DFFFF  
E7000- E6000- E5000- E4000- E3000- E2000- E1000-  
E7FFF E6FFF E5FFF E4FFF E3FFF E2FFF E1FFF  
DEFFF  
DDFFF  
DCFFF  
DBFFF  
DAFFF  
D9FFF  
D8FFF  
E0000-  
E0FFF  
EF000- EE000- ED000- EC000- EB000- EA000- E9000- E8000-  
EFFFF  
F7000  
F7FFF  
FF000  
FFFFF  
EEFFF  
F6000  
F6FFF  
FE000  
FEFFF  
EDFFF  
F5000  
F5FFF  
ECFFF  
F4000  
F4FFF  
EBFFF  
F3000  
F3FFF  
EAFFF  
F2000  
F2FFF  
E9FFF  
F1000  
F1FFF  
F9000  
F9FFF  
E8FFF  
F0000  
F0FFF  
F8000  
F8FFF  
MTRR_fix4K_F0000  
MTRR_fix4K_F8000  
FD000- FC000- FB000- FA000-  
FDFFF FCFFF FBFFF FAFFF  
182  
Page Attribute Table (PAT)  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Variable-Range  
MTRRs  
A variable MTRR can be programmed to start at address  
0000_0000h because the fixed MTRRs always override the  
variable ones. However, it is recommended not to create an  
overlap.  
The upper two variable MTRRs should not be used by the BIOS  
and are reserved for operating system use.  
Variable-RangeMTRR  
Register Format  
The variable address range is power of 2 sized and aligned. The  
range of supported sizes is from 212 to 236 in powers of 2. The  
AMD Athlon processor does not implement A[35:32].  
63  
36 35  
12 11  
8 7  
0
Type  
Physical Base  
Reserved  
Symbol  
Physical Base Base address in Register Pair  
Type  
Description  
Bits  
3512  
See MTRR Types and Properties 70  
Figure 16. MTRRphysBasen Register Format  
Note: A software attempt to write to reserved bits will generate a  
general protection exception.  
Physical  
Base  
Specifies a 24-bit value which is extended by 12  
bits to form the base address of the region defined  
in the register pair.  
Type  
Page Attribute Table (PAT)  
183  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
63  
36 35  
12 11 10  
0
Physical Mask  
V
Reserved  
Symbol  
Description  
Bits  
Physical Mask 24-Bit Mask  
3512  
V
Variable Range Register Pair Enabled 11  
(V = 0 at reset)  
Figure 17. MTRRphysMaskn Register Format  
Note: A software attempt to write to reserved bits will generate a  
general protection exception.  
Physical  
Mask  
Specifies a 24-bit mask to determine the range of  
the region defined in the register pair.  
V
Enables the register pair when set (V = 0 at reset).  
Mask values can represent discontinuous ranges (when the  
mask defines a lower significant bit as zero and a higher  
significant bit as one). In a discontinuous range, the memory  
area not mapped by the mask value is set to the default type.  
Discontinuous ranges should not be used.  
The range that is mapped by the variable-range MTRR register  
pair must meet the following range size and alignment rule:  
Each defined memory range must have a size equal to 2n (11  
< n < 36).  
The base address for the address pair must be aligned to a  
similar 2n boundary.  
An example of a variable MTRR pair is as follows:  
To map the address range from 8 Mbytes (0080_0000h) to  
16 Mbytes (00FF_FFFFh) as writeback memory, the base  
register should be loaded with 80_0006h, and the mask  
should be loaded with FFF8_00800h.  
184  
Page Attribute Table (PAT)  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
MTRR MSR Format  
This table defines the model-specific registers related to the  
memory type range register implementation. All MTRRs are  
defined to be 64 bits.  
Table 18. MTRR-Related Model-Specific Register (MSR) Map  
Register Address  
0FEh  
Register Name  
MTRRcap  
Description  
200h  
201h  
MTRR Base0  
MTRR Mask0  
202h  
203h  
204h  
205h  
206h  
207h  
MTRR Base1  
MTRR Mask1  
MTRR Base2  
MTRR Mask2  
MTRR Base3  
MTRR Mask3  
208h  
209h  
20Ah  
20Bh  
20Ch  
20Dh  
20Eh  
MTRR Base4  
MTRR Mask4  
MTRR Base5  
MTRR Mask5  
MTRR Base6  
MTRR Mask6  
MTRR Base7  
20Fh  
MTRR Mask7  
250h  
258h  
259h  
268h  
269h  
26Ah  
26Bh  
26Ch  
26Dh  
26Eh  
MTRRFIX64k_00000  
MTRRFIX16k_80000  
MTRRFIX16k_A0000  
MTRRFIX4k_C0000  
MTRRFIX4k_C8000  
MTRRFIX4k_D0000  
MTRRFIX4k_D8000  
MTRRFIX4k_E0000  
MTRRFIX4k_E8000  
MTRRFIX4k_F0000  
MTRRFIX4k_F8000  
MTRRdefType  
26Fh  
2FFh  
Page Attribute Table (PAT)  
185  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
186  
Page Attribute Table (PAT)  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix F  
Instruction Dispatch and  
Execution Resources  
This chapter describes the MacroOPs generated by each  
decoded instruction, along with the relative static execution  
latencies of these groups of operations. Tables 19 through 24  
starting on page 188 define the integer, MMX, MMX  
extensions, floating-point, 3DNow!, and 3DNow! extensions  
instructions, respectively.  
The first column in these tables indicates the instruction  
mnemonic and operand types with the following notations:  
reg8byte integer register defined by instruction byte(s) or  
bits 5, 4, and 3 of the modR/M byte  
mreg8byte integer register defined by bits 2, 1, and 0 of  
the modR/M byte  
reg16/32word and doubleword integer register defined by  
instruction byte(s) or bits 5, 4, and 3 of the modR/M byte  
mreg16/32word and doubleword integer register defined  
by bits 2, 1, and 0 of the modR/M byte  
mem8byte memory location  
mem16/32word or doubleword memory location  
mem32/48doubleword or 6-byte memory location  
mem4848-bit integer value in memory  
mem6464-bit value in memory  
imm8/16/328-bit, 16-bit or 32-bit immediate value  
disp88-bit displacement value  
Instruction Dispatch and Execution Resources  
187  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
disp16/3216-bit or 32-bit displacement value  
disp32/4832-bit or 48-bit displacement value  
eXXregister width depending on the operand size  
mem32real32-bit floating-point value in memory  
mem64real64-bit floating-point value in memory  
mem80real80-bit floating-point value in memory  
mmregMMX/3DNow! register  
mmreg1MMX/3DNow! register defined by bits 5, 4, and 3  
of the modR/M byte  
mmreg2MMX/3DNow! register defined by bits 2, 1, and 0  
of the modR/M byte  
The second and third columns list all applicable encoding  
opcode bytes.  
The fourth column lists the modR/M byte used by the  
instruction. The modR/M byte defines the instruction as  
register or memory form. If mod bits 7 and 6 are documented as  
mm (memory form), mm can only be 10b, 01b, or 00b.  
The fifth column lists the type of instruction decode—  
DirectPath or VectorPath (see DirectPath Decoderon page  
information). The AMD Athlonprocessor enhanced decode  
logic can process three instructions per clock.  
The FPU, MMX, and 3DNow! instruction tables have an  
additional column that lists the possible FPU execution  
pipelines available for use by any particular DirectPath  
decoded operation. Typically, VectorPath instructions require  
more than one execution pipe resource.  
Table 19. Integer Instructions  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
AAA  
AAD  
AAM  
AAS  
37h  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
D5h  
D4h  
3Fh  
0Ah  
0Ah  
188  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
ADC mreg8, reg8  
10h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
ADC mem8, reg8  
10h  
ADC mreg16/32, reg16/32  
ADC mem16/32, reg16/32  
ADC reg8, mreg8  
11h  
11h  
12h  
ADC reg8, mem8  
12h  
ADC reg16/32, mreg16/32  
ADC reg16/32, mem16/32  
ADC AL, imm8  
13h  
13h  
14h  
ADC EAX, imm16/32  
15h  
DirectPath  
ADC mreg8, imm8  
80h  
80h  
81h  
11-010-xxx DirectPath  
mm-010-xxx DirectPath  
11-010-xxx DirectPath  
mm-010-xxx DirectPath  
11-010-xxx DirectPath  
mm-010-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
ADC mem8, imm8  
ADC mreg16/32, imm16/32  
ADC mem16/32, imm16/32  
ADC mreg16/32, imm8 (sign extended)  
ADC mem16/32, imm8 (sign extended)  
ADD mreg8, reg8  
81h  
83h  
83h  
00h  
00h  
01h  
ADD mem8, reg8  
ADD mreg16/32, reg16/32  
ADD mem16/32, reg16/32  
ADD reg8, mreg8  
01h  
02h  
02h  
03h  
03h  
04h  
05h  
80h  
80h  
81h  
ADD reg8, mem8  
ADD reg16/32, mreg16/32  
ADD reg16/32, mem16/32  
ADD AL, imm8  
ADD EAX, imm16/32  
DirectPath  
ADD mreg8, imm8  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-xxx-xxx DirectPath  
ADD mem8, imm8  
ADD mreg16/32, imm16/32  
ADD mem16/32, imm16/32  
ADD mreg16/32, imm8 (sign extended)  
ADD mem16/32, imm8 (sign extended)  
AND mreg8, reg8  
81h  
83h  
83h  
20h  
Instruction Dispatch and Execution Resources  
189  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
AND mem8, reg8  
20h  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
AND mreg16/32, reg16/32  
AND mem16/32, reg16/32  
AND reg8, mreg8  
21h  
21h  
22h  
AND reg8, mem8  
22h  
AND reg16/32, mreg16/32  
AND reg16/32, mem16/32  
AND AL, imm8  
23h  
23h  
24h  
AND EAX, imm16/32  
AND mreg8, imm8  
AND mem8, imm8  
AND mreg16/32, imm16/32  
AND mem16/32, imm16/32  
AND mreg16/32, imm8 (sign extended)  
AND mem16/32, imm8 (sign extended)  
ARPL mreg16, reg16  
ARPL mem16, reg16  
BOUND  
25h  
DirectPath  
80h  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
VectorPath  
80h  
81h  
81h  
83h  
83h  
63h  
63h  
62h  
BSF reg16/32, mreg16/32  
BSF reg16/32, mem16/32  
BSR reg16/32, mreg16/32  
BSR reg16/32, mem16/32  
BSWAP EAX  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
BCh  
BCh  
BDh  
BDh  
C8h  
C9h  
CAh  
CBh  
CCh  
CDh  
CEh  
CFh  
A3h  
A3h  
BAh  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
DirectPath  
BSWAP ECX  
DirectPath  
BSWAP EDX  
DirectPath  
BSWAP EBX  
DirectPath  
BSWAP ESP  
DirectPath  
BSWAP EBP  
DirectPath  
BSWAP ESI  
DirectPath  
BSWAP EDI  
DirectPath  
BT mreg16/32, reg16/32  
BT mem16/32, reg16/32  
BT mreg16/32, imm8  
11-xxx-xxx DirectPath  
mm-xxx-xxx VectorPath  
11-100-xxx DirectPath  
190  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Byte Byte Byte  
Decode  
Type  
BT mem16/32, imm8  
BTC mreg16/32, reg16/32  
BTC mem16/32, reg16/32  
BTC mreg16/32, imm8  
BTC mem16/32, imm8  
BTR mreg16/32, reg16/32  
BTR mem16/32, reg16/32  
BTR mreg16/32, imm8  
BTR mem16/32, imm8  
BTS mreg16/32, reg16/32  
BTS mem16/32, reg16/32  
BTS mreg16/32, imm8  
BTS mem16/32, imm8  
CALL full pointer  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
9Ah  
E8h  
FFh  
FFh  
FFh  
98h  
F8h  
FCh  
FAh  
0Fh  
F5h  
0Fh  
0Fh  
BAh mm-100-xxx DirectPath  
BBh  
BBh  
BAh  
BAh  
B3h  
B3h  
BAh  
BAh  
ABh  
ABh  
BAh  
BAh  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-111-xxx VectorPath  
mm-111-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-110-xxx VectorPath  
mm-110-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-101-xxx VectorPath  
mm-101-xxx VectorPath  
VectorPath  
CALL near imm16/32  
CALL mem16:16/32  
CALL near mreg32 (indirect)  
CALL near mem32 (indirect)  
CBW/CWDE  
VectorPath  
11-011-xxx VectorPath  
11-010-xxx VectorPath  
mm-010-xxx VectorPath  
DirectPath  
CLC  
DirectPath  
CLD  
VectorPath  
CLI  
VectorPath  
CLTS  
06h  
VectorPath  
CMC  
DirectPath  
CMOVA/CMOVNBE reg16/32, reg16/32  
CMOVA/CMOVNBE reg16/32, mem16/32  
47h  
47h  
43h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
CMOVAE/CMOVNB/CMOVNC reg16/32, mem16/32 0Fh  
CMOVAE/CMOVNB/CMOVNC mem16/32,  
mem16/32  
0Fh  
43h  
mm-xxx-xxx DirectPath  
CMOVB/CMOVC/CMOVNAE reg16/32, reg16/32  
CMOVB/CMOVC/CMOVNAE mem16/32, reg16/32  
CMOVBE/CMOVNA reg16/32, reg16/32  
0Fh  
0Fh  
0Fh  
0Fh  
42h  
42h  
46h  
46h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
CMOVBE/CMOVNA reg16/32, mem16/32  
Instruction Dispatch and Execution Resources  
191  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Byte Byte Byte  
Decode  
Type  
CMOVE/CMOVZ reg16/32, reg16/32  
CMOVE/CMOVZ reg16/32, mem16/32  
CMOVG/CMOVNLE reg16/32, reg16/32  
CMOVG/CMOVNLE reg16/32, mem16/32  
CMOVGE/CMOVNL reg16/32, reg16/32  
CMOVGE/CMOVNL reg16/32, mem16/32  
CMOVL/CMOVNGE reg16/32, reg16/32  
CMOVL/CMOVNGE reg16/32, mem16/32  
CMOVLE/CMOVNG reg16/32, reg16/32  
CMOVLE/CMOVNG reg16/32, mem16/32  
CMOVNE/CMOVNZ reg16/32, reg16/32  
CMOVNE/CMOVNZ reg16/32, mem16/32  
CMOVNO reg16/32, reg16/32  
CMOVNO reg16/32, mem16/32  
CMOVNP/CMOVPO reg16/32, reg16/32  
CMOVNP/CMOVPO reg16/32, mem16/32  
CMOVNS reg16/32, reg16/32  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
38h  
38h  
39h  
39h  
3Ah  
3Ah  
3Bh  
3Bh  
3Ch  
44h  
44h  
4Fh  
4Fh  
4Dh  
4Dh  
4Ch  
4Ch  
4Eh  
4Eh  
45h  
45h  
41h  
41h  
4Bh  
4Bh  
49h  
49h  
40h  
40h  
4Ah  
4Ah  
48h  
48h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
CMOVNS reg16/32, mem16/32  
CMOVO reg16/32, reg16/32  
CMOVO reg16/32, mem16/32  
CMOVP/CMOVPE reg16/32, reg16/32  
CMOVP/CMOVPE reg16/32, mem16/32  
CMOVS reg16/32, reg16/32  
CMOVS reg16/32, mem16/32  
CMP mreg8, reg8  
CMP mem8, reg8  
CMP mreg16/32, reg16/32  
CMP mem16/32, reg16/32  
CMP reg8, mreg8  
CMP reg8, mem8  
CMP reg16/32, mreg16/32  
CMP reg16/32, mem16/32  
CMP AL, imm8  
192  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
CMP EAX, imm16/32  
CMP mreg8, imm8  
CMP mem8, imm8  
CMP mreg16/32, imm16/32  
CMP mem16/32, imm16/32  
CMP mreg16/32, imm8 (sign extended)  
CMP mem16/32, imm8 (sign extended)  
CMPSB mem8,mem8  
CMPSW mem16, mem32  
CMPSD mem32, mem32  
CMPXCHG mreg8, reg8  
CMPXCHG mem8, reg8  
CMPXCHG mreg16/32, reg16/32  
CMPXCHG mem16/32, reg16/32  
CMPXCHG8B mem64  
CPUID  
3Dh  
DirectPath  
80h  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
VectorPath  
80h  
81h  
81h  
83h  
83h  
A6h  
A7h  
VectorPath  
A7h  
VectorPath  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
99h  
27h  
2Fh  
48h  
49h  
4Ah  
4Bh  
4Ch  
4Dh  
4Eh  
4Fh  
FEh  
FEh  
FFh  
FFh  
F6h  
F6h  
B0h  
B0h  
B1h  
B1h  
C7h  
A2h  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
VectorPath  
CWD/CDQ  
DirectPath  
DAA  
VectorPath  
DAS  
VectorPath  
DEC EAX  
DirectPath  
DEC ECX  
DirectPath  
DEC EDX  
DirectPath  
DEC EBX  
DirectPath  
DEC ESP  
DirectPath  
DEC EBP  
DirectPath  
DEC ESI  
DirectPath  
DEC EDI  
DirectPath  
DEC mreg8  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-110-xxx VectorPath  
mm-110-xxx VectorPath  
DEC mem8  
DEC mreg16/32  
DEC mem16/32  
DIV AL, mreg8  
DIV AL, mem8  
Instruction Dispatch and Execution Resources  
193  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
DIV EAX, mreg16/32  
DIV EAX, mem16/32  
ENTER  
F7h  
11-110-xxx VectorPath  
mm-110-xxx VectorPath  
VectorPath  
F7h  
C8  
IDIV mreg8  
F6h  
11-111-xxx VectorPath  
mm-111-xxx VectorPath  
11-111-xxx VectorPath  
mm-111-xxx VectorPath  
11-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-101-xxx VectorPath  
mm-101-xxx VectorPath  
11-101-xxx VectorPath  
mm-101-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
VectorPath  
IDIV mem8  
F6h  
IDIV EAX, mreg16/32  
IDIV EAX, mem16/32  
IMUL reg16/32, imm16/32  
IMUL reg16/32, mreg16/32, imm16/32  
IMUL reg16/32, mem16/32, imm16/32  
IMUL reg16/32, imm8 (sign extended)  
IMUL reg16/32, mreg16/32, imm8 (signed)  
IMUL reg16/32, mem16/32, imm8 (signed)  
IMUL AX, AL, mreg8  
IMUL AX, AL, mem8  
IMUL EDX:EAX, EAX, mreg16/32  
IMUL EDX:EAX, EAX, mem16/32  
IMUL reg16/32, mreg16/32  
IMUL reg16/32, mem16/32  
IN AL, imm8  
F7h  
F7h  
69h  
69h  
69h  
6Bh  
6Bh  
6Bh  
F6h  
F6h  
F7h  
F7h  
0Fh  
0Fh  
E4h  
E5h  
E5h  
ECh  
EDh  
EDh  
40h  
41h  
42h  
43h  
44h  
45h  
46h  
47h  
AFh  
AFh  
IN AX, imm8  
VectorPath  
IN EAX, imm8  
VectorPath  
IN AL, DX  
VectorPath  
IN AX, DX  
VectorPath  
IN EAX, DX  
VectorPath  
INC EAX  
DirectPath  
INC ECX  
DirectPath  
INC EDX  
DirectPath  
INC EBX  
DirectPath  
INC ESP  
DirectPath  
INC EBP  
DirectPath  
INC ESI  
DirectPath  
INC EDI  
DirectPath  
194  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
INC mreg8  
FEh  
FEh  
FFh  
FFh  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
VectorPath  
INC mem8  
INC mreg16/32  
INC mem16/32  
INVD  
0Fh  
0Fh  
70h  
71h  
72h  
73h  
74h  
75h  
76h  
77h  
78h  
79h  
7Ah  
7Bh  
7Ch  
7Dh  
7Eh  
7Fh  
E3h  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
08h  
01h  
INVLPG  
mm-111-xxx VectorPath  
DirectPath  
JO short disp8  
JNO short disp8  
DirectPath  
JB/JNAE/JC short disp8  
JNB/JAE/JNC short disp8  
JZ/JE short disp8  
JNZ/JNE short disp8  
JBE/JNA short disp8  
JNBE/JA short disp8  
JS short disp8  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
JNS short disp8  
DirectPath  
JP/JPE short disp8  
JNP/JPO short disp8  
JL/JNGE short disp8  
JNL/JGE short disp8  
JLE/JNG short disp8  
JNLE/JG short disp8  
JCXZ/JEC short disp8  
JO near disp16/32  
JNO near disp16/32  
JB/JNAE near disp16/32  
JNB/JAE near disp16/32  
JZ/JE near disp16/32  
JNZ/JNE near disp16/32  
JBE/JNA near disp16/32  
JNBE/JA near disp16/32  
JS near disp16/32  
JNS near disp16/32  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
VectorPath  
80h  
81h  
82h  
83h  
84h  
85h  
86h  
87h  
88h  
89h  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
Instruction Dispatch and Execution Resources  
195  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Byte Byte Byte  
Decode  
Type  
JP/JPE near disp16/32  
JNP/JPO near disp16/32  
JL/JNGE near disp16/32  
JNL/JGE near disp16/32  
JLE/JNG near disp16/32  
JNLE/JG near disp16/32  
JMP near disp16/32 (direct)  
JMP far disp32/48 (direct)  
JMP disp8 (short)  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
E9h  
EAh  
EBh  
EFh  
FFh  
FFh  
FFh  
9Fh  
0Fh  
0Fh  
C5h  
8Dh  
8Dh  
C9h  
C4h  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
ACh  
ADh  
ADh  
E2h  
8Ah  
8Bh  
8Ch  
8Dh  
8Eh  
8Fh  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
VectorPath  
DirectPath  
JMP far mem32 (indirect)  
JMP far mreg32 (indirect)  
JMP near mreg16/32 (indirect)  
JMP near mem16/32 (indirect)  
LAHF  
mm-101-xxx VectorPath  
mm-101-xxx VectorPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
VectorPath  
LAR reg16/32, mreg16/32  
LAR reg16/32, mem16/32  
LDS reg16/32, mem32/48  
LEA reg16, mem16/32  
LEA reg32, mem16/32  
LEAVE  
02h  
02h  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
mm-xxx-xxx DirectPath  
VectorPath  
LES reg16/32, mem32/48  
LFS reg16/32, mem32/48  
LGDT mem48  
mm-xxx-xxx VectorPath  
VectorPath  
B4h  
01h  
B5h  
01h  
00h  
00h  
01h  
01h  
mm-010-xxx VectorPath  
VectorPath  
LGS reg16/32, mem32/48  
LIDT mem48  
mm-011-xxx VectorPath  
11-010-xxx VectorPath  
mm-010-xxx VectorPath  
11-100-xxx VectorPath  
mm-100-xxx VectorPath  
VectorPath  
LLDT mreg16  
LLDT mem16  
LMSW mreg16  
LMSW mem16  
LODSB AL, mem8  
LODSW AX, mem16  
LODSD EAX, mem32  
LOOP disp8  
VectorPath  
VectorPath  
VectorPath  
196  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
LOOPE/LOOPZ disp8  
LOOPNE/LOOPNZ disp8  
LSL reg16/32, mreg16/32  
LSL reg16/32, mem16/32  
LSS reg16/32, mem32/48  
LTR mreg16  
E1h  
VectorPath  
VectorPath  
E0h  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
88h  
88h  
89h  
89h  
8Ah  
8Ah  
8Bh  
8Bh  
8Ch  
8Ch  
8Eh  
8Eh  
A0h  
A1h  
A2h  
A3h  
B0h  
B1h  
B2h  
B3h  
B4h  
B5h  
B6h  
B7h  
B8h  
B9h  
03h  
03h  
B2h  
00h  
00h  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-011-xxx VectorPath  
mm-011-xxx VectorPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
DirectPath  
LTR mem16  
MOV mreg8, reg8  
MOV mem8, reg8  
MOV mreg16/32, reg16/32  
MOV mem16/32, reg16/32  
MOV reg8, mreg8  
MOV reg8, mem8  
MOV reg16/32, mreg16/32  
MOV reg16/32, mem16/32  
MOV mreg16, segment reg  
MOV mem16, segment reg  
MOV segment reg, mreg16  
MOV segment reg, mem16  
MOV AL, mem8  
MOV EAX, mem16/32  
MOV mem8, AL  
DirectPath  
DirectPath  
MOV mem16/32, EAX  
MOV AL, imm8  
DirectPath  
DirectPath  
MOV CL, imm8  
DirectPath  
MOV DL, imm8  
DirectPath  
MOV BL, imm8  
DirectPath  
MOV AH, imm8  
DirectPath  
MOV CH, imm8  
DirectPath  
MOV DH, imm8  
DirectPath  
MOV BH, imm8  
DirectPath  
MOV EAX, imm16/32  
MOV ECX, imm16/32  
DirectPath  
DirectPath  
Instruction Dispatch and Execution Resources  
197  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
MOV EDX, imm16/32  
MOV EBX, imm16/32  
MOV ESP, imm16/32  
MOV EBP, imm16/32  
MOV ESI, imm16/32  
MOV EDI, imm16/32  
MOV mreg8, imm8  
MOV mem8, imm8  
MOV mreg16/32, imm16/32  
MOV mem16/32, imm16/32  
MOVSB mem8,mem8  
MOVSD mem16, mem16  
MOVSW mem32, mem32  
MOVSX reg16/32, mreg8  
MOVSX reg16/32, mem8  
MOVSX reg32, mreg16  
MOVSX reg32, mem16  
MOVZX reg16/32, mreg8  
MOVZX reg16/32, mem8  
MOVZX reg32, mreg16  
MOVZX reg32, mem16  
MUL AL, mreg8  
BAh  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
BBh  
BCh  
BDh  
BEh  
BFh  
C6h  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
VectorPath  
C6h  
C7h  
C7h  
A4h  
A5h  
VectorPath  
A5h  
VectorPath  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
F6h  
F6h  
F7h  
F7h  
F7h  
F7h  
F6h  
F6h  
F7h  
F7h  
90h  
F6h  
BEh  
BEh  
BFh  
BFh  
B6h  
B6h  
B7h  
B7h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-100-xxx VectorPath  
mm-100-xx VectorPath  
11-100-xxx VectorPath  
mm-100-xxx VectorPath  
11-100-xxx VectorPath  
mm-100-xx VectorPath  
11-011-xxx DirectPath  
mm-011-xx DirectPath  
11-011-xxx DirectPath  
mm-011-xx DirectPath  
DirectPath  
MUL AL, mem8  
MUL AX, mreg16  
MUL AX, mem16  
MUL EAX, mreg32  
MUL EAX, mem32  
NEG mreg8  
NEG mem8  
NEG mreg16/32  
NEG mem16/32  
NOP (XCHG EAX, EAX)  
NOT mreg8  
11-010-xxx DirectPath  
198  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
NOT mem8  
F6h  
mm-010-xx DirectPath  
11-010-xxx DirectPath  
mm-010-xx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
NOT mreg16/32  
NOT mem16/32  
OR mreg8, reg8  
OR mem8, reg8  
OR mreg16/32, reg16/32  
OR mem16/32, reg16/32  
OR reg8, mreg8  
OR reg8, mem8  
OR reg16/32, mreg16/32  
OR reg16/32, mem16/32  
OR AL, imm8  
F7h  
F7h  
08h  
08h  
09h  
09h  
0Ah  
0Ah  
0Bh  
0Bh  
0Ch  
0Dh  
80h  
80h  
81h  
OR EAX, imm16/32  
OR mreg8, imm8  
OR mem8, imm8  
OR mreg16/32, imm16/32  
OR mem16/32, imm16/32  
OR mreg16/32, imm8 (sign extended)  
OR mem16/32, imm8 (sign extended)  
OUT imm8, AL  
DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
VectorPath  
81h  
83h  
83h  
E6h  
E7h  
E7h  
EEh  
EFh  
OUT imm8, AX  
VectorPath  
OUT imm8, EAX  
OUT DX, AL  
VectorPath  
VectorPath  
OUT DX, AX  
VectorPath  
OUT DX, EAX  
EFh  
VectorPath  
POP ES  
07h  
VectorPath  
POP SS  
17h  
VectorPath  
POP DS  
1Fh  
VectorPath  
POP FS  
0Fh  
0Fh  
58h  
59h  
5Ah  
A1h  
A9h  
VectorPath  
POP GS  
VectorPath  
POP EAX  
VectorPath  
POP ECX  
VectorPath  
POP EDX  
VectorPath  
Instruction Dispatch and Execution Resources  
199  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
POP EBX  
5Bh  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
POP ESP  
5Ch  
POP EBP  
5Dh  
POP ESI  
5Eh  
POP EDI  
5Fh  
POP mreg 16/32  
POP mem 16/32  
POPA/POPAD  
POPF/POPFD  
PUSH ES  
8Fh  
11-000-xxx VectorPath  
mm-000-xxx VectorPath  
VectorPath  
8Fh  
61h  
9Dh  
VectorPath  
06h  
VectorPath  
PUSH CS  
0Eh  
VectorPath  
PUSH FS  
0Fh  
0Fh  
16h  
1Eh  
50h  
51h  
52h  
53h  
54h  
55h  
56h  
57h  
6Ah  
68h  
FFh  
FFh  
60h  
9Ch  
C0h  
C0h  
C1h  
C1h  
A0h  
A8h  
VectorPath  
PUSH GS  
VectorPath  
PUSH SS  
VectorPath  
PUSH DS  
VectorPath  
PUSH EAX  
DirectPath  
PUSH ECX  
DirectPath  
PUSH EDX  
DirectPath  
PUSH EBX  
DirectPath  
PUSH ESP  
DirectPath  
PUSH EBP  
DirectPath  
PUSH ESI  
DirectPath  
PUSH EDI  
DirectPath  
PUSH imm8  
PUSH imm16/32  
PUSH mreg16/32  
PUSH mem16/32  
PUSHA/PUSHAD  
PUSHF/PUSHFD  
RCL mreg8, imm8  
RCL mem8, imm8  
RCL mreg16/32, imm8  
RCL mem16/32, imm8  
DirectPath  
DirectPath  
11-110-xxx VectorPath  
mm-110-xxx VectorPath  
VectorPath  
VectorPath  
11-010-xxx DirectPath  
mm-010-xxx VectorPath  
11-010-xxx DirectPath  
mm-010-xxx VectorPath  
200  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
D0h  
D0h  
D1h  
D1h  
D2h  
D2h  
D3h  
D3h  
C0h  
Byte  
RCL mreg8, 1  
11-010-xxx DirectPath  
mm-010-xxx DirectPath  
11-010-xxx DirectPath  
mm-010-xxx DirectPath  
11-010-xxx DirectPath  
mm-010-xxx VectorPath  
11-010-xxx DirectPath  
mm-010-xxx VectorPath  
11-011-xxx DirectPath  
mm-011-xxx VectorPath  
11-011-xxx DirectPath  
mm-011-xxx VectorPath  
11-011-xxx DirectPath  
mm-011-xxx DirectPath  
11-011-xxx DirectPath  
mm-011-xxx DirectPath  
11-011-xxx DirectPath  
mm-011-xxx VectorPath  
11-011-xxx DirectPath  
mm-011-xxx VectorPath  
VectorPath  
RCL mem8, 1  
RCL mreg16/32, 1  
RCL mem16/32, 1  
RCL mreg8, CL  
RCL mem8, CL  
RCL mreg16/32, CL  
RCL mem16/32, CL  
RCR mreg8, imm8  
RCR mem8, imm8  
RCR mreg16/32, imm8  
RCR mem16/32, imm8  
RCR mreg8, 1  
C0h  
C1h  
C1h  
D0h  
D0h  
D1h  
D1h  
D2h  
D2h  
D3h  
D3h  
RCR mem8, 1  
RCR mreg16/32, 1  
RCR mem16/32, 1  
RCR mreg8, CL  
RCR mem8, CL  
RCR mreg16/32, CL  
RCR mem16/32, CL  
RDMSR  
0Fh  
0Fh  
0F  
32h  
33h  
31h  
RDPMC  
VectorPath  
RDTSC  
VectorPath  
RET near imm16  
RET near  
C2h  
C3h  
CAh  
CBh  
C0h  
C0h  
C1h  
C1h  
D0h  
D0h  
VectorPath  
VectorPath  
RET far imm16  
RET far  
VectorPath  
VectorPath  
ROL mreg8, imm8  
ROL mem8, imm8  
ROL mreg16/32, imm8  
ROL mem16/32, imm8  
ROL mreg8, 1  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
ROL mem8, 1  
Instruction Dispatch and Execution Resources  
201  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Byte  
Decode  
Type  
Byte Byte  
D1h  
D1h  
D2h  
D2h  
D3h  
D3h  
C0h  
C0h  
C1h  
C1h  
D0h  
D0h  
D1h  
D1h  
D2h  
D2h  
D3h  
D3h  
9Eh  
ROL mreg16/32, 1  
ROL mem16/32, 1  
ROL mreg8, CL  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
VectorPath  
ROL mem8, CL  
ROL mreg16/32, CL  
ROL mem16/32, CL  
ROR mreg8, imm8  
ROR mem8, imm8  
ROR mreg16/32, imm8  
ROR mem16/32, imm8  
ROR mreg8, 1  
ROR mem8, 1  
ROR mreg16/32, 1  
ROR mem16/32, 1  
ROR mreg8, CL  
ROR mem8, CL  
ROR mreg16/32, CL  
ROR mem16/32, CL  
SAHF  
SAR mreg8, imm8  
SAR mem8, imm8  
SAR mreg16/32, imm8  
SAR mem16/32, imm8  
SAR mreg8, 1  
C0h  
C0h  
C1h  
C1h  
D0h  
D0h  
D1h  
D1h  
D2h  
D2h  
D3h  
D3h  
18h  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
SAR mem8, 1  
SAR mreg16/32, 1  
SAR mem16/32, 1  
SAR mreg8, CL  
SAR mem8, CL  
SAR mreg16/32, CL  
SAR mem16/32, CL  
SBB mreg8, reg8  
SBB mem8, reg8  
18h  
202  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
SBB mreg16/32, reg16/32  
SBB mem16/32, reg16/32  
SBB reg8, mreg8  
19h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
19h  
1Ah  
1Ah  
1Bh  
SBB reg8, mem8  
SBB reg16/32, mreg16/32  
SBB reg16/32, mem16/32  
SBB AL, imm8  
1Bh  
1Ch  
1Dh  
80h  
SBB EAX, imm16/32  
SBB mreg8, imm8  
DirectPath  
11-011-xxx DirectPath  
mm-011-xxx DirectPath  
11-011-xxx DirectPath  
mm-011-xxx DirectPath  
11-011-xxx DirectPath  
mm-011-xxx DirectPath  
VectorPath  
SBB mem8, imm8  
80h  
SBB mreg16/32, imm16/32  
SBB mem16/32, imm16/32  
SBB mreg16/32, imm8 (sign extended)  
SBB mem16/32, imm8 (sign extended)  
SCASB AL, mem8  
81h  
81h  
83h  
83h  
AEh  
SCASW AX, mem16  
AFh  
VectorPath  
SCASD EAX, mem32  
SETO mreg8  
AFh  
VectorPath  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
90h  
90h  
91h  
91h  
92h  
92h  
93h  
93h  
94h  
94h  
95h  
95h  
96h  
96h  
97h  
97h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
SETO mem8  
SETNO mreg8  
SETNO mem8  
SETB/SETC/SETNAE mreg8  
SETB/SETC/SETNAE mem8  
SETAE/SETNB/SETNC mreg8  
SETAE/SETNB/SETNC mem8  
SETE/SETZ mreg8  
SETE/SETZ mem8  
SETNE/SETNZ mreg8  
SETNE/SETNZ mem8  
SETBE/SETNA mreg8  
SETBE/SETNA mem8  
SETA/SETNBE mreg8  
SETA/SETNBE mem8  
Instruction Dispatch and Execution Resources  
203  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Byte Byte Byte  
Decode  
Type  
SETS mreg8  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
C0h  
C0h  
C1h  
C1h  
D0h  
D0h  
D1h  
D1h  
D2h  
D2h  
D3h  
D3h  
C0h  
C0h  
C1h  
98h  
98h  
99h  
99h  
9Ah  
9Ah  
9Bh  
9Bh  
9Ch  
9Ch  
9Dh  
9Dh  
9Eh  
9Eh  
9Fh  
9Fh  
01h  
01h  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
mm-000-xxx VectorPath  
mm-001-xxx VectorPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-101-xxx DirectPath  
SETS mem8  
SETNS mreg8  
SETNS mem8  
SETP/SETPE mreg8  
SETP/SETPE mem8  
SETNP/SETPO mreg8  
SETNP/SETPO mem8  
SETL/SETNGE mreg8  
SETL/SETNGE mem8  
SETGE/SETNL mreg8  
SETGE/SETNL mem8  
SETLE/SETNG mreg8  
SETLE/SETNG mem8  
SETG/SETNLE mreg8  
SETG/SETNLE mem8  
SGDT mem48  
SIDT mem48  
SHL/SAL mreg8, imm8  
SHL/SAL mem8, imm8  
SHL/SAL mreg16/32, imm8  
SHL/SAL mem16/32, imm8  
SHL/SAL mreg8, 1  
SHL/SAL mem8, 1  
SHL/SAL mreg16/32, 1  
SHL/SAL mem16/32, 1  
SHL/SAL mreg8, CL  
SHL/SAL mem8, CL  
SHL/SAL mreg16/32, CL  
SHL/SAL mem16/32, CL  
SHR mreg8, imm8  
SHR mem8, imm8  
SHR mreg16/32, imm8  
204  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
SHR mem16/32, imm8  
SHR mreg8, 1  
C1h  
mm-101-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-000-xxx VectorPath  
mm-000-xxx VectorPath  
11-100-xxx VectorPath  
mm-100-xxx VectorPath  
DirectPath  
D0h  
SHR mem8, 1  
D0h  
SHR mreg16/32, 1  
SHR mem16/32, 1  
SHR mreg8, CL  
D1h  
D1h  
D2h  
SHR mem8, CL  
D2h  
SHR mreg16/32, CL  
SHR mem16/32, CL  
SHLD mreg16/32, reg16/32, imm8  
SHLD mem16/32, reg16/32, imm8  
SHLD mreg16/32, reg16/32, CL  
SHLD mem16/32, reg16/32, CL  
SHRD mreg16/32, reg16/32, imm8  
SHRD mem16/32, reg16/32, imm8  
SHRD mreg16/32, reg16/32, CL  
SHRD mem16/32, reg16/32, CL  
SLDT mreg16  
D3h  
D3h  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
F9h  
FDh  
FBh  
AAh  
ABh  
ABh  
0Fh  
0Fh  
28h  
28h  
29h  
29h  
A4h  
A4h  
A5h  
A5h  
ACh  
ACh  
ADh  
ADh  
00h  
00h  
01h  
SLDT mem16  
SMSW mreg16  
SMSW mem16  
01h  
STC  
STD  
VectorPath  
STI  
VectorPath  
STOSB mem8, AL  
VectorPath  
STOSW mem16, AX  
STOSD mem32, EAX  
STR mreg16  
VectorPath  
VectorPath  
00h  
00h  
11-001-xxx VectorPath  
mm-001-xxx VectorPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
STR mem16  
SUB mreg8, reg8  
SUB mem8, reg8  
SUB mreg16/32, reg16/32  
SUB mem16/32, reg16/32  
Instruction Dispatch and Execution Resources  
205  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Decode  
Type  
Byte Byte  
Byte  
SUB reg8, mreg8  
SUB reg8, mem8  
SUB reg16/32, mreg16/32  
SUB reg16/32, mem16/32  
SUB AL, imm8  
2Ah  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
2Ah  
2Bh  
2Bh  
2Ch  
SUB EAX, imm16/32  
SUB mreg8, imm8  
SUB mem8, imm8  
SUB mreg16/32, imm16/32  
SUB mem16/32, imm16/32  
SUB mreg16/32, imm8 (sign extended)  
SUB mem16/32, imm8 (sign extended)  
SYSCALL  
2Dh  
DirectPath  
80h  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
VectorPath  
80h  
81h  
81h  
83h  
83h  
0Fh  
0Fh  
0Fh  
0Fh  
84h  
84h  
85h  
85h  
A8h  
A9h  
F6h  
F6h  
F7h  
F7h  
0Fh  
0Fh  
0Fh  
0Fh  
9Bh  
0Fh  
0Fh  
05h  
34h  
35h  
07h  
SYSENTER  
VectorPath  
SYSEXIT  
VectorPath  
SYSRET  
VectorPath  
TEST mreg8, reg8  
TEST mem8, reg8  
TEST mreg16/32, reg16/32  
TEST mem16/32, reg16/32  
TEST AL, imm8  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
TEST EAX, imm16/32  
TEST mreg8, imm8  
TEST mem8, imm8  
TEST mreg8, imm16/32  
TEST mem8, imm16/32  
VERR mreg16  
DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-100-xxx VectorPath  
mm-100-xxx VectorPath  
11-101-xxx VectorPath  
mm-101-xxx VectorPath  
DirectPath  
00h  
00h  
00h  
00h  
VERR mem16  
VERW mreg16  
VERW mem16  
WAIT  
WBINVD  
09h  
30h  
VectorPath  
WRMSR  
VectorPath  
206  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 19. Integer Instructions (Continued)  
Instruction Mnemonic  
First Second ModR/M  
Byte Byte Byte  
Decode  
Type  
XADD mreg8, reg8  
0Fh  
0Fh  
0Fh  
0Fh  
86h  
86h  
87h  
87h  
90h  
91h  
92h  
93h  
94h  
95h  
96h  
97h  
D7h  
30h  
30h  
31h  
31h  
32h  
32h  
33h  
33h  
34h  
35h  
80h  
80h  
81h  
81h  
83h  
83h  
C0h  
C0h  
C1h  
C1h  
11-100-xxx VectorPath  
mm-100-xxx VectorPath  
11-101-xxx VectorPath  
mm-101-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
11-xxx-xxx VectorPath  
mm-xxx-xxx VectorPath  
DirectPath  
XADD mem8, reg8  
XADD mreg16/32, reg16/32  
XADD mem16/32, reg16/32  
XCHG reg8, mreg8  
XCHG reg8, mem8  
XCHG reg16/32, mreg16/32  
XCHG reg16/32, mem16/32  
XCHG EAX, EAX  
XCHG EAX, ECX  
VectorPath  
XCHG EAX, EDX  
VectorPath  
XCHG EAX, EBX  
VectorPath  
XCHG EAX, ESP  
VectorPath  
XCHG EAX, EBP  
VectorPath  
XCHG EAX, ESI  
VectorPath  
XCHG EAX, EDI  
VectorPath  
XLAT  
VectorPath  
XOR mreg8, reg8  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
DirectPath  
XOR mem8, reg8  
XOR mreg16/32, reg16/32  
XOR mem16/32, reg16/32  
XOR reg8, mreg8  
XOR reg8, mem8  
XOR reg16/32, mreg16/32  
XOR reg16/32, mem16/32  
XOR AL, imm8  
XOR EAX, imm16/32  
XOR mreg8, imm8  
DirectPath  
11-110-xxx DirectPath  
mm-110-xxx DirectPath  
11-110-xxx DirectPath  
mm-110-xxx DirectPath  
11-110-xxx DirectPath  
mm-110-xxx DirectPath  
XOR mem8, imm8  
XOR mreg16/32, imm16/32  
XOR mem16/32, imm16/32  
XOR mreg16/32, imm8 (sign extended)  
XOR mem16/32, imm8 (sign extended)  
Instruction Dispatch and Execution Resources  
207  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
Table 20. MMXInstructions  
22007E/0November 1999  
Prefix First ModR/M  
Byte(s) Byte Byte  
77h  
Decode  
Type  
Instruction Mnemonic  
EMMS  
FPU Pipe(s)  
Notes  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
DirectPath FADD/FMUL/FSTORE  
MOVD mmreg, reg32  
6Eh 11-xxx-xxx VectorPath  
1
1
MOVD mmreg, mem32  
MOVD reg32, mmreg  
6Eh mm-xxx-xxx DirectPath FADD/FMUL/FSTORE  
7Eh 11-xxx-xxx VectorPath  
MOVD mem32, mmreg  
MOVQ mmreg1, mmreg2  
MOVQ mmreg, mem64  
MOVQ mmreg2, mmreg1  
MOVQ mem64, mmreg  
PACKSSDW mmreg1, mmreg2  
PACKSSDW mmreg, mem64  
PACKSSWB mmreg1, mmreg2  
PACKSSWB mmreg, mem64  
PACKUSWB mmreg1, mmreg2  
PACKUSWB mmreg, mem64  
PADDB mmreg1, mmreg2  
PADDB mmreg, mem64  
PADDD mmreg1, mmreg2  
PADDD mmreg, mem64  
PADDSB mmreg1, mmreg2  
PADDSB mmreg, mem64  
PADDSW mmreg1, mmreg2  
PADDSW mmreg, mem64  
PADDUSB mmreg1, mmreg2  
PADDUSB mmreg, mem64  
PADDUSW mmreg1, mmreg2  
PADDUSW mmreg, mem64  
PADDW mmreg1, mmreg2  
PADDW mmreg, mem64  
PAND mmreg1, mmreg2  
PAND mmreg, mem64  
7Eh mm-xxx-xxx DirectPath  
6Fh 11-xxx-xxx DirectPath  
FSTORE  
FADD/FMUL  
6Fh mm-xxx-xxx DirectPath FADD/FMUL/FSTORE  
7Fh 11-xxx-xxx DirectPath  
7Fh mm-xxx-xxx DirectPath  
6Bh 11-xxx-xxx DirectPath  
6Bh mm-xxx-xxx DirectPath  
63h 11-xxx-xxx DirectPath  
63h mm-xxx-xxx DirectPath  
67h 11-xxx-xxx DirectPath  
67h mm-xxx-xxx DirectPath  
FCh 11-xxx-xxx DirectPath  
FCh mm-xxx-xxx DirectPath  
FEh 11-xxx-xxx DirectPath  
FEh mm-xxx-xxx DirectPath  
ECh 11-xxx-xxx DirectPath  
ECh mm-xxx-xxx DirectPath  
EDh 11-xxx-xxx DirectPath  
EDh mm-xxx-xxx DirectPath  
DCh 11-xxx-xxx DirectPath  
DCh mm-xxx-xxx DirectPath  
DDh 11-xxx-xxx DirectPath  
DDh mm-xxx-xxx DirectPath  
FDh 11-xxx-xxx DirectPath  
FDh mm-xxx-xxx DirectPath  
DBh 11-xxx-xxx DirectPath  
DBh mm-xxx-xxx DirectPath  
FADD/FMUL  
FSTORE  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
Notes:  
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.  
208  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 20. MMXInstructions (Continued)  
Prefix First ModR/M  
Byte(s) Byte Byte  
Decode  
Type  
Instruction Mnemonic  
FPU Pipe(s)  
Notes  
PANDN mmreg1, mmreg2  
PANDN mmreg, mem64  
PCMPEQB mmreg1, mmreg2  
PCMPEQB mmreg, mem64  
PCMPEQD mmreg1, mmreg2  
PCMPEQD mmreg, mem64  
PCMPEQW mmreg1, mmreg2  
PCMPEQW mmreg, mem64  
PCMPGTB mmreg1, mmreg2  
PCMPGTB mmreg, mem64  
PCMPGTD mmreg1, mmreg2  
PCMPGTD mmreg, mem64  
PCMPGTW mmreg1, mmreg2  
PCMPGTW mmreg, mem64  
PMADDWD mmreg1, mmreg2  
PMADDWD mmreg, mem64  
PMULHW mmreg1, mmreg2  
PMULHW mmreg, mem64  
PMULLW mmreg1, mmreg2  
PMULLW mmreg, mem64  
POR mmreg1, mmreg2  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
DFh 11-xxx-xxx DirectPath  
DFh mm-xxx-xxx DirectPath  
74h 11-xxx-xxx DirectPath  
74h mm-xxx-xxx DirectPath  
76h 11-xxx-xxx DirectPath  
76h mm-xxx-xxx DirectPath  
75h 11-xxx-xxx DirectPath  
75h mm-xxx-xxx DirectPath  
64h 11-xxx-xxx DirectPath  
64h mm-xxx-xxx DirectPath  
66h 11-xxx-xxx DirectPath  
66h mm-xxx-xxx DirectPath  
65h 11-xxx-xxx DirectPath  
65h mm-xxx-xxx DirectPath  
F5h 11-xxx-xxx DirectPath  
F5h mm-xxx-xxx DirectPath  
E5h 11-xxx-xxx DirectPath  
E5h mm-xxx-xxx DirectPath  
D5h 11-xxx-xxx DirectPath  
D5h mm-xxx-xxx DirectPath  
EBh 11-xxx-xxx DirectPath  
EBh mm-xxx-xxx DirectPath  
F2h 11-xxx-xxx DirectPath  
F2h mm-xxx-xxx DirectPath  
72h 11-110-xxx DirectPath  
F3h 11-xxx-xxx DirectPath  
F3h mm-xxx-xxx DirectPath  
73h 11-110-xxx DirectPath  
F1h 11-xxx-xxx DirectPath  
F1h mm-xxx-xxx DirectPath  
71h 11-110-xxx DirectPath  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
POR mmreg, mem64  
PSLLD mmreg1, mmreg2  
PSLLD mmreg, mem64  
PSLLD mmreg, imm8  
PSLLQ mmreg1, mmreg2  
PSLLQ mmreg, mem64  
PSLLQ mmreg, imm8  
PSLLW mmreg1, mmreg2  
PSLLW mmreg, mem64  
PSLLW mmreg, imm8  
Notes:  
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.  
Instruction Dispatch and Execution Resources  
209  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
Table 20. MMXInstructions (Continued)  
22007E/0November 1999  
Prefix First ModR/M  
Byte(s) Byte Byte  
Decode  
Type  
Instruction Mnemonic  
FPU Pipe(s)  
Notes  
PSRAW mmreg1, mmreg2  
PSRAW mmreg, mem64  
PSRAW mmreg, imm8  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
E1h 11-xxx-xxx DirectPath  
E1h mm-xxx-xxx DirectPath  
71h 11-100-xxx DirectPath  
E2h 11-xxx-xxx DirectPath  
E2h mm-xxx-xxx DirectPath  
72h 11-100-xxx DirectPath  
D2h 11-xxx-xxx DirectPath  
D2h mm-xxx-xxx DirectPath  
72h 11-010-xxx DirectPath  
D3h 11-xxx-xxx DirectPath  
D3h mm-xxx-xxx DirectPath  
73h 11-010-xxx DirectPath  
D1h 11-xxx-xxx DirectPath  
D1h mm-xxx-xxx DirectPath  
71h 11-010-xxx DirectPath  
F8h 11-xxx-xxx DirectPath  
F8h mm-xxx-xxx DirectPath  
FAh 11-xxx-xxx DirectPath  
FAh mm-xxx-xxx DirectPath  
E8h 11-xxx-xxx DirectPath  
E8h mm-xxx-xxx DirectPath  
E9h 11-xxx-xxx DirectPath  
E9h mm-xxx-xxx DirectPath  
D8h 11-xxx-xxx DirectPath  
D8h mm-xxx-xxx DirectPath  
D9h 11-xxx-xxx DirectPath  
D9h mm-xxx-xxx DirectPath  
F9h 11-xxx-xxx DirectPath  
F9h mm-xxx-xxx DirectPath  
68h 11-xxx-xxx DirectPath  
68h mm-xxx-xxx DirectPath  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
PSRAD mmreg1, mmreg2  
PSRAD mmreg, mem64  
PSRAD mmreg, imm8  
PSRLD mmreg1, mmreg2  
PSRLD mmreg, mem64  
PSRLD mmreg, imm8  
PSRLQ mmreg1, mmreg2  
PSRLQ mmreg, mem64  
PSRLQ mmreg, imm8  
PSRLW mmreg1, mmreg2  
PSRLW mmreg, mem64  
PSRLW mmreg, imm8  
PSUBB mmreg1, mmreg2  
PSUBB mmreg, mem64  
PSUBD mmreg1, mmreg2  
PSUBD mmreg, mem64  
PSUBSB mmreg1, mmreg2  
PSUBSB mmreg, mem64  
PSUBSW mmreg1, mmreg2  
PSUBSW mmreg, mem64  
PSUBUSB mmreg1, mmreg2  
PSUBUSB mmreg, mem64  
PSUBUSW mmreg1, mmreg2  
PSUBUSW mmreg, mem64  
PSUBW mmreg1, mmreg2  
PSUBW mmreg, mem64  
PUNPCKHBW mmreg1, mmreg2  
PUNPCKHBW mmreg, mem64  
Notes:  
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.  
210  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 20. MMXInstructions (Continued)  
Prefix First ModR/M  
Byte(s) Byte Byte  
Decode  
Type  
Instruction Mnemonic  
FPU Pipe(s)  
Notes  
PUNPCKHDQ mmreg1, mmreg2  
PUNPCKHDQ mmreg, mem64  
PUNPCKHWD mmreg1, mmreg2  
PUNPCKHWD mmreg, mem64  
PUNPCKLBW mmreg1, mmreg2  
PUNPCKLBW mmreg, mem64  
PUNPCKLDQ mmreg1, mmreg2  
PUNPCKLDQ mmreg, mem64  
PUNPCKLWD mmreg1, mmreg2  
PUNPCKLWD mmreg, mem64  
PXOR mmreg1, mmreg2  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
6Ah 11-xxx-xxx DirectPath  
6Ah mm-xxx-xxx DirectPath  
69h 11-xxx-xxx DirectPath  
69h mm-xxx-xxx DirectPath  
60h 11-xxx-xxx DirectPath  
60h mm-xxx-xxx DirectPath  
62h 11-xxx-xxx DirectPath  
62h mm-xxx-xxx DirectPath  
61h 11-xxx-xxx DirectPath  
61h mm-xxx-xxx DirectPath  
EFh 11-xxx-xxx DirectPath  
EFh mm-xxx-xxx DirectPath  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
PXOR mmreg, mem64  
Notes:  
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.  
Table 21. MMXExtensions  
Prefix First ModR/M  
Instruction Mnemonic  
Decode  
Type  
FPU  
Pipe(s)  
Notes  
Byte(s) Byte  
Byte  
MASKMOVQ mmreg1, mmreg2  
MOVNTQ mem64, mmreg  
PAVGB mmreg1, mmreg2  
PAVGB mmreg, mem64  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
F7h  
E7h  
VectorPath FADD/FMUL/FSTORE  
DirectPath  
FSTORE  
E0h 11-xxx-xxx DirectPath  
E0h mm-xxx-xxx DirectPath  
E3h 11-xxx-xxx DirectPath  
E3h mm-xxx-xxx DirectPath  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
PAVGW mmreg1, mmreg2  
PAVGW mmreg, mem64  
PEXTRW reg32, mmreg, imm8  
PINSRW mmreg, reg32, imm8  
PINSRW mmreg, mem16, imm8  
PMAXSW mmreg1, mmreg2  
PMAXSW mmreg, mem64  
PMAXUB mmreg1, mmreg2  
PMAXUB mmreg, mem64  
PMINSW mmreg1, mmreg2  
Notes:  
C5h  
C4h  
C4h  
VectorPath  
VectorPath  
VectorPath  
EEh 11-xxx-xxx DirectPath  
EEh mm-xxx-xxx DirectPath  
DEh 11-xxx-xxx DirectPath  
DEh mm-xxx-xxx DirectPath  
EAh 11-xxx-xxx DirectPath  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
1. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line that will be prefetched.  
Instruction Dispatch and Execution Resources  
211  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
Table 21. MMXExtensions (Continued)  
22007E/0November 1999  
Prefix First ModR/M  
Byte(s) Byte Byte  
Decode  
Type  
FPU  
Pipe(s)  
Instruction Mnemonic  
Notes  
PMINSW mmreg, mem64  
PMINUB mmreg1, mmreg2  
PMINUB mmreg, mem64  
PMOVMSKB reg32, mmreg  
PMULHUW mmreg1, mmreg2  
PMULHUW mmreg, mem64  
PSADBW mmreg1, mmreg2  
PSADBW mmreg, mem64  
PSHUFW mmreg1, mmreg2, imm8  
PSHUFW mmreg, mem64, imm8  
PREFETCHNTA mem8  
PREFETCHT0 mem8  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
0Fh  
EAh mm-xxx-xxx DirectPath  
DAh 11-xxx-xxx DirectPath  
DAh mm-xxx-xxx DirectPath  
FADD/FMUL  
FADD/FMUL  
FADD/FMUL  
D7h  
VectorPath  
E4h 11-xxx-xxx DirectPath  
E4h mm-xxx-xxx DirectPath  
F6h 11-xxx-xxx DirectPath  
F6h mm-xxx-xxx DirectPath  
FMUL  
FMUL  
FADD  
FADD  
70h  
70h  
18h  
18h  
18h  
18h  
AEh  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
DirectPath  
VectorPath  
FADD/FMUL  
FADD/FMUL  
-
-
-
-
-
1
1
1
1
PREFETCHT1 mem8  
PREFETCHT2 mem8  
SFENCE  
Notes:  
1. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line that will be prefetched.  
Table 22. Floating-Point Instructions  
First Second ModR/M  
Byte Byte Byte  
Decode  
Type  
FPU  
Pipe(s)  
Instruction Mnemonic  
F2XM1  
Note  
D9h  
D9h  
D8h  
D8h  
DCh  
DCh  
DEh  
DFh  
DFh  
D9h  
DBh  
F0h  
E1h  
VectorPath  
DirectPath  
FABS  
FMUL  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD ST, ST(i)  
FADD [mem32real]  
FADD ST(i), ST  
FADD [mem64real]  
FADDP ST(i), ST  
FBLD [mem80]  
FBSTP [mem80]  
FCHS  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-000-xxx DirectPath  
11-000-xxx DirectPath  
mm-100-xxx VectorPath  
mm-110-xxx VectorPath  
DirectPath  
1
1
1
E0h  
E2h  
FMUL  
FCLEX  
VectorPath  
Notes:  
1. The last three bits of the modR/M byte select the stack entry ST(i).  
212  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 22. Floating-Point Instructions (Continued)  
First Second ModR/M  
Instruction Mnemonic  
Decode  
Type  
FPU  
Pipe(s)  
Note  
Byte Byte  
DAh C0-C7h  
DAh C8-CFh  
DAh D0-D7h  
DAh D8-DFh  
DBh C0-C7h  
DBh C8-CFh  
DBh D0-D7h  
DBh D8-DFh  
D8h  
Byte  
FCMOVB ST(0), ST(i)  
FCMOVE ST(0), ST(i)  
FCMOVBE ST(0), ST(i)  
FCMOVU ST(0), ST(i)  
FCMOVNB ST(0), ST(i)  
FCMOVNE ST(0), ST(i)  
FCMOVNBE ST(0), ST(i)  
FCMOVNU ST(0), ST(i)  
FCOM ST(i)  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
VectorPath  
11-010-xxx DirectPath  
11-011-xxx DirectPath  
mm-010-xxx DirectPath  
mm-010-xxx DirectPath  
VectorPath  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
1
1
FCOMP ST(i)  
D8h  
FCOM [mem32real]  
FCOM [mem64real]  
FCOMI ST, ST(i)  
FCOMIP ST, ST(i)  
FCOMP [mem32real]  
FCOMP [mem64real]  
FCOMPP  
D8h  
DCh  
DBh F0-F7h  
DFh F0-F7h  
D8h  
VectorPath  
mm-011-xxx DirectPath  
mm-011-xxx DirectPath  
11-011-001 DirectPath  
VectorPath  
DCh  
DEh  
D9h  
D9h  
D8h  
DCh  
D8h  
DCh  
DEh  
D8h  
DCh  
D8h  
DCh  
DEh  
DDh  
D9h  
FFh  
F6h  
FCOS  
FDECSTP  
DirectPath FADD/FMUL/FSTORE  
FDIV ST, ST(i)  
11-110-xxx DirectPath  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
1
1
FDIV ST(i), ST  
11-111-xxx DirectPath  
mm-110-xxx DirectPath  
mm-110-xxx DirectPath  
11-111-xxx DirectPath  
11-110-xxx DirectPath  
11-111-xxx DirectPath  
mm-111-xxx DirectPath  
mm-111-xxx DirectPath  
11-110-xxx DirectPath  
FDIV [mem32real]  
FDIV [mem64real]  
FDIVP ST, ST(i)  
1
1
1
FDIVR ST, ST(i)  
FDIVR ST(i), ST  
FDIVR [mem32real]  
FDIVR [mem64real]  
FDIVRP ST(i), ST  
FFREE ST(i)  
1
1
1
11-000-xxx DirectPath FADD/FMUL/FSTORE  
DirectPath FADD/FMUL/FSTORE  
FFREEP ST(i)  
DFh C0-C7h  
Notes:  
1. The last three bits of the modR/M byte select the stack entry ST(i).  
Instruction Dispatch and Execution Resources  
213  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 22. Floating-Point Instructions (Continued)  
First Second ModR/M  
Instruction Mnemonic  
Decode  
Type  
FPU  
Pipe(s)  
Note  
Byte Byte  
DAh  
Byte  
FIADD [mem32int]  
FIADD [mem16int]  
FICOM [mem32int]  
FICOM [mem16int]  
FICOMP [mem32int]  
FICOMP [mem16int]  
FIDIV [mem32int]  
FIDIV [mem16int]  
FIDIVR [mem32int]  
FIDIVR [mem16int]  
FILD [mem16int]  
FILD [mem32int]  
FILD [mem64int]  
FIMUL [mem32int]  
FIMUL [mem16int]  
FINCSTP  
mm-000-xxx VectorPath  
mm-000-xxx VectorPath  
mm-010-xxx VectorPath  
mm-010-xxx VectorPath  
mm-011-xxx VectorPath  
mm-011-xxx VectorPath  
mm-110-xxx VectorPath  
mm-110-xxx VectorPath  
mm-111-xxx VectorPath  
mm-111-xxx VectorPath  
mm-000-xxx DirectPath  
mm-000-xxx DirectPath  
mm-101-xxx DirectPath  
mm-001-xxx VectorPath  
mm-001-xxx VectorPath  
DEh  
DAh  
DEh  
DAh  
DEh  
DAh  
DEh  
DAh  
DEh  
DFh  
FSTORE  
FSTORE  
FSTORE  
DBh  
DFh  
DAh  
DEh  
D9h  
DBh  
DFh  
DBh  
DFh  
DBh  
DFh  
DAh  
DEh  
DAh  
DEh  
D9h  
D9h  
DDh  
DBh  
D9h  
F7h  
E3h  
DirectPath FADD/FMUL/FSTORE  
VectorPath  
FINIT  
FIST [mem16int]  
FIST [mem32int]  
FISTP [mem16int]  
FISTP [mem32int]  
FISTP [mem64int]  
FISUB [mem32int]  
FISUB [mem16int]  
FISUBR [mem32int]  
FISUBR [mem16int]  
FLD ST(i)  
mm-010-xxx DirectPath  
FSTORE  
FSTORE  
FSTORE  
FSTORE  
FSTORE  
mm-010-xxx DirectPath  
mm-011-xxx DirectPath  
mm-011-xxx DirectPath  
mm-111-xxx DirectPath  
mm-100-xxx VectorPath  
mm-100-xxx VectorPath  
mm-101-xxx VectorPath  
mm-101-xxx VectorPath  
11-000-xxx DirectPath  
FADD/FMUL  
1
FLD [mem32real]  
FLD [mem64real]  
FLD [mem80real]  
FLD1  
mm-000-xxx DirectPath FADD/FMUL/FSTORE  
mm-000-xxx DirectPath FADD/FMUL/FSTORE  
mm-101-xxx VectorPath  
E8h  
DirectPath  
FSTORE  
Notes:  
1. The last three bits of the modR/M byte select the stack entry ST(i).  
214  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 22. Floating-Point Instructions (Continued)  
First Second ModR/M  
Instruction Mnemonic  
Decode  
Type  
FPU  
Pipe(s)  
Note  
Byte Byte  
D9h  
Byte  
FLDCW [mem16]  
FLDENV [mem14byte]  
FLDENV [mem28byte]  
FLDL2E  
mm-101-xxx VectorPath  
mm-100-xxx VectorPath  
mm-100-xxx VectorPath  
DirectPath  
D9h  
D9h  
D9h  
D9h  
D9h  
D9h  
D9h  
D9h  
D8h  
DCh  
D8h  
DCh  
DEh  
D9h  
D9h  
D9h  
D9h  
D9h  
D9h  
DDh  
DDh  
DDh  
DDh  
D9h  
D9h  
D9h  
D9h  
D9h  
DDh  
DDh  
EAh  
E9h  
ECh  
EDh  
EBh  
EEh  
FSTORE  
FSTORE  
FSTORE  
FSTORE  
FSTORE  
FSTORE  
FMUL  
FLDL2T  
DirectPath  
FLDLG2  
DirectPath  
FLDLN2  
DirectPath  
FLDPI  
DirectPath  
FLDZ  
DirectPath  
FMUL ST, ST(i)  
FMUL ST(i), ST  
FMUL [mem32real]  
FMUL [mem64real]  
FMULP ST, ST(i)  
FNOP  
11-001-xxx DirectPath  
11-001-xxx DirectPath  
mm-001-xxx DirectPath  
mm-001-xxx DirectPath  
11-001-xxx DirectPath  
1
1
FMUL  
FMUL  
FMUL  
FMUL  
1
D0h  
F2h  
F3h  
F8h  
F5h  
FCh  
DirectPath FADD/FMUL/FSTORE  
VectorPath  
FPTAN  
FPATAN  
VectorPath  
FPREM  
DirectPath  
DirectPath  
FMUL  
FMUL  
FPREM1  
FRNDINT  
VectorPath  
FRSTOR [mem94byte]  
FRSTOR [mem108byte]  
FSAVE [mem94byte]  
FSAVE [mem108byte]  
FSCALE  
mm-100-xxx VectorPath  
mm-100-xxx VectorPath  
mm-110-xxx VectorPath  
mm-110-xxx VectorPath  
VectorPath  
FDh  
FEh  
FBh  
FAh  
FSIN  
VectorPath  
FSINCOS  
VectorPath  
FSQRT  
DirectPath  
FMUL  
FSTORE  
FST [mem32real]  
FST [mem64real]  
FST ST(i)  
mm-010-xxx DirectPath  
mm-010-xxx DirectPath  
11-010xxx DirectPath  
FSTORE  
FADD/FMUL  
Notes:  
1. The last three bits of the modR/M byte select the stack entry ST(i).  
Instruction Dispatch and Execution Resources  
215  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 22. Floating-Point Instructions (Continued)  
First Second ModR/M  
Instruction Mnemonic  
Decode  
Type  
FPU  
Pipe(s)  
Note  
Byte Byte  
D9h  
Byte  
FSTCW [mem16]  
FSTENV [mem14byte]  
FSTENV [mem28byte]  
FSTP [mem32real]  
FSTP [mem64real]  
FSTP [mem80real]  
FSTP ST(i)  
mm-111-xxx VectorPath  
mm-110-xxx VectorPath  
mm-110-xxx VectorPath  
mm-011-xxx DirectPath  
mm-011-xxx DirectPath  
mm-111-xxx VectorPath  
11-011-xxx DirectPath  
VectorPath  
D9h  
D9h  
D9h  
FADD/FMUL  
FADD/FMUL  
DDh  
D9h  
DDh  
FADD/FMUL  
FSTSW AX  
DFh  
DDh  
D8h  
DCh  
D8h  
DCh  
DEh  
D8h  
DCh  
D8h  
DCh  
DEh  
D9h  
DDh  
E0h  
FSTSW [mem16]  
FSUB [mem32real]  
FSUB [mem64real]  
FSUB ST, ST(i)  
FSUB ST(i), ST  
FSUBP ST, ST(i)  
FSUBR [mem32real]  
FSUBR [mem64real]  
FSUBR ST, ST(i)  
FSUBR ST(i), ST  
FSUBRP ST(i), ST  
FTST  
mm-111-xxx VectorPath  
mm-100-xxx DirectPath  
mm-100-xxx DirectPath  
11-100-xxx DirectPath  
11-101-xxx DirectPath  
11-101-xxx DirectPath  
mm-101-xxx DirectPath  
mm-101-xxx DirectPath  
11-100-xxx DirectPath  
11-101-xxx DirectPath  
11-100-xxx DirectPath  
DirectPath  
FSTORE  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
1
1
1
1
1
1
E4h  
FUCOM  
11-100-xxx DirectPath  
VectorPath  
FUCOMI ST, ST(i)  
FUCOMIP ST, ST(i)  
FUCOMP  
DB E8-EFh  
DF E8-EFh  
DDh  
VectorPath  
11-101-xxx DirectPath  
DirectPath  
FUCOMPP  
DAh  
9Bh  
D9h  
D9h  
D9h  
D9h  
D9h  
E9h  
FWAIT  
DirectPath  
FXAM  
E5h  
VectorPath  
FXCH  
11-001-xxx DirectPath FADD/FMUL/FSTORE  
FXTRACT  
F4h  
F1h  
F9h  
VectorPath  
VectorPath  
VectorPath  
FYL2X  
FYL2XP1  
Notes:  
1. The last three bits of the modR/M byte select the stack entry ST(i).  
216  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 23. 3DNow!Instructions  
Prefix  
Byte(s)  
ModR/M  
Byte  
Decode  
Type  
FPU  
Pipe(s)  
Instruction Mnemonic  
FEMMS  
imm8  
Note  
0Fh  
0Eh  
BFh  
BFh  
DirectPath FADD/FMUL/FSTORE  
2
PAVGUSB mmreg1, mmreg2 0Fh, 0Fh  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
FADD/FMUL  
FADD/FMUL  
FADD  
PAVGUSB mmreg, mem64  
PF2ID mmreg1, mmreg2  
PF2ID mmreg, mem64  
PFACC mmreg1, mmreg2  
PFACC mmreg, mem64  
PFADD mmreg1, mmreg2  
PFADD mmreg, mem64  
0Fh, 0Fh  
0Fh, 0Fh 1Dh  
0Fh, 0Fh 1Dh  
FADD  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
AEh  
AEh  
9Eh  
9Eh  
B0h  
B0h  
90h  
90h  
A0h  
A0h  
A4h  
A4h  
94h  
94h  
B4h  
B4h  
96h  
96h  
A6h  
A6h  
B6h  
B6h  
A7h  
A7h  
97h  
FADD  
FADD  
FADD  
FADD  
PFCMPEQ mmreg1, mmreg2 0Fh, 0Fh  
PFCMPEQ mmreg, mem64 0Fh, 0Fh  
PFCMPGE mmreg1, mmreg2 0Fh, 0Fh  
PFCMPGE mmreg, mem64 0Fh, 0Fh  
PFCMPGT mmreg1, mmreg2 0Fh, 0Fh  
FADD  
FADD  
FADD  
FADD  
FADD  
PFCMPGT mmreg, mem64  
PFMAX mmreg1, mmreg2  
PFMAX mmreg, mem64  
PFMIN mmreg1, mmreg2  
PFMIN mmreg, mem64  
PFMUL mmreg1, mmreg2  
PFMUL mmreg, mem64  
PFRCP mmreg1, mmreg2  
PFRCP mmreg, mem64  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
FADD  
FADD  
FADD  
FADD  
FADD  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
FMUL  
PFRCPIT1 mmreg1, mmreg2 0Fh, 0Fh  
PFRCPIT1 mmreg, mem64 0Fh, 0Fh  
PFRCPIT2 mmreg1, mmreg2 0Fh, 0Fh  
PFRCPIT2 mmreg, mem64 0Fh, 0Fh  
PFRSQIT1 mmreg1, mmreg2 0Fh, 0Fh  
PFRSQIT1 mmreg, mem64 0Fh, 0Fh  
PFRSQRT mmreg1, mmreg2 0Fh, 0Fh  
Notes:  
1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line that will be  
prefetched.  
2. The byte listed in the column titled imm8is actually the opcode byte.  
Instruction Dispatch and Execution Resources  
217  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 23. 3DNow!Instructions (Continued)  
Prefix  
ModR/M  
Byte  
Decode  
Type  
FPU  
Pipe(s)  
Instruction Mnemonic  
imm8  
Note  
Byte(s)  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
PFRSQRT mmreg, mem64  
PFSUB mmreg1, mmreg2  
PFSUB mmreg, mem64  
PFSUBR mmreg1, mmreg2  
PFSUBR mmreg, mem64  
PI2FD mmreg1, mmreg2  
PI2FD mmreg, mem64  
97h  
9Ah  
9Ah  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
mm-000-xxx DirectPath  
mm-001-xxx DirectPath  
FMUL  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
FMUL  
FMUL  
-
0Fh, 0Fh AAh  
0Fh, 0Fh AAh  
0Fh, 0Fh 0Dh  
0Fh, 0Fh 0Dh  
PMULHRW mmreg1, mmreg2 0Fh, 0Fh  
PMULHRW mmreg1, mem64 0Fh, 0Fh  
B7h  
B7h  
0Dh  
0Dh  
PREFETCH mem8  
PREFETCHW mem8  
Notes:  
0Fh  
0Fh  
1, 2  
1, 2  
-
1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line that will be  
prefetched.  
2. The byte listed in the column titled imm8is actually the opcode byte.  
Table 24. 3DNow!Extensions  
Prefix  
Byte(s)  
ModR/M  
Byte  
Decode  
Type  
FPU  
Pipe(s)  
Instruction Mnemonic  
imm8  
Note  
PF2IW mmreg1, mmreg2  
PF2IW mmreg, mem64  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
0Fh, 0Fh  
1Ch  
1Ch  
8Ah  
8Ah  
8Eh  
8Eh  
0Ch  
0Ch  
BBh  
BBh  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
11-xxx-xxx DirectPath  
mm-xxx-xxx DirectPath  
FADD  
FADD  
PFNACC mmreg1, mmreg2  
PFNACC mmreg, mem64  
PFPNACC mmreg1, mmreg2  
PFPNACC mmreg, mem64  
PI2FW mmreg1, mmreg2  
PI2FW mmreg, mem64  
FADD  
FADD  
FADD  
FADD  
FADD  
FADD  
PSWAPD mmreg1, mmreg2  
PSWAPD mmreg, mem64  
FADD/FMUL  
FADD/FMUL  
218  
Instruction Dispatch and Execution Resources  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Appendix G  
DirectPath versus  
VectorPath Instructions  
Select DirectPath Over VectorPath Instructions  
Use DirectPath instructions rather than VectorPath  
instructions. DirectPath instructions are optimized for decode  
and execute efficiently by minimizing the number of operations  
per x86 instruction, which includes registerregister op  
memoryas well as registerregister op registerforms of  
instructions.  
DirectPath Instructions  
The following tables contain DirectPath instructions, which  
should be used in the AMD Athlon processor wherever possible:  
All 3DNow! instructions, including the 3DNow! Extensions,  
are DirectPath and are listed in Table 23, 3DNow!™  
Select DirectPath Over VectorPath Instructions  
219  
Download from Www.Somanuals.com. All Manuals Search And Download.  
             
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 25. DirectPath Integer Instructions  
Instruction Mnemonic  
AND mreg16/32, reg16/32  
AND mem16/32, reg16/32  
AND reg8, mreg8  
Instruction Mnemonic  
ADC mreg8, reg8  
ADC mem8, reg8  
ADC mreg16/32, reg16/32  
ADC mem16/32, reg16/32  
ADC reg8, mreg8  
AND reg8, mem8  
AND reg16/32, mreg16/32  
AND reg16/32, mem16/32  
AND AL, imm8  
ADC reg8, mem8  
ADC reg16/32, mreg16/32  
ADC reg16/32, mem16/32  
ADC AL, imm8  
AND EAX, imm16/32  
AND mreg8, imm8  
AND mem8, imm8  
ADC EAX, imm16/32  
AND mreg16/32, imm16/32  
AND mem16/32, imm16/32  
AND mreg16/32, imm8 (sign extended)  
AND mem16/32, imm8 (sign extended)  
BSWAP EAX  
ADC mreg8, imm8  
ADC mem8, imm8  
ADC mreg16/32, imm16/32  
ADC mem16/32, imm16/32  
ADC mreg16/32, imm8 (sign extended)  
ADC mem16/32, imm8 (sign extended)  
ADD mreg8, reg8  
BSWAP ECX  
BSWAP EDX  
BSWAP EBX  
ADD mem8, reg8  
BSWAP ESP  
ADD mreg16/32, reg16/32  
ADD mem16/32, reg16/32  
ADD reg8, mreg8  
BSWAP EBP  
BSWAP ESI  
BSWAP EDI  
ADD reg8, mem8  
BT mreg16/32, reg16/32  
BT mreg16/32, imm8  
ADD reg16/32, mreg16/32  
ADD reg16/32, mem16/32  
ADD AL, imm8  
BT mem16/32, imm8  
CBW/CWDE  
ADD EAX, imm16/32  
CLC  
ADD mreg8, imm8  
CMC  
ADD mem8, imm8  
CMOVA/CMOVBE reg16/32, reg16/32  
CMOVA/CMOVBE reg16/32, mem16/32  
CMOVAE/CMOVNB/CMOVNC reg16/32, mem16/32  
CMOVAE/CMOVNB/CMOVNC mem16/32, mem16/32  
CMOVB/CMOVC/CMOVNAE reg16/32, reg16/32  
CMOVB/CMOVC/CMOVNAE mem16/32, reg16/32  
ADD mreg16/32, imm16/32  
ADD mem16/32, imm16/32  
ADD mreg16/32, imm8 (sign extended)  
ADD mem16/32, imm8 (sign extended)  
AND mreg8, reg8  
AND mem8, reg8  
220  
DirectPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic  
CMOVBE/CMOVNA reg16/32, reg16/32  
CMOVBE/CMOVNA reg16/32, mem16/32  
CMOVE/CMOVZ reg16/32, reg16/32  
CMOVE/CMOVZ reg16/32, mem16/32  
CMOVG/CMOVNLE reg16/32, reg16/32  
CMOVG/CMOVNLE reg16/32, mem16/32  
CMOVGE/CMOVNL reg16/32, reg16/32  
CMOVGE/CMOVNL reg16/32, mem16/32  
CMOVL/CMOVNGE reg16/32, reg16/32  
CMOVL/CMOVNGE reg16/32, mem16/32  
CMOVLE/CMOVNG reg16/32, reg16/32  
CMOVLE/CMOVNG reg16/32, mem16/32  
CMOVNE/CMOVNZ reg16/32, reg16/32  
CMOVNE/CMOVNZ reg16/32, mem16/32  
CMOVNO reg16/32, reg16/32  
Instruction Mnemonic  
CMP AL, imm8  
CMP EAX, imm16/32  
CMP mreg8, imm8  
CMP mem8, imm8  
CMP mreg16/32, imm16/32  
CMP mem16/32, imm16/32  
CMP mreg16/32, imm8 (sign extended)  
CMP mem16/32, imm8 (sign extended)  
CWD/CDQ  
DEC EAX  
DEC ECX  
DEC EDX  
DEC EBX  
DEC ESP  
DEC EBP  
CMOVNO reg16/32, mem16/32  
CMOVNP/CMOVPO reg16/32, reg16/32  
CMOVNP/CMOVPO reg16/32, mem16/32  
CMOVNS reg16/32, reg16/32  
DEC ESI  
DEC EDI  
DEC mreg8  
DEC mem8  
CMOVNS reg16/32, mem16/32  
CMOVO reg16/32, reg16/32  
DEC mreg16/32  
DEC mem16/32  
INC EAX  
CMOVO reg16/32, mem16/32  
CMOVP/CMOVPE reg16/32, reg16/32  
CMOVP/CMOVPE reg16/32, mem16/32  
CMOVS reg16/32, reg16/32  
INC ECX  
INC EDX  
INC EBX  
CMOVS reg16/32, mem16/32  
INC ESP  
CMP mreg8, reg8  
INC EBP  
CMP mem8, reg8  
INC ESI  
CMP mreg16/32, reg16/32  
INC EDI  
CMP mem16/32, reg16/32  
INC mreg8  
CMP reg8, mreg8  
INC mem8  
CMP reg8, mem8  
INC mreg16/32  
INC mem16/32  
JO short disp8  
CMP reg16/32, mreg16/32  
CMP reg16/32, mem16/32  
DirectPath Instructions  
221  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic  
JNO short disp8  
Instruction Mnemonic  
JMP near mreg16/32 (indirect)  
JB/JNAE short disp8  
JNB/JAE short disp8  
JZ/JE short disp8  
JMP near mem16/32 (indirect)  
LEA reg32, mem16/32  
MOV mreg8, reg8  
JNZ/JNE short disp8  
JBE/JNA short disp8  
JNBE/JA short disp8  
JS short disp8  
MOV mem8, reg8  
MOV mreg16/32, reg16/32  
MOV mem16/32, reg16/32  
MOV reg8, mreg8  
JNS short disp8  
MOV reg8, mem8  
JP/JPE short disp8  
MOV reg16/32, mreg16/32  
MOV reg16/32, mem16/32  
MOV AL, mem8  
JNP/JPO short disp8  
JL/JNGE short disp8  
JNL/JGE short disp8  
JLE/JNG short disp8  
JNLE/JG short disp8  
JO near disp16/32  
MOV EAX, mem16/32  
MOV mem8, AL  
MOV mem16/32, EAX  
MOV AL, imm8  
JNO near disp16/32  
JB/JNAE near disp16/32  
JNB/JAE near disp16/32  
JZ/JE near disp16/32  
JNZ/JNE near disp16/32  
JBE/JNA near disp16/32  
JNBE/JA near disp16/32  
JS near disp16/32  
MOV CL, imm8  
MOV DL, imm8  
MOV BL, imm8  
MOV AH, imm8  
MOV CH, imm8  
MOV DH, imm8  
MOV BH, imm8  
MOV EAX, imm16/32  
MOV ECX, imm16/32  
MOV EDX, imm16/32  
MOV EBX, imm16/32  
MOV ESP, imm16/32  
MOV EBP, imm16/32  
MOV ESI, imm16/32  
MOV EDI, imm16/32  
MOV mreg8, imm8  
MOV mem8, imm8  
MOV mreg16/32, imm16/32  
JNS near disp16/32  
JP/JPE near disp16/32  
JNP/JPO near disp16/32  
JL/JNGE near disp16/32  
JNL/JGE near disp16/32  
JLE/JNG near disp16/32  
JNLE/JG near disp16/32  
JMP near disp16/32 (direct)  
JMP far disp32/48 (direct)  
JMP disp8 (short)  
222  
DirectPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic  
MOV mem16/32, imm16/32  
Instruction Mnemonic  
PUSH EAX  
PUSH ECX  
PUSH EDX  
PUSH EBX  
PUSH ESP  
PUSH EBP  
PUSH ESI  
PUSH EDI  
PUSH imm8  
MOVSX reg16/32, mreg8  
MOVSX reg16/32, mem8  
MOVSX reg32, mreg16  
MOVSX reg32, mem16  
MOVZX reg16/32, mreg8  
MOVZX reg16/32, mem8  
MOVZX reg32, mreg16  
MOVZX reg32, mem16  
NEG mreg8  
PUSH imm16/32  
RCL mreg8, imm8  
RCL mreg16/32, imm8  
RCL mreg8, 1  
NEG mem8  
NEG mreg16/32  
NEG mem16/32  
NOP (XCHG EAX, EAX)  
NOT mreg8  
RCL mem8, 1  
RCL mreg16/32, 1  
RCL mem16/32, 1  
RCL mreg8, CL  
NOT mem8  
NOT mreg16/32  
NOT mem16/32  
RCL mreg16/32, CL  
RCR mreg8, imm8  
RCR mreg16/32, imm8  
RCR mreg8, 1  
OR mreg8, reg8  
OR mem8, reg8  
OR mreg16/32, reg16/32  
OR mem16/32, reg16/32  
OR reg8, mreg8  
RCR mem8, 1  
RCR mreg16/32, 1  
RCR mem16/32, 1  
RCR mreg8, CL  
OR reg8, mem8  
OR reg16/32, mreg16/32  
OR reg16/32, mem16/32  
OR AL, imm8  
RCR mreg16/32, CL  
ROL mreg8, imm8  
ROL mem8, imm8  
ROL mreg16/32, imm8  
ROL mem16/32, imm8  
ROL mreg8, 1  
OR EAX, imm16/32  
OR mreg8, imm8  
OR mem8, imm8  
OR mreg16/32, imm16/32  
OR mem16/32, imm16/32  
OR mreg16/32, imm8 (sign extended)  
OR mem16/32, imm8 (sign extended)  
ROL mem8, 1  
ROL mreg16/32, 1  
ROL mem16/32, 1  
DirectPath Instructions  
223  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic  
Instruction Mnemonic  
SBB reg16/32, mreg16/32  
ROL mreg8, CL  
ROL mem8, CL  
SBB reg16/32, mem16/32  
SBB AL, imm8  
ROL mreg16/32, CL  
ROL mem16/32, CL  
ROR mreg8, imm8  
ROR mem8, imm8  
ROR mreg16/32, imm8  
ROR mem16/32, imm8  
ROR mreg8, 1  
SBB EAX, imm16/32  
SBB mreg8, imm8  
SBB mem8, imm8  
SBB mreg16/32, imm16/32  
SBB mem16/32, imm16/32  
SBB mreg16/32, imm8 (sign extended)  
SBB mem16/32, imm8 (sign extended)  
SETO mreg8  
ROR mem8, 1  
ROR mreg16/32, 1  
ROR mem16/32, 1  
ROR mreg8, CL  
SETO mem8  
SETNO mreg8  
ROR mem8, CL  
SETNO mem8  
ROR mreg16/32, CL  
ROR mem16/32, CL  
SAR mreg8, imm8  
SAR mem8, imm8  
SAR mreg16/32, imm8  
SAR mem16/32, imm8  
SAR mreg8, 1  
SETB/SETC/SETNAE mreg8  
SETB/SETC/SETNAE mem8  
SETAE/SETNB/SETNC mreg8  
SETAE/SETNB/SETNC mem8  
SETE/SETZ mreg8  
SETE/SETZ mem8  
SETNE/SETNZ mreg8  
SETNE/SETNZ mem8  
SETBE/SETNA mreg8  
SETBE/SETNA mem8  
SETA/SETNBE mreg8  
SETA/SETNBE mem8  
SETS mreg8  
SAR mem8, 1  
SAR mreg16/32, 1  
SAR mem16/32, 1  
SAR mreg8, CL  
SAR mem8, CL  
SAR mreg16/32, CL  
SAR mem16/32, CL  
SBB mreg8, reg8  
SBB mem8, reg8  
SBB mreg16/32, reg16/32  
SBB mem16/32, reg16/32  
SBB reg8, mreg8  
SBB reg8, mem8  
SETS mem8  
SETNS mreg8  
SETNS mem8  
SETP/SETPE mreg8  
SETP/SETPE mem8  
SETNP/SETPO mreg8  
SETNP/SETPO mem8  
224  
DirectPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 25. DirectPath Integer Instructions (Continued)  
Instruction Mnemonic  
SUB mem8, reg8  
Instruction Mnemonic  
SETL/SETNGE mreg8  
SETL/SETNGE mem8  
SETGE/SETNL mreg8  
SETGE/SETNL mem8  
SETLE/SETNG mreg8  
SETLE/SETNG mem8  
SETG/SETNLE mreg8  
SETG/SETNLE mem8  
SHL/SAL mreg8, imm8  
SHL/SAL mem8, imm8  
SHL/SAL mreg16/32, imm8  
SHL/SAL mem16/32, imm8  
SHL/SAL mreg8, 1  
SUB mreg16/32, reg16/32  
SUB mem16/32, reg16/32  
SUB reg8, mreg8  
SUB reg8, mem8  
SUB reg16/32, mreg16/32  
SUB reg16/32, mem16/32  
SUB AL, imm8  
SUB EAX, imm16/32  
SUB mreg8, imm8  
SUB mem8, imm8  
SUB mreg16/32, imm16/32  
SUB mem16/32, imm16/32  
SUB mreg16/32, imm8 (sign extended)  
SUB mem16/32, imm8 (sign extended)  
TEST mreg8, reg8  
SHL/SAL mem8, 1  
SHL/SAL mreg16/32, 1  
SHL/SAL mem16/32, 1  
SHL/SAL mreg8, CL  
SHL/SAL mem8, CL  
SHL/SAL mreg16/32, CL  
SHL/SAL mem16/32, CL  
SHR mreg8, imm8  
TEST mem8, reg8  
TEST mreg16/32, reg16/32  
TEST mem16/32, reg16/32  
TEST AL, imm8  
TEST EAX, imm16/32  
TEST mreg8, imm8  
SHR mem8, imm8  
TEST mem8, imm8  
SHR mreg16/32, imm8  
SHR mem16/32, imm8  
SHR mreg8, 1  
TEST mreg8, imm16/32  
TEST mem8, imm16/32  
WAIT  
SHR mem8, 1  
XCHG EAX, EAX  
SHR mreg16/32, 1  
XOR mreg8, reg8  
SHR mem16/32, 1  
XOR mem8, reg8  
SHR mreg8, CL  
XOR mreg16/32, reg16/32  
XOR mem16/32, reg16/32  
XOR reg8, mreg8  
SHR mem8, CL  
SHR mreg16/32, CL  
SHR mem16/32, CL  
STC  
XOR reg8, mem8  
XOR reg16/32, mreg16/32  
SUB mreg8, reg8  
DirectPath Instructions  
225  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 25. DirectPath Integer Instructions (Continued)  
Instruction Mnemonic  
XOR reg16/32, mem16/32  
XOR AL, imm8  
XOR EAX, imm16/32  
XOR mreg8, imm8  
XOR mem8, imm8  
XOR mreg16/32, imm16/32  
XOR mem16/32, imm16/32  
XOR mreg16/32, imm8 (sign extended)  
XOR mem16/32, imm8 (sign extended)  
226  
DirectPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 26. DirectPath MMXInstructions  
Instruction Mnemonic  
EMMS  
Instruction Mnemonic  
PCMPEQD mmreg, mem64  
PCMPEQW mmreg1, mmreg2  
PCMPEQW mmreg, mem64  
PCMPGTB mmreg1, mmreg2  
PCMPGTB mmreg, mem64  
PCMPGTD mmreg1, mmreg2  
PCMPGTD mmreg, mem64  
PCMPGTW mmreg1, mmreg2  
PCMPGTW mmreg, mem64  
PMADDWD mmreg1, mmreg2  
PMADDWD mmreg, mem64  
PMULHW mmreg1, mmreg2  
PMULHW mmreg, mem64  
PMULLW mmreg1, mmreg2  
PMULLW mmreg, mem64  
POR mmreg1, mmreg2  
MOVD mmreg, mem32  
MOVD mem32, mmreg  
MOVQ mmreg1, mmreg2  
MOVQ mmreg, mem64  
MOVQ mmreg2, mmreg1  
MOVQ mem64, mmreg  
PACKSSDW mmreg1, mmreg2  
PACKSSDW mmreg, mem64  
PACKSSWB mmreg1, mmreg2  
PACKSSWB mmreg, mem64  
PACKUSWB mmreg1, mmreg2  
PACKUSWB mmreg, mem64  
PADDB mmreg1, mmreg2  
PADDB mmreg, mem64  
PADDD mmreg1, mmreg2  
PADDD mmreg, mem64  
PADDSB mmreg1, mmreg2  
PADDSB mmreg, mem64  
PADDSW mmreg1, mmreg2  
PADDSW mmreg, mem64  
PADDUSB mmreg1, mmreg2  
PADDUSB mmreg, mem64  
PADDUSW mmreg1, mmreg2  
PADDUSW mmreg, mem64  
PADDW mmreg1, mmreg2  
PADDW mmreg, mem64  
PAND mmreg1, mmreg2  
PAND mmreg, mem64  
POR mmreg, mem64  
PSLLD mmreg1, mmreg2  
PSLLD mmreg, mem64  
PSLLD mmreg, imm8  
PSLLQ mmreg1, mmreg2  
PSLLQ mmreg, mem64  
PSLLQ mmreg, imm8  
PSLLW mmreg1, mmreg2  
PSLLW mmreg, mem64  
PSLLW mmreg, imm8  
PSRAW mmreg1, mmreg2  
PSRAW mmreg, mem64  
PSRAW mmreg, imm8  
PANDN mmreg1, mmreg2  
PANDN mmreg, mem64  
PCMPEQB mmreg1, mmreg2  
PCMPEQB mmreg, mem64  
PCMPEQD mmreg1, mmreg2  
PSRAD mmreg1, mmreg2  
PSRAD mmreg, mem64  
PSRAD mmreg, imm8  
PSRLD mmreg1, mmreg2  
PSRLD mmreg, mem64  
DirectPath Instructions  
227  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 26. DirectPath MMXInstructions (Continued)
Instruction Mnemonic  
PSRLD mmreg, imm8  
Instruction Mnemonic  
PXOR mmreg, mem64  
PSRLQ mmreg1, mmreg2  
PSRLQ mmreg, mem64  
Table 27. DirectPath MMXExtensions  
PSRLQ mmreg, imm8  
Instruction Mnemonic  
MOVNTQ mem64, mmreg  
PAVGB mmreg1, mmreg2  
PAVGB mmreg, mem64  
PSRLW mmreg1, mmreg2  
PSRLW mmreg, mem64  
PSRLW mmreg, imm8  
PSUBB mmreg1, mmreg2  
PSUBB mmreg, mem64  
PAVGW mmreg1, mmreg2  
PAVGW mmreg, mem64  
PMAXSW mmreg1, mmreg2  
PMAXSW mmreg, mem64  
PMAXUB mmreg1, mmreg2  
PMAXUB mmreg, mem64  
PMINSW mmreg1, mmreg2  
PMINSW mmreg, mem64  
PMINUB mmreg1, mmreg2  
PMINUB mmreg, mem64  
PMULHUW mmreg1, mmreg2  
PMULHUW mmreg, mem64  
PSADBW mmreg1, mmreg2  
PSADBW mmreg, mem64  
PSHUFW mmreg1, mmreg2, imm8  
PSHUFW mmreg, mem64, imm8  
PREFETCHNTA mem8  
PSUBD mmreg1, mmreg2  
PSUBD mmreg, mem64  
PSUBSB mmreg1, mmreg2  
PSUBSB mmreg, mem64  
PSUBSW mmreg1, mmreg2  
PSUBSW mmreg, mem64  
PSUBUSB mmreg1, mmreg2  
PSUBUSB mmreg, mem64  
PSUBUSW mmreg1, mmreg2  
PSUBUSW mmreg, mem64  
PSUBW mmreg1, mmreg2  
PSUBW mmreg, mem64  
PUNPCKHBW mmreg1, mmreg2  
PUNPCKHBW mmreg, mem64  
PUNPCKHDQ mmreg1, mmreg2  
PUNPCKHDQ mmreg, mem64  
PUNPCKHWD mmreg1, mmreg2  
PUNPCKHWD mmreg, mem64  
PUNPCKLBW mmreg1, mmreg2  
PUNPCKLBW mmreg, mem64  
PUNPCKLDQ mmreg1, mmreg2  
PUNPCKLDQ mmreg, mem64  
PUNPCKLWD mmreg1, mmreg2  
PUNPCKLWD mmreg, mem64  
PXOR mmreg1, mmreg2  
PREFETCHT0 mem8  
PREFETCHT1 mem8  
PREFETCHT2 mem8  
228  
DirectPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 28. DirectPath Floating-Point Instructions  
Instruction Mnemonic  
FIST [mem32int]  
FISTP [mem16int]  
FISTP [mem32int]  
FISTP [mem64int]  
FLD ST(i)  
Instruction Mnemonic  
FABS  
FADD ST, ST(i)  
FADD [mem32real]  
FADD ST(i), ST  
FADD [mem64real]  
FADDP ST(i), ST  
FCHS  
FLD [mem32real]  
FLD [mem64real]  
FLD [mem80real]  
FLD1  
FCOM ST(i)  
FCOMP ST(i)  
FLDL2E  
FCOM [mem32real]  
FCOM [mem64real]  
FCOMP [mem32real]  
FCOMP [mem64real]  
FCOMPP  
FLDL2T  
FLDLG2  
FLDLN2  
FLDPI  
FLDZ  
FDECSTP  
FMUL ST, ST(i)  
FMUL ST(i), ST  
FMUL [mem32real]  
FMUL [mem64real]  
FMULP ST, ST(i)  
FNOP  
FDIV ST, ST(i)  
FDIV ST(i), ST  
FDIV [mem32real]  
FDIV [mem64real]  
FDIVP ST, ST(i)  
FDIVR ST, ST(i)  
FDIVR ST(i), ST  
FDIVR [mem32real]  
FDIVR [mem64real]  
FDIVRP ST(i), ST  
FFREE ST(i)  
FPREM  
FPREM1  
FSQRT  
FST [mem32real]  
FST [mem64real]  
FST ST(i)  
FFREEP ST(i)  
FSTP [mem32real]  
FSTP [mem64real]  
FSTP [mem80real]  
FSTP ST(i)  
FILD [mem16int]  
FILD [mem32int]  
FILD [mem64int]  
FIMUL [mem32int]  
FIMUL [mem16int]  
FINCSTP  
FSUB [mem32real]  
FSUB [mem64real]  
FSUB ST, ST(i)  
FIST [mem16int]  
DirectPath Instructions  
229  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 28. DirectPath Floating-Point Instructions  
Instruction Mnemonic  
FSUB ST(i), ST  
FSUBP ST, ST(i)  
FSUBR [mem32real]  
FSUBR [mem64real]  
FSUBR ST, ST(i)  
FSUBR ST(i), ST  
FSUBRP ST(i), ST  
FTST  
FUCOM  
FUCOMP  
FUCOMPP  
FWAIT  
FXCH  
230  
DirectPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
VectorPath Instructions  
The following tables contain VectorPath instructions, which  
should be avoided in the AMD Athlon processor:  
Table 29. VectorPath Integer Instructions  
Instruction Mnemonic  
AAA  
Instruction Mnemonic  
BTS mem16/32, imm8  
CALL full pointer  
AAD  
AAM  
CALL near imm16/32  
CALL mem16:16/32  
CALL near mreg32 (indirect)  
CALL near mem32 (indirect)  
CLD  
AAS  
ARPL mreg16, reg16  
ARPL mem16, reg16  
BOUND  
BSF reg16/32, mreg16/32  
BSF reg16/32, mem16/32  
BSR reg16/32, mreg16/32  
BSR reg16/32, mem16/32  
BT mem16/32, reg16/32  
BTC mreg16/32, reg16/32  
BTC mem16/32, reg16/32  
BTC mreg16/32, imm8  
BTC mem16/32, imm8  
BTR mreg16/32, reg16/32  
BTR mem16/32, reg16/32  
BTR mreg16/32, imm8  
BTR mem16/32, imm8  
BTS mreg16/32, reg16/32  
BTS mem16/32, reg16/32  
BTS mreg16/32, imm8  
CLI  
CLTS  
CMPSB mem8,mem8  
CMPSW mem16, mem32  
CMPSD mem32, mem32  
CMPXCHG mreg8, reg8  
CMPXCHG mem8, reg8  
CMPXCHG mreg16/32, reg16/32  
CMPXCHG mem16/32, reg16/32  
CMPXCHG8B mem64  
CPUID  
DAA  
DAS  
DIV AL, mreg8  
DIV AL, mem8  
DIV EAX, mreg16/32  
VectorPath Instructions  
231  
Download from Www.Somanuals.com. All Manuals Search And Download.  
     
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 29. VectorPath Integer Instructions (Continued)  
Instruction Mnemonic  
DIV EAX, mem16/32  
ENTER  
Instruction Mnemonic  
LEA reg16, mem16/32  
LEAVE  
IDIV mreg8  
LES reg16/32, mem32/48  
LFS reg16/32, mem32/48  
LGDT mem48  
IDIV mem8  
IDIV EAX, mreg16/32  
IDIV EAX, mem16/32  
IMUL reg16/32, imm16/32  
IMUL reg16/32, mreg16/32, imm16/32  
IMUL reg16/32, mem16/32, imm16/32  
IMUL reg16/32, imm8 (sign extended)  
IMUL reg16/32, mreg16/32, imm8 (signed)  
IMUL reg16/32, mem16/32, imm8 (signed)  
IMUL AX, AL, mreg8  
IMUL AX, AL, mem8  
IMUL EDX:EAX, EAX, mreg16/32  
IMUL EDX:EAX, EAX, mem16/32  
IMUL reg16/32, mreg16/32  
IMUL reg16/32, mem16/32  
IN AL, imm8  
LGS reg16/32, mem32/48  
LIDT mem48  
LLDT mreg16  
LLDT mem16  
LMSW mreg16  
LMSW mem16  
LODSB AL, mem8  
LODSW AX, mem16  
LODSD EAX, mem32  
LOOP disp8  
LOOPE/LOOPZ disp8  
LOOPNE/LOOPNZ disp8  
LSL reg16/32, mreg16/32  
LSL reg16/32, mem16/32  
LSS reg16/32, mem32/48  
LTR mreg16  
IN AX, imm8  
IN EAX, imm8  
IN AL, DX  
LTR mem16  
IN AX, DX  
MOV mreg16, segment reg  
MOV mem16, segment reg  
MOV segment reg, mreg16  
MOV segment reg, mem16  
MOVSB mem8,mem8  
MOVSD mem16, mem16  
MOVSW mem32, mem32  
MUL AL, mreg8  
IN EAX, DX  
INVD  
INVLPG  
JCXZ/JEC short disp8  
JMP far disp32/48 (direct)  
JMP far mem32 (indirect)  
JMP far mreg32 (indirect)  
LAHF  
MUL AL, mem8  
LAR reg16/32, mreg16/32  
LAR reg16/32, mem16/32  
LDS reg16/32, mem32/48  
MUL AX, mreg16  
MUL AX, mem16  
MUL EAX, mreg32  
232  
VectorPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 29. VectorPath Integer Instructions (Continued)
Instruction Mnemonic  
MUL EAX, mem32  
Instruction Mnemonic  
RCL mem8, imm8  
OUT imm8, AL  
OUT imm8, AX  
OUT imm8, EAX  
OUT DX, AL  
OUT DX, AX  
OUT DX, EAX  
POP ES  
RCL mem16/32, imm8  
RCL mem8, CL  
RCL mem16/32, CL  
RCR mem8, imm8  
RCR mem16/32, imm8  
RCR mem8, CL  
RCR mem16/32, CL  
RDMSR  
POP SS  
POP DS  
RDPMC  
POP FS  
RDTSC  
POP GS  
RET near imm16  
POP EAX  
RET near  
POP ECX  
RET far imm16  
POP EDX  
RET far  
POP EBX  
SAHF  
POP ESP  
SCASB AL, mem8  
SCASW AX, mem16  
SCASD EAX, mem32  
SGDT mem48  
POP EBP  
POP ESI  
POP EDI  
POP mreg 16/32  
POP mem 16/32  
POPA/POPAD  
POPF/POPFD  
PUSH ES  
SIDT mem48  
SHLD mreg16/32, reg16/32, imm8  
SHLD mem16/32, reg16/32, imm8  
SHLD mreg16/32, reg16/32, CL  
SHLD mem16/32, reg16/32, CL  
SHRD mreg16/32, reg16/32, imm8  
SHRD mem16/32, reg16/32, imm8  
SHRD mreg16/32, reg16/32, CL  
SHRD mem16/32, reg16/32, CL  
SLDT mreg16  
PUSH CS  
PUSH FS  
PUSH GS  
PUSH SS  
PUSH DS  
PUSH mreg16/32  
PUSH mem16/32  
PUSHA/PUSHAD  
PUSHF/PUSHFD  
SLDT mem16  
SMSW mreg16  
SMSW mem16  
STD  
VectorPath Instructions  
233  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Table 29. VectorPath Integer Instructions (Continued) Table 30. VectorPath MMXInstructions  
Instruction Mnemonic  
Instruction Mnemonic  
MOVD mmreg, mreg32  
STI  
STOSB mem8, AL  
STOSW mem16, AX  
STOSD mem32, EAX  
STR mreg16  
MOVD mreg32, mmreg  
Table 31. VectorPath MMXExtensions  
Instruction Mnemonic  
MASKMOVQ mmreg1, mmreg2  
PEXTRW reg32, mmreg, imm8  
PINSRW mmreg, reg32, imm8  
PINSRW mmreg, mem16, imm8  
PMOVMSKB reg32, mmreg  
SFENCE  
STR mem16  
SYSCALL  
SYSENTER  
SYSEXIT  
SYSRET  
VERR mreg16  
VERR mem16  
VERW mreg16  
VERW mem16  
WBINVD  
WRMSR  
XADD mreg8, reg8  
XADD mem8, reg8  
XADD mreg16/32, reg16/32  
XADD mem16/32, reg16/32  
XCHG reg8, mreg8  
XCHG reg8, mem8  
XCHG reg16/32, mreg16/32  
XCHG reg16/32, mem16/32  
XCHG EAX, ECX  
XCHG EAX, EDX  
XCHG EAX, EBX  
XCHG EAX, ESP  
XCHG EAX, EBP  
XCHG EAX, ESI  
XCHG EAX, EDI  
XLAT  
234  
VectorPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
   
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Table 32. VectorPath Floating-Point Instructions  
Instruction Mnemonic  
F2XM1  
Instruction Mnemonic  
FLDENV [mem14byte]  
FLDENV [mem28byte]  
FPTAN  
FBLD [mem80]  
FBSTP [mem80]  
FCLEX  
FPATAN  
FCMOVB ST(0), ST(i)  
FCMOVE ST(0), ST(i)  
FCMOVBE ST(0), ST(i)  
FCMOVU ST(0), ST(i)  
FCMOVNB ST(0), ST(i)  
FCMOVNE ST(0), ST(i)  
FCMOVNBE ST(0), ST(i)  
FCMOVNU ST(0), ST(i)  
FCOMI ST, ST(i)  
FRNDINT  
FRSTOR [mem94byte]  
FRSTOR [mem108byte]  
FSAVE [mem94byte]  
FSAVE [mem108byte]  
FSCALE  
FSIN  
FSINCOS  
FSTCW [mem16]  
FSTENV [mem14byte]  
FSTENV [mem28byte]  
FSTP [mem80real]  
FSTSW AX  
FCOMIP ST, ST(i)  
FCOS  
FIADD [mem32int]  
FIADD [mem16int]  
FICOM [mem32int]  
FICOM [mem16int]  
FICOMP [mem32int]  
FICOMP [mem16int]  
FIDIV [mem32int]  
FIDIV [mem16int]  
FIDIVR [mem32int]  
FIDIVR [mem16int]  
FIMUL [mem32int]  
FIMUL [mem16int]  
FINIT  
FSTSW [mem16]  
FUCOMI ST, ST(i)  
FUCOMIP ST, ST(i)  
FXAM  
FXTRACT  
FYL2X  
FYL2XP1  
FISUB [mem32int]  
FISUB [mem16int]  
FISUBR [mem32int]  
FISUBR [mem16int]  
FLD [mem80real]  
FLDCW [mem16]  
VectorPath Instructions  
235  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
236  
VectorPath Instructions  
Download from Www.Somanuals.com. All Manuals Search And Download.  
22007E/0November 1999  
AMD AthlonProcessor x86 Code Optimization  
Index  
Numerics  
D
DirectPath  
E
A
AMD AthlonProcessor  
F
Floating-Point  
B
Blended Code, AMD-K6 and AMD Athlon Processors  
Branches  
G
C
I
Index  
237  
Download from Www.Somanuals.com. All Manuals Search And Download.  
 
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
Instruction  
Multiplication  
N
O
L
P
Loops  
Pointers  
Prefetch  
R
M
Memory  
S
Stack  
238  
Index  
Download from Www.Somanuals.com. All Manuals Search And Download.  
AMD AthlonProcessor x86 Code Optimization  
22007E/0November 1999  
240  
Index  
Download from Www.Somanuals.com. All Manuals Search And Download.  

Acoustic Research Speaker System AW827 User Manual
Addonics Technologies Computer Drive AD5EHPMEU3 User Manual
Agilent Technologies Laptop 81200 User Manual
AG Neovo Computer Monitor TS 17R User Manual
Aiwa Car Stereo System CDC X145 User Manual
AVERATEC GPS Receiver 320 GPS User Manual
Avet Reels Fishing Equipment HXJ 5 2 User Manual
Avital Automobile Alarm Model 4300 User Manual
Axis Communications TV DVD Combo AXIS P7701 User Manual
BB Electronics Power Supply DR 75 12 User Manual