AMD Typewriter x86 User Manual

AMD Athlon Processor

x86 Code Optimization

Guide

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Contents

Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Introduction

About this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

AMD Athlon ™ Processor Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

AMD Athlon Processor Microarchitecture Summary . . . . . . . . . . . . . 4

Top Optimizations

Optimization Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Group I Optimizations — Essential Optimizations . . . . . . . . . . . . . . . 8

Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . 8

Use the 3DNow!™ PREFETCH and PREFETCHW

Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Select DirectPath Over VectorPath Instructions . . . . . . . . . . . 9

Group II Optimizations —Secondary Optimizations . . . . . . . . . . . . . . 9

Load-Execute Instruction Usage. . . . . . . . . . . . . . . . . . . . . . . . . 9

Take Advantage of Write Combining. . . . . . . . . . . . . . . . . . . . 10

Use 3DNow! Instruction s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0

Avoid Branches Dependent on Random Dat a . . . . . . . . . . . . . 1 0

Avoid Placing Code and Data in the Same

64-Byte Cache Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

C Source Level Optimizations

Ensure Floating-Point Variables and Expressions

are of Type Floa t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3

Use 32-Bit Data Types for Integer Code . . . . . . . . . . . . . . . . . . . . . . . 13

Consider the Sign of Integer Operands . . . . . . . . . . . . . . . . . . . . . . . 14

Use Array Style Instead of Pointer Style Code . . . . . . . . . . . . . . . . . 15

Completely Unroll Small Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Avoid Unnecessary Store-to-Load Dependencies . . . . . . . . . . . . . . . 18

Consider Expression Order in Compound Branch Conditions . . . . . 20

Contents

iii

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Switch Statement Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Optimize Switch Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Use Prototypes for All Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Use Const Type Qualifie r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2

Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Generalization for Multiple Constant Control Code. . . . . . . . 23

Declare Local Functions as Static . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Dynamic Memory Allocation Consideration . . . . . . . . . . . . . . . . . . . 25

Introduce Explicit Parallelism into Cod e . . . . . . . . . . . . . . . . . . . . . . 2 5

Explicitly Extract Common Subexpressions . . . . . . . . . . . . . . . . . . . 26

C Language Structure Component Considerations . . . . . . . . . . . . . . 27

Sort Local Variables According to Base Type Size . . . . . . . . . . . . . . 28

Accelerating Floating-Point Divides and Square Roots . . . . . . . . . . 29

Avoid Unnecessary Integer Division. . . . . . . . . . . . . . . . . . . . . . . . . . 31

Copy Frequently De-referenced Pointer Arguments to

Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Instruction Decoding Optimizations

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . . 34

Load-Execute Instruction Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Use Load-Execute Integer Instructions . . . . . . . . . . . . . . . . . . 34

Use Load-Execute Floating-Point Instructions with

Floating-Point Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Avoid Load-Execute Floating-Point Instructions with

Integer Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Align Branch Targets in Program Hot Spots . . . . . . . . . . . . . . . . . . . 36

Use Short Instruction Length s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 6

Avoid Partial Register Reads and Writes. . . . . . . . . . . . . . . . . . . . . . 37

Replace Certain SHLD Instructions with Alternative Code. . . . . . . 38

Use 8-Bit Sign-Extended Immediates . . . . . . . . . . . . . . . . . . . . . . . . . 38

Contents

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Use 8-Bit Sign-Extended Displacements. . . . . . . . . . . . . . . . . . . . . . . 39

Code Padding Using Neutral Code Fillers . . . . . . . . . . . . . . . . . . . . . 39

Recommendations for the AMD Athlon Processo r . . . . . . . . . 4 0

Recommendations for AMD-K6^®Family and

AMD Athlon Processor Blended Code . . . . . . . . . . . . . . . . . . . 41

Cache and Memory Optimizations

Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Avoid Memory Size Mismatche s . . . . . . . . . . . . . . . . . . . . . . . . 4 5

Align Data Where Possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Use the 3DNow! PREFETCH and PREFETCHW Instructions. . . . . 46

Take Advantage of Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . 50

Avoid Placing Code and Data in the Same 64-Byte Cache Line. . . . 50

Store-to-Load Forwarding Restrictions. . . . . . . . . . . . . . . . . . . . . . . . 51

Store-to-Load Forwarding Pitfalls —True Dependencies. . . . 51

Summary of Store-to-Load Forwarding Pitfalls to Avoi d . . . . 5 4

Stack Alignment Consideration s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4

Align TBYTE Variables on Quadword Aligned Addresse s . . . . . . . . 5 5

C Language Structure Component Considerations . . . . . . . . . . . . . . 55

Sort Variables According to Base Type Size . . . . . . . . . . . . . . . . . . . 56

Branch Optimizations

Avoid Branches Dependent on Random Data . . . . . . . . . . . . . . . . . . 57

AMD Athlon Processor Specific Code . . . . . . . . . . . . . . . . . . . 58

Blended AMD-K6 and AMD Athlon Processor Code . . . . . . . 58

Always Pair CALL and RETURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Replace Branches with Computation in 3DNow! Cod e . . . . . . . . . . . 6 0

Muxing Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Sample Code Translated into 3DNow! Code . . . . . . . . . . . . . . 61

Avoid the Loop Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Avoid Far Control Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . 65

Avoid Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Contents

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Scheduling Optimizations

Schedule Instructions According to their Latency . . . . . . . . . . . . . . 67

Unrolling Loop s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7

Complete Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Partial Loop Unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Use Function Inlinin g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Always Inline Functions if Called from One Site . . . . . . . . . . 72

Always Inline Functions with Fewer than 25 Machine

Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Avoid Address Generation Interlocks. . . . . . . . . . . . . . . . . . . . . . . . . 72

Use MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Minimize Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . 73

Push Memory Data Carefully. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Integer Optimizations

Replace Divides with Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Multiplication by Reciprocal (Division) Utility . . . . . . . . . . . 77

Unsigned Division by Multiplication of Constant. . . . . . . . . . 78

Signed Division by Multiplication of Constant . . . . . . . . . . . . 79

Use Alternative Code When Multiplying by a Constant. . . . . . . . . . 81

Use MMX ™ Instructions for Integer-Only Work . . . . . . . . . . . . . . . . 83

Repeated String Instruction Usag e . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4

Latency of Repeated String Instruction s . . . . . . . . . . . . . . . . . 8 4

Guidelines for Repeated String Instructions . . . . . . . . . . . . . 84

Use XOR Instruction to Clear Integer Register s . . . . . . . . . . . . . . . . 8 6

Efficient 64-Bit Integer Arithmeti c . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6

Efficient Implementation of Population Count Functio n . . . . . . . . . 9 1

Derivation of Multiplier Used for Integer Division

by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Unsigned Derivation for Algorithm, Multiplier, and

Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Contents

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Signed Derivation for Algorithm, Multiplier, and

Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Floating-Point Optimizations

Ensure All FPU Data is Aligned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Use Multiplies Rather than Divides . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Use FFREEP Macro to Pop One Register from the FPU Stack . . . . 98

Floating-Point Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 98

Use the FXCH Instruction Rather than FST/FLD Pairs . . . . . . . . . . 99

Avoid Using Extended-Precision Data . . . . . . . . . . . . . . . . . . . . . . . . 99

Minimize Floating-Point-to-Integer Conversion s . . . . . . . . . . . . . . . 1 00

Floating-Point Subexpression Elimination. . . . . . . . . . . . . . . . . . . . 103

Check Argument Range of Trigonometric Instructions

Efficiently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Take Advantage of the FSINCOS Instruction . . . . . . . . . . . . . . . . . 105

3DNow!™ and MMX™ Optimizations

107

Use 3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Use FEMMS Instructio n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 07

Use 3DNow! Instructions for Fast Division . . . . . . . . . . . . . . . . . . . 108

Optimized 14-Bit Precision Divide . . . . . . . . . . . . . . . . . . . . . 108

Optimized Full 24-Bit Precision Divide . . . . . . . . . . . . . . . . . 108

Pipelined Pair of 24-Bit Precision Divides. . . . . . . . . . . . . . . 109

Newton-Raphson Reciproca l . . . . . . . . . . . . . . . . . . . . . . . . . . 1 09

Use 3DNow! Instructions for Fast Square Root and

Reciprocal Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Optimized 15-Bit Precision Square Root . . . . . . . . . . . . . . . . 110

Optimized 24-Bit Precision Square Root . . . . . . . . . . . . . . . . 110

Newton-Raphson Reciprocal Square Root. . . . . . . . . . . . . . . 111

Use MMX PMADDWD Instruction to Perform

Two 32-Bit Multiplies in Paralle l . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 11

3DNow! and MMX Intra-Operand Swapping . . . . . . . . . . . . . . . . . . 112

Contents

vii

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Fast Conversion of Signed Words to Floating-Point . . . . . . . . . . . . 113

Use MMX PXOR to Negate 3DNow! Data . . . . . . . . . . . . . . . . . . . . 113

Use MMX PCMP Instead of 3DNow! PFCMP. . . . . . . . . . . . . . . . . . 114

Use MMX Instructions for Block Copies and Block Fills . . . . . . . . 115

Use MMX PXOR to Clear All Bits in an MMX Register . . . . . . . . . 118

Use MMX PCMPEQD to Set All Bits in an MMX Registe r . . . . . . . 1 19

Use MMX PAND to Find Absolute Value in 3DNow! Code . . . . . . 119

Optimized Matrix Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Efficient 3D-Clipping Code Computation Using

3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Use 3DNow! PAVGUSB for MPEG-2 Motion Compensation . . . . . 123

Stream of Packed Unsigned Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Complex Number Arithmetic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

General x86 Optimization Guidelines

127

Short Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Stack Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Appendix A

AMD Athlon ™ Processor Microarchitecture

1 2 9

Introductio n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 29

AMD Athlon Processor Microarchitectur e . . . . . . . . . . . . . . . . . . . . 1 30

Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Predecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Branch Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Early Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Instruction Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Integer Schedule r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 35

viii

Contents

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Floating-Point Schedule r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 36

Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . 137

Load-Store Unit (LSU). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

L2 Cache Controlle r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 39

Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

AMD Athlon System Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Appendix B

Pipeline and Execution Unit Resources Overview

141

Fetch and Decode Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Execution Unit Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Integer Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Floating-Point Pipeline Operations . . . . . . . . . . . . . . . . . . . . 150

Load/Store Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . 151

Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Appendix C

Implementation of Write Combining

155

Introductio n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 55

Write-Combining Definitions and Abbreviations . . . . . . . . . . . . . . 156

What is Write Combining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Programming Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Write-Combining Operation s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 57

Sending Write-Buffer Data to the System . . . . . . . . . . . . . . . 159

Appendix D

Performance-Monitoring Counters

161

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Performance Counter Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

PerfEvtSel[3:0] MSRs

(MSR Addresses C001_0000h –C001_0003h) . . . . . . . . . . . . . 162

Contents

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

PerfCtr[3:0] MSRs

(MSR Addresses C001_0004h –C001_0007h) . . . . . . . . . . . . . 167

Starting and Stopping the Performance-Monitoring

Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

Event and Time-Stamp Monitoring Software. . . . . . . . . . . . . . . . . . 168

Monitoring Counter Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Appendix E

Programming the MTRR and PAT

171

Introductio n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 71

Memory Type Range Register (MTRR) Mechanism . . . . . . . . . . . . 171

Page Attribute Table (PAT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Appendix F

Appendix G

Instruction Dispatch and Execution Resources

DirectPath versus VectorPath Instructions

187

219

Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . 219

DirectPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

VectorPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Contents

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

List of Figures

Figure 1. AMD Athlon ™ Processor Block Diagram . . . . . . . . . . . 131

Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135

Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 137

Figure 4. Load/Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Figure 5. Fetch/Scan/Align/Decode Pipeline Hardwar e . . . . . . . . 1 42

Figure 6. Fetch/Scan/Align/Decode Pipeline Stage s . . . . . . . . . . . 1 42

Figure 7. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 144

Figure 8. Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Figure 9. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 146

Figure 10. Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . 146

Figure 11. PerfEvtSel[3:0] Registers . . . . . . . . . . . . . . . . . . . . . . . . 162

Figure 12. MTRR Mapping of Physical Memory . . . . . . . . . . . . . . . 173

Figure 13. MTRR Capability Register Format . . . . . . . . . . . . . . . . 174

Figure 14. MTRR Default Type Register Format . . . . . . . . . . . . . . 175

Figure 15. Page Attribute Table (MSR 277h) . . . . . . . . . . . . . . . . . 177

Figure 16. MTRRphysBasen Register Forma t . . . . . . . . . . . . . . . . . 1 83

Figure 17. MTRRphysMaskn Register Format . . . . . . . . . . . . . . . . 184

List of Figures

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

xii

List of Figures

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

List of Tables

Table 1. Latency of Repeated String Instructions. . . . . . . . . . . . . 84

Table 2. Integer Pipeline Operation Types . . . . . . . . . . . . . . . . . 149

Table 3. Integer Decode Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Table 4. Floating-Point Pipeline Operation Types . . . . . . . . . . . 150

Table 5. Floating-Point Decode Types . . . . . . . . . . . . . . . . . . . . . 150

Table 6. Load/Store Unit Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Table 7. Sample 1 – Integer Register Operations . . . . . . . . . . . . 153

Table 8. Sample 2 – Integer Register and Memory Load

Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Table 9. Write Combining Completion Events . . . . . . . . . . . . . . 158

Table 10. AMD Athlon ™ System Bus Commands

Generation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Table 11. Performance-Monitoring Counters. . . . . . . . . . . . . . . . . 164

Table 12. Memory Type Encodings . . . . . . . . . . . . . . . . . . . . . . . . . 174

Table 13. Standard MTRR Types and Propertie s . . . . . . . . . . . . . 1 76

Table 14. PATi 3-Bit Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Table 15. Effective Memory Type Based on PAT and

MTRRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Table 16. Final Output Memory Types . . . . . . . . . . . . . . . . . . . . . . 180

Table 17. MTRR Fixed Range Register Format . . . . . . . . . . . . . . 182

Table 18. MTRR-Related Model-Specific Register

(MSR) Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Table 19. Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

Table 20. MMX ™ Instruction s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 08

Table 21. MMX Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Table 22. Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . 212

Table 23. 3DNow!™ Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . 217

Table 24. 3DNow! Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Table 25. DirectPath Integer Instructions . . . . . . . . . . . . . . . . . . . 220

Table 26. DirectPath MMX Instructions. . . . . . . . . . . . . . . . . . . . . 227

Table 27. DirectPath MMX Extensions. . . . . . . . . . . . . . . . . . . . . . 228

Table 28. DirectPath Floating-Point Instructions . . . . . . . . . . . . . 229

List of Tables

xiii

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 29. VectorPath Integer Instruction s . . . . . . . . . . . . . . . . . . . 2 31

Table 30. VectorPath MMX Instructions . . . . . . . . . . . . . . . . . . . . 234

Table 31. VectorPath MMX Extensions . . . . . . . . . . . . . . . . . . . . . 234

Table 32. VectorPath Floating-Point Instruction s . . . . . . . . . . . . . 2 35

xiv

List of Tables

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Revision History

Date Rev

Description

Added “About this D o cument” on page 1.

Further clarification of “Consider the Sign of Integer Operands” on page 14.

Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15.

Added the optimization, “Accelerating Floating-Point Divides and Square Roots” on page 29.

Clarified examples in “Copy Frequently De-refer enc ed Pointer Arguments to L ocal Variables” on page 31.

Further clarification of “Select DirectP ath Over VectorPath Inst ructions” on page 3 4.

Further clarification of “Align Branch T a rgets in Progra m Ho t Spots” on page 3 6.

Further clarification of RE P instruction as a filler in “Code Padding Using Neutral Code Fil lers” on page 39.

Further clarification of “Use the 3DNow!™ P REFETCH and PREFETCHW Instru ctions” on page 46.

Modified examples 1 an d 2 of “Unsigned Division by Multiplication of Constant” on page 7 8.

Added the optimization, “E fficient Implementation of Population Count F unction” on page 91.

Further clarification of “U se FFREEP Macro to Pop One Register from the FPU Stack” on page 98.

Further cl arification of “Minimize Floating-Point-to-Int eger Conversions” on page 1 00.

Added the optimization , “Check Argument Range of Trigonometric Instructions Efficiently” on page 103.

Added the optim ization, “T ake Advantage of the FSINCOS Instruction” on page 105.

Further cl arification of “Use 3DNow!™ Instructions for Fast D ivision” on page 108.

Further clarification “Use FEMMS Instruction” on page 107 .

Nov.

1999

Further clarification of “Use 3DNow!™ Instructions for Fast Square Ro ot and Rec iprocal Square Root” on

page 110.

Clarified “3DNow!™ and M MX™ Intra-Operand Swapping” on page 112.

Corrected PCMPGT inform ation in “Use MMX™ PCMP Instead of 3DNow!™ P FCMP” on page 1 14.

Added the optimization, “Use MMX™ Instructions for Block Copies and Bloc k Fills” on page 115.

Modified the rule for “U se MMX™ PXOR to Clear All Bits in an MMX™ Register” on page 1 18.

Modified the rule for “U se MM X™ PCMPEQD to Set All Bits in an MMX™ Regis ter” on page 1 19.

Added the optimiz ation, “Optimized Matrix Multiplication” on page 1 19.

Added the optimization, “Effic ien t 3D-Clipping Code Computation Using 3DNow!™ Instruction s” on page

122.

Added the optimization, “C omplex Number Arithmetic” on page 126.

Added Appendix E, “Programming the MTRR and PAT”.

Rearranged the appendices.

Added Index.

Revision History

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

xvi

Revision History

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Introduction

The AMD Athlon™ processor is the newest microprocessor in

the AMD K86™ family of microprocessors. The advances in the

AMD Athlon processor take superscalar operation and

out-of-order execution to a new level. The AMD Athlon

processor has been designed to efficiently execute code written

for previous-generation x86 processors. However, to enable the

fastest code execution with the AMD Athlon processor,

programmers should write software that includes specific code

optimization techniques.

About this Document

This document contains information to assist programmers in

creating optimized code for the AMD Athlon processor. In

addition to compiler and assembler designers, this document

has been targeted to C and assembly language programmers

writing execution-sensitive code sequences.

This document assumes that the reader possesses in-depth

knowledge of the x86 instruction set, the x86 architecture

(registers, programming modes, etc.), and the IBM PC-AT

platform.

This guide has been written specifically for the AMD Athlon

processor, but it includes considerations for

About this Document

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

previous-generation processors and describes how those

optimizations are applicable to the AMD Athlon processor. This

guide contains the following chapters:

Chapter 1: Introduction. Outlines the material covered in this

document. Summarizes the AMD Athlon microarchitecture.

Chapter 2: Top Optimizations. Provides convenient descriptions of

the most important optimizations a programmer should take

into consideration.

Chapter 3: C Source Level Optimizations. Describes optimizations that

C/C++ programmers can implement.

Chapter 4: Instruction Decoding Optimizations. Describes methods that

will make the most efficient use of the three sophisticated

instruction decoders in the AMD Athlon processor.

Chapter 5: Cache and Memory Optimizations. Describes optimizations

that makes efficient use of the large L1 caches and high-

bandwidth buses of the AMD Athlon processor.

Chapter 6: Branch Optimizations. Describes optimizations that

improves branch prediction and minimizes branch penalties.

Chapter 7: Scheduling Optimizations. Describes optimizations that

improves code scheduling for efficient execution resource

utilization.

Chapter 8: Integer Optimizations. Describes optimizations that

improves integer arithmetic and makes efficient use of the

integer execution units in the AMD Athlon processor.

Chapter 9: Floating-Point Optimizations. Describes optimizations that

makes maximum use of the superscalar and pipelined floating-

point unit (FPU) of the AMD Athlon processor.

Chapter 10: 3DNow!™ and MMX™ Optimizations. Describes guidelines

for Enhanced 3DNow! and MMX code optimization techniques.

Chapter 11: General x86 Optimizations Guidelines. Lists

generic

optimizations techniques applicable to x86 processors.

Appendix A: AMD Athlon Processor Microarchitecture. Describes in

detail the microarchitecture of the AMD Athlon processor.

About this Document

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix B: Pipeline and Execution Unit Resources Overview. Describes

in detail the execution units and its relation to the instruction

pipeline.

Appendix C: Implementation of Write Combining. Describes

the

algorithm used by the AMD Athlon processor to write combine.

Appendix D: Performance Monitoring Counters. Describes the usage of

the performance counters available in the AMD Athlon

processor.

Appendix E: Programming the MTRR and PAT. Describes the steps

needed to program the Memory Type Range Registers and the

Page Attribute Table.

Appendix F: Instruction Dispatch and Execution Resources. Lists

instruction’s execution resource usage.

the

Appendix G: DirectPath versus VectorPath Instructions. Lists the x86

instructions that are DirectPath and VectorPath instructions.

AMD Athlon™ Processor Family

The AMD Athlon processor family uses state-of-the-art

decoupled decode/execution design techniques to deliver

next-generation performance with x86 binary software

compatibility. This next-generation processor family advances

x86 code execution by using flexible instruction predecoding,

wide and balanced decoders, aggressive out-of-order execution,

parallel integer execution pipelines, parallel floating-point

execution pipelines, deep pipelined execution for higher

delivered operating frequency, dedicated backside cache

memory, and a new high-performance double-rate 64-bit local

bus. As an x86 binary-compatible processor, the AMD Athlon

processor implements the industry-standard x86 instruction set

by decoding and executing the x86 instructions using a

proprietary microarchitecture. This microarchitecture allows

the delivery of maximum performance when running x86-based

PC software.

AMD Athlon™ Processor Family

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

AMD Athlon™ Processor Microarchitecture Summary

The AMD Athlon processor brings superscalar performance

and high operating frequency to PC systems running

industry-standard x86 software. A brief summary of the

next-generation design features implemented in the

AMD Athlon processor is as follows:

ꢀ

High-speed double-rate local bus interface

Large, split 128-Kbyte level-one (L1) cache

Dedicated backside level-two (L2) cache

Instruction predecode and branch detection during cache

line fills

ꢀ

Decoupled decode/execution core

Three-way x86 instruction decoding

Dynamic scheduling and speculative execution

Three-way integer execution

Three-way address generation

Three-way floating-point execution

3DNow!™ technology and MMX™ single-instruction

multiple-data (SIMD) instruction extensions

ꢀ

Super data forwarding

Deep out-of-order integer and floating-point execution

Dynamic branch prediction

The AMD Athlon processor communicates through a

next-generation high-speed local bus that is beyond the current

Socket 7 or Super7™ bus standard. The local bus can transfer

data at twice the rate of the bus operating frequency by using

both the rising and falling edges of the clock (see

“A MD Athlon ™ S ystem Bus ” o n page 139 for more

information).

To reduce on-chip cache miss penalties and to avoid subsequent

data load or instruction fetch stalls, the AMD Athlon processor

has a dedicated high-speed backside L2 cache. The large

128-Kbyte L1 on-chip cache and the backside L2 cache allow the

AMD Athlon™ Processor Microarchitecture Summary

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

AMD Athlon execution core to achieve and sustain maximum

performance.

As a decoupled decode/execution processor, the AMD Athlon

processor makes use of a proprietary microarchitecture, which

defines the heart of the AMD Athlon processor. With the

inclusion of all these features, the AMD Athlon processor is

capable of decoding, issuing, executing, and retiring multiple

x86 instructions per cycle, resulting in superior scaleable

performance.

The AMD Athlon processor includes both the industry-standard

MMX SIMD integer instructions and the 3DNow! SIMD

floating-point instructions that were first introduced in the

AMD-K6^®-2 processor. The design of 3DNow! technology was

based on suggestions from leading graphics and independent

software vendors (ISVs). Using SIMD format, the AMD Athlon

processor can generate up to four 32-bit, single-precision

floating-point results per clock cycle.

The 3DNow! execution units allow for high-performance

floating-point vector operations, which can replace x87

instructions and enhance the performance of 3D graphics and

other floating-point-intensive applications. Because the

3DNow! architecture uses the same registers as the MMX

instructions, switching between MMX and 3DNow! has no

penalty.

The AMD Athlon processor designers took another innovative

step by carefully integrating the traditional x87 floating-point,

MMX, and 3DNow! execution units into one operational engine.

With the introduction of the AMD Athlon processor, the

switching overhead between x87, MMX, and 3DNow!

technology is virtually eliminated. The AMD Athlon processor

combined with 3DNow! technology brings a better multimedia

experience to mainstream PC users while maintaining

backwards compatibility with all existing x86 software.

Although the AMD Athlon processor can extract code

parallelism on-the-fly from off-the-shelf, commercially available

x86 software, specific code optimization for the AMD Athlon

processor can result in even higher delivered performance. This

document describes the proprietary microarchitecture in the

AMD Athlon processor and makes recommendations for

optimizing execution of x86 software on the processor.

AMD Athlon™ Processor Microarchitecture Summary

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

The coding techniques for achieving peak performance on the

AMD Athlon processor include, but are not limited to, those for

the AMD-K6, AMD-K6-2, Pentium^®, Pentium Pro, and Pentium

II processors. However, many of these optimizations are not

necessary for the AMD Athlon processor to achieve maximum

performance. Due to the more flexible pipeline control and

aggressive out-of-order execution, the AMD Athlon processor is

not as sensitive to instruction selection and code scheduling.

This flexibility is one of the distinct advantages of the

AMD Athlon processor.

The AMD Athlon processor uses the latest in processor

microarchitecture design techniques to provide the highest x86

performance for today’s PC. In short, the AMD Athlon

processor offers true next-generation performance with x86

binary software compatibility.

AMD Athlon™ Processor Microarchitecture Summary

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Top Optimizations

This chapter contains concise descriptions of the best

optimizations for improving the performance of the

AMD Athlon™ processor. Subsequent chapters contain more

detailed descriptions of these and other optimizations. The

optimizations in this chapter are divided into two groups and

listed in order of importance.

Group I — Essential

Optimizations

Group I contains essential optimizations. Users should follow

these critical guidelines closely. The optimizations in Group I

are as follows:

ꢀ

Memory Size and Alignment Issues—Avoid memory size

mismatches—Align data where possible

Use the 3DNow!™ PREFETCH and PREFETCHW

Instructions

Select DirectPath Over VectorPath Instructions

Group II — Secondary

Optimizations

Group II contains secondary optimizations that can

significantly improve the performance of the AMD Athlon

processor. The optimizations in Group II are as follows:

ꢀ

Load-Execute Instruction Usage—Use Load-Execute

instructions—Avoid load-execute floating-point instructions

with integer operands

ꢀ

Take Advantage of Write Combining

Use 3DNow! Instructions

Avoid Branches Dependent on Random Data

Top Optimizations

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

ꢀ

Avoid Placing Code and Data in the Same 64-Byte Cache

Line

Optimization Star

The top optimizations described in this chapter are flagged

with a star. In addition, the star appears beside the more

detailed descriptions found in subsequent chapters.

TOP

✩

Group I Optimizations — Essential Optimizations

Memory Size and Alignment Issues

See “Memory Size and Alignment Issues ” on page 45 for more

details.

Avoid Memory Size Mismatches

Avoid memory size mismatches when instructions operate on

the same data. For instructions that store and reload the same

data, keep operands aligned and keep the loads/stores of each

TOP

✩

operand the same size.

Align Data Where Possible

Avoid misaligned data references. A misaligned store or load

operation suffers a minimum one-cycle penalty in the

AMD Athlon processor load/store pipeline.

TOP

✩

Use the 3DNow!™ PREFETCH and PREFETCHW Instructions

For code that can take advantage of prefetching, use the

3DNow! PREFETCH and PREFETCHW instructions to increase

the effective bandwidth to the AMD Athlon processor, which

TOP

✩

significantly improves performance. All the prefetch

instructions are essentially integer instructions and can be used

Optimization Star

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

anywhere, in any type of code (integer, x87, 3DNow!, MMX,

etc.). Use the following formula to determine prefetch distance:

Prefetch Length = 200 (^DS/_C)

ꢀ

Round up to the nearest cache line.

DS is the data stride per loop iteration.

C is the number of cycles per loop iteration when hitting in

the L1 cache.

See “Use the 3DNow!™ PREFETCH and PREFETCHW

Instructions ” on page 46 for more details.

Select DirectPath Over VectorPath Instructions

Use DirectPath instructions rather than VectorPath

instructions. DirectPath instructions are optimized for decode

and execute efficiently by minimizing the number of operations

per x86 instruction. Three DirectPath instructions can be

decoded in parallel. Using VectorPath instructions will block

DirectPath instructions from decoding simultaneously.

TOP

✩

See Appendix G, “DirectPath versus VectorPath Instructions ”

on page 219 for a list of DirectPath and VectorPath instructions.

Group II Optimizations—Secondary Optimizations

Load-Execute Instruction Usage

See “Load-Execute Instruction Usage ” on page 34 for more

details.

Use Load-Execute Instructions

Wherever possible, use load-execute instructions to increase

code density with the one exception described below. The

split-instruction form of load-execute instructions can be used

TOP

✩

to avoid scheduler stalls for longer executing instructions and

to explicitly schedule the load and execute operations.

Group II Optimizations—Secondary Optimizations

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Avoid Load-Execute Floating-Point Instructions with Integer Operands

Do not use load-execute floating-point instructions with integer

operands. The floating-point load-execute instructions with

integer operands are VectorPath and generate two OPs in a

cycle, while the discrete equivalent enables a third DirectPath

instruction to be decoded in the same cycle.

TOP

✩

Take Advantage of Write Combining

This guideline applies only to operating system, device driver,

and BIOS programmers. In order to improve system

performance, the AMD Athlon processor aggressively combines

multiple memory-write cycles of any data size that address

locations within a 64-byte cache line aligned write buffer.

TOP

✩

See Appendix C, “Implementation of Write Combining ” on

page 155 for more details.

Use 3DNow!™ Instructions

Unless accuracy requirements dictate otherwise, perform

floating-point computations using the 3DNow! instructions

instead of x87 instructions. The SIMD nature of 3DNow!

instructions achieves twice the number of FLOPs that are

achieved through x87 instructions. 3DNow! instructions also

provide for a flat register file instead of the stack-based

approach of x87 instructions.

TOP

✩

See T able 23 on page 217 for a list of 3DNow! instructions. For

information about instruction usage, see the 3DNow!™

Technology Manual, order# 21928.

Avoid Branches Dependent on Random Data

Avoid data-dependent branches around a single instruction.

Data-dependent branches acting upon basically random data

can cause the branch prediction logic to mispredict the branch

about 50% of the time. Design branch-free alternative code

sequences, which results in shorter average execution time.

TOP

✩

See “Avoid Branches Dependent on Random Data ” on page 57

for more details.

Group II Optimizations—Secondary Optimizations

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Avoid Placing Code and Data in the Same 64-Byte Cache Line

Consider that the AMD Athlon processor cache line is twice the

size of previous processors. Code and data should not be shared

in the same 64-byte cache line, especially if the data ever

TOP

✩

becomes modified. In order to maintain cache coherency, the

AMD Athlon processor may thrash its caches, resulting in lower

performance.

In general the following should be avoided:

ꢀ

Self-modifying code

ꢀ

Storing data in code segments

See “Avoid Placing Code and Data in the Same 64-Byte Cache

Line ” on page 50 for more details.

Group II Optimizations—Secondary Optimizations

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Group II Optimizations—Secondary Optimizations

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

C Source Level Optimizations

This chapter details C programming practices for optimizing

code for the AMD Athlon™ processor. Guidelines are listed in

order of importance.

Ensure Floating-Point Variables and Expressions are of

Type Float

For compilers that generate 3DNow!™ instructions, make sure

that all floating-point variables and expressions are of type

float. Pay special attention to floating-point constants. These

require a suffix of “F” or “f” (for example, 3.14f) in order to be

of type float, otherwise they default to type double. To avoid

automatic promotion of float arguments to double, always use

function prototypes for all functions that accept float

arguments.

Use 32-Bit Data Types for Integer Code

Use 32-bit data types for integer code. Compiler

implementations vary, but typically the following data types are

included—int, signed, signed int, unsigned, unsigned int, long,

signed long, long int, signed long int, unsigned long, and unsigned

long int.

Ensure Floating-Point Variables and Expressions are of Type Float

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Consider the Sign of Integer Operands

In many cases, the data stored in integer variables determines

whether a signed or an unsigned integer type is appropriate.

For example, to record the weight of a person in pounds, no

negative numbers are required so an unsigned type is

appropriate. However, recording temperatures in degrees

Celsius may require both positive and negative numbers so a

signed type is needed.

Where there is a choice of using either a signed or an unsigned

type, it should be considered that certain operations are faster

with unsigned types while others are faster for signed types.

Integer-to-floating-point conversion using integers larger than

16-bit is faster with signed types, as the x86 FPU provides

instructions for converting signed integers to floating-point, but

has no instructions for converting unsigned integers. In a

typical case, a 32-bit integer is converted as follows:

Example 1 (Avoid):

double x;

====>

MOV [temp+4], 0

MOV EAX, i

unsigned int i;

MOV [temp], eax

FILD QWORD PTR [temp]

FSTP QWORD PTR [x]

x = i;

This code is slow not only because of the number of instructions

but also because a size mismatch prevents store-to-load-

forwarding to the FILD instruction.

Example (Preferred):

double x;

int i;

====>

FILD DWORD PTR [i]

FSTP QWORD PTR [x]

x = i;

Computing quotients and remainders in integer division by

constants are faster when performed on unsigned types. In a

typical case, a 32-bit integer is divided by four as follows:

Consider the Sign of Integer Operands

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example (Avoid):

int i;

====>

MOV EAX, i

CDQ

i = i / 4;

AND EDX, 3

ADD EAX, EDX

SAR EAX, 2

MOV i, EAX

Example (Preferred):

unsigned int i; ====>

SHR i, 2

i = i / 4;

In summary:

Use unsigned types for:

ꢀ

Division and remainders

Loop counters

Array indexing

Use signed types for:

Integer-to-float conversion

ꢀ

Use Array Style Instead of Pointer Style Code

The use of pointers in C makes work difficult for the optimizers

in C compilers. Without detailed and aggressive pointer

analysis, the compiler has to assume that writes through a

pointer can write to any place in memory. This includes storage

allocated to other variables, creating the issue of aliasing, i.e.,

the same block of memory is accessible in more than one way.

In order to help the optimizer of the C compiler in its analysis,

avoid the use of pointers where possible. One example where

this is trivially possible is in the access of data organized as

arrays. C allows the use of either the array operator [] or

pointers to access the array. Using array-style code makes the

task of the optimizer easier by reducing possible aliasing.

For example, x[0] and x[2] can not possibly refer to the same

memory location, while *p and *q could. It is highly

recommended to use the array style, as significant performance

advantages can be achieved with most compilers.

Use Array Style Instead of Pointer Style Code

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Note that source code transformations will interact with a

compiler’s code generator and that it is difficult to control the

generated machine code from the source level. It is even

possible that source code transformations for improving

performance and compiler optimizations "fight" each other.

Depending on the compiler and the specific source code it is

therefore possible that pointer style code will be compiled into

machine code that is faster than that generated from equivalent

array style code. It is advisable to check the performance after

any source code transformation to see whether performance

indeed increased.

Example 1 (Avoid):

typedef struct {

float x,y,z,w;

} VERTEX;

typedef struct {

float m[4][4];

} MATRIX;

void XForm (float *res, const float *v, const float *m, int

numverts)

{

float dp;

int i;

const VERTEX* vv = (VERTEX *)v;

for (i = 0; i < numverts; i++) {

dp = vv->x * *m++;

dp += vv->y * *m++;

dp += vv->z * *m++;

dp += vv->w * *m++;

*res++ = dp; /* write transformed x */

dp = vv->x * *m++;

dp += vv->y * *m++;

dp += vv->z * *m++;

dp += vv->w * *m++;

*res++ = dp; /* write transformed y */

dp = vv->x * *m++;

dp += vv->y * *m++;

dp += vv->z * *m++;

dp += vv->w * *m++;

Use Array Style Instead of Pointer Style Code

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

*res++ = dp; /* write transformed z */

dp = vv->x * *m++;

dp += vv->y * *m++;

dp += vv->z * *m++;

dp += vv->w * *m++;

*res++ = dp; /* write transformed w */

++vv;

/* next input vertex */

m -= 16; /* reset to start of transform matrix */

}

Example 2 (Preferred):

typedef struct {

float x,y,z,w;

} VERTEX;

typedef struct {

float m[4][4];

} MATRIX;

void XForm (float *res, const float *v, const float *m, int

numverts)

{

int i;

const VERTEX* vv = (VERTEX *)v;

const MATRIX* mm = (MATRIX *)m;

VERTEX* rr = (VERTEX *)res;

for (i = 0; i < numverts; i++) {

rr->x = vv->x*mm->m[0][0] + vv->y*mm->m[0][1] +

vv->z*mm->m[0][2] + vv->w*mm->m[0][3];

rr->y = vv->x*mm->m[1][0] + vv->y*mm->m[1][1] +

vv->z*mm->m[1][2] + vv->w*mm->m[1][3];

rr->z = vv->x*mm->m[2][0] + vv->y*mm->m[2][1] +

vv->z*mm->m[2][2] + vv->w*mm->m[2][3];

rr->w = vv->x*mm->m[3][0] + vv->y*mm->m[3][1] +

vv->z*mm->m[3][2] + vv->w*mm->m[3][3];

}

Use Array Style Instead of Pointer Style Code

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Completely Unroll Small Loops

Take advantage of the AMD Athlon processor’s large, 64-Kbyte

instruction cache and completely unroll small loops. Unrolling

loops can be beneficial to performance, especially if the loop

body is small which makes the loop overhead significant. Many

compilers are not aggressive at unrolling loops. For loops that

have a small fixed loop count and a small loop body, completely

unrolling the loops at the source level is recommended.

Example 1 (Avoid):

// 3D-transform: multiply vector V by 4x4 transform matrix M

for (i=0; i<4; i++) {

r[i] = 0;

for (j=0; j<4; j++) {

r[i] += M[j][i]*V[j];

}

Example 2 (Preferred):

// 3D-transform: multiply vector V by 4x4 transform matrix M

r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] +

M[3][0]*V[3];

r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] +

M[3][1]*V[3];

r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] +

M[3][2]*V[3];

r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] +

M[3][3]*v[3];

Avoid Unnecessary Store-to-Load Dependencies

A store-to-load dependency exists when data is stored to

memory, only to be read back shortly thereafter. See

“Store-to-Load Forwarding Restrictions ” on page 51 for more

details. The AMD Athlon processor contains hardware to

accelerate such store-to-load dependencies, allowing the load to

obtain the store data before it has been written to memory.

However, it is still faster to avoid such dependencies altogether

and keep the data in an internal register.

Avoiding store-to-load dependencies is especially important if

they are part of a long dependency chains, as might occur in a

recurrence computation. If the dependency occurs while

operating on arrays, many compilers are unable to optimize the

Completely Unroll Small Loops

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

code in a way that avoids the store-to-load dependency. In some

instances the language definition may prohibit the compiler

from using code transformations that would remove the store-

to-load dependency. It is therefore recommended that the

programmer remove the dependency manually, e.g., by

introducing a temporary variable that can be kept in a register.

This can result in a significant performance increase. The

following is an example of this.

Example 1 (Avoid):

double x[VECLEN], y[VECLEN], z[VECLEN];

unsigned int k;

for (k = 1; k < VECLEN; k++) {

x[k] = x[k-1] + y[k];

}

for (k = 1; k < VECLEN; k++) {

x[k] = z[k] * (y[k] - x[k-1]);

}

Example 2 (Preferred):

double x[VECLEN], y[VECLEN], z[VECLEN];

unsigned int k;

double t;

t = x[0];

for (k = 1; k < VECLEN; k++) {

t = t + y[k];

x[k] = t;

}

t = x[0];

for (k = 1; k < VECLEN; k++) {

t = z[k] * (y[k] - t);

x[k] = t;

}

Avoid Unnecessary Store-to-Load Dependencies

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Consider Expression Order in Compound Branch

Conditions

Branch conditions in C programs are often compound

conditions consisting of multiple boolean expressions joined by

the boolean operators && and ||. C guarantees a short-circuit

evaluation of these operators. This means that in the case of ||,

the first operand to evaluate to TRUE terminates the

evaluation, i.e., following operands are not evaluated at all.

Similarly for &&, the first operand to evaluate to FALSE

terminates the evaluation. Because of this short-circuit

evaluation, it is not always possible to swap the operands of ||

and &&. This is especially the case when the evaluation of one

of the operands causes a side effect. However, in most cases the

exchange of operands is possible.

When used to control conditional branches, expressions

involving || and && are translated into a series of conditional

branches. The ordering of the conditional branches is a function

of the ordering of the expressions in the compound condition,

and can have a significant impact on performance. It is

unfortunately not possible to give an easy, closed-form formula

on how to order the conditions. Overall performance is a

function of a variety of the following factors:

ꢀ

probability of a branch mispredict for each of the branches

generated

ꢀ

additional latency incurred due to a branch mispredict

cost of evaluating the conditions controlling each of the

branches generated

ꢀ

amount of parallelism that can be extracted in evaluating

the branch conditions

data stream consumed by an application (mostly due to the

dependence of mispredict probabilities on the nature of the

incoming data in data dependent branches)

It is therefore recommended to experiment with the ordering of

expressions in compound branch conditions in the most active

areas of a program (so called hot spots) where most of the

execution time is spent. Such hot spots can be found through

the use of profiling. A "typical" data stream should be fed to

the program while doing the experiments.

Consider Expression Order in Compound Branch Conditions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Switch Statement Usage

Optimize Switch Statements

Switch statements are translated using a variety of algorithms.

The most common of these are jump tables and comparison

chains/trees. It is recommended to sort the cases of a switch

statement according to the probability of occurrences, with the

most probable first. This will improve performance when the

switch is translated as a comparison chain. It is further

recommended to make the case labels small, contiguous

integers, as this will allow the switch to be translated as a jump

table.

Example 1 (Avoid):

int days_in_month, short_months, normal_months, long_months;

switch (days_in_month) {

case 28:

case 29: short_months++; break;

case 30: normal_months++; break;

case 31: long_months++; break;

default: printf ("month has fewer than 28 or more than 31

days\n");

}

Example 2 (Preferred):

int days_in_month, short_months, normal_months, long_months;

switch (days_in_month) {

case 31: long_months++; break;

case 30: normal_months++; break;

case 28:

case 29: short_months++; break;

default: printf ("month has fewer than 28 or more than 31

days\n");

}

Use Prototypes for All Functions

In general, use prototypes for all functions. Prototypes can

convey additional information to the compiler that might

enable more aggressive optimizations.

Switch Statement Usage

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Use Const Type Qualifier

Use the “const” type qualifier as much as possible. This

optimization makes code more robust and may enable higher

performance code to be generated due to the additional

information available to the compiler. For example, the C

standard allows compilers to not allocate storage for objects

that are declared “const”, if their address is never taken.

Generic Loop Hoisting

To improve the performance of inner loops, it is beneficial to

reduce redundant constant calculations (i.e., loop invariant

calculations). However, this idea can be extended to invariant

control structures.

The first case is that of a constant “if()” statement in a “for()”

loop.

Example 1:

for( i ... ) {

if( CONSTANT0 ) {

DoWork0( i );

} else {

DoWork1( i );

// does not affect CONSTANT0

}

The above loop should be transformed into:

if( CONSTANT0 ) {

for( i ... ) {

DoWork0( i );

}

} else {

for( i ... ) {

DoWork1( i );

}

This will make your inner loops tighter by avoiding repetitious

evaluation of a known “if()” control structure. Although the

branch would be easily predicted, the extra instructions and

decode limitations imposed by branching are saved, which are

usually well worth it.

Use Const Type Qualifier

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Generalization for Multiple Constant Control Code

To generalize this further for multiple constant control code

some more work may have to be done to create the proper outer

loop. Enumeration of the constant cases will reduce this to a

simple switch statement.

Example 2:

for(i ... ) {

if( CONSTANT0 ) {

DoWork0( i );

//does not affect CONSTANT0

// or CONSTANT1

} else {

DoWork1( i );

//does not affect CONSTANT0

// or CONSTANT1

}

if( CONSTANT1 ) {

DoWork2( i );

//does not affect CONSTANT0

// or CONSTANT1

} else {

DoWork3( i );

//does not affect CONSTANT0

// or CONSTANT1

}

The above loop should be transformed into:

#define combine( c1, c2 ) (((c1) << 1) + (c2))

switch( combine( CONSTANT0!=0, CONSTANT1!=0 ) ) {

case combine( 0, 0 ):

for( i ... ) {

DoWork0( i );

DoWork2( i );

}

break;

case combine( 1, 0 ):

for( i ... ) {

DoWork1( i );

DoWork2( i );

}

break;

case combine( 0, 1 ):

for( i ... ) {

DoWork0( i );

DoWork3( i );

}

break;

Generic Loop Hoisting

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

case combine( 1, 1 ):

for( i ... ) {

DoWork1( i );

DoWork3( i );

}

break;

default:

break;

}

The trick here is that there is some up-front work involved in

generating all the combinations for the switch constant and the

total amount of code has doubled. However, it is also clear that

the inner loops are "if()-free". In ideal cases where the

“DoWork*()” functions are inlined, the successive functions

will have greater overlap leading to greater parallelism than

would be possible in the presence of intervening “if()”

statements.

The same idea can be applied to constant “switch()”

statements, or combinations of “switch()” statements and “if()”

statements inside of “for()” loops. The method for combining

the input constants gets more complicated but will be worth it

for the performance benefit.

However, the number of inner loops can also substantially

increase. If the number of inner loops is prohibitively high, then

only the most common cases need to be dealt with directly, and

the remaining cases can fall back to the old code in a "default:"

clause for the “switch()” statement.

This typically comes up when the programmer is considering

runtime generated code. While runtime generated code can

lead to similar levels of performance improvement, it is much

harder to maintain, and the developer must do their own

optimizations for their code generation without the help of an

available compiler.

Declare Local Functions as Static

Functions that are not used outside the file in which they are

defined should always be declared static, which forces internal

linkage. Otherwise, such functions default to external linkage,

Declare Local Functions as Static

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

which might inhibit certain optimizations with some

compilers—for example, aggressive inlining.

Dynamic Memory Allocation Consideration

Dynamic memory allocation (‘malloc’ in C language) should

always return a pointer that is suitably aligned for the largest

base type (quadword alignment). Where this aligned pointer

cannot be guaranteed, use the technique shown in the following

code to make the pointer quadword aligned, if needed. This

code assumes the pointer can be cast to a long.

Example:

double* p;

double* np;

p = (double *)malloc(sizeof(double)*number_of_doubles+7L);

np = (double *)((((long)(p))+7L) & (–8L));

Then use ‘np’ instead of ‘p’ to access the data. ‘p’ is still needed

in order to deallocate the storage.

Introduce Explicit Parallelism into Code

Where possible, long dependency chains should be broken into

several independent dependency chains which can then be

executed in parallel exploiting the pipeline execution units.

This is especially important for floating-point code, whether it

is mapped to x87 or 3DNow! instructions because of the longer

latency of floating-point operations. Since most languages,

including ANSI C, guarantee that floating-point expressions are

not re-ordered, compilers can not usually perform such

optimizations unless they offer a switch to allow ANSI non-

compliant reordering of floating-point expressions according to

algebraic rules.

Note that re-ordered code that is algebraically identical to the

original code does not necessarily deliver identical

computational results due to the lack of associativity of floating

point operations. There are well-known numerical

considerations in applying these optimizations (consult a book

on numerical analysis). In some cases, these optimizations may

Dynamic Memory Allocation Consideration

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

lead to unexpected results. Fortunately, in the vast majority of

cases, the final result will differ only in the least significant

bits.

Example 1 (Avoid):

double a[100],sum;

int i;

sum = 0.0f;

for (i=0; i<100; i++) {

sum += a[i];

}

Example 2 (Preferred):

double a[100],sum1,sum2,sum3,sum4,sum;

int i;

sum1 = 0.0;

sum2 = 0.0;

sum3 = 0.0;

sum4 = 0.0;

for (i=0; i<100; i+4) {

sum1 += a[i];

sum2 += a[i+1];

sum3 += a[i+2];

sum4 += a[i+3];

}

sum = (sum4+sum3)+(sum1+sum2);

Notice that the 4-way unrolling was chosen to exploit the 4-stage

fully pipelined floating-point adder. Each stage of the floating-

point adder is occupied on every clock cycle, ensuring maximal

sustained utilization.

Explicitly Extract Common Subexpressions

In certain situations, C compilers are unable to extract common

subexpressions from floating-point expressions due to the

guarantee against reordering of such expressions in the ANSI

standard. Specifically, the compiler can not re-arrange the

computation according to algebraic equivalencies before

extracting common subexpressions. In such cases, the

programmer should manually extract the common

subexpression. It should be noted that re-arranging the

expression may result in different computational results due to

the lack of associativity of floating-point operations, but the

results usually differ in only the least significant bits.

Explicitly Extract Common Subexpressions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 1

Avoid:

double a,b,c,d,e,f;

e = b*c/d;

f = b/d*a;

Preferred:

double a,b,c,d,e,f,t;

t = b/d;

e = c*t;

f = a*t;

Example 2

Avoid:

double a,b,c,e,f;

e = a/c;

f = b/c;

Preferred:

double a,b,c,e,f,t;

t = 1/c;

e = a*t

f = b*t;

C Language Structure Component Considerations

Many compilers have options that allow padding of structures

to make their size multiples of words, doublewords, or

quadwords, in order to achieve better alignment for structures.

In addition, to improve the alignment of structure members,

some compilers might allocate structure elements in an order

that differs from the order in which they are declared. However,

some compilers might not offer any of these features, or their

implementation might not work properly in all situations.

Therefore, to achieve the best alignment of structures and

structure members while minimizing the amount of padding

regardless of compiler optimizations, the following methods are

suggested.

Sort by Base Type

Size

Sort structure members according to their base type size,

declaring members with a larger base type size ahead of

members with a smaller base type size.

C Language Structure Component Considerations

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Pad by Multiple of

Largest Base Type

Size

Pad the structure to a multiple of the largest base type size of

any member. In this fashion, if the first member of a structure is

naturally aligned, all other members are naturally aligned as

well. The padding of the structure to a multiple of the largest

based type size allows, for example, arrays of structures to be

perfectly aligned.

The following example demonstrates the reordering of

structure member declarations:

Original ordering (Avoid):

struct {

char

long

a[5];

double x;

} baz;

New ordering, with padding (Preferred):

struct {

double x;

long

char

a[5];

pad[7];

} baz;

See “C Language Structure Component Considerations ” on

page 55 for a different perspective.

Sort Local Variables According to Base Type Size

When a compiler allocates local variables in the same order in

which they are declared in the source code, it can be helpful to

declare local variables in such a manner that variables with a

larger base type size are declared ahead of the variables with

smaller base type size. Then, if the first variable is allocated so

that it is naturally aligned, all other variables are allocated

contiguously in the order they are declared, and are naturally

aligned without any padding.

Some compilers do not allocate variables in the order they are

declared. In these cases, the compiler should automatically

allocate variables in such a manner as to make them naturally

aligned with the minimum amount of padding. In addition,

some compilers do not guarantee that the stack is aligned

suitably for the largest base type (that is, they do not guarantee

Sort Local Variables According to Base Type Size

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

quadword alignment), so that quadword operands might be

misaligned, even if this technique is used and the compiler does

allocate variables in the order they are declared.

The following example demonstrates the reordering of local

variable declarations:

Original ordering (Avoid):

short ga, gu, gi;

long

double x, y, z[3];

char a, b;

foo, bar;

float baz;

Improved ordering (Preferred):

double z[3];

double x, y;

long

foo, bar;

float baz;

short ga, gu, gi;

See “Sort V a riables According to Base T y pe Size ” on page 56 for

more information from a different perspective.

Accelerating Floating-Point Divides and Square Roots

Divides and square roots have a much longer latency than other

floating-point operations, even though the AMD Athlon

processor provides significant acceleration of these two

operations. In some codes, these operations occur so often as to

seriously impact performance. In these cases, it is

recommended to port the code to 3DNow! inline assembly or to

use a compiler that can generate 3DNow! code. If code has hot

spots that use single-precision arithmetic only (i.e., all

computation involves data of type float) and for some reason

cannot be ported to 3DNow!, the following technique may be

used to improve performance.

The x87 FPU has a precision-control field as part of the FPU

control word. The precision-control setting determines what

precision results get rounded to. It affects the basic arithmetic

operations, including divides and square roots. AMD Athlon

and AMD-K6^®family processors implement divide and square

root in such fashion as to only compute the number of bits

Accelerating Floating-Point Divides and Square Roots

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

necessary for the currently selected precision. This means that

setting precision control to single precision (versus Win32

default of double precision) lowers the latency of those

operations.

The Microsoft^®Visual C environment provides functions to

manipulate the FPU control word and thus the precision

control. Note that these functions are not very fast, so changes

of precision control should be inserted where it creates little

overhead, such as outside a computation-intensive loop.

Otherwise the overhead created by the function calls outweighs

the benefit from reducing the latencies of divide and square

root operations.

The following example shows how to set the precision control to

single precision and later restore the original settings in the

Microsoft Visual C environment.

Example:

/* prototype for _controlfp() function */

#include <float.h>

unsigned int orig_cw;

/* Get current FPU control word and save it */

orig_cw = _controlfp (0,0);

/* Set precision control in FPU control word to single

precision. This reduces the latency of divide and square

root operations.

_controlfp (_PC_24, MCW_PC);

/* restore original FPU control word */

_controlfp (orig_cw, 0xfffff);

Accelerating Floating-Point Divides and Square Roots

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Avoid Unnecessary Integer Division

Integer division is the slowest of all integer arithmetic

operations and should be avoided wherever possible. One

possibility for reducing the number of integer divisions is

multiple divisions, in which division can be replaced with

multiplication as shown in the following examples. This

replacement is possible only if no overflow occurs during the

computation of the product. This can be determined by

considering the possible ranges of the divisors.

Example 1 (Avoid):

int i,j,k,m;

m = i / j / k;

Example 2 (Preferred):

int i,j,k,l;

m = i / (j * k);

Copy Frequently De-referenced Pointer Arguments to Local

Variables

Avoid frequently de-referencing pointer arguments inside a

function. Since the compiler has no knowledge of whether

aliasing exists between the pointers, such de-referencing can

not be optimized away by the compiler. This prevents data from

being kept in registers and significantly increases memory

traffic.

Note that many compilers have an “assume no aliasing”

optimization switch. This allows the compiler to assume that

two different pointers always have disjoint contents and does

not require copying of pointer arguments to local variables.

Otherwise, copy the data pointed to by the pointer arguments

to local variables at the start of the function and if necessary

copy them back at the end of the function.

Avoid Unnecessary Integer Division

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 1 (Avoid):

//assumes pointers are different and q!=r

void isqrt (unsigned long a,

unsigned long *q,

unsigned long *r)

{

*q = a;

if (a > 0)

{

while (*q > (*r = a / *q))

{

*q = (*q + *r) >> 1;

}

*r = a - *q * *q;

}

Example 2 (Preferred):

//assumes pointers are different and q!=r

void isqrt (unsigned long a,

unsigned long *q,

unsigned long *r)

{

unsigned long qq, rr;

qq = a;

if (a > 0)

{

while (qq > (rr = a / qq))

{

qq = (qq + rr) >> 1;

}

rr = a - qq * qq;

*q = qq;

*r = rr;

}

Copy Frequently De-referenced Pointer Arguments to Local Variables

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Instruction Decoding

Optimizations

This chapter discusses ways to maximize the number of

instructions decoded by the instruction decoders in the

AMD Athlon™ processor. Guidelines are listed in order of

importance.

Overview

The AMD Athlon processor instruction fetcher reads 16-byte

aligned code windows from the instruction cache. The

instruction bytes are then merged into a 24-byte instruction

queue. On each cycle, the in-order front-end engine selects for

decode up to three x86 instructions from the instruction-byte

queue.

All instructions (x86, x87, 3DNow!™, and MMX™) are

classified into two types of decodes — DirectPath and

VectorPath (see “DirectPath Decoder ” and “V e ctorPath

Decoder ” on page 133 for more information). DirectPath

instructions are common instructions that are decoded directly

in hardware. VectorPath instructions are more complex

instructions that require the use of a sequence of multiple

operations issued from an on-chip ROM.

Up to three DirectPath instructions can be selected for decode

per cycle. Only one VectorPath instruction can be selected for

decode per cycle. DirectPath instructions and VectorPath

instructions cannot be simultaneously decoded.

Overview

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Select DirectPath Over VectorPath Instructions

Use DirectPath instructions rather than VectorPath

instructions. DirectPath instructions are optimized for decode

and execute efficiently by minimizing the number of operations

per x86 instruction, which includes ‘register←register op

memory’ as well as ‘register←register op register’ forms of

instructions. Up to three DirectPath instructions can be

decoded per cycle. VectorPath instructions will block the

decoding of DirectPath instructions.

TOP

✩

The very high majority of instructions used be a compiler has

been implemented as DirectPath instructions in the

AMD Athlon processor. Assembly writers must still take into

consideration the usage of DirectPath versus VectorPath

instructions.

See Appendix F , “Instruction Dispatch and Execution

Resources ” on page 187 and A ppendix G, “DirectPath versus

VectorPath Instructions ” on page 219 for tables of DirectPath

and VectorPath instructions.

Load-Execute Instruction Usage

Use Load-Execute Integer Instructions

Most load-execute integer instructions are DirectPath

decodable and can be decoded at the rate of three per cycle.

Splitting a load-execute integer instruction into two separate

instructions—a load instruction and a “reg, reg” instruction—

reduces decoding bandwidth and increases register pressure,

which results in lower performance. The split-instruction form

can be used to avoid scheduler stalls for longer executing

instructions and to explicitly schedule the load and execute

operations.

TOP

✩

Select DirectPath Over VectorPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Use Load-Execute Floating-Point Instructions with Floating-Point

Operands

When operating on single-precision or double-precision

floating-point data, wherever possible use floating-point

load-execute instructions to increase code density.

TOP

Note: This optimization applies only to floating-point instructions

with floating-point operands and not with integer operands,

as described in the next optimization.

✩

This coding style helps in two ways. First, denser code allows

more work to be held in the instruction cache. Second, the

denser code generates fewer internal OPs and, therefore, the

FPU scheduler holds more work, which increases the chances of

extracting parallelism from the code.

Example 1 (Avoid):

FLD

FMUL

QWORD PTR [TEST1]

QWORD PTR [TEST2]

ST, ST(1)

Example 2 (Preferred):

FLD

FMUL

QWORD PTR [TEST1]

QWORD PTR [TEST2]

Avoid Load-Execute Floating-Point Instructions with Integer Operands

Do not use load-execute floating-point instructions with integer

operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR,

FICOM, and FICOMP. Remember that floating-point

TOP

instructions can have integer operands while integer

instruction cannot have floating-point operands.

✩

Floating-point computations involving integer-memory

operands should use separate FILD and arithmetic instructions.

This optimization has the potential to increase decode

bandwidth and OP density in the FPU scheduler. The floating-

point load-execute instructions with integer operands are

VectorPath and generate two OPs in a cycle, while the discrete

equivalent enables a third DirectPath instruction to be decoded

in the same cycle. In some situations this optimizations can also

reduce execution time if the FILD can be scheduled several

instructions ahead of the arithmetic instruction in order to

cover the FILD latency.

Load-Execute Instruction Usage

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 1 (Avoid):

FLD

FIMUL

FIADD

QWORD PTR [foo]

DWORD PTR [bar]

DWORD PTR [baz]

Example 2 (Preferred):

FILD

FLD

FMULP

FADDP

DWORD PTR [bar]

DWORD PTR [baz]

QWORD PTR [foo]

ST(2), ST

ST(1),ST

Align Branch Targets in Program Hot Spots

In program hot spots (i.e., innermost loops in the absence of

profiling data), place branch targets at or near the beginning of

16-byte aligned code windows. This technique helps to

maximize the number of instructions that are filled into the

instruction-byte queue while preventing I-cache space in

branch intensive code.

Use Short Instruction Lengths

Assemblers and compilers should generate the tightest code

possible to optimize use of the I-cache and increase average

decode rate. Wherever possible, use instructions with shorter

lengths. Using shorter instructions increases the number of

instructions that can fit into the instruction-byte queue. For

example, use 8-bit displacements as opposed to 32-bit

displacements. In addition, use the single-byte format of simple

integer instructions whenever possible, as opposed to the

2-byte opcode ModR/M format.

Example 1 (Avoid):

81 C0 78 56 34 12 add eax, 12345678h ;uses 2-byte opcode

; form (with ModR/M)

81 C3 FB FF FF FF add ebx, -5

;uses 32-bit

; immediate

0F 84 05 00 00 00 jz $label1

;uses 2-byte opcode,

; 32-bit immediate

Align Branch Targets in Program Hot Spots

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 2 (Preferred):

05 78 56 34 12 add eax, 12345678h ;uses single byte

; opcode form

83 C3 FB

add ebx, -5

;uses 8-bit sign

; extended immediate

;uses 1-byte opcode,

; 8-bit immediate

74 05

jz $label1

Avoid Partial Register Reads and Writes

In order to handle partial register writes, the AMD Athlon

processor execution core implements a data-merging scheme.

In the execution unit, an instruction writing a partial register

merges the modified portion with the current state of the

remainder of the register. Therefore, the dependency hardware

can potentially force a false dependency on the most recent

instruction that writes to any part of the register.

Example 1 (Avoid):

MOV

AL, 10

AH, 12

;inst 1

;inst 2 has a false dependency on

; inst 1

;inst 2 merges new AH with current

; EAX register value forwarded

; by inst 1

In addition, an instruction that has a read dependency on any

part of a given architectural register has a read dependency on

the most recent instruction that modifies any part of the same

architectural register.

Example 2 (Avoid):

MOV

BX, 12h

BL, DL

;inst 1

;inst 2, false dependency on

; completion of inst 1

;inst 3, false dependency on

; completion of inst 2

;inst 4, depends on completion of

; inst 2

MOV

BH, CL

AL, BL

Avoid Partial Register Reads and Writes

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Replace Certain SHLD Instructions with Alternative Code

Certain instances of the SHLD instruction can be replaced by

alternative code using SHR and LEA. The alternative code has

lower latency and requires less execution resources. SHR and

LEA (32-bit version) are DirectPath instructions, while SHLD is

a VectorPath instruction. SHR and LEA preserves decode

bandwidth as it potentially enables the decoding of a third

DirectPath instruction.

Example 1

Example 2

Example 3

(Avoid):

SHLD REG1, REG2, 1

(Preferred):

SHR REG2, 31

LEA REG1, [REG1*2 + REG2]

(Avoid):

SHLD REG1, REG2, 2

(Preferred):

SHR REG2, 30

LEA REG1, [REG1*4 + REG2]

(Avoid):

SHLD REG1, REG2, 3

(Preferred):

SHR REG2, 29

LEA REG1, [REG1*8 + REG2]

Use 8-Bit Sign-Extended Immediates

Using 8-bit sign-extended immediates improves code density

with no negative effects on the AMD Athlon processor. For

example, ADD BX, –5 should be encoded “83 C3 FB” and not

“81 C3 FF FB”.

Replace Certain SHLD Instructions with Alternative

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Use 8-Bit Sign-Extended Displacements

Use 8-bit sign-extended displacements for conditional

branches. Using short, 8-bit sign-extended displacements for

conditional branches improves code density with no negative

effects on the AMD Athlon processor.

Code Padding Using Neutral Code Fillers

Occasionally a need arises to insert neutral code fillers into the

code stream, e.g., for code alignment purposes or to space out

branches. Since this filler code can be executed, it should take

up as few execution resources as possible, not diminish decode

density, and not modify any processor state other than

advancing EIP. A one byte padding can easily be achieved using

the NOP instructions (XCHG EAX, EAX; opcode 0x90). In the

x86 architecture, there are several multi-byte "NOP"

instructions available that do not change processor state other

than EIP:

ꢀ

MOV REG, REG

XCHG REG, REG

CMOVcc REG, REG

SHR REG, 0

SAR REG, 0

SHL REG, 0

SHRD REG, REG, 0

SHLD REG, REG, 0

LEA REG, [REG]

LEA REG, [REG+00]

LEA REG, [REG*1+00]

LEA REG, [REG+00000000]

LEA REG, [REG*1+00000000]

Not all of these instructions are equally suitable for purposes of

code padding. For example, SHLD/SHRD are microcoded which

reduces decode bandwidth and takes up execution resources.

Use 8-Bit Sign-Extended Displacements

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Recommendations for the AMD Athlon™ Processor

For code that is optimized specifically for the AMD Athlon

processor, the optimal code fillers are NOP instructions (opcode

0x90) with up to two REP prefixes (0xF3). In the AMD Athlon

processor, a NOP with up to two REP prefixes can be handled

by a single decoder with no overhead. As the REP prefixes are

redundant and meaningless, they get discarded, and NOPs are

handled without using any execution resources. The three

decoders of AMD Athlon processor can handle up to three

NOPs, each with up to two REP prefixes each, in a single cycle,

for a neutral code filler of up to nine bytes.

Note: When used as a filler instruction, REP/REPNE prefixes can

be used in conjunction only with NOPs. REP/REPNE has

undefined behavior when used with instructions other than

a NOP.

If a larger amount of code padding is required, it is

recommended to use a JMP instruction to jump across the

padding region. The following assembly language macros show

this:

NOP1_ATHLON TEXTEQU <DB 090h>

NOP2_ATHLON TEXTEQU <DB 0F3h, 090h>

NOP3_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h>

NOP4_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 090h>

NOP5_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 090h>

NOP6_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h>

NOP7_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,

090h>

NOP8_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,

0F3h, 090h>

NOP9_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,

0F3h, 0F3h, 090h>

NOP10_ATHLONTEXTEQU <DB 0EBh, 008h, 90h, 90h, 90h, 90h,

90h, 90h, 90h, 90h>

Code Padding Using Neutral Code Fillers

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Recommendations for AMD-K6^®Family and AMD Athlon™ Processor

Blended Code

On x86 processors other than the AMD Athlon processor

(including the AMD-K6 family of processors), the REP prefix

and especially multiple prefixes cause decoding overhead, so

the above technique is not recommended for code that has to

run well both on AMD Athlon processor and other x86

processors (blended code). In such cases the instructions and

instruction sequences below are recommended. For neutral

code fillers longer than eight bytes in length, the JMP

instruction can be used to jump across the padding region.

Note that each of the instructions and instruction sequences

below utilizes an x86 register. To avoid performance

degradation, the register used in the padding should be

selected so as to not lengthen existing dependency chains, i.e.,

one should select a register that is not used by instructions in

the vicinity of the neutral code filler. Note that certain

instructions use registers implicitly. For example, PUSH, POP,

CALL, and RET all make implicit use of the ESP register. The

5-byte filler sequence below consists of two instructions. If flag

changes across the code padding are acceptable, the following

instructions may be used as single instruction, 5-byte code

fillers:

ꢀ

TEST EAX, 0FFFF0000h

CMP EAX, 0FFFF0000h

ꢀ

The following assembly language macros show the

recommended neutral code fillers for code optimized for the

AMD Athlon processor that also has to run well on other x86

processors. Note for some padding lengths, versions using ESP

or EBP are missing due to the lack of fully generalized

addressing modes.

NOP2_EAX TEXTEQU <DB 08Bh,0C0h> ;mov eax, eax

NOP2_EBX TEXTEQU <DB 08Bh,0DBh> ;mov ebx, ebx

NOP2_ECX TEXTEQU <DB 08Bh,0C9h> ;mov ecx, ecx

NOP2_EDX TEXTEQU <DB 08Bh,0D2h> ;mov edx, edx

NOP2_ESI TEXTEQU <DB 08Bh,0F6h> ;mov esi, esi

NOP2_EDI TEXTEQU <DB 08Bh,0FFh> ;mov edi, edi

NOP2_ESP TEXTEQU <DB 08Bh,0E4h> ;mov esp, esp

NOP2_EBP TEXTEQU <DB 08Bh,0EDh> ;mov ebp, ebp

NOP3_EAX TEXTEQU <DB 08Dh,004h,020h> ;lea eax, [eax]

NOP3_EBX TEXTEQU <DB 08Dh,01Ch,023h> ;lea ebx, [ebx]

Code Padding Using Neutral Code Fillers

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx]

NOP3_EDX TEXTEQU <DB 08Dh,014h,022h> ;lea edx, [edx]

NOP3_ESI TEXTEQU <DB 08Dh,024h,024h> ;lea esi, [esi]

NOP3_EDI TEXTEQU <DB 08Dh,034h,026h> ;lea edi, [edi]

NOP3_ESP TEXTEQU <DB 08Dh,03Ch,027h> ;lea esp, [esp]

NOP3_EBP TEXTEQU <DB 08Dh,06Dh,000h> ;lea ebp, [ebp]

NOP4_EAX TEXTEQU <DB 08Dh,044h,020h,000h> ;lea eax, [eax+00]

NOP4_EBX TEXTEQU <DB 08Dh,05Ch,023h,000h> ;lea ebx, [ebx+00]

NOP4_ECX TEXTEQU <DB 08Dh,04Ch,021h,000h> ;lea ecx, [ecx+00]

NOP4_EDX TEXTEQU <DB 08Dh,054h,022h,000h> ;lea edx, [edx+00]

NOP4_ESI TEXTEQU <DB 08Dh,064h,024h,000h> ;lea esi, [esi+00]

NOP4_EDI TEXTEQU <DB 08Dh,074h,026h,000h> ;lea edi, [edi+00]

NOP4_ESP TEXTEQU <DB 08Dh,07Ch,027h,000h> ;lea esp, [esp+00]

;lea eax, [eax+00];nop

NOP5_EAX TEXTEQU <DB 08Dh,044h,020h,000h,090h>

;lea ebx, [ebx+00];nop

NOP5_EBX TEXTEQU <DB 08Dh,05Ch,023h,000h,090h>

;lea ecx, [ecx+00];nop

NOP5_ECX TEXTEQU <DB 08Dh,04Ch,021h,000h,090h>

;lea edx, [edx+00];nop

NOP5_EDX TEXTEQU <DB 08Dh,054h,022h,000h,090h>

;lea esi, [esi+00];nop

NOP5_ESI TEXTEQU <DB 08Dh,064h,024h,000h,090h>

;lea edi, [edi+00];nop

NOP5_EDI TEXTEQU <DB 08Dh,074h,026h,000h,090h>

;lea esp, [esp+00];nop

NOP5_ESP TEXTEQU <DB 08Dh,07Ch,027h,000h,090h>

;lea eax, [eax+00000000]

NOP6_EAX TEXTEQU <DB 08Dh,080h,0,0,0,0>

;lea ebx, [ebx+00000000]

NOP6_EBX TEXTEQU <DB 08Dh,09Bh,0,0,0,0>

;lea ecx, [ecx+00000000]

NOP6_ECX TEXTEQU <DB 08Dh,089h,0,0,0,0>

;lea edx, [edx+00000000]

NOP6_EDX TEXTEQU <DB 08Dh,092h,0,0,0,0>

;lea esi, [esi+00000000]

NOP6_ESI TEXTEQU <DB 08Dh,0B6h,0,0,0,0>

Code Padding Using Neutral Code Fillers

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

;lea edi ,[edi+00000000]

NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0>

;lea ebp ,[ebp+00000000]

NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0>

;lea eax,[eax*1+00000000]

NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0>

;lea ebx,[ebx*1+00000000]

NOP7_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0>

;lea ecx,[ecx*1+00000000]

NOP7_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0>

;lea edx,[edx*1+00000000]

NOP7_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0>

;lea esi,[esi*1+00000000]

NOP7_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0>

;lea edi,[edi*1+00000000]

NOP7_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0>

;lea ebp,[ebp*1+00000000]

NOP7_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0>

;lea eax,[eax*1+00000000] ;nop

NOP8_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0,90h>

;lea ebx,[ebx*1+00000000] ;nop

NOP8_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0,90h>

;lea ecx,[ecx*1+00000000] ;nop

NOP8_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0,90h>

;lea edx,[edx*1+00000000] ;nop

NOP8_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0,90h>

;lea esi,[esi*1+00000000] ;nop

NOP8_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0,90h>

;lea edi,[edi*1+00000000] ;nop

NOP8_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0,90h>

;lea ebp,[ebp*1+00000000] ;nop

NOP8_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0,90h>

;JMP

NOP9 TEXTEQU <DB 0EBh,007h,90h,90h,90h,90h,90h,90h,90h>

Code Padding Using Neutral Code Fillers

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Code Padding Using Neutral Code Fillers

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Cache and Memory

Optimizations

This chapter describes code optimization techniques that take

advantage of the large L1 caches and high-bandwidth buses of

the AMD Athlon™ processor. Guidelines are listed in order of

importance.

Memory Size and Alignment Issues

Avoid Memory Size Mismatches

Avoid memory size mismatches when instructions operate on

the same data. For instructions that store and reload the same

data, keep operands aligned and keep the loads/stores of each

operand the same size. The following code examples result in a

store-to-load-forwarding (STLF) stall:

TOP

✩

Example 1 (Avoid):

MOV DWORD PTR [FOO], EAX

MOV DWORD PTR [FOO+4], EDX

FLD QWORD PTR [FOO]

Avoid large-to-small mismatches, as shown in the following

code:

Example 2 (Avoid):

FST QWORD PTR [FOO]

MOV EAX, DWORD PTR [FOO]

MOV EDX, DWORD PTR [FOO+4]

Memory Size and Alignment Issues

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Align Data Where Possible

In general, avoid misaligned data references. All data whose

size is a power of 2 is considered aligned if it is naturally

aligned. For example:

TOP

ꢀ

QWORD accesses are aligned if they access an address

divisible by 8.

✩

DWORD accesses are aligned if they access an address

divisible by 4.

WORD accesses are aligned if they access an address

divisible by 2.

TBYTE accesses are aligned if they access an address

divisible by 8.

A misaligned store or load operation suffers a minimum

one-cycle penalty in the AMD Athlon processor load/store

pipeline. In addition, using misaligned loads and stores

increases the likelihood of encountering a store-to-load

forwarding pitfall. For a more detailed discussion of store-to-

load forwarding issues, see “Store-to-Load Forwarding

Restrictions ” on page 51.

Use the 3DNow!™ PREFETCH and PREFETCHW Instructions

For code that can take advantage of prefetching, use the

3DNow! PREFETCH and PREFETCHW instructions to

increase the effective bandwidth to the AMD Athlon processor.

TOP

The PREFETCH and PREFETCHW instructions take

advantage of the AMD Athlon processor’s high bus bandwidth

to hide long latencies when fetching data from system memory.

The prefetch instructions are essentially integer instructions

and can be used anywhere, in any type of code (integer, x87,

3DNow!, MMX, etc.).

✩

Large data sets typically require unit-stride access to ensure

that all data pulled in by PREFETCH or PREFETCHW is

actually used. If necessary, algorithms or data structures should

be reorganized to allow unit-stride access.

Use the 3DNow!™ PREFETCH and PREFETCHW

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

PREFETCH/W versus

PREFETCHNTA/T0/T1

/T2

The PREFETCHNTA/T0/T1/T2 instructions in the MMX

extensions are processor implementation dependent. To

maintain compatibility with the 25 million AMD-K6^®-2 and

AMD-K6-III processors already sold, use the 3DNow!

PREFETCH/W instructions instead of the various prefetch

flavors in the new MMX extensions.

PREFETCHW Usage

Code that intends to modify the cache line brought in through

prefetching should use the PREFETCHW instruction. While

PREFETCHW works the same as a PREFETCH on the

AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a

hint to the AMD Athlon processor of an intent to modify the

cache line. The AMD Athlon processor will mark the cache line

being brought in by PREFETCHW as Modified. Using

PREFETCHW can save an additional 15-25 cycles compared to

a PREFETCH and the subsequent cache state change caused by

a write to the prefetched cache line.

Multiple Prefetches

Programmers can initiate multiple outstanding prefetches on

the AMD Athlon processor. While the AMD-K6-2 and

AMD-K6-III processors can have only one outstanding prefetch,

the AMD Athlon processor can have up to six outstanding

prefetches. When all six buffers are filled by various memory

read requests, the processor will simply ignore any new

prefetch requests until a buffer frees up. Multiple prefetch

requests are essentially handled in-order. If data is needed first,

then that data should be prefetched first.

The example below shows how to initiate multiple prefetches

when traversing more than one array.

Example (Multiple Prefetches):

.CODE

.K3D

; original C code

;

; #define LARGE_NUM 65536

;

; double array_a[LARGE_NUM];

; double array b[LARGE_NUM];

; double array c[LARGE_NUM];

; int i;

;

; for (i = 0; i < LARGE_NUM; i++) {

;

a[i] = b[i] * c[i]

; }

Use the 3DNow!™ PREFETCH and PREFETCHW Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

MOV

ECX, (-LARGE_NUM)

;used biased index

EAX, OFFSET array_a

EDX, OFFSET array_b

ECX, OFFSET array_c

;get address of array_a

;get address of array_b

;get address of array_c

$loop:

PREFETCHW [EAX+196]

;two cachelines ahead

PREFETCH

[EDX+196]

[ECX+196]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE] ;b[i]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE] ;b[i]*c[i]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE] ;a[i] = b[i]*c[i]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+8] ;b[i+1]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+8] ;b[i+1]*c[i+1]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+8] ;a[i+1] =

; b[i+1]*c[i+1]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+16];b[i+2]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+16];b[i+2]*c[i+2]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+16];a[i+2] =

; [i+2]*c[i+2]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+24];b[i+3]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+24];b[i+3]*c[i+3]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+24];a[i+3] =

; b[i+3]*c[i+3]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+32];b[i+4]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+32];b[i+4]*c[i+4]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+32];a[i+4] =

; b[i+4]*c[i+4]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+40];b[i+5]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+40];b[i+5]*c[i+5]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+40];a[i+5] =

; b[i+5]*c[i+5]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+48];b[i+6]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+48];b[i+6]*c[i+6]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+48];a[i+6] =

; b[i+6]*c[i+6]

FLD QWORD PTR [EDX+ECX*8+ARR_SIZE+56];b[i+7]

FMUL QWORD PTR [ECX+ECX*8+ARR_SIZE+56];b[i+7]*c[i+7]

FSTP QWORD PTR [EAX+ECX*8+ARR_SIZE+56];a[i+7] =

; b[i+7]*c[i+7]

ADD ECX, 8

JNZ $loop

;next 8 products

;until none left

END

Use the 3DNow!™ PREFETCH and PREFETCHW

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

The following optimization rules were applied to this example.

ꢀ

Loops should be unrolled to make sure that the data stride

per loop iteration is equal to the length of a cache line. This

avoids overlapping PREFETCH instructions and thus

optimal use of the available number of outstanding

PREFETCHes.

ꢀ

Since the array "array_a" is written rather than read,

PREFETCHW is used instead of PREFETCH to avoid

overhead for switching cache lines to the correct MESI

state. The PREFETCH lookahead has been optimized such

that each loop iteration is working on three cache lines

while six active PREFETCHes bring in the next six cache

lines.

ꢀ

Index arithmetic has been reduced to a minimum by use of

complex addressing modes and biasing of the array base

addresses in order to cut down on loop overhead.

Determining Prefetch

Distance

Given the latency of a typical AMD Athlon processor system

and expected processor speeds, the following formula should be

used to determine the prefetch distance in bytes for a single

array:

Prefetch Distance = 200 (^DS/_C) bytes

ꢀ

Round up to the nearest 64-byte cache line.

ꢀ

The number 200 is a constant based upon expected

AMD Athlon processor clock frequencies and typical system

memory latencies.

ꢀ

DS is the data stride in bytes per loop iteration.

C is the number of cycles for one loop to execute entirely

from the L1 cache.

The prefetch distance for multiple arrays are typically even

longer.

Prefetch at Least 64

Bytes Away from

Surrounding Stores

The PREFETCH and PREFETCHW instructions can be

affected by false dependencies on stores. If there is a store to an

address that matches a request, that request (the PREFETCH

or PREFETCHW instruction) may be blocked until the store is

written to the cache. Therefore, code should prefetch data that

is located at least 64 bytes away from any surrounding store’s

data address.

Use the 3DNow!™ PREFETCH and PREFETCHW Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Take Advantage of Write Combining

Operating system and device driver programmers should take

advantage of the write-combining capabilities of the

AMD Athlon processor. The AMD Athlon processor has a very

aggressive write-combining algorithm, which improves

performance significantly.

TOP

✩

See Appendix C, “Implementation of Write Combining ” on

page 155 for more details.

Avoid Placing Code and Data in the Same 64-Byte Cache

Line

Sharing code and data in the same 64-byte cache line may cause

the L1 caches to thrash (unnecessary castout of code/data) in

order to maintain coherency between the separate instruction

TOP

and data caches. The AMD Athlon processor has a cache-line

size of 64-bytes, which is twice the size of previous processors.

Programmers must be aware that code and data should not be

shared within this larger cache line, especially if the data

becomes modified.

✩

For example, programmers should consider that a memory

indirect JMP instruction may have the data for the jump table

residing in the same 64-byte cache line as the JMP instruction,

which would result in lower performance.

Although rare, do not place critical code at the border between

32-byte aligned code segments and a data segments. The code

at the start or end of your data segment should be as rarely

executed as possible or simply padded with garbage.

In general, the following should be avoided:

ꢀ

self-modifying code

ꢀ

storing data in code segments

Take Advantage of Write Combining

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Store-to-Load Forwarding Restrictions

Store-to-load forwarding refers to the process of a load reading

(forwarding) data from the store buffer (LS2). There are

instances in the AMD Athlon processor load/store architecture

when either a load operation is not allowed to read needed data

from a store in the store buffer, or a load OP detects a false data

dependency on a store in the store buffer.

In either case, the load cannot complete (load the needed data

into a register) until the store has retired out of the store buffer

and written to the data cache. A store-buffer entry cannot retire

and write to the data cache until every instruction before the

store has completed and retired from the reorder buffer.

The implication of this restriction is that all instructions in the

reorder buffer, up to and including the store, must complete

and retire out of the reorder buffer before the load can

complete. Effectively, the load has a false dependency on every

instruction up to the store.

The following sections describe store-to-load forwarding

examples that are acceptable and those that should be avoided.

Store-to-Load Forwarding Pitfalls—True Dependencies

A load is allowed to read data from the store-buffer entry only if

all of the following conditions are satisfied:

ꢀ

The start address of the load matches the start address of

the store.

ꢀ

The load operand size is equal to or smaller than the store

operand size.

ꢀ

Neither the load or store is misaligned.

The store data is not from a high-byte register (AH, BH, CH,

or DH).

The following sections describe common-case scenarios to avoid

whereby a load has a true dependency on a LS2-buffered store

but cannot read (forward) data from a store-buffer entry.

Store-to-Load Forwarding Restrictions

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Narrow-to-Wide

Store-Buffer Data

Forwarding

If the following conditions are present, there is a

narrow-to-wide store-buffer data forwarding restriction:

ꢀ

The operand size of the store data is smaller than the

operand size of the load data.

Restriction

ꢀ

The range of addresses spanned by the store data covers

some sub-region of range of addresses spanned by the load

data.

Avoid the type of code shown in the following two examples.

Example 1 (Avoid):

MOV EAX, 10h

MOV WORD PTR [EAX], BX

;word store

...

MOV ECX, DWORD PTR [EAX] ;doubleword load

;cannot forward upper

; byte from store buffer

Example 2 (Avoid):

MOV EAX, 10h

MOV BYTE PTR [EAX + 3], BL ;byte store

...

MOV ECX, DWORD PTR [EAX] ;doubleword load

;cannot forward upper byte

; from store buffer

Wide-to-Narrow

Store-Buffer Data

Forwarding

If the following conditions are present, there is a

wide-to-narrow store-buffer data forwarding restriction:

ꢀ

The operand size of the store data is greater than the

operand size of the load data.

Restriction

ꢀ

The start address of the store data does not match the start

address of the load.

Example 3 (Avoid):

MOV EAX, 10h

ADD DWORD PTR [EAX], EBX ;doubleword store

MOV CX, WORD PTR [EAX + 2] ;word load-cannot forward high

; word from store buffer

Use example 5 instead of example 4.

Example 4 (Avoid):

MOVQ

...

[foo], MM1

;store upper and lower half

;fine

ADD

EAX, [foo]

EDX, [foo+4] ;uh-oh!

Store-to-Load Forwarding Restrictions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 5 (Preferred):

MOVD

[foo], MM1

;store lower half

PUNPCKHDQ MM1, MM1

;get upper half into lower half

MOVD

...

ADD

[foo+4], MM1 ;store lower half

EAX, [foo]

;fine

EDX, [foo+4] ;fine

Misaligned

Store-Buffer Data

Forwarding

If the following condition is present, there is a misaligned

store-buffer data forwarding restriction:

ꢀ

The store or load address is misaligned. For example, a

quadword store is not aligned to a quadword boundary, a

doubleword store is not aligned to doubleword boundary,

etc.

Restriction

A common case of misaligned store-data forwarding involves

the passing of misaligned quadword floating-point data on the

doubleword-aligned integer stack. Avoid the type of code shown

in the following example.

Example 6 (Avoid):

MOV ESP, 24h

FSTP QWORD PTR [ESP] ;esp=24

;store occurs to quadword

; misaligned address

FLD QWORD PTR[ESP] ;quadword load cannot forward

; from quadword misaligned

; ‘fstp[esp]’ store OP

High-Byte

If the following condition is present, there is a high-byte

store-data buffer forwarding restriction:

Store-Buffer Data

Forwarding

Restriction

ꢀ

The store data is from a high-byte register (AH, BH, CH,

DH).

Avoid the type of code shown in the following example.

Example 7 (Avoid):

MOV EAX, 10h

MOV [EAX], BH

;high-byte store

MOV DL, [EAX]

;load cannot forward from

; high-byte store

Store-to-Load Forwarding Restrictions

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

One Supported Store-

to-Load Forwarding

Case

There is one case of a mismatched store-to-load forwarding that

is supported by the by AMD Athlon processor. The lower 32 bits

from an aligned QWORD write feeding into a DWORD read is

allowed.

Example 8 (Allowed):

MOVQ [AlignedQword], mm0

...

MOV

EAX, [AlignedQword]

Summary of Store-to-Load Forwarding Pitfalls to Avoid

To avoid store-to-load forwarding pitfalls, code should conform

to the following guidelines:

ꢀ

Maintain consistent use of operand size across all loads and

stores. Preferably, use doubleword or quadword operand

sizes.

ꢀ

Avoid misaligned data references.

Avoid narrow-to-wide and wide-to-narrow forwarding cases.

When using word or byte stores, avoid loading data from

anywhere in the same doubleword of memory other than the

identical start addresses of the stores.

Stack Alignment Considerations

Make sure the stack is suitably aligned for the local variable

with the largest base type. Then, using the technique described

in “C Language Structure Component Considerations ” on page

55, all variables can be properly aligned with no padding.

Extend to 32 Bits

Before Pushing onto

Stack

Function arguments smaller than 32 bits should be extended to

32 bits before being pushed onto the stack, which ensures that

the stack is always doubleword aligned on entry to a function.

If a function has no local variables with a base type larger than

doubleword, no further work is necessary. If the function does

have local variables whose base type is larger than a

doubleword, additional code should be inserted to ensure

proper alignment of the stack. For example, the following code

achieves quadword alignment:

Stack Alignment Considerations

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example (Preferred):

Prolog:

PUSH

MOV

SUB

AND

EBP

EBP, ESP

ESP, SIZE_OF_LOCALS

ESP, –8

;size of local variables

;push registers that need to be preserved

Epilog:

MOV

POP

;pop register that needed to be preserved

ESP, EBP

EBP

RET

With this technique, function arguments can be accessed via

EBP, and local variables can be accessed via ESP. In order to

free EBP for general use, it needs to be saved and restored

between the prolog and the epilog.

Align TBYTE Variables on Quadword Aligned Addresses

Align variables of type TBYTE on quadword aligned addresses.

In order to make an array of TBYTE variables that are aligned,

array elements are 16-bytes apart. In general, TBYTE variables

should be avoided. Use double-precision variables instead.

C Language Structure Component Considerations

Structures (‘struct’ in C language) should be made the size of a

multiple of the largest base type of any of their components. To

meet this requirement, padding should be used where

necessary.

Language definitions permitting, to minimize padding,

structure components should be sorted and allocated such that

the components with a larger base type are allocated ahead of

those with a smaller base type. For example, consider the

following code:

Align TBYTE Variables on Quadword Aligned Addresses

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example:

struct {

char a[5];

long k;

doublex;

} baz;

The structure components should be allocated (lowest to

highest address) as follows:

x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0

See “C Language Structure Component Considerations ” on

page 27 for more information from a C source code perspective.

Sort Variables According to Base Type Size

Sort local variables according to their base type size and

allocate variables with larger base type size ahead of those with

smaller base type size. Assuming the first variable allocated is

naturally aligned, all other variables are naturally aligned

without any padding. The following example is a declaration of

local variables in a C function:

Example:

short

long

ga, gu, gi;

foo, bar;

double x, y, z[3];

char

float

a, b;

baz;

Allocate in the following order from left to right (from higher to

lower addresses):

x, y, z[2], z[1], z[0], foo, bar, baz, ga, gu, gi, a, b;

See “Sort Local Variables According to Base T y pe Size ” on page

28 for more information from a C source code perspective.

Sort Variables According to Base Type Size

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Branch Optimizations

While the AMD Athlon™ processor contains a very

sophisticated branch unit, certain optimizations increase the

effectiveness of the branch prediction unit. This chapter

discusses rules that improve branch prediction and minimize

branch penalties. Guidelines are listed in order of importance.

Avoid Branches Dependent on Random Data

Avoid conditional branches depending on random data, as these

are difficult to predict. For example, a piece of code receives a

random stream of characters “A” through “Z” and branches if

the character is before “M” in the collating sequence.

Data-dependent branches acting upon basically random data

causes the branch prediction logic to mispredict the branch

about 50% of the time.

TOP

✩

If possible, design branch-free alternative code sequences,

which results in shorter average execution time. This technique

is especially important if the branch body is small. Examples 1

and 2 illustrate this concept using the CMOV instruction. Note

that the AMD-K6^®processor does not support the CMOV

instruction. Therefore, blended AMD-K6 and AMD Athlon

processor code should use examples 3 and 4.

Avoid Branches Dependent on Random Data

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

AMD Athlon™ Processor Specific Code

Example 1 — Signed integer ABS function (X = labs(X)):

MOV

NEG

ECX, [X]

EBX, ECX

ECX

;load value

;save value

;–value

CMOVS

MOV

ECX, EBX

[X], ECX

;if –value is negative, select value

;save labs result

Example 2 — Unsigned integer min function (z = x < y ? x : y):

MOV

CMP

EAX, [X]

EBX, [Y]

EAX, EBX

;load X value

;load Y value

;EBX<=EAX ? CF=0 : CF=1

;EAX=(EBX<=EAX) ? EBX:EAX

;save min (X,Y)

CMOVNC EAX, EBX

MOV [Z], EAX

Blended AMD-K6^®and AMD Athlon™ Processor Code

Example 3 — Signed integer ABS function (X = labs(X)):

MOV

SAR

XOR

SUB

MOV

ECX, [X]

EBX, ECX

ECX, 31

EBX, ECX

[X], EBX

;load value

;save value

;x < 0 ? 0xffffffff : 0

;x < 0 ? ~x : x

;x < 0 ? (~x)+1 : x

;x < 0 ? -x : x

Example 4 — Unsigned integer min function (z = x < y ? x : y):

MOV

SUB

SBB

AND

ADD

MOV

EAX, [x]

EBX, [y]

EAX, EBX

ECX, ECX

ECX, EAX

ECX, EBX

[z], ECX

;load x

;load y

;x < y ? CF : NC ; x - y

;x < y ? 0xffffffff : 0

;x < y ? x - y : 0

;x < y ? x - y + y : y

;x < y ? x : y

Example 5 — Hexadecimal to ASCII conversion

(y=x < 10 ? x + 0x30: x + 0x41):

MOV

CMP

SBB

DAS

MOV

AL, [X]

AL, 10

AL, 69h

;load X value

;if x is less than 10, set carry flag

;0..9 –> 96h, Ah..Fh –> A1h...A6h

;0..9: subtract 66h, Ah..Fh: Sub. 60h

;save conversion in y

[Y],AL

Avoid Branches Dependent on Random Data

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 6 — Increment Ring Buffer Offset:

//C Code

char buf[BUFSIZE];

int a;

if (a < (BUFSIZE-1)) {

a++;

} else {

a = 0;

}

;-------------

;Assembly Code

MOV

CMP

INC

SBB

AND

MOV

EAX, [a]

; old offset

EAX, (BUFSIZE-1) ; a < (BUFSIZE-1) ? CF : NC

EAX

EDX, EDX

EAX, EDX

[a], EAX

; a++

; a < (BUFSIZE-1) ? 0xffffffff :0

; a < (BUFSIZE-1) ? a++ : 0

; store new offset

Example 7 — Integer Signum Function:

//C Code

int a, s;

if (!a) {

s = 0;

} else if (a < 0) {

s = -1;

} else {

s = 1;

}

;-------------

;Assembly Code

MOV

CDQ

CMP

ADC

MOV

EAX, [a]

;load a

;t = a < 0 ? 0xffffffff : 0

;a > 0 ? CF : NC

;a > 0 ? t+1 : t

;signum(x)

EDX, EAX

EDX, 0

[s], EDX

Always Pair CALL and RETURN

When the 12 entry return address stack gets out of

synchronization, the latency of returns increase. The return

address stack becomes out of sync when:

ꢀ

calls and returns do not match

ꢀ

the depth of the return stack is exceeded because of too

many levels of nested functions calls

Always Pair CALL and RETURN

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Replace Branches with Computation in 3DNow!™ Code

Branches negatively impact the performance of 3DNow! code.

Branches can operate only on one data item at a time, i.e., they

are inherently scalar and inhibit the SIMD processing that

makes 3DNow! code superior. Also, branches based on 3DNow!

comparisons require data to be passed to the integer units,

which requires either transport through memory, or the use of

“MOVD reg, MMreg” instructions. If the body of the branch is

small, one can achieve higher performance by replacing the

branch with computation. The computation simulates

predicated execution or conditional moves. The principal tools

for this are the following instructions: PCMPGT, PFCMPGT,

PFCMPGE, PFMIN, PFMAX, PAND, PANDN, POR, PXOR.

Muxing Constructs

The most important construct to avoiding branches in

3DNow!™ and MMX™ code is a 2-way muxing construct that is

equivalent to the ternary operator “?:” in C and C++. It is

implemented using the PCMP/PFCMP, PAND, PANDN, and

POR instructions. To maximize performance, it is important to

apply the PAND and PANDN instructions in the proper order.

Example 1 (Avoid):

; r = (x < y) ? a : b

;

; in: mm0 a

;

mm1 b

mm2 x

mm3 y

; out: mm1 r

PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0

MOVQ

PANDN

PAND

POR

MM4, MM3 ; duplicate mask

MM3, MM0 ; y > x ? 0 : a

MM1, MM4 ; y > x ? b : 0

MM1, MM3 ; r = y > x ? b : a

Because the use of PANDN destroys the mask created by PCMP,

the mask needs to be saved, which requires an additional

chain, and increases register pressure. Therefore 2-way muxing

constructs should be written as follows.

Replace Branches with Computation in 3DNow!™ Code

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 2 (Preferred):

; r = (x < y) ? a : b

;

; in: mm0 a

;

mm1 b

mm2 x

mm3 y

; out: mm1 r

PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0

PAND

PANDN

POR

MM1, MM3 ; y > x ? b : 0

MM3, MM0 ; y > x > 0 : a

MM1, MM3 ; r = y > x ? b : a "

Sample Code Translated into 3DNow!™ Code

The following examples use scalar code translated into 3DNow!

code. Note that it is not recommended to use 3DNow! SIMD

instructions for scalar code, because the advantage of 3DNow!

instructions lies in their “SIMDness”. These examples are

meant to demonstrate general techniques for translating source

code with branches into branchless 3DNow! code. Scalar source

code was chosen to keep the examples simple. These techniques

work in an identical fashion for vector code.

Each example shows the C code and the resulting 3DNow! code.

Example 1:

C code:

float x,y,z;

if (x < y) {

z += 1.0;

}

else {

z -= 1.0;

}

3DNow! code:

;in: MM0 = x

;

MM1 = y

MM2 = z

;out: MM0 = z

MOVQ

MM3, MM0

MM4, one

;save x

;1.0

PFCMPGE MM0, MM1

;x < y ? 0 : 0xffffffff

;x < y ? 0 : 0x80000000

;x < y ? 1.0 : -1.0

;x < y ? z+1.0 : z-1.0

PSLLD

PXOR

MM0, 31

MM0, MM4

MM0, MM2

PFADD

Replace Branches with Computation in 3DNow!™ Code

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 2:

C code:

float x,z;

z = abs(x);

if (z >= 1) {

z = 1/z;

}

3DNow! code:

;in: MM0 = x

;out: MM0 = z

MOVQ

PAND

PFRCP

MOVQ

MM5, mabs ;0x7fffffff

MM0, MM5 ;z=abs(x)

MM2, MM0 ;1/z approx

MM1, MM0 ;save z

PFRCPIT1 MM0, MM2 ;1/z step

PFRCPIT2 MM0, MM2 ;1/z final

PFMIN

MM0, MM1 ;z = z < 1 ? z : 1/z

Example 3:

C code:

float x,z,r,res;

z = fabs(x)

if (z < 0.575) {

res = r;

}

else {

res = PI/2 - 2*r;

}

3DNow! code:

;in: MM0 = x

;

MM1 = r

;out: MM0 = res

MOVQ

PAND

MOVQ

MM7, mabs ;mask for absolute value

MM0, MM7 ;z = abs(x)

MM2, bnd ;0.575

PCMPGTD MM2, MM0 ;z < 0.575 ? 0xffffffff : 0

MOVQ

MM3, pio2 ;pi/2

MM0, MM1 ;save r

PFADD MM1, MM1 ;2*r

PFSUBR MM1, MM3 ;pi/2 - 2*r

PAND

MM0, MM2 ;z < 0.575 ? r : 0

PANDN MM2, MM1 ;z < 0.575 ? 0 : pi/2 - 2*r

POR

MM0, MM2 ;z < 0.575 ? r : pi/2 - 2 * r

Replace Branches with Computation in 3DNow!™ Code

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 4:

C code:

#define PI 3.14159265358979323

float x,z,r,res;

/* 0 <= r <= PI/4 */

z = abs(x)

if (z < 1) {

res = r;

}

else {

res = PI/2-r;

}

3DNow! code:

;in: MM0 = x

;

MM1 = r

;out: MM1 = res

MOVQ

PAND

MM5, mabs ; mask to clear sign bit

MM6, one ; 1.0

MM0, MM5 ; z=abs(x)

PCMPGTD MM6, MM0 ; z < 1 ? 0xffffffff : 0

MOVQ

MM4, pio2 ; pi/2

MM4, MM1 ; pi/2-r

MM6, MM4 ; z < 1 ? 0 : pi/2-r

MM1, MM6 ; res = z < 1 ? r : pi/2-r

PFSUB

PANDN

PFMAX

Replace Branches with Computation in 3DNow!™ Code

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 5:

C code:

#define PI 3.14159265358979323

float x,y,xa,ya,r,res;

int xs,df;

xs = x < 0 ? 1 : 0;

xa = fabs(x);

ya = fabs(y);

df = (xa < ya);

if (xs && df) {

res = PI/2 + r;

}

else if (xs) {

res = PI - r;

}

else if (df) {

res = PI/2 - r;

}

else {

res = r;

}

3DNow! code:

;in: MM0 = r

;

MM1 = y

MM2 = x

;out: MM0 = res

MOVQ

PAND

MOVQ

MM7, sgn

MM6, sgn

MM5, mabs ;mask to clear sign bit

MM7, MM2

MM1, MM5

MM2, MM5

MM6, MM1

;mask to extract sign bit

;xs = sign(x)

;ya = abs(y)

;xa = abs(x)

PCMPGTD MM6, MM2

;df = (xa < ya) ? 0xffffffff : 0

;df = bit<31>

;xs

PSLLD

MOVQ

MM6, 31

MM5, MM7

MM7, MM6

PXOR

;xs^df ? 0x80000000 : 0

MOVQ

PXOR

PSRAD

PANDN

PFSUB

MM3, npio2 ;-pi/2

MM5, MM3

MM6, 31

;xs ? pi/2 : -pi/2

;df ? 0xffffffff : 0

;xs ? (df ? 0 : pi/2) : (df ? 0 : -pi/2)

;pr = pi/2 + (xs ? (df ? 0 : pi/2) :

; (df ? 0 : -pi/2))

MM6, MM5

MM6, MM3

POR

PFADD

MM0, MM7

MM0, MM6

;ar = xs^df ? -r : r

;res = ar + pr

Replace Branches with Computation in 3DNow!™ Code

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Avoid the Loop Instruction

The LOOP instruction in the AMD Athlon processor requires

eight cycles to execute. Use the preferred code shown below:

Example 1 (Avoid):

LOOP

LABEL

Example 2 (Preferred):

DEC

JNZ

ECX

LABEL

Avoid Far Control Transfer Instructions

Avoid using far control transfer instructions. Far control

transfer branches can not be predicted by the branch target

buffer (BTB).

Avoid the Loop Instruction

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Avoid Recursive Functions

Avoid recursive functions due to the danger of overflowing the

return address stack. Convert end-recursive functions to

iterative code. An end-recursive function is when the function

call to itself is at the end of the code.

Example 1 (Avoid):

long fac(long a)

{

if (a==0) {

return (1);

} else {

return (a*fac(a–1));

}

return (t);

}

Example 2 (Preferred):

long fac(long a)

{

long t=1;

while (a > 0) {

t *= a;

a--;

}

return (t);

}

Avoid Recursive Functions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Scheduling Optimizations

This chapter describes how to code instructions for efficient

scheduling. Guidelines are listed in order of importance.

Schedule Instructions According to their Latency

The AMD Athlon™ processor can execute up to three x86

instructions per cycle, with each x86 instruction possibly having

a different latency. The AMD Athlon processor has flexible

scheduling, but for absolute maximum performance, schedule

instructions, especially FPU and 3DNow!™ instructions,

according to their latency. Dependent instructions will then not

have to wait on instructions with longer latencies.

See Appendix F , “Instruction Dispatch and Execution

Resources ” on page 187 for a list of latency numbers.

Unrolling Loops

Complete Loop Unrolling

Make use of the large AMD Athlon processor 64-Kbyte

instruction cache and unroll loops to get more parallelism and

reduce loop overhead, even with branch prediction. Complete

Schedule Instructions According to their Latency

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

unrolling reduces register pressure by removing the loop

counter. To completely unroll a loop, remove the loop control

and replicate the loop body N times. In addition, completely

unrolling a loop increases scheduling opportunities.

Only unrolling very large code loops can result in the inefficient

use of the L1 instruction cache. Loops can be unrolled

completely, if all of the following conditions are true:

ꢀ

The loop is in a frequently executed piece of code.

The loop count is known at compile time.

The loop body, once unrolled, is less than 100 instructions,

which is approximately 400 bytes of code.

Partial Loop Unrolling

Partial loop unrolling can increase register pressure, which can

make it inefficient due to the small number of registers in the

x86 architecture. However, in certain situations, partial

unrolling can be efficient due to the performance gains

possible. Partial loop unrolling should be considered if the

following conditions are met:

ꢀ

Spare registers are available

Loop body is small, so that loop overhead is significant

Number of loop iterations is likely > 10

Consider the following piece of C code:

double a[MAX_LENGTH], b[MAX_LENGTH];

for (i=0; i< MAX_LENGTH; i++) {

a[i] = a[i] + b[i];

}

Without loop unrolling, the code looks like the following:

Unrolling Loops

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Without Loop Unrolling:

MOV ECX, MAX_LENGTH

MOV EAX, OFFSET A

MOV EBX, OFFSET B

$add_loop:

FLD

FADD

FSTP

ADD

DEC

JNZ

QWORD PTR [EAX]

QWORD PTR [EBX]

QWORD PTR [EAX]

EAX, 8

EBX, 8

ECX

$add_loop

The loop consists of seven instructions. The AMD Athlon

processor can decode/retire three instructions per cycle, so it

cannot execute faster than three iterations in seven cycles, or

3/7 floating-point adds per cycle. However, the pipelined

floating-point adder allows one add every cycle. In the following

code, the loop is partially unrolled by a factor of two, which

creates potential endcases that must be handled outside the

loop:

With Partial Loop Unrolling:

MOV

SHR

JNC

FLD

FADD

FSTP

ADD

ECX, MAX_LENGTH

EAX, offset A

EBX, offset B

ECX, 1

$add_loop

QWORD PTR [EAX]

QWORD PTR [EBX]

QWORD PTR [EAX]

EAX, 8

EBX, 8

$add_loop:

FLD

QWORD PTR[EAX]

FADD

FSTP

FLD

FADD

FSTP

ADD

QWORD PTR[EBX]

QWORD PTR[EAX]

QWORD PTR[EAX+8]

QWORD PTR[EBX+8]

QWORD PTR[EAX+8]

EAX, 16

ADD

DEC

EBX, 16

ECX

JNZ

$add_loop

Now the loop consists of 10 instructions. Based on the

decode/retire bandwidth of three OPs per cycle, this loop goes

Unrolling Loops

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

no faster than three iterations in 10 cycles, or 6/10

floating-point adds per cycle, or 1.4 times as fast as the original

loop.

Deriving Loop

Control For Partially

Unrolled Loops

A frequently used loop construct is a counting loop. In a typical

case, the loop count starts at some lower bound lo, increases by

some fixed, positive increment inc for each iteration of the

loop, and may not exceed some upper bound hi. The following

example shows how to partially unroll such a loop by an

unrolling factor of fac, and how to derive the loop control for

the partially unrolled version of the loop.

Example 1 (rolled loop):

for (k = lo; k <= hi; k += inc) {

x[k] =

...

}

Example 2 (partially unrolled loop):

for (k = lo; k <= (hi - (fac-1)*inc); k += fac*inc) {

x[k] =

...

x[k+inc] =

...

x[k+(fac-1)*inc] =

...

}

/* handle end cases */

for (k = k; k <= hi; k += inc) {

x[k] =

...

}

Unrolling Loops

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Use Function Inlining

Overview

Make use of the AMD Athlon processor’s large 64-Kbyte

instruction cache by inlining small routines to avoid

procedure-call overhead. Consider the cost of possible

increased register usage, which can increase load/store

instructions for register spilling.

Function inlining has the advantage of eliminating function call

overhead and allowing better register allocation and

instruction scheduling at the site of the function call. The

disadvantage is decreasing code locality, which can increase

execution time due to instruction cache misses. Therefore,

function inlining is an optimization that has to be used

judiciously.

In general, due to its very large instruction cache, the

AMD Athlon processor is less susceptible than other processors

to the negative side effect of function inlining. Function call

overhead on the AMD Athlon processor can be low because

calls and returns are executed at high speed due to the use of

prediction mechanisms. However, there is still overhead due to

passing function arguments through memory, which creates

STLF (store-to-load-forwarding) dependencies. Some compilers

allow for a reduction of this overhead by allowing arguments to

be passed in registers in one of their calling conventions, which

has the drawback of constraining register allocation in the

function and at the site of the function call.

In general, function inlining works best if the compiler can

utilize feedback from a profiler to identify the function call

sites most frequently executed. If such data is not available, a

reasonable heuristic is to concentrate on function calls inside

loops. Functions that are directly recursive should not be

considered candidates for inlining. However, if they are

end-recursive, the compiler should convert them to an iterative

equivalent to avoid potential overflow of the AMD Athlon

processor return prediction mechanism (return stack) during

deep recursion. For best results, a compiler should support

function inlining across multiple source files. In addition, a

compiler should provide inline templates for commonly used

library functions, such as sin(), strcmp(), or memcpy().

Use Function Inlining

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Always Inline Functions if Called from One Site

A function should always be inlined if it can be established that

it is called from just one site in the code. For the C language,

determination of this characteristic is made easier if functions

are explicitly declared static unless they require external

linkage. This case occurs quite frequently, as functionality that

could be concentrated in a single large function is split across

multiple small functions for improved maintainability and

readability.

Always Inline Functions with Fewer than 25 Machine Instructions

In addition, functions that create fewer than 25 machine

instructions once inlined should always be inlined because it is

likely that the function call overhead is close to or more than

the time spent executing the function body. For large functions,

the benefits of reduced function call overhead gives

diminishing returns. Therefore, a function that results in the

insertion of more than 500 machine instructions at the call site

should probably not be inlined. Some larger functions might

consist of multiple, relatively short paths that are negatively

affected by function overhead. In such a case, it can be

advantageous to inline larger functions. Profiling information is

the best guide in determining whether to inline such large

functions.

Avoid Address Generation Interlocks

Loads and stores are scheduled by the AMD Athlon processor to

access the data cache in program order. Newer loads and stores

with their addresses calculated can be blocked by older loads

and stores whose addresses are not yet calculated – this is

known as an address generation interlock. Therefore, it is

advantageous to schedule loads and stores that can calculate

their addresses quickly, ahead of loads and stores that require

the resolution of a long dependency chain in order to generate

their addresses. Consider the following code examples.

Avoid Address Generation Interlocks

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 1 (Avoid):

ADD EBX, ECX

;inst 1

MOV EAX, DWORD PTR [10h]

;inst 2 (fast address calc.)

MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.)

MOV EDX, DWORD PTR [24h]

;this load is stalled from

; accessing data cache due

; to long latency for

; generating address for

; inst 3

Example 2 (Preferred):

ADD EBX, ECX

MOV EAX, DWORD PTR [10h]

MOV EDX, DWORD PTR [24h]

;inst 1

;inst 2

;place load above inst 3

; to avoid address

; generation interlock stall

MOV ECX, DWORD PTR [EAX+EBX] ;inst 3

Use MOVZX and MOVSX

Use the MOVZX and MOVSX instructions to zero-extend and

sign-extend byte-size and word-size operands to doubleword

length. For example, typical code for zero extension creates a

superset dependency when the zero-extended value is used, as

in the following code:

Example 1 (Avoid):

XOR

MOV

EAX, EAX

AL, [MEM]

Example 2 (Preferred):

MOVZX

EAX, BYTE PTR [MEM]

Minimize Pointer Arithmetic in Loops

Minimize pointer arithmetic in loops, especially if the loop

body is small. In this case, the pointer arithmetic would cause

significant overhead. Instead, take advantage of the complex

addressing modes to utilize the loop counter to index into

memory arrays. Using complex addressing modes does not have

any negative impact on execution speed, but the reduced

number of instructions preserves decode bandwidth.

Use MOVZX and MOVSX

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 1 (Avoid):

int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;

for (i=0; i < MAXSIZE; i++) {

c [i] = a[i] + b[i];

}

MOV

XOR

ECX, MAXSIZE ;initialize loop counter

ESI, ESI

EDI, EDI

EBX, EBX

;initialize offset into array a

;initialize offset into array b

;initialize offset into array c

$add_loop:

MOV

ADD

MOV

ADD

DEC

JNZ

EAX, [ESI + a] ;get element a

EDX, [EDI + b] ;get element b

EAX, EDX ;a[i] + b[i]

[EBX + c], EAX ;write result to c

ESI, 4

EDI, 4

EBX, 4

ECX

;increment offset into a

;increment offset into b

;increment offset into c

;decrement loop count

;until loop count 0

$add_loop

Example 2 (Preferred):

int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;

for (i=0; i < MAXSIZE; i++) {

c [i] = a[i] + b[i];

}

MOV ECX, MAXSIZE-1

;initialize loop counter

$add_loop:

MOV EAX, [ECX*4 + a] ;get element a

MOV EDX, [ECX*4 + b] ;get element b

ADD EAX, EDX

;a[i] + b[i]

MOV [ECX*4 + c], EAX ;write result to c

DEC ECX

JNS $add_loop

;decrement index

;until index negative

Note that the code in example 2 traverses the arrays in a

downward direction (i.e., from higher addresses to lower

addresses), whereas the original code in example 1 traverses

the arrays in an upward direction. Such a change in the

direction of the traversal is possible if each loop iteration is

completely independent of all other loop iterations, as is the

case here.

In code where the direction of the array traversal can’t be

switched, it is still possible to minimize pointer arithmetic by

appropriately biasing base addresses and using an index

Minimize Pointer Arithmetic in Loops

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

variable that starts with a negative value and reaches zero when

the loop expires. Note that if the base addresses are held in

registers (e.g., when the base addresses are passed as

arguments of a function) biasing the base addresses requires

additional instructions to perform the biasing at run time and a

small amount of additional overhead is incurred. In the

examples shown here the base addresses are used in the

displacement portion of the address and biasing is

accomplished at compile time by simply modifying the

displacement.

Example 3 (Preferred):

int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;

for (i=0; i < MAXSIZE; i++) {

c [i] = a[i] + b[i];

}

MOV ECX, (-MAXSIZE)

;initialize index

$add_loop:

MOV EAX, [ECX*4 + a + MAXSIZE*4];get a element

MOV EDX, [ECX*4 + b + MAXSIZE*4];get b element

ADD EAX, EDX

;a[i] + b[i]

MOV [ECX*4 + c + MAXSIZE*4], EAX;write result to c

INC ECX

JNZ $add_loop

;increment index

;until index==0

Push Memory Data Carefully

Carefully choose the best method for pushing memory data. To

reduce register pressure and code dependencies, follow

example 2 below.

Example 1 (Avoid):

MOV EAX, [MEM]

PUSH EAX

Example 2 (Preferred):

PUSH [MEM]

Push Memory Data Carefully

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Push Memory Data Carefully

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Integer Optimizations

This chapter describes ways to improve integer performance

through optimized programming techniques. The guidelines are

listed in order of importance.

Replace Divides with Multiplies

Replace integer division by constants with multiplication by

the reciprocal. Because the AMD Athlon™ processor has a very

fast integer multiply (5–9 cycles signed, 4–8 cycles unsigned)

and the integer division delivers only one bit of quotient per

cycle (22–47 cycles signed, 17–41 cycles unsigned), the

equivalent code is much faster. The user can follow the

examples in this chapter that illustrate the use of integer

division by constants, or access the executables in the

opt_utilities directory in the AMD documentation CD-ROM

(order# 21860) to find alternative code for dividing by a

constant.

Multiplication by Reciprocal (Division) Utility

The code for the utilities can be found at “Derivation of

Multiplier Used for Integer Division by Constants ” on page 93.

All utilities were compiled for the Microsoft Windows^®95,

Windows 98, and Windows NT^®environments. All utilities are

provided ‘as is’ and are not supported by AMD.

Replace Divides with Multiplies

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Signed Division

Utility

In the opt_utilities directory of the AMD documentation

CDROM, run sdiv.exe in a DOS window to find the fastest code

for signed division by a constant. The utility displays the code

after the user enters a signed constant divisor. Type “sdiv >

example.out” to output the code to a file.

Unsigned Division

Utility

In the opt_utilities directory of the AMD documentation

CDROM, run udiv.exe in a DOS window to find the fastest code

for unsigned division by a constant. The utility displays the code

after the user enters an unsigned constant divisor. Type “udiv >

example.out” to output the code to a file.

Unsigned Division by Multiplication of Constant

Algorithm: Divisors

The following code shows an unsigned division using a constant

value multiplier.

1 <= d < 2³¹, Odd d

;In: d = divisor, 1 <= d < 2^31, odd d

;Out: a = algorithm

;

m = multiplier

s = shift factor

;algorithm 0

MOV EDX, dividend

MOV EAX, m

MUL EDX

SHR EDX, s ;EDX=quotient

;algorithm 1

MOV EDX, dividend

MOV EAX, m

MUL EDX

ADD EAX, m

ADC EDX, 0

SHR EDX, s ;EDX=quotient

Derivation of a, m, s

The derivation for the algorithm (a), multiplier (m), and shift

count (s), is found in the section “Unsigned Derivation for

Algorithm, Multiplie r , a nd Shift F a ctor ” on page 93.

Algorithm: Divisors

2³¹<= d < 2³²

For divisors 2³¹<= d < 2³², the possible quotient values are

either 0 or 1. This makes it easy to establish the quotient by

simple comparison of the dividend and divisor. In cases where

the dividend needs to be preserved, example 1 below is

recommended.

Replace Divides with Multiplies

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 1:

;In:

;Out:

EDX = dividend

EDX = quotient

XOR EDX, EDX;0

CMP EAX, d ;CF = (dividend < divisor) ? 1 : 0

SBB EDX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1

In cases where the dividend does not need to be preserved, the

division can be accomplished without the use of an additional

example 2 below:

Example 2:

;In: EDX = dividend

;Out: EAX = quotient

CMP EDX, d ;CF = (dividend < divisor) ? 1 : 0

MOV EAX, 0 ;0

SBB EAX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1

Simpler Code for

Restricted Dividend

Integer division by a constant can be made faster if the range of

the dividend is limited, which removes a shift associated with

most divisors. For example, for a divide by 10 operation, use the

following code if the dividend is less than 40000005h:

MOV EAX, dividend

MOV EDX, 01999999Ah

MUL EDX

MOV quotient, EDX

Signed Division by Multiplication of Constant

Algorithm: Divisors

2 <= d < 2³¹

These algorithms work if the divisor is positive. If the divisor is

negative, use abs(d) instead of d, and append a ‘NEG EDX’ to

the code. The code makes use of the fact that n/–d = –(n/d).

;IN: d = divisor, 2 <= d < 2^31

;OUT: a = algorithm

;

m = multiplier

s = shift count

;algorithm 0

MOV EAX, m

MOV EDX, dividend

MOV ECX, EDX

IMUL EDX

SHR ECX, 31

SAR EDX, s

ADD EDX, ECX

;quotient in EDX

Replace Divides with Multiplies

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

;algorithm 1

MOV EAX, m

MOV EDX, dividend

MOV ECX, EDX

IMUL EDX

ADD EDX, ECX

SHR ECX, 31

SAR EDX, s

ADD EDX, ECX

;quotient in EDX

Derivation for a, m, s

Signed Division By 2

The derivation for the algorithm (a), multiplier (m), and shift

count (s), is found in the section “Signed Derivation for

Algorithm, Multiplie r , a nd Shift F a ctor ” on page 95.

;IN: EAX = dividend

;OUT:EAX = quotient

CMP EAX, 800000000h

SBB EAX, –1

SAR EAX, 1

;CY = 1, if dividend >=0

;Increment dividend if it is < 0

;Perform a right shift

Signed Division By 2ⁿ

;IN:EAX = dividend

;OUT:EAX = quotient

CDQ

;Sign extend into EDX

AND EDX, (2^n–1)

ADD EAX, EDX

SAR EAX, (n)

;Mask correction (use divisor –1)

;Apply correction if necessary

;Perform right shift by

; log2 (divisor)

Signed Division By –2 ;IN:EAX = dividend

;OUT:EAX = quotient

CMP EAX, 800000000h

SBB EAX, –1

;CY = 1, if dividend >= 0

;Increment dividend if it is < 0

;Perform right shift

SAR EAX, 1

NEG EAX

;Use (x/–2) == –(x/2)

Signed Division By

;IN:EAX = dividend

;OUT:EAX = quotient

CDQ

–(2ⁿ)

;Sign extend into EDX

AND EDX, (2^n–1)

ADD EAX, EDX

SAR EAX, (n)

NEG EAX

;Mask correction (–divisor –1)

;Apply correction if necessary

;Right shift by log2(–divisor)

;Use (x/–(2^n)) == (–(x/2^n))

Remainder of Signed

Integer 2 or –2

;IN:EAX = dividend

;OUT:EAX = remainder

CDQ

;Sign extend into EDX

;Compute remainder

;Negate remainder if

;Dividend was < 0

AND EDX, 1

XOR EAX, EDX

SUB EAX, EDX

MOV [remainder], EAX

Replace Divides with Multiplies

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Remainder of Signed

;IN:EAX = dividend

;OUT:EAX = remainder

CDQ

Integer 2ⁿor –(2ⁿ)

;Sign extend into EDX

;Mask correction (abs(divison)–1)

;Apply pre-correction

AND EDX, (2^n–1)

ADD EAX, EDX

AND EAX, (2^n–1)

SUB EAX, EDX

;Mask out remainder (abs(divison)–1)

;Apply pre-correction, if necessary

MOV [remainder], EAX

Use Alternative Code When Multiplying by a Constant

A 32-bit integer multiply by a constant has a latency of five

cycles. Therefore, use alternative code when multiplying by

certain constants. In addition, because there is just one

multiply unit, the replacement code may provide better

throughput.

The following code samples are designed such that the original

source also receives the final result. Other sequences are

possible if the result is in a different register. Adds have been

favored over shifts to keep code size small. Generally, there is a

fast replacement if the constant has very few 1 bits in binary.

More constants are found in the file multiply_by_constants.txt

located in the same directory where this document is located in

the SDK.

by 2:

by 3:

by 4:

by 5:

by 6:

ADD REG1, REG1

;1 cycle

;2 cycles

;1 cycle

;2 cycles

;3 cycles

LEA REG1, [REG1*2+REG1]

SHL REG1, 2

LEA REG1, [REG1*4+REG1]

LEA REG2, [REG1*4+REG1]

ADD REG1, REG2

by 7:

MOV REG2, REG1

SHL REG1, 3

;2 cycles

SUB REG1, REG2

by 8:

by 9:

SHL REG1, 3

;1 cycle

;2 cycles

;3 cycles

LEA REG1, [REG1*8+REG1]

by 10: LEA REG2, [REG1*8+REG1]

ADD REG1, REG2

Use Alternative Code When Multiplying by a Constant

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

by 11: LEA REG2, [REG1*8+REG1]

;3 cycles

ADD REG1, REG1

ADD REG1, REG2

by 12: SHL REG1, 2

LEA REG1, [REG1*2+REG1]

;3 cycles

by 13: LEA REG2, [REG1*2+REG1]

SHL REG1, 4

SUB REG1, REG2

by 14: LEA REG2, [REG1*4+REG1]

LEA REG1, [REG1*8+REG1]

ADD REG1, REG2

;3 cycles

;2 cycles

by 15: MOV REG2, REG1

SHL REG1, 4

SUB REG1, REG2

by 16: SHL REG1, 4

;1 cycle

by 17: MOV REG2, REG1

SHL REG1, 4

;2 cycles

ADD REG1, REG2

by 18: ADD REG1, REG1

LEA REG1, [REG1*8+REG1]

;3 cycles

by 19: LEA REG2, [REG1*2+REG1]

SHL REG1, 4

ADD REG1, REG2

by 20: SHL REG1, 2

;3 cycles

LEA REG1, [REG1*4+REG1]

by 21: LEA REG2, [REG1*4+REG1]

SHL REG1, 4

ADD REG1, REG2

by 22: use IMUL

by 23: LEA REG2, [REG1*8+REG1]

SHL REG1, 5

;3 cycles

SUB REG1, REG2

by 24: SHL REG1, 3

;3 cycles

LEA REG1, [REG1*2+REG1]

by 25: LEA REG2, [REG1*8+REG1]

SHL REG1, 4

ADD REG1, REG2

Use Alternative Code When Multiplying by a Constant

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

by 26: use IMUL

by 27: LEA REG2, [REG1*4+REG1]

SHL REG1, 5

;3 cycles

SUB REG1, REG2

by 28: MOV REG2, REG1

SHL REG1, 3

SUB REG1, REG2

SHL REG1, 2

by 29: LEA REG2, [REG1*2+REG1]

SHL REG1, 5

;3 cycles

SUB REG1, REG2

by 30: MOV REG2, REG1

SHL REG1, 4

SUB REG1, REG2

ADD REG1, REG1

by 31: MOV REG2, REG1

SHL REG1, 5

;2 cycles

;1 cycle

SUB REG1, REG2

by 32: SHL REG1, 5

Use MMX™ Instructions for Integer-Only Work

In many programs it can be advantageous to use MMX

instructions to do integer-only work, especially if the function

already uses 3DNow!™ or MMX code. Using MMX instructions

relieves register pressure on the integer registers. As long as

data is simply loaded/stored, added, shifted, etc., MMX

instructions are good substitutes for integer instructions.

Integer registers are freed up with the following results:

ꢀ

May be able to reduce the number of integer registers to

saved/restored on function entry/edit.

ꢀ

Free up integer registers for pointers, loop counters, etc., so

that they do not have to be spilled to memory, which

reduces memory traffic and latency in dependency chains.

Be careful with regards to passing data between MMX and

integer registers and of creating mismatched store-to-load

forwarding cases. See “Unrolling Loops ” on page 67.

Use MMX™ Instructions for Integer-Only Work

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

In addition, using MMX instructions increases the available

parallelism. The AMD Athlon processor can issue three integer

OPs and two MMX OPs per cycle.

Repeated String Instruction Usage

Latency of Repeated String Instructions

Table 1 shows the latency for repeated string instructions on the

AMD Athlon processor.

Table 1. Latency of Repeated String Instructions

Instruction

REP MOVS

ECX=0 (cycles)

DF = 0 (cycles)

15 + (4/3*c)

14 + (1*c)

DF = 1 (cycles)

25 + (4/3*c)

24 + (1*c)

REP STOS

REP LODS

REP SCAS

REP CMPS

Note:

15 + (2*c)

15 + (5/2*c)

16 + (10/3*c)

15 + (5/2*c)

16 + (10/3*c)

c = value of ECX, (ECX > 0)

Table 1 lists the latencies with the direction flag (DF) = 0

(increment) and DF = 1. In addition, these latencies are

assumed for aligned memory operands. Note that for

MOVS/STOS, when DF = 1 (DOWN), the overhead portion of the

latency increases significantly. However, these types are less

commonly found. The user should use the formula and round up

to the nearest integer value to determine the latency.

Guidelines for Repeated String Instructions

To help achieve good performance, this section contains

guidelines for the careful scheduling of VectorPath repeated

string instructions.

Use the Largest

Possible Operand

Size

Always move data using the largest operand size possible. For

example, use REP MOVSD rather than REP MOVSW and REP

MOVSW rather than REP MOVSB. Use REP STOSD rather than

REP STOSW and REP STOSW rather than REP MOVSB.

Repeated String Instruction Usage

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Ensure DF=0 (UP)

Always make sure that DF = 0 (UP) (after execution of CLD) for

REP MOVS and REP STOS. DF = 1 (DOWN) is only needed for

certain cases of overlapping REP MOVS (for example, source

and destination overlap).

While string instructions with DF = 1 (DOWN) are slower, only

the overhead part of the cycle equation is larger and not the

throughput part. See Table 1, “Latency of Repeated String

Instructions,” on page 84 for additional latency numbers.

Align Source and

Destination with

Operand Size

For REP MOVS, make sure that both source and destination are

aligned with regard to the operand size. Handle the end case

separately, if necessary. If either source or destination cannot

be aligned, make the destination aligned and the source

misaligned. For REP STOS, make the destination aligned.

Inline REP String

with Low Counts

Expand REP string instructions into equivalent sequences of

simple x86 instructions, if the repeat count is constant and less

than eight. Use an inline sequence of loads and stores to

accomplish the move. Use a sequence of stores to emulate REP

STOS. This technique eliminates the setup overhead of REP

instructions and increases instruction throughput.

Use Loop for REP

String with Low

Variable Counts

If the repeated count is variable, but is likely less than eight,

use a simple loop to move/store the data. This technique avoids

the overhead of REP MOVS and REP STOS.

Using MOVQ and

MOVNTQ for Block

Copy/Fill

To fill or copy blocks of data that are larger than 512 bytes, or

where the destination is in uncacheable memory, it is

recommended to use the MMX instructions MOVQ/MOVNTQ

instead of REP STOS and REP MOVS in order to achieve

maximum performance. (See the guideline, “Use MMX ™

Instructions for Block Copies and Block Fills ” on page 115.)

Repeated String Instruction Usage

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Use XOR Instruction to Clear Integer Registers

To clear an integer register to all 0s, use “XOR reg, reg”. The

AMD Athlon processor is able to avoid the false read

dependency on the XOR instruction.

Example 1 (Acceptable):

MOV

REG, 0

Example 2 (Preferred):

XOR

REG, REG

Efficient 64-Bit Integer Arithmetic

This section contains a collection of code snippets and

subroutines showing the efficient implementation of 64-bit

arithmetic. Addition, subtraction, negation, and shifts are best

handled by inline code. Multiplies, divides, and remainders are

less common operations and should usually be implemented as

subroutines. If these subroutines are used often, the

programmer should consider inlining them. Except for division

and remainder, the code presented works for both signed and

unsigned integers. The division and remainder code shown

works for unsigned integers, but can easily be extended to

handle signed integers.

Example 1 (Addition):

;add operand in ECX:EBX to operand EDX:EAX, result in

; EDX:EAX

ADD

ADC

EAX, EBX

EDX, ECX

Example 2 (Subtraction):

;subtract operand in ECX:EBX from operand EDX:EAX, result in

; EDX:EAX

SUB

SBB

EAX, EBX

EDX, ECX

Example 3 (Negation):

;negate operand in EDX:EAX

NOT

NEG

SBB

EDX

EAX

EDX, –1 ;fixup: increment hi-word if low-word was 0

Use XOR Instruction to Clear Integer Registers

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Example 4 (Left shift):

;shift operand in EDX:EAX left, shift count in ECX (count

; applied modulo 64)

SHLD

SHL

TEST

MOV

XOR

EDX, EAX, CL

EAX, CL

ECX, 32

$lshift_done

EDX, EAX

EAX, EAX

;first apply shift count

; mod 32 to EDX:EAX

;need to shift by another 32?

;no, done

;left shift EDX:EAX

; by 32 bits

$lshift_done:

Example 5 (Right shift):

SHRD

SHR

TEST

EAX, EDX, CL

EDX, CL

;first apply shift count

; mod 32 to EDX:EAX

;need to shift by another 32?

;no, done

;left shift EDX:EAX

; by 32 bits

ECX, 32

$rshift_done

EAX, EDX

EDX, EDX

MOV

XOR

$rshift_done:

Example 6 (Multiplication):

;_llmul computes the low-order half of the product of its

; arguments, two 64-bit integers

;

;INPUT: [ESP+8]:[ESP+4] multiplicand

;

[ESP+16]:[ESP+12] multiplier

;OUTPUT: EDX:EAX

;

(multiplicand * multiplier) % 2^64

;DESTROYS: EAX,ECX,EDX,EFlags

_llmul PROC

MOV

JNZ

MUL

RET

EDX, [ESP+8]

ECX, [ESP+16]

EDX, ECX

EDX, [ESP+12]

EAX, [ESP+4]

$twomul

;multiplicand_hi

;multiplier_hi

;one operand >= 2^32?

;multiplier_lo

;multiplicand_lo

;yes, need two multiplies

;multiplicand_lo * multiplier_lo

;done, return to caller

EDX

$twomul:

IMUL

ADD

EDX, [ESP+8] ;p3_lo = multiplicand_hi*multiplier_lo

ECX, EAX

ECX, EDX

;p2_lo = multiplier_hi*multiplicand_lo

; p2_lo + p3_lo

MUL

DWORD PTR [ESP+12] ;p1=multiplicand_lo*multiplier_lo

ADD

RET

EDX, ECX

;p1+p2lo+p3_lo = result in EDX:EAX

;done, return to caller

_llmul ENDP

Efficient 64-Bit Integer Arithmetic

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 7 (Division):

;_ulldiv divides two unsigned 64-bit integers, and returns

; the quotient.

;

;INPUT:

;

[ESP+8]:[ESP+4] dividend

[ESP+16]:[ESP+12] divisor

;OUTPUT: EDX:EAX

;

quotient of division

;DESTROYS:EAX,ECX,EDX,EFlags

_ulldiv PROC

PUSH

MOV

TEST

JNZ

CMP

JAE

DIV

MOV

EBX

;save EBX as per calling convention

ECX, [ESP+20] ;divisor_hi

EBX, [ESP+16] ;divisor_lo

EDX, [ESP+12] ;dividend_hi

EAX, [ESP+8]

ECX, ECX

$big_divisor

EDX, EBX

$two_divs

EBX

;dividend_lo

;divisor > 2^32–1?

;yes, divisor > 32^32–1

;only one division needed? (ECX = 0)

;need two divisions

;EAX = quotient_lo

;EDX = quotient_hi = 0 (quotient in

; EDX:EAX)

EDX, ECX

POP

RET

EBX

;restore EBX as per calling convention

;done, return to caller

$two_divs:

MOV

XOR

DIV

XCHG

DIV

MOV

POP

RET

ECX, EAX

;save dividend_lo in ECX

;get dividend_hi

;zero extend it into EDX:EAX

;quotient_hi in EAX

;ECX = quotient_hi, EAX = dividend_lo

;EAX = quotient_lo

;EDX = quotient_hi (quotient in EDX:EAX)

;restore EBX as per calling convention

;done, return to caller

EAX, EDX

EDX, EDX

EBX

EAX, ECX

EBX

EDX, ECX

EBX

$big_divisor:

PUSH

MOV

SHR

RCR

ROR

RCR

BSR

SHRD

SHR

ROL

DIV

MOV

EDI

;save EDI as per calling convention

;save divisor_hi

;shift both divisor and dividend right

; by 1 bit

EDI, ECX

EDX, 1

EAX, 1

EDI, 1

EBX, 1

ECX, ECX

;ECX = number of remaining shifts

EBX, EDI, CL ;scale down divisor and dividend

EAX, EDX, CL ; such that divisor is

EDX, CL

EDI, 1

EBX

; less than 2^32 (i.e. fits in EBX)

;restore original divisor_hi

;compute quotient

EBX, [ESP+12] ;dividend_lo

Efficient 64-Bit Integer Arithmetic

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

MOV

IMUL

ECX, EAX

EDI, EAX

;save quotient

;quotient * divisor hi-word

; (low only)

MUL

ADD

SUB

MOV

SBB

XOR

POP

RET

DWORD PTR [ESP+20];quotient * divisor lo-word

EDX, EDI

EBX, EAX

EAX, ECX

;EDX:EAX = quotient * divisor

;dividend_lo – (quot.*divisor)_lo

;get quotient

ECX, [ESP+16]

;dividend_hi

ECX, EDX

EAX, 0

EDX, EDX

EDI

;subtract divisor * quot. from dividend

;adjust quotient if remainder negative

;clear hi-word of quot(EAX<=FFFFFFFFh)

;restore EDI as per calling convention

;restore EBX as per calling convention

;done, return to caller

EBX

_ulldiv ENDP

Example 8 (Remainder):

;_ullrem divides two unsigned 64-bit integers, and returns

; the remainder.

;

;INPUT:

;

[ESP+8]:[ESP+4] dividend

[ESP+16]:[ESP+12] divisor

;

;OUTPUT:

;

EDX:EAX

remainder of division

;DESTROYS: EAX,ECX,EDX,EFlags

_ullrem PROC

PUSH

MOV

TEST

JNZ

CMP

JAE

DIV

MOV

POP

RET

EBX

;save EBX as per calling convention

ECX, [ESP+20] ;divisor_hi

EBX, [ESP+16] ;divisor_lo

EDX, [ESP+12] ;dividend_hi

EAX, [ESP+8] ;dividend_lo

ECX, ECX

;divisor > 2^32–1?

$r_big_divisor;yes, divisor > 32^32–1

EDX, EBX

$r_two_divs

EBX

EAX, EDX

EDX, ECX

EBX

;only one division needed? (ECX = 0)

;need two divisions

;EAX = quotient_lo

;EAX = remainder_lo

;EDX = remainder_hi = 0

;restore EBX as per calling convention

;done, return to caller

Efficient 64-Bit Integer Arithmetic

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

$r_two_divs:

MOV

XOR

DIV

ECX, EAX

EAX, EDX

EDX, EDX

EBX

;save dividend_lo in ECX

;get dividend_hi

;zero extend it into EDX:EAX

;EAX = quotient_hi, EDX = intermediate

; remainder

MOV

DIV

MOV

XOR

POP

RET

EAX, ECX

EBX

EAX, EDX

EDX, EDX

EBX

;EAX = dividend_lo

;EAX = quotient_lo

;EAX = remainder_lo

;EDX = remainder_hi = 0

;restore EBX as per calling convention

;done, return to caller

$r_big_divisor:

PUSH

MOV

SHR

RCR

ROR

RCR

BSR

SHRD

SHR

ROL

DIV

MOV

IMUL

MUL

ADD

SUB

MOV

SBB

EDI

;save EDI as per calling convention

;save divisor_hi

;shift both divisor and dividend right

; by 1 bit

EDI, ECX

EDX, 1

EAX, 1

EDI, 1

EBX, 1

ECX, ECX

;ECX = number of remaining shifts

EBX, EDI, CL ;scale down divisor and dividend such

EAX, EDX, CL ; that divisor is less than 2^32

EDX, CL

EDI, 1

EBX

EBX, [ESP+12] ;dividend lo-word

ECX, EAX

EDI, EAX

DWORD PTR [ESP+20] ;quotient * divisor lo-word

EDX, EDI

EBX, EAX

ECX, [ESP+16] ;dividend_hi

EAX, [ESP+20] ;divisor_lo

ECX, EDX

; (i.e. fits in EBX)

;restore original divisor (EDI:ESI)

;compute quotient

;save quotient

;quotient * divisor hi-word (low only)

;EDX:EAX = quotient * divisor

;dividend_lo – (quot.*divisor)–lo

;subtract divisor * quot. from

; dividend

;(remainder < 0)? 0xFFFFFFFF : 0

;(remainder < 0)? divisor_lo : 0

SBB

AND

ADD

ADC

POP

RET

EDX, EDX

EAX, EDX

EDX, [ESP+24] ;(remainder < 0)? divisor_hi : 0

EAX, EBX

EDX, ECX

EDI

;remainder += (remainder < 0)?

; divisor : 0

;restore EDI as per calling convention

;restore EBX as per calling convention

;done, return to caller

EBX

_ullrem ENDP

Efficient 64-Bit Integer Arithmetic

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Efficient Implementation of Population Count Function

Population count is an operation that determines the number of

set bits in a bit string. For example, this can be used to

determine the cardinality of a set. The following example code

shows how to efficiently implement a population count

operation for 32-bit operands. The example is written for the

inline assembler of Microsoft Visual C.

Function popcount() implements a branchless computation of

the population count. It is based on a O(log(n)) algorithm that

successively groups the bits into groups of 2, 4, 8, 16, and 32,

while maintaining a count of the set bits in each group. The

algorithms consist of the following steps:

Step 1

Partition the integer into groups of two bits. Compute the

population count for each 2-bit group and store the result in the

2-bit group. This calls for the following transformation to be

performed for each 2-bit group:

00b -> 00b

01b -> 01b

10b -> 01b

11b -> 10b

If the original value of a 2-bit group is v, then the new value will

be v - (v >> 1). In order to handle all 2-bit groups simultaneously,

it is necessary to mask appropriately to prevent spilling from

one bit group to the next lower bit group. Thus:

w = v - ((v >> 1) & 0x55555555)

Step 2

Add the population count of adjacent 2-bit group and store the

sum to the 4-bit group resulting from merging these adjacent

2-bit groups. To do this simultaneously to all groups, mask out

the odd numbered groups, mask out the even numbered groups,

and then add the odd numbered groups to the even numbered

groups:

x = (w & 0x33333333) + ((w >> 2) & 0x33333333)

Each 4-bit field now has value 0000b, 0001b, 0010b, 0011b, or

0100b.

Efficient Implementation of Population Count Function

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Step 3

For the first time, the value in each k-bit field is small enough

that adding two k-bit fields results in a value that still fits in the

k-bit field. Thus the following computation is performed:

y = (x + (x >> 4)) & 0x0F0F0F0F

The result is four 8-bit fields whose lower half has the desired

sum and whose upper half contains "junk" that has to be

masked out. In a symbolic form:

= 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh

x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg

sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ

The WWWW, XXXX, YYYY, and ZZZZ values are the

interesting sums with each at most 1000b, or 8 decimal.

Step 4

The four 4-bit sums can now be rapidly accumulated by means

of a multiply with a "magic" multiplier. This can be derived

from looking at the following chart of partial products:

0p0q0r0s * 01010101 =

:0p0q0r0s

0p:0q0r0s

0p0q:0r0s

0p0q0r:0s

000pxxww:vvuutt0s

Here p, q, r, and s are the 4-bit sums from the previous step, and

vv is the final result in which we are interested. Thus, the final

result:

z = (y * 0x01010101) >> 24

Example:

unsigned int popcount(unsigned int v)

{

unsigned int retVal;

__asm {

MOV EAX, [v]

MOV EDX, EAX

SHR EAX, 1

;v >> 1

AND EAX, 055555555h ;(v >> 1) & 0x55555555

SUB EDX, EAX

MOV EAX, EDX

SHR EDX, 2

;w = v - ((v >> 1) & 0x55555555)

;w >> 2

AND EAX, 033333333h ;w & 0x33333333

AND EDX, 033333333h ;(w >> 2) & 0x33333333

Efficient Implementation of Population Count Function

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

ADD EAX, EDX

;x = (w & 0x33333333) + ((w >> 2) &

; 0x33333333)

MOV EDX, EDX

SHR EAX, 4

;x >> 4

ADD EAX, EDX

;x + (x >> 4)

AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F)

IMUL EAX, 001010101h ;y * 0x01010101

SHR EAX, 24

;population count = (y *

; 0x01010101) >> 24

;store result

MOV retVal, EAX

}

return (retVal);

}

Derivation of Multiplier Used for Integer Division by

Constants

Unsigned Derivation for Algorithm, Multiplier, and Shift Factor

The utility udiv.exe was compiled using the code shown in this

section.

The following code derives the multiplier value used when

performing integer division by constants. The code works for

unsigned integer division and for odd divisors between 1 and

2³¹–1, inclusive. For divisors of the form d = d’*2ⁿ, the multiplier

is the same as for d’ and the shift factor is s + n.

/* Code snippet to determine algorithm (a), multiplier (m),

and shift factor (s) to perform division on unsigned 32-bit

integers by constant divisor. Code is written for the

Microsoft Visual C compiler. */

In: d = divisor, 1 <= d < 2^31, d odd

Out: a = algorithm

m = multiplier

s = shift factor

;algorithm 0

MOV EDX, dividend

MOV EAX, m

MUL EDX

SHR EDX, s ;EDX=quotient

Derivation of Multiplier Used for Integer Division by Constants

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

;algorithm 1

MOV EDX, dividend

MOV EAX, m

MUL EDX

ADD EAX, m

ADC EDX, 0

SHR EDX, s ;EDX=quotient

typedef unsigned __int64 U64;

typedef unsigned long

U32;

U32 d, l, s, m, a, r;

U64 m_low, m_high, j, k;

U32 log2 (U32 i)

{

U32 t = 0;

i = i >> 1;

while (i) {

i = i >> 1;

t++;

}

return (t);

}

/* Generate m, s for algorithm 0. Based on: Granlund, T.;

Montgomery, P.L.:"Division by Invariant Integers using

Multiplication”. SIGPLAN Notices, Vol. 29, June 1994, page

61. */

= log2(d) + 1;

= (((U64)(0xffffffff)) % ((U64)(d)));

= (((U64)(1)) << (32+l)) / ((U64)(0xffffffff–j));

m_low = (((U64)(1)) << (32+l)) / d;

m_high = ((((U64)(1)) << (32+l)) + k) / d;

while (((m_low >> 1) < (m_high >> 1)) && (l > 0)) {

m_low = m_low >> 1;

m_high = m_high >> 1;

= l – 1;

}

if ((m_high >> 32) == 0) {

m = ((U32)(m_high));

s = l;

a = 0;

}

Derivation of Multiplier Used for Integer Division by

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

/* Generate m, s for algorithm 1. Based on: Magenheimer,

D.J.; et al: “Integer Multiplication and Division on the HP

Precision Architecture”. IEEE Transactions on Computers, Vol

37, No. 8, August 1988, page 980. */

else {

s = log2(d);

m_low = (((U64)(1)) << (32+s)) / ((U64)(d));

= ((U32)((((U64)(1)) << (32+s)) % ((U64)(d))));

m = (r < ((d>>1)+1)) ? ((U32)(m_low)) : ((U32)(m_low))+1;

a = 1;

}

/* Reduce multiplier/shift factor for either algorithm to

smallest possible */

while (!(m&1)) {

m = m >> 1;

s––;

}

Signed Derivation for Algorithm, Multiplier, and Shift Factor

The utility sdiv.exe was compiled using the following code.

/* Code snippet to determine algorithm (a), multiplier (m),

and shift count (s) for 32-bit signed integer division,

given divisor d. Written for Microsoft Visual C compiler. */

IN: d = divisor, 2 <= d < 2^31

OUT: a = algorithm

m = multiplier

s = shift count

;algorithm 0

MOV EAX, m

MOV EDX, dividend

MOV ECX, EDX

IMUL EDX

SHR ECX, 31

SAR EDX, s

ADD EDX, ECX

; quotient in EDX

Derivation of Multiplier Used for Integer Division by Constants

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

;algorithm 1

MOV EAX, m

MOV EDX, dividend

MOV ECX, EDX

IMUL EDX

ADD EDX, ECX

SHR ECX, 31

SAR EDX, s

ADD EDX, ECX

; quotient in EDX

U32;

typedef unsigned __int64 U64;

typedef unsigned long

U32 log2 (U32 i)

{

U32 t = 0;

i = i >> 1;

while (i) {

i = i >> 1;

t++;

}

return (t);

}

U32 d, l, s, m, a;

U64 m_low, m_high, j, k;

/* Determine algorithm (a), multiplier (m), and shift count

(s) for 32-bit signed integer division. Based on: Granlund,

T.; Montgomery, P.L.: “Division by Invariant Integers using

Multiplication”. SIGPLAN Notices, Vol. 29, June 1994, page

61. */

= log2(d);

= (((U64)(0x80000000)) % ((U64)(d)));

= (((U64)(1)) << (32+l)) / ((U64)(0x80000000–j));

m_low = (((U64)(1)) << (32+l)) / d;

m_high = ((((U64)(1)) << (32+l)) + k) / d;

while (((m_low >> 1) < (m_high >> 1)) && (l > 0)) {

m_low = m_low >> 1;

m_high = m_high >> 1;

= l – 1;

}

m = ((U32)(m_high));

s = l;

a = (m_high >> 31) ? 1 : 0;

Derivation of Multiplier Used for Integer Division by

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Floating-Point Optimizations

This chapter details the methods used to optimize

floating-point code to the pipelined floating-point unit (FPU).

Guidelines are listed in order of importance.

Ensure All FPU Data is Aligned

As discussed in “Memory Size and Alignment Issues ” on page

45, floating-point data should be naturally aligned. That is,

words should be aligned on word boundaries, doublewords on

doubleword boundaries, and quadwords on quadword

boundaries. Misaligned memory accesses reduce the available

memory bandwidth.

Use Multiplies Rather than Divides

If accuracy requirements allow, floating-point division by a

constant should be converted to a multiply by the reciprocal.

Divisors that are powers of two and their reciprocal are exactly

representable, except in the rare case that the reciprocal

overflows or underflows, and therefore does not cause an

accuracy issue. Unless such an overflow or underflow occurs, a

division by a power of two should always be converted to a

multiply. Although the AMD Athlon™ processor has

high-performance division, multiplies are significantly faster

than divides.

Ensure All FPU Data is Aligned

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Use FFREEP Macro to Pop One Register from the FPU Stack

In FPU intensive code, frequently accessed data is often

pre-loaded at the bottom of the FPU stack before processing

floating-point data. After completion of processing, it is

desirable to remove the pre-loaded data from the FPU stack as

quickly as possible. The classical way to clean up the FPU stack

is to use either of the following instructions:

FSTP

ST(0)

;removes one register from stack

FCOMPP

;removes two registers from stack

On the AMD Athlon processor, a faster alternative is to use the

FFREEP instruction below. Note that the FFREEP instruction,

although insufficiently documented in the past, is supported by

all 32-bit x86 processors. The opcode bytes for FFREEP ST(i)

are listed in T able 22 on page 212.

FFREEP ST(0)

;removes one register from stack

FFREEP ST(i) works like FFREE ST(i) except that it

increments the FPU top-of-stack after doing the FFREE work.

In other words, FFREEP ST(i) marks ST(i) as empty, then

increments the x87 stack pointer. On the AMD Athlon

processor, the FFREEP instruction converts to an internal NOP,

which can go down any pipe with no dependencies.

Many assemblers do not support the FFREEP instruction. In

these cases, a simple text macro can be created to facilitate use

of the FFREEP ST(0).

FFREEP_ST0

TEXTEQU

Floating-Point Compare Instructions

For branches that are dependent on floating-point comparisons,

use the following instructions:

ꢀ

FCOMI

FCOMIP

FUCOMI

FUCOMIP

Use FFREEP Macro to Pop One Register from the FPU

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

These instructions are much faster than the classical approach

using FSTSW, because FSTSW is essentially a serializing

instruction on the AMD Athlon processor. When FSTSW cannot

be avoided (for example, backward compatibility of code with

older processors), no FPU instruction should occur between an

FCOM[P], FICOM[P], FUCOM[P], or FTST and a dependent

FSTSW. This optimization allows the use of a fast forwarding

mechanism for the FPU condition codes internal to the

AMD Athlon processor FPU and increases performance.

Use the FXCH Instruction Rather than FST/FLD Pairs

Increase parallelism by breaking up dependency chains or by

evaluating multiple dependency chains simultaneously by

explicitly switching execution between them. Although the

AMD Athlon processor FPU has a deep scheduler, which in

most cases can extract sufficient parallelism from existing code,

long dependency chains can stall the scheduler while issue slots

are still available. The maximum dependency chain length that

the scheduler can absorb is about six 4-cycle instructions.

To switch execution between dependency chains, use of the

FXCH instruction is recommended because it has an apparent

latency of zero cycles and generates only one OP. The

AMD Athlon processor FPU contains special hardware to

handle up to three FXCH instructions per cycle. Using FXCH is

preferred over the use of FST/FLD pairs, even if the FST/FLD

pair works on a register. An FST/FLD pair adds two cycles of

latency and consists of two OPs.

Avoid Using Extended-Precision Data

Store data as either single-precision or double-precision

quantities. Loading and storing extended-precision data is

comparatively slower.

Use the FXCH Instruction Rather than FST/FLD Pairs

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Minimize Floating-Point-to-Integer Conversions

C++, C, and Fortran define floating-point-to-integer conversions

as truncating. This creates a problem because the active

rounding mode in an application is typically round-to-nearest-

even. The classical way to do a double-to-int conversion

therefore works as follows:

Example 1 (Fast):

SUB

FLD

[I], EDX

;trunc(X)=rndint(X)-correction

;load double to be converted

;save current FPU control word

QWORD PTR [X]

FSTCW [SAVE_CW]

MOVZX EAX, WORD PTR[SAVE_CW];retrieve control word

MOV

EAX, 0C00h

;rounding control field = truncate

WORD PTR [NEW_CW], AX ;new FPU control word

FLDCW [NEW_CW]

FISTP DWORD PTR [I]

FLDCW [SAVE_CW]

;load new FPU control word

;do double->int conversion

;restore original control word

The AMD Athlon processor contains special acceleration

hardware to execute such code as quickly as possible. In most

situations, the above code is therefore the fastest way to

perform floating-point-to-integer conversion and the conversion

is compliant both with programming language standards and

the IEEE-754 standard.

According to the recommendations for inlining (see “Always

Inline Functions with Fewer than 25 Machine Instructions ” on

page 72), the above code should not be put into a separate

subroutine (e.g., ftol). It should rather be inlined into the main

code.

In some codes, floating-point numbers are converted to an

integer and the result is immediately converted back to

floating-point. In such cases, the FRNDINT instruction should

be used for maximum performance instead of FISTP in the code

above. FRNDINT delivers the integral result directly to an FPU

FISTP to store the integer result and then converting it back to

floating-point with FILD.

If there are multiple, consecutive floating-point-to-integer

conversions, the cost of FLDCW operations should be

minimized by saving the current FPU control word, forcing the

100

Minimize Floating-Point-to-Integer Conversions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

FPU into truncating mode, and performing all of the

conversions before restoring the original control word.

The speed of the above code is somewhat dependent on the

nature of the code surrounding it. For applications in which the

speed of floating-point-to-integer conversions is extremely

critical for application performance, experiment with either of

the following substitutions, which may or may not be faster than

the code above.

The first substitution simulates a truncating floating-point to

integer conversion provided that there are no NaNs, infinities,

and overflows. This conversion is therefore not IEEE-754

compliant. This code works properly only if the current FPU

rounding mode is round-to-nearest-even, which is usually the

case.

Example 2 (Potentially faster).

FLD

FST

QWORD PTR [X]

DWORD PTR [TX]

;load double to be converted

;store X because sign(X) is needed

;store rndint(x) as default result

;compute DIFF = X - rndint(X)

FIST DWORD PTR [I]

FISUB DWORD PTR [I]

FSTP DWORD PTR [DIFF] ;store DIFF as we need sign(DIFF)

MOV

EAX, [TX]

EDX, [DIFF]

;DIFF

;DIFF == 0 ?

TEST EDX, EDX

$DONE

;default result is OK, done

XOR

SAR

LEA

AND

SUB

$DONE:

EDX, EAX ; need correction if sign(X) != sign(DIFF)

EAX, 31

EDX, 31

EAX, [EAX+EAX+1] ;(X<0) ? 0xFFFFFFFF : 1

EDX, EAX

[I], EDX

;(X<0) ? 0xFFFFFFFF : 0

; sign(X)!=sign(DIFF)?0xFFFFFFFF:0

;correction: -1, 0, 1

;trunc(X)=rndint(X)-correction

The second substitution simulates a truncating floating-point to

integer conversion using only integer instructions and therefore

works correctly independent of the FPUs current rounding

mode. It does not handle NaNs, infinities, and overflows

according to the IEEE-754 standard. Note that the first

instruction of this code may cause an STLF size mismatch

resulting in performance degradation if the variable to be

converted has been stored recently.

Minimize Floating-Point-to-Integer Conversions

101

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 3 (Potentially faster):

MOV

XOR

MOV

AND

CMP

ECX, DWORD PTR[X+4] ;get upper 32 bits of double

EDX, EDX

EAX, ECX

;i = 0

;save sign bit

ECX, 07FF00000h

ECX, 03FF00000h

$DONE2

EDX, DWORD PTR[X]

ECX, 20

;isolate exponent field

;if abs(x) < 1.0

; then i = 0

;get lower 32 bits of double

;extract exponent

MOV

SHR

SHRD EDX, EAX, 21

;extract mantissa

NEG

ADD

SAR

ECX

;compute shift factor for extracting

;non-fractional mantissa bits

;set integer bit of mantissa

;x < 0 ? 0xffffffff : 0

;i = trunc(abs(x))

ECX, 1054

EDX, 080000000h

EAX, 31

SHR

EDX, CL

XOR

SUB

EDX, EAX

;i = x < 0 ? ~i : i

;i = x < 0 ? -i : i

$DONE2:

MOV

[I], EDX

;store result

For applications which can tolerate a floating-point-to-integer

conversion that is not compliant with existing programming

language standards (but is IEEE-754 compliant), perform the

conversion using the rounding mode that is currently in effect

(usually round-to-nearest-even).

Example 4 (Fastest):

FLD

QWORD PTR [X]

; get double to be converted

; store integer result

FISTP DWORD PTR [I]

Some compilers offer an option to use the code from example 4

for floating-point-to-integer conversion, using the default

rounding mode.

Lastly, consider setting the rounding mode throughout an

application to truncate and using the code from example 4 to

perform extremely fast conversions that are compliant with

language standards and IEEE-754. This mode is also provided

as an option by some compilers. Note that use of this technique

also changes the rounding mode for all other FPU operations

inside the application, which can lead to significant changes in

numerical results and even program failure (for example, due to

lack of convergence in iterative algorithms).

102

Minimize Floating-Point-to-Integer Conversions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Floating-Point Subexpression Elimination

There are cases which do not require an FXCH instruction after

every instruction to allow access to two new stack entries. In the

cases where two instructions share a source operand, an FXCH

is not required between the two instructions. When there is an

opportunity for subexpression elimination, reduce the number

of superfluous FXCH instructions by putting the shared source

operand at the top of the stack. For example, using the function:

func( (x*y), (x+z) )

Example 1 (Avoid):

FLD

FADD

FXCH

FMUL

CALL

FSTP

ST, ST(2)

ST(1)

ST, ST(2)

FUNC

ST(0)

Example 2 (Preferred):

FLD

FMUL

ST(1), ST

FADDP ST(2), ST

CALL FUNC

Check Argument Range of Trigonometric Instructions

Efficiently

The transcendental instructions FSIN, FCOS, FPTAN, and

FSINCOS are architecturally restricted in their argument

range. Only arguments with a magnitude of <= 2^63 can be

evaluated. If the argument is out of range, the C2 bit in the FPU

status word is set, and the argument is returned as the result.

Software needs to guard against such (extremely infrequent)

cases.

Floating-Point Subexpression Elimination

103

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

If an “argument out of range” is detected, a range reduction

subroutine is invoked which reduces the argument to less than

2^63 before the instruction is attempted again. While an

argument > 2^63 is unusual, it often indicates a problem

elsewhere in the code and the code may completely fail in the

absence of a properly guarded trigonometric instruction. For

example, in the case of FSIN or FCOS generated from a sin() or

cos() function invocation in the HLL, the downstream code

might reasonably expect that the returned result is in the range

[-1,1].

A naive solution for guarding a trigonometric instruction may

check the C2 bit in the FPU status word after each FSIN, FCOS,

FPTAN, and FSINCOS instruction, and take appropriate action

if it is set (indicating an argument out of range).

Example 1 (Avoid):

FLD

FSIN

QWORD PTR [x];argument

;compute sine

FSTSW AX

;store FPU status word to AX

;is the C2 bit set?

;nope, argument was in range, all OK

TEST

AX, 0400h

$in_range

CALL

FSIN

$reduce_range;reduce argument in ST(0) to < 2^63

;compute sine (in-range argument

; guaranteed)

$in_range:

Such a solution is inefficient since the FSTSW instruction is

serializing with respect to all x87/3DNow!/MMX instructions

and should thus be avoided (see the section “Floating-Point

Compare Instructions ” on page 98). Use of FSTSW in the above

fashion slows down the common path through the code.

Instead, it is advisable to check the argument before one of the

trigonometric instructions is invoked.

Example 2 (Preferred):

FLD

QWORD PTR [x] ;argument

DWORD PTR [two_to_the_63]

;2^63

FCOMIP ST,ST(1)

JBE

CALL

;argument <= 2^63 ?

;Yes, It is in range.

$reduce_range ;reduce argument in ST(0) to < 2^63

$in_range

$in_range:

FSIN

;compute sine (in-range argument

; guaranteed)

104

Check Argument Range of Trigonometric Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Since out-of-range arguments are extremely uncommon, the

conditional branch will be perfectly predicted, and the other

instructions used to guard the trigonometric instruction can

execute in parallel to it.

Take Advantage of the FSINCOS Instruction

Frequently, a piece of code that needs to compute the sine of an

argument also needs to compute the cosine of that same

argument. In such cases, the FSINCOS instruction should be

used to compute both trigonometric functions concurrently,

which is faster than using separate FSIN and FCOS instructions

to accomplish the same task.

Example 1 (Avoid):

FLD

QWORD PTR [x]

DWORD PTR [two_to_the_63]

FCOMIP ST,ST(1)

JBE

CALL

$in_range

$reduce_range

$in_range:

FLD

ST(0)

FCOS

FSTP

FSIN

FSTP

QWORD PTR [cosine_x]

QWORD PTR [sine_x]

Example 2 (Preferred):

FLD

QWORD PTR [x]

DWORD PTR [two_to_the_63]

FCOMIP ST,ST(1)

JBE

CALL

$in_range

$reduce_range

$in_range:

FSINCOS

FSTP

QWORD PTR [cosine_x]

QWORD PTR [sine_x]

Take Advantage of the FSINCOS Instruction

105

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

106

Take Advantage of the FSINCOS Instruction

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

3DNow!™ and MMX™

Optimizations

This chapter describes 3DNow! and MMX code optimization

techniques for the AMD Athlon™ processor. Guidelines are

listed in order of importance. 3DNow! porting guidelines can be

found in the 3DNow!™ Instruction Porting Guide, order# 22621.

Use 3DNow!™ Instructions

Unless accuracy requirements dictate otherwise, perform

floating-point computations using the 3DNow! instructions

instead of x87 instructions. The SIMD nature of 3DNow!

achieves twice the number of FLOPs that are achieved through

x87 instructions. 3DNow! instructions provide for a flat register

file instead of the stack-based approach of x87 instructions.

TOP

✩

See the 3DNow!™ Technology Manual, order# 21928 for

information on instruction usage.

Use FEMMS Instruction

Though there is no penalty for switching between x87 FPU and

3DNow!/MMX instructions in the AMD Athlon processor, the

FEMMS instruction should be used to ensure the same code

also runs optimally on AMD-K6^®family processors. The

Use 3DNow!™ Instructions

107

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

FEMMS instruction is supported for backward compatibility

with AMD-K6 family processors, and is aliased to the EMMS

instruction.

3DNow! and MMX instructions are designed to be used

concurrently with no switching issues. Likewise, enhanced

3DNow! instructions can be used simultaneously with MMX

instructions. However, x87 and 3DNow! instructions share the

same architectural registers so there is no easy way to use them

concurrently without cleaning up the register file in between

using FEMMS/EMMS.

Use 3DNow!™ Instructions for Fast Division

3DNow! instructions can be used to compute a very fast, highly

accurate reciprocal or quotient.

Optimized 14-Bit Precision Divide

This divide operation executes with a total latency of seven

cycles, assuming that the program hides the latency of the first

MOVD/MOVQ instructions within preceding code.

Example:

MOVD

MM0, [MEM] ;

MM0, MM0

MM2, [MEM] ;

MM2, MM0

0 | W

PFRCP

MOVQ

PFMUL

;

1/W | 1/W (approximate)

Y | X

Y/W | X/W

;

Optimized Full 24-Bit Precision Divide

This divide operation executes with a total latency of 15 cycles,

assuming that the program hides the latency of the first

MOVD/MOVQ instructions within preceding code.

Example:

MOVD

MM0, [W]

MM1, MM0

;

0 | W

PFRCP

PUNPCKLDQ MM0, MM0

PFRCPIT1 MM0, MM1

MOVQ

PFRCPIT2 MM0, MM1

PFMUL MM2, MM0

1/W | 1/W (approximate)

W | W

1/W | 1/W (refine)

Y | X

1/W | 1/W (final)

Y/W | X/W

(MMX instr.)

MM2, [X_Y]

108

Use 3DNow!™ Instructions for Fast Division

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Pipelined Pair of 24-Bit Precision Divides

This divide operation executes with a total latency of 21 cycles,

assuming that the program hides the latency of the first

MOVD/MOVQ instructions within preceding code.

Example:

MOVQ

PFRCP

MOVQ

MM0, [DIVISORS] ; y | x

MM1, MM0

MM2, MM0

; 1/x | 1/x (approximate)

; y | x

PUNPCKHDQ MM0, MM0

PFRCP MM0, MM0

PUNPCKLDQ MM1, MM0

; y | y

; 1/y | 1/y (approximate)

; 1/y | 1/x (approximate)

MOVQ

MM0, [DIVIDENDS] ; z | w

PFRCPIT1 MM2, MM1

PFRCPIT2 MM2, MM1

PFMUL

; 1/y | 1/x (intermediate)

; 1/y | 1/x (final)

; z/y | w/x

MM0, MM2

Newton-Raphson Reciprocal

Consider the quotient q = ^a/_b. An (on-chip) ROM-based table

lookup can be used to quickly produce a 14-to-15-bit precision

approximation of ¹/_busing just one PFRCP instruction. A full

24-bit precision reciprocal can then be quickly computed from

this approximation using a Newton Raphson algorithm.

The general Newton-Raphson recurrence for the reciprocal is as

follows:

Z_i+1= Z_i• (2 – b • Z_i)

Given that the initial approximation is accurate to at least 14

bits, and that a full IEEE single-precision mantissa contains 24

bits, just one Newton-Raphson iteration is required. The

following sequence shows the 3DNow! instructions that produce

the initial reciprocal approximation, compute the full precision

reciprocal from the approximation, and finally, complete the

desired divide of ^a/_b.

X₀= PFRCP(b)

X₁= PFRCPIT1(b,X₀)

X₂= PFRCPIT2(X₁,X₀)

q = PFMUL(a,X₂)

The 24-bit final reciprocal value is X₂. In the AMD Athlon

processor 3DNow! technology implementation the operand X₂

contains the correct round-to-nearest single precision

reciprocal for approximately 99% of all arguments.

Use 3DNow!™ Instructions for Fast Division

109

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Use 3DNow!™ Instructions for Fast Square Root and

Reciprocal Square Root

3DNow! instructions can be used to compute a very fast, highly

accurate square root and reciprocal square root.

Optimized 15-Bit Precision Square Root

This square root operation can be executed in only 7 cycles,

assuming a program hides the latency of the first MOVD

instruction within previous code. The reciprocal square root

operation requires four less cycles than the square root

operation.

Example:

MOVD

MM0, [MEM]

MM1, MM0

;

0 | a

;1/sqrt(a) | 1/sqrt(a) (approximate)

a | a (MMX instr.)

; sqrt(a) | sqrt(a)

PFRSQRT

PUNPCKLDQ MM0, MM0

PFMUL MM0, MM1

;

Optimized 24-Bit Precision Square Root

This square root operation can be executed in only 19 cycles,

assuming a program hides the latency of the first MOVD

instruction within previous code. The reciprocal square root

operation requires four less cycles than the square root

operation.

Example:

MOVD

MM0, [MEM]

MM1, MM0

MM2, MM1

MM1, MM1

;

0 | a

; 1/sqrt(a) | 1/sqrt(a) (approx.)

; X_0 = 1/(sqrt a) (approx.)

; X_0 * X_0 | X_0 * X_0 (step 1)

PFRSQRT

MOVQ

PFMUL

PUNPCKLDQ MM0, MM0

PFRSQIT1 MM1, MM0

PFRCPIT2 MM1, MM2

;

a | a

(intermediate)

(MMX instr)

(step 2)

; 1/sqrt(a) | 1/sqrt(a) (step 3)

; sqrt(a) | sqrt(a)

PFMUL

MM0, MM1

110

Use 3DNow!™ Instructions for Fast Square Root and

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Newton-Raphson Reciprocal Square Root

The general Newton-Raphson reciprocal square root recurrence

is:

Z_i+1= 1/2 • Z_i• (3 – b • Z_i²)

To reduce the number of iterations, the initial approximation

read from a table. The 3DNow! reciprocal square root

approximation is accurate to at least 15 bits. Accordingly, to

obtain a single-precision 24-bit reciprocal square root of an

input operand b, one Newton-Raphson iteration is required,

using the following sequence of 3DNow! instructions:

X₀= PFRSQRT(b)

X₁= PFMUL(X₀,X₀)

X₂= PFRSQIT1(b,X₁)

X₃= PFRCPIT2(X₂,X₀)

X₄= PFMUL(b,X₃)

The 24-bit final reciprocal square root value is X₃. In the

AMD Athlon processor 3DNow! implementation, the estimate

contains the correct round-to-nearest value for approximately

87% of all arguments. The remaining arguments differ from the

correct round-to-nearest value by one unit-in-the-last-place. The

square root (X₄) is formed in the last step by multiplying by the

input operand b.

Use MMX™ PMADDWD Instruction to Perform Two 32-Bit

Multiplies in Parallel

The MMX PMADDWD instruction can be used to perform two

signed 16x16→32 bit multiplies in parallel, with much higher

performance than can be achieved using the IMUL instruction.

The PMADDWD instruction is designed to perform four

16x16→32 bit signed multiplies and accumulate the results

pairwise. By making one of the results in a pair a zero, there are

now just two multiplies. The following example shows how to

multiply 16-bit signed numbers a,b,c,d into signed 32-bit

products a×c and b×d:

Use MMX™ PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel

111

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example:

PXOR

MOVD

MM2, MM2

MM0, [ab]

MM1, [cd]

; 0 | 0

; 0 0 | b a

; 0 0 | d c

; 0 b | 0 a

; 0 d | 0 c

; b*d | a*c

PUNPCKLWD MM0, MM2

PUNCPKLWD MM1, MM2

PMADDWD MM0, MM1

3DNow!™ and MMX™ Intra-Operand Swapping

AMD Athlon™

Specific Code

If the swapping of MMX register halves is necessary, use the

PSWAPD instruction, which is a new AMD Athlon 3DNow! DSP

extension. Use of this instruction should only be for

AMD Athlon specific code. “PSWAPD MMreg1, MMreg2”

performs the following operation:

mmreg1[63:32] = mmreg2[31:0])

mmreg1[31:0] = mmreg2[63:32])

See the AMD Extensions to the 3DNow! and MMX Instruction Set

Manual, order #22466 for more usage information.

Blended Code

Otherwise, for blended code, which needs to run well on

AMD-K6 and AMD Athlon family processors, the following code

is recommended:

Example 1 (Preferred, faster):

;MM1 = SWAP (MM0), MM0 destroyed

MOVQ

MM1, MM0

;make a copy

;duplicate lower half

;combine lower halves

PUNPCKLDQ MM0, MM0

PUNPCKHDQ MM1, MM0

Example 2 (Preferred, fast):

;MM1 = SWAP (MM0), MM0 preserved

MOVQ

MM1, MM0

;make a copy

;duplicate upper half

;combine upper halves

PUNPCKHDQ MM1, MM1

PUNPCKLDQ MM1, MM0

Both examples accomplish the swapping, but the first example

should be used if the original contents of the register do not

need to be preserved. The first example is faster due to the fact

that the MOVQ and PUNPCKLDQ instructions can execute in

parallel. The instructions in the second example are dependent

on one another and take longer to execute.

112

3DNow!™ and MMX™ Intra-Operand Swapping

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Fast Conversion of Signed Words to Floating-Point

In many applications there is a need to quickly convert data

consisting of packed 16-bit signed integers into floating-point

numbers. The following two examples show how this can be

accomplished efficiently on AMD processors.

The first example shows how to do the conversion on a processor

that supports AMD’s 3DNow! extensions, such as the

AMD Athlon processor. It demonstrates the increased

efficiency from using the PI2FW instruction. Use of this

instruction should only be for AMD Athlon processor specific

code. See the AMD Extensions to the 3DNow!™ and MMX™

Instruction Set Manual, order #22466 for more information on

this instruction.

The second example demonstrates how to accomplish the same

task in blended code that achieves good performance on the

AMD Athlon processor as well as on the AMD-K6 family

processors that support 3DNow! technology.

Example 1 (AMD Athlon specific code using 3DNow! DSP extension):

MOVD

MM0, [packed_sword]

;0 0 | b a

;b b | a a

PUNPCKLWD MM0, MM0

PI2FW

MOVQ

MM0, MM0

[packed_float], MM0

;xb=float(b)|xa=float(a)

;store xb | xa

Example 2 (AMD-K6 Family and AMD Athlon processor blended code):

MOVD

PXOR

MM1, [packed_sword] ;0 0 | b a

MM0, MM0

;0 0 | 0 0

;b 0 | a 0

PUNPCKLWD MM0, MM1

PSRAD

PI2FD

MOVQ

MM0, 16

MM0, MM0

;sign extend: b | a

;xb=float(b) | xa=float(a)

[packed_float], MM0 ;store xb | xa

Use MMX™ PXOR to Negate 3DNow!™ Data

For both the AMD Athlon and AMD-K6 processors, it is

recommended that code use the MMX PXOR instruction to

change the sign bit of 3DNow! operations instead of the 3DNow!

PFMUL instruction. On the AMD Athlon processor, using

PXOR allows for more parallelism, as it can execute in either

the FADD or FMUL pipes. PXOR has an execution latency of

two, but because it is a MMX instruction, there is an initial one

Fast Conversion of Signed Words to Floating-Point

113

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

cycle bypassing penalty, and another one cycle penalty if the

result goes to a 3DNow! operation. The PFMUL execution

latency is four, therefore, in the worst case, the PXOR and

PMUL instructions are the same in terms of latency. On the

AMD-K6 processor, there is only a one cycle latency for PXOR,

versus a two cycle latency for the 3DNow! PFMUL instruction.

Use the following code to negate 3DNow! data:

msgn

PXOR

DQ 8000000080000000h

MM0, [msgn]

;toggle sign bit

Use MMX™ PCMP Instead of 3DNow!™ PFCMP

Use the MMX PCMP instruction instead of the 3DNow! PFCMP

instruction. On the AMD Athlon processor, the PCMP has a

latency of two cycles while the PFCMP has a latency of four

cycles. In addition to the shorter latency, PCMP can be issued to

either the FADD or the FMUL pipe, while PFCMP is restricted

to the FADD pipe.

Note: The PFCMP instruction has a ‘GE’ (greater or equal)

version (PFCMPGE) that is missing from PCMP.

Both Numbers

Positive

If both arguments are positive, PCMP always works.

One Negative, One

Positive

If one number is negative and the other is positive, PCMP still

works, except when one number is a positive zero and the other

is a negative zero.

Both Numbers

Negative

Be careful when performing integer comparison using PCMPGT

on two negative 3DNow! numbers. The result is the inverse of

the PFCMPGT floating-point comparison. For example:

–2 = 84000000

–4 = 84800000

PCMPGT gives 84800000 > 84000000, but –4 < –2. To address

this issue, simply reverse the comparison by swapping the

source operands.

114

Use MMX™ PCMP Instead of 3DNow!™ PFCMP

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Use MMX™ Instructions for Block Copies and Block Fills

For moving or filling small blocks of data (e.g., less than 512

bytes) between cacheable memory areas, the REP MOVS and

REP STOS families of instructions deliver good performance

and are straightforward to use. For moving and filling larger

blocks of data, or to move/fill blocks of data where the

destination is in non-cacheable space, it is recommended to

make use of MMX instructions and MMX extensions. The

following examples all use quadword-aligned blocks of data. In

cases where memory blocks are not quadword aligned,

additional code is required to handle end cases as needed.

AMD-K6^®and

AMD Athlon™

Processor Blended

Code

The following example code, written for the inline assembler of

Microsoft Visual C, is suitable for moving/filling a large quad-

word aligned block of data in the following situations:

ꢀ

Blended code, i.e., code that needs to perform well on both

AMD Athlon and AMD-K6 family processors

ꢀ

AMD Athlon processor specific code where the destination

is in cacheable memory and immediate data re-use of the

data at the destination is expected

ꢀ

AMD-K6 family specific code where the destination is in

non-cacheable memory

Example 1:

/* block copy (source and destination QWORD aligned) */

__asm {

mov

shr

eax, [src_ptr]

edx, [dst_ptr]

ecx, [blk_size]

ecx, 6

align 16

Use MMX™ Instructions for Block Copies and Block Fills

115

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

$xfer:

movq

add

mm0, [eax]

edx, 64

mm1, [eax+8]

eax, 64

movq

add

movq

dec

mm2, [eax-48]

[edx-64], mm0

mm0, [eax-40]

[edx-56], mm1

mm1, [eax-32]

[edx-48], mm2

mm2, [eax-24]

[edx-40], mm0

mm0, [eax-16]

[edx-32], mm1

mm1, [eax-8]

[edx-24], mm2

[edx-16], mm0

ecx

movq

jnz

[edx-8], mm1

$xfer

femms

}

/* block fill (destination QWORD aligned) */

__asm {

mov

shr

movq

edx, [dst_ptr]

ecx, [blk_size]

ecx, 6

mm0, [fill_data]

align 16

$fill:

movq

add

[edx], mm0

[edx+8], mm0

[edx+16], mm0

[edx+24], mm0

[edx+32], mm0

[edx+40], mm0

edx, 64

movq

decq

mov

[edx-16], mm0

ecx

[edx-8], mm0

$fill

jnz

femms

}

116

Use MMX™ Instructions for Block Copies and Block Fills

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

AMD Athlon™

Processor Specific

Code

The following example code, written for the inline assembler of

Microsoft Visual C, is suitable for moving/filling a quadword

aligned block of data in the following situations:

ꢀ

AMD Athlon processor specific code where the destination

of the block copy is in non-cacheable memory space

ꢀ

AMD Athlon processor specific code where the destination

of the block copy is in cacheable space, but no immediate

data re-use of the data at the destination is expected.

Example 2:

/* block copy (source and destination QWORD aligned) */

__asm {

mov

shr

eax, [src_ptr]

edx, [dst_ptr]

ecx, [blk_size]

ecx, 6

align 16

$xfer_nc:

prefetchnta [eax+256]

movq

add

mm0, [eax]

edx, 64

movq

add

mm1, [eax+8]

eax, 64

movq

movntq

movq

movntq

movq

movntq

movq

movntq

movq

movntq

movq

movntq

dec

mm2, [eax-48]

[edx-64], mm0

mm0, [eax-40]

[edx-56], mm1

mm1, [eax-32]

[edx-48], mm2

mm2, [eax-24]

[edx-40], mm0

mm0, [eax-16]

[edx-32], mm1

mm1, [eax-8]

[edx-24], mm2

[edx-16], mm0

ecx

movntq

jnz

[edx-8], mm1

$xfer_nc

femms

sfence

}

Use MMX™ Instructions for Block Copies and Block Fills

117

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

/* block fill (destination QWORD aligned) */

__asm {

mov

shr

movq

edx, [dst_ptr]

ecx, [blk_size]

ecx, 6

mm0, [fill_data]

align 16

$fill_nc:

movntq

add

[edx], mm0

[edx+8], mm0

[edx+16], mm0

[edx+24], mm0

[edx+32], mm0

[edx+40], mm0

[edx+48], mm0

[edx+56], mm0

edx, 64

dec

ecx

jnz

$fill_nc

femms

sfence

}

Use MMX™ PXOR to Clear All Bits in an MMX™ Register

To clear all the bits in an MMX register to zero, use:

PXOR MMreg, MMreg

Note that PXOR MMreg, MMreg is dependent on previous

writes to MMreg. Therefore, using PXOR in the manner

described can lengthen dependency chains, which in return

may lead to reduced performance. An alternative in such cases

is to use:

zero DD 0

MOVD MMreg, DWORD PTR [zero]

i.e., to load a zero from a statically initialized and properly

aligned memory location. However, loading the data from

memory runs the risk of cache misses. Cases where MOVD is

superior to PXOR are therefore rare and PXOR should be used

in general.

118

Use MMX™ PXOR to Clear All Bits in an MMX™ Register

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register

To set all the bits in an MMX register to one, use:

PCMPEQD MMreg, MMreg

Note that PCMPEQD MMreg, MMreg is dependent on previous

writes to MMreg. Therefore, using PCMPEQD in the manner

described can lengthen dependency chains, which in return

may lead to reduced performance. An alternative in such cases

is to use:

ones DQ 0FFFFFFFFFFFFFFFFh

MOVQ MMreg, QWORD PTR [ones]

i.e., to load a quadword of 0xFFFFFFFFFFFFFFFF from a

statically initialized and properly aligned memory location.

However, loading the data from memory runs the risk of cache

misses. Cases where MOVQ is superior to PCMPEQD are

therefore rare and PCMPEQD should be used in general.

Use MMX™ PAND to Find Absolute Value in 3DNow!™ Code

Use the following to compute the absolute value of 3DNow!

floating-point operands:

mabs

PAND

DQ 7FFFFFFF7FFFFFFFh

MM0, [mabs]

;mask out sign bit

Optimized Matrix Multiplication

The multiplication of a 4x4 matrix with a 4x1 vector is

commonly used in 3D graphics for geometry transformation.

This routine serves to translate, scale, rotate, and apply

perspective to 3D coordinates represented in homogeneous

coordinates. The following code sample is a 3DNow! optimized,

general 3D vertex transformation routine that completes in 16

cycles on the AMD Athlon processor:

Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register

119

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

/* Function XForm performs a fully generalized 3D transform on an array

of vertices pointed to by "v" and stores the transformed vertices in

the location pointed to by "res". Each vertex consists of four floats.

The 4x4 transform matrix is pointed to by "m". The matrix elements are

also floats. The argument "numverts" indicates how many vertices have

to be transformed. The computation performed for each vertex is:

res->x = v->x*m[0][0] + v->y*m[1][0] + v->z*m[2][0] + v->w*m[3][0]

res->y = v->x*m[0][1] + v->y*m[1][1] + v->z*m[2][1] + v->w*m[3][1]

res->z = v->x*m[0][2] + v->y*m[1][2] + v->z*m[2][2] + v->w*m[3][2]

res->w = v->x*m[0][3] + v->y*m[1][3] + v->z*m[2][3] + v->w*m[3][3]

#define M00 0

#define M01 4

#define M02 8

#define M03 12

#define M10 16

#define M11 20

#define M12 24

#define M13 28

#define M20 32

#define M21 36

#define M22 40

#define M23 44

#define M30 48

#define M31 52

#define M32 56

#define M33 60

void XForm (float *res, const float *v, const float *m, int numverts)

{

_asm {

MOV

EDX, [V]

EAX, [M]

EBX, [RES]

ECX, [NUMVERTS]

;EDX = source vector ptr

;EAX = matrix ptr

;EBX = destination vector ptr

;ECX = number of vertices to transform

;3DNow! version of fully general 3D vertex tranformation.

;Optimal for AMD Athlon (completes in 16 cycles)

FEMMS

ALIGN

;clear MMX state

;for optimal branch alignment

120

Optimized Matrix Multiplication

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

$$xform:

ADD

EBX, 16

;res++

MOVQ

ADD

MM0, QWORD PTR [EDX]

MM1, QWORD PTR [EDX+8]

EDX, 16

;v->y | v->x

;v->w | v->z

;v++

MOVQ

MM2, MM0

;v->y | v->x

MM3, QWORD PTR [EAX+M00] ;m[0][1] | m[0][0]

PUNPCKLDQ MM0, MM0

;v->x | v->x

MOVQ

PFMUL

MM4, QWORD PTR [EAX+M10] ;m[1][1] | m[1][0]

MM3, MM0

;v->x*m[0][1] | v->x*m[0][0]

PUNPCKHDQ MM2, MM2

;v->y | v->y

PFMUL

MOVQ

PFMUL

MOVQ

MM4, MM2

;v->y*m[1][1] | v->y*m[1][0]

MM5, QWORD PTR [EAX+M02] ;m[0][3] | m[0][2]

MM7, QWORD PTR [EAX+M12] ;m[1][3] | m[1][2]

MM6, MM1

MM5, MM0

;v->w | v->z

;v->x*m[0][3] | v0>x*m[0][2]

MM0, QWORD PTR [EAX+M20] ;m[2][1] | m[2][0]

PUNPCKLDQ MM1, MM1

;v->z | v->z

PFMUL

MOVQ

PFMUL

PFADD

MM7, MM2

;v->y*m[1][3] | v->y*m[1][2]

MM2, QWORD PTR [EAX+M22] ;m[2][3] | m[2][2]

MM0, MM1

MM3, MM4

;v->z*m[2][1] | v->z*m[2][0]

;v->x*m[0][1]+v->y*m[1][1] |

; v->x*m[0][0]+v->y*m[1][0]

MOVQ

PFMUL

PFADD

MM4, QWORD PTR [EAX+M30] ;m[3][1] | m[3][0]

MM2, MM1

MM5, MM7

;v->z*m[2][3] | v->z*m[2][2]

;v->x*m[0][3]+v->y*m[1][3] |

; v->x*m[0][2]+v->y*m[1][2]

MOVQ

MM1, QWORD PTR [EAX+M32] ;m[3][3] | m[3][2]

PUNPCKHDQ MM6, MM6

;v->w | v->w

PFADD

MM3, MM0

;v->x*m[0][1]+v->y*m[1][1]+v->z*m[2][1] |

; v->x*m[0][0]+v->y*m[1][0]+v->z*m[2][0]

;v->w*m[3][1] | v->w*m[3][0]

PFMUL

PFADD

MM4, MM6

MM1, MM6

MM5, MM2

;v->w*m[3][3] | v->w*m[3][2]

;v->x*m[0][3]+v->y*m[1][3]+v->z*m[2][3] |

; v->x*m[0][2]+v->y*m[1][2]+v->z*m[2][2]

;v->x*m[0][1]+v->y*m[1][1]+v->z*m[2][1]+

; v->w*m[3][1] | v->x*m[0][0]+v->y*m[1][0]+

; v->z*m[2][0]+v->w*m[3][0]

PFADD

MM3, MM4

MOVQ

PFADD

[EBX-16], MM3

MM5, MM1

;store res->y | res->x

;v->x*m[0][3]+v->y*m[1][3]+v->z*m[2][3]+

; v->w*m[3][3] | v->x*m[0][2]+v->y*m[1][2]+

; v->z*m[2][2]+v->w*m[3][2]

MOVQ

DEC

JNZ

[EBX-8], MM5

ECX

$$XFORM

;store res->w | res->z

;numverts--

;until numverts == 0

FEMMS

;clear MMX state

}

Optimized Matrix Multiplication

121

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Efficient 3D-Clipping Code Computation Using 3DNow!™

Instructions

Clipping is one of the major activities occurring in a 3D

graphics pipeline. In many instances, this activity is split into

two parts which do not necessarily have to occur consecutively:

ꢀ

Computation of the clip code for each vertex, where each

bit of the clip code indicates whether the vertex is outside

the frustum with regard to a specific clip plane.

Examination of the clip code for a vertex and clipping if the

clip code is non-zero.

The following example shows how to use 3DNow! instructions to

efficiently implement a clip code computation for a frustum

that is defined by:

ꢀ

-w <= x <= w

-w <= y <= w

-w <= z <= w

.DATA

RIGHT EQU 01h

LEFT

EQU 02h

ABOVE EQU 04h

BELOW EQU 08h

BEHIND EQU 10h

BEFORE EQU 20h

ALIGN 8

ABOVE_RIGHT

BELOW_LEFT

DD RIGHT

DD ABOVE

DD LEFT

DD BELOW

BEHIND_BEFORE DD BEFORE

DD BEHIND

.CODE

;; Generalized computation of 3D clip code (out code)

;;

;; Register usage: IN

;;

MM5 y | x

MM6 w | z

;;

OUT

MM2 clip code (out code)

122

Efficient 3D-Clipping Code Computation Using

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

;;

DESTROYS MM0,MM1,MM2,MM3,MM4

PXOR

MOVQ

MM0, MM0

MM1, MM6

MM4, MM5

; 0 | 0

; w | z

; y | x

; w | w

PUNPCKHDQ MM1, MM1

MOVQ

PFSUBR

MM3, MM6

MM2, MM5

MM3, MM0

MM2, MM0

; w | z

; y | x

; -w | -z

; -y | -x

PUNPCKLDQ MM3, MM6

PFCMPGT MM4, MM1

; z | -z

; y>w?FFFFFFFF:0 | x>w?FFFFFFFF:0

MOVQ

MM0, QWORD PTR [ABOVE_RIGHT] ; ABOVE | RIGHT

PFCMPGT MM3, MM1

PFCMPGT MM2, MM1

; z>w?FFFFFFFF:0 | -z>w>FFFFFFFF:0

; -y>w?FFFFFFFF:0 | -x>w?FFFFFFFF:0

MOVQ

PAND

MOVQ

PAND

POR

MM1, QWORD PTR [BEHIND_BEFORE] ; BEHIND | BEFORE

MM4, MM0

MM0, QWORD PTR [BELOW_LEFT]

MM3, MM1

MM2, MM0

MM2, MM4

; y > w ? ABOVE:0 | x > w ? RIGHT:0

; BELOW | LEFT

; z > w ? BEHIND:0 | -z > w ? BEFORE:0

; -y > w ? BELOW:0 | -x > w ? LEFT:0

; BELOW,ABOVE | LEFT,RIGHT

POR

MOVQ

MM2, MM3

MM1, MM2

; BELOW,ABOVE,BEHIND | LEFT,RIGHT,BEFORE

; BELOW,ABOVE,BEHIND | BELOW,ABOVE,BEHIND

; zclip, yclip, xclip = clip code

PUNPCKHDQ MM2, MM2

POR MM2, MM1

Use 3DNow!™ PAVGUSB for MPEG-2 Motion Compensation

Use the 3DNow! PAVGUSB instruction for MPEG-2 motion

compensation. The PAVGUSB instruction produces the rounded

averages of the eight unsigned 8-bit integer values in the source

operand (a MMX register or a 64-bit memory location) and the

eight corresponding unsigned 8-bit integer values in the

destination operand (a MMX register). The PAVGUSB

instruction is extremely useful in DVD (MPEG-2) decoding

where motion compensation performs a lot of byte averaging

between and within macroblocks. The PAVGUSB instruction

helps speed up these operations. In addition, PAVGUSB can

free up some registers and make unrolling the averaging loops

possible.

The following code fragment uses original MMX code to

perform averaging between the source macroblock and

destination macroblock:

Use 3DNow!™ PAVGUSB for MPEG-2 Motion Compensation

123

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Example 1 (Avoid):

MOV

MOVQ

MOV

ESI, DWORD PTR Src_MB

EDI, DWORD PTR Dst_MB

EDX, DWORD PTR SrcStride

EBX, DWORD PTR DstStride

MM7, QWORD PTR [ConstFEFE]

MM6, QWORD PTR [Const0101]

ECX, 16

L1:

MOVQ

PAND

POR

PSRLQ

PAND

PADDB

MM0, [ESI]

MM1, [EDI]

MM2, MM0

MM3, MM1

MM2, MM6

MM3, MM6

MM0, MM7

MM1, MM7

MM2, MM3

MM0, 1

;MM0=QWORD1

;MM1=QWORD3

;MM0 = QWORD1 & 0xfefefefe

;MM1 = QWORD3 & 0xfefefefe

;calculate adjustment

;MM0 = (QWORD1 & 0xfefefefe)/2

;MM1 = (QWORD3 & 0xfefefefe)/2

MM1, 1

MM2, MM6

MM0, MM1

;MM0 = QWORD1/2 + QWORD3/2 w/o

; adjustment

PADDB

MOVQ

PAND

POR

MM0, MM2

[EDI], MM0

MM4, [ESI+8]

MM5, [EDI+8]

MM2, MM4

MM3, MM5

MM2, MM6

MM3, MM6

MM4, MM7

MM5, MM7

MM2, MM3

MM4, 1

;add lsb adjustment

;MM4=QWORD2

;MM5=QWORD4

;MM0 = QWORD2 & 0xfefefefe

;MM1 = QWORD4 & 0xfefefefe

;calculate adjustment

;MM0 = (QWORD2 & 0xfefefefe)/2

;MM1 = (QWORD4 & 0xfefefefe)/2

PSRLQ

PAND

PADDB

MM5, 1

MM2, MM6

MM4, MM5

;MM0 = QWORD2/2 + QWORD4/2 w/o

; adjustment

PADDB

MOVQ

MM4, MM2

[EDI+8], MM4

;add lsb adjustment

ADD

LOOP

ESI, EDX

EDI, EBX

124

Use 3DNow!™ PAVGUSB for MPEG-2 Motion

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

The following code fragment uses the 3DNow! PAVGUSB

instruction to perform averaging between the source

macroblock and destination macroblock:

Example 2 (Preferred):

MOV

EAX, DWORD PTR Src_MB

EDI, DWORD PTR Dst_MB

EDX, DWORD PTR SrcStride

EBX, DWORD PTR DstStride

ECX, 16

L1:

MOVQ

MM0, [EAX]

MM1, [EAX+8]

;MM0=QWORD1

;MM1=QWORD2

PAVGUSB MM0, [EDI]

;(QWORD1 + QWORD3)/2 with

; adjustment

PAVGUSB MM1, [EDI+8]

;(QWORD2 + QWORD4)/2 with

; adjustment

ADD

EAX, EDX

[EDI], MM0

[EDI+8], MM1

EDI, EBX

MOVQ

ADD

LOOP

Stream of Packed Unsigned Bytes

The following code is an example of how to process a stream of

packed unsigned bytes (like RGBA information) with faster

3DNow! instructions.

Example:

outside loop:

PXOR

MM0, MM0

inside loop:

MOVD

PUNPCKLBW MM1, MM0

MOVQ MM2, MM1

PUNPCKLWD MM1, MM0

PUNPCKHWD MM2, MM0

PI2FD

MM1, [VAR] ;

0 | v[3],v[2],v[1],v[0]

;0,v[3],0,v[2] | 0,v[1],0,v[0]

; 0,0,0,v[1] | 0,0,0,v[0]

; 0,0,0,v[3] | 0,0,0,v[2]

; float(v[1]) | float(v[0])

; float(v[3]) | float(v[2])

MM1, MM1

MM2, MM2

Stream of Packed Unsigned Bytes

125

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Complex Number Arithmetic

Complex numbers have a “real” part and an “imaginary” part.

Multiplying complex numbers (ex. 3 + 4i) is an integral part of

many algorithms such as Discrete Fourier Transform (DFT) and

complex FIR filters. Complex number multiplication is shown

below:

(src0.real + src0.imag) * (src1.real + src1.imag) = result

result = (result.real + result.imag)

result.real <= src0.real*src1.real - src0.imag*src1.imag

result.imag <= src0.real*src1.imag + src0.imag*src1.real

Example:

(1+2i) * (3+4i) => result.real + result.imag

result.real <= 1*3 - 2*4 = -5

result.imag <= 1*4i + 2i*3 = 10i

result = -5 +10i

Assuming that complex numbers are represented as two

element vectors [v.real, v.imag], one can see the need for

swapping the elements of src1 to perform the multiplies for

result.imag, and the need for a mixed positive/negative

accumulation to complete the parallel computation of

result.real and result.imag.

PSWAPD performs the swapping of elements for src1 and

PFPNACC performs the mixed positive/negative accumulation

to complete the computation. The code example below

summarizes the computation of a complex number multiply.

Example:

;MM0 = s0.imag | s0.real ;reg_hi | reg_lo

;MM1 = s1.imag | s1.real

PSWAPD MM2, MM0

;M2 =

s0.real | s0.imag

PFMUL

MM0, MM1

MM1, MM2

;M0 = s0.imag*s1.imag |s0.real*s1.real

;M1 = s0.real*s1.imag | s0.imag*s1.real

PFPNACC MM0, MM1

;M0 =

res.imag | res.real

PSWAPD supports independent source and result operands and

enables PSWAPD to also perform a copy function. In the above

example, this eliminates the need for a separate “MOVQ MM2,

MM0” instruction.

126

Complex Number Arithmetic

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

General x86 Optimization

Guidelines

This chapter describes general code optimization techniques

specific to superscalar processors (that is, techniques common

to the AMD-K6^®processor, AMD Athlon™ processor, and

Pentium^®family processors). In general, all optimization

techniques used for the AMD-K6 processor, Pentium, and

Pentium Pro processors either improve the performance of the

AMD Athlon processor or are not required and have a neutral

effect (usually due to fewer coding restrictions with the

AMD Athlon processor).

Short Forms

Use shorter forms of instructions to increase the effective

number of instructions that can be examined for decoding at

any one time. Use 8-bit displacements and jump offsets where

possible.

Example 1 (Avoid):

CMP

REG, 0

Example 2 (Preferred):

TEST

REG, REG

Although both of these instructions have an execute latency of

one, fewer opcode bytes need to be examined by the decoders

for the TEST instruction.

Short Forms

127

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Dependencies

Spread out true dependencies to increase the opportunities for

parallel execution. Anti-dependencies and output

dependencies do not impact performance.

Maintain frequently used values in registers rather than in

memory. This technique avoids the comparatively long latencies

for accessing memory.

Stack Allocation

When allocating space for local variables and/or outgoing

parameters within a procedure, adjust the stack pointer and

use moves rather than pushes. This method of allocation allows

random access to the outgoing parameters so that they can be

set up when they are calculated instead of being held

somewhere else until the procedure call. In addition, this

method reduces ESP dependencies and uses fewer execution

resources.

128

Dependencies

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix A

AMD Athlon™ Processor

Microarchitecture

Introduction

When discussing processor design, it is important to understand

the following terms—architecture, microarchitecture, and design

implementation. The term architecture refers to the instruction

set and features of a processor that are visible to software

programs running on the processor. The architecture

determines what software the processor can run. The

architecture of the AMD Athlon processor is the

industry-standard x86 instruction set.

The term microarchitecture refers to the design techniques used

in the processor to reach the target cost, performance, and

functionality goals. The AMD Athlon processor

microarchitecture is a decoupled decode/execution design

approach. In other words, the decoders essentially operate

independent of the execution units, and the execution core uses

a small number of instructions and simplified circuit design for

fast single-cycle execution and fast operating frequencies.

The term design implementation refers to the actual logic and

circuit designs from which the processor is created according to

the microarchitecture specifications.

Introduction

129

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

AMD Athlon™ Processor Microarchitecture

The innovative AMD Athlon processor microarchitecture

approach implements the x86 instruction set by processing

simpler operations (OPs) instead of complex x86 instructions.

These OPs are specially designed to include direct support for

the x86 instructions while observing the high-performance

principles of fixed-length encoding, regularized instruction

fields, and a large register set. Instead of executing complex

x86 instructions, which have lengths from 1 to 15 bytes, the

AMD Athlon processor executes the simpler fixed-length OPs,

while maintaining the instruction coding efficiencies found in

x86 programs. The enhanced microarchitecture used in the

AMD Athlon processor enables higher processor core

performance and promotes straightforward extendibility for

future designs.

Superscalar Processor

The AMD Athlon processor is an aggressive, out-of-order,

three-way superscalar x86 processor. It can fetch, decode, and

issue up to three x86 instructions per cycle with a centralized

instruction control unit (ICU) and two independent instruction

schedulers— an integer scheduler and a floating-point

scheduler. These two schedulers can simultaneously issue up to

nine OPs to the three general-purpose integer execution units

(IEUs), three address-generation units (AGUs), and three

floating-point/3DNow!™/MMX™ execution units. The

AMD Athlon moves integer instructions down the integer

execution pipeline, which consists of the integer scheduler and

the IEUs, as shown in Figure 1 on page 131. Floating-point

instructions are handled by the floating-point execution

pipeline, which consists of the floating-point scheduler and the

x87/3DNow!/MMX execution units.

130

AMD Athlon™ Processor Microarchitecture

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Predecode

Cache

Branch

Prediction Table

2-Way, 64-Kbyte Instruction Cache

24-Entry L1 TLB/256-Entry L2 TLB

Fetch/Decode

Control

3-Way x86 Instruction Decoders

Instruction Control Unit (72-Entry)

FPU Stack Map / Rename

Integer Scheduler (18-Entry)

FPU Scheduler (36-Entry)

FPU Register File (88-Entry)

IEU0 AGU0 IEU1 AGU1 IEU2 AGU2

Bus

Interface

Unit

L2 Cache

Controller

FMUL

MMX

3DNow!

FADD

MMX™

3DNow!™

FSTORE

Load / Store Queue Unit

2-Way, 64-Kbyte Data Cache

32-Entry L1 TLB/256-Entry L2 TLB

System Interface

L2 SRAMs

Figure 1. AMD Athlon™ Processor Block Diagram

Instruction Cache

The out-of-order execute engine of the AMD Athlon processor

contains a very large 64-Kbyte L1 instruction cache. The L1

instruction cache is organized as a 64-Kbyte, two-way,

set-associative array. Each line in the instruction array is 64

bytes long. Functions associated with the L1 instruction cache

are instruction loads, instruction prefetching, instruction

predecoding, and branch prediction. Requests that miss in the

L1 instruction cache are fetched from the backside L2 cache or,

subsequently, from the local memory using the bus interface

unit (BIU).

The instruction cache generates fetches on the naturally

aligned 64 bytes containing the instructions and the next

sequential line of 64 bytes (a prefetch). The principal of

program spatial locality makes data prefetching very effective

and avoids or reduces execution stalls due to the amount of

time wasted reading the necessary data. Cache line

AMD Athlon™ Processor Microarchitecture

131

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

replacement is based on a least-recently used (LRU)

replacement algorithm.

The L1 instruction cache has an associated two-level translation

look-aside buffer (TLB) structure. The first-level TLB is fully

associative and contains 24 entries (16 that map 4-Kbyte pages

and eight that map 2-Mbyte or 4-Mbyte pages). The second-level

TLB is four-way set associative and contains 256 entries, which

can map 4-Kbyte pages.

Predecode

Predecoding begins as the L1 instruction cache is filled.

Predecode information is generated and stored alongside the

instruction cache. This information is used to help efficiently

identify the boundaries between variable length x86

instructions, to distinguish DirectPath from VectorPath

early-decode instructions, and to locate the opcode byte in each

instruction. In addition, the predecode logic detects code

branches such as CALLs, RETURNs and short unconditional

JMPs. When a branch is detected, predecoding begins at the

target of the branch.

Branch Prediction

The fetch logic accesses the branch prediction table in parallel

with the instruction cache and uses the information stored in

the branch prediction table to predict the direction of branch

instructions.

The AMD Athlon processor employs combinations of a branch

target address buffer (BTB), a global history bimodal counter

(GHBC) table, and a return address stack (RAS) hardware in

order to predict and accelerate branches. Predicted-taken

branches incur only a single-cycle delay to redirect the

instruction fetcher to the target instruction. In the event of a

mispredict, the minimum penalty is ten cycles.

The BTB is a 2048-entry table that caches in each entry the

predicted target address of a branch.

In addition, the AMD Athlon processor implements a 12-entry

return address stack to predict return addresses from a near or

far call. As CALLs are fetched, the next EIP is pushed onto the

132

AMD Athlon™ Processor Microarchitecture

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

return stack. Subsequent RETs pop a predicted return address

off the top of the stack.

Early Decoding

The DirectPath and VectorPath decoders perform

early-decoding of instructions into MacroOPs. A MacroOP is a

fixed length instruction which contains one or more OPs. The

outputs of the early decoders keep all (DirectPath or

VectorPath) instructions in program order. Early decoding

produces three MacroOPs per cycle from either path. The

outputs of both decoders are multiplexed together and passed

to the next stage in the pipeline, the instruction control unit.

When the target 16-byte instruction window is obtained from

the instruction cache, the predecode data is examined to

determine which type of basic decode should occur —

DirectPath or VectorPath.

DirectPath Decoder

DirectPath instructions can be decoded directly into a

MacroOP, and subsequently into one or two OPs in the final

issue stage. A DirectPath instruction is limited to those x86

instructions that can be further decoded into one or two OPs.

The length of the x86 instruction does not determine DirectPath

instructions. A maximum of three DirectPath x86 instructions

can occupy a given aligned 8-byte block. 16-bytes are fetched at

a time. Therefore, up to six DirectPath x86 instructions can be

passed into the DirectPath decode pipeline.

VectorPath Decoder

Uncommon x86 instructions requiring two or more MacroOPs

proceed down the VectorPath pipeline. The sequence of

MacroOPs is produced by an on-chip ROM known as the MROM.

The VectorPath decoder can produce up to three MacroOPs per

cycle. Decoding a VectorPath instruction may prevent the

simultaneous decode of a DirectPath instruction.

AMD Athlon™ Processor Microarchitecture

133

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Instruction Control Unit

The instruction control unit (ICU) is the control center for the

AMD Athlon processor. The ICU controls the following

resources—the centralized in-flight reorder buffer, the integer

scheduler, and the floating-point scheduler. In turn, the ICU is

responsible for the following functions—MacroOP dispatch,

MacroOP retirement, register and flag dependency resolution

and renaming, execution resource management, interrupts,

exceptions, and branch mispredictions.

The ICU takes the three MacroOPs per cycle from the early

decoders and places them in a centralized, fixed-issue reorder

buffer. This buffer is organized into 24 lines of three MacroOPs

each. The reorder buffer allows the ICU to track and monitor up

to 72 in-flight MacroOPs (whether integer or floating-point) for

maximum instruction throughput. The ICU can simultaneously

dispatch multiple MacroOPs from the reorder buffer to both the

integer and floating-point schedulers for final decode, issue,

and execution as OPs. In addition, the ICU handles exceptions

and manages the retirement of MacroOPs.

Data Cache

The L1 data cache contains two 64-bit ports. It is a

write-allocate and writeback cache that uses an LRU

replacement policy. The data cache and instruction cache are

both two-way set-associative and 64-Kbytes in size. It is divided

into 8 banks where each bank is 8 bytes wide. In addition, this

cache supports the MOESI (Modified, Owner, Exclusive,

Shared, and Invalid) cache coherency protocol and data parity.

The L1 data cache has an associated two-level TLB structure.

The first-level TLB is fully associative and contains 32 entries

(24 that map 4-Kbyte pages and eight that map 2-Mbyte or

4-Mbyte pages). The second-level TLB is four-way set

associative and contains 256 entries, which can map 4-Kbyte

pages.

134

AMD Athlon™ Processor Microarchitecture

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Integer Scheduler

The integer scheduler is based on a three-wide queuing system

(also known as a reservation station) that feeds three integer

execution positions or pipes. The reservation stations are six

entries deep, for a total queuing system of 18 integer

MacroOPs.Each reservation station divides the MacroOPs into

integer and address generation OPs, as required.

Integer Execution Unit

The integer execution pipeline consists of three identical

pipes—0, 1, and 2. Each integer pipe consists of an integer

execution unit (IEU) and an address generation unit (AGU).

The integer execution pipeline is organized to match the three

MacroOP dispatch pipes in the ICU as shown in Figure 2 on

page 135. MacroOPs are broken down into OPs in the

schedulers. OPs issue when their operands are available either

from the register file or result buses.

OPs are executed when their operands are available. OPs from

a single MacroOP can execute out-of-order. In addition, a

particular integer pipe can be executing two OPs from different

MacroOPs (one in the IEU and one in the AGU) at the same

time.

P ip e lin e

S ta g e

In s tru c tio n C o n tro l U n it a n d R e g is te r F ile s

M a c ro O P s

In te g e r S c h e d u le r

(1 8 -e n try )

IE U 0

A G U 0

IE U 1

A G U 1

IE U 2

A G U 2

In te g e r M u ltip ly (IM U L )

Figure 2. Integer Execution Pipeline

AMD Athlon™ Processor Microarchitecture

135

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Each of the three IEUs are general purpose in that each

performs logic functions, arithmetic functions, conditional

functions, divide step functions, status flag multiplexing, and

branch resolutions. The AGUs calculate the logical addresses

for loads, stores, and LEAs. A load and store unit reads and

writes data to and from the L1 data cache. The integer

scheduler sends a completion status to the ICU when the

outstanding OPs for a given MacroOP are executed.

All integer operations can be handled within any of the three

IEUs with the exception of multiplies. Multiplies are handled

by a pipelined multiplier that is attached to the pipeline at pipe

0. See Figure 2 on page 135. Multiplies always issue to integer

pipe 0, and the issue logic creates results bus bubbles for the

multiplier in integer pipes 0 and 1 by preventing non-multiply

OPs from issuing at the appropriate time.

Floating-Point Scheduler

The AMD Athlon processor floating-point logic is a

high-performance, fully-pipelined, superscalar, out-of-order

execution unit. It is capable of accepting three MacroOPs of any

mixture of x87 floating-point, 3DNow! or MMX operations per

cycle.

The floating-point scheduler handles register renaming and has

a dedicated 36-entry scheduler buffer organized as 12 lines of

three MacroOPs each. It also performs OP issue, and

out-of-order execution. The floating-point scheduler

communicates with the ICU to retire a MacroOP, to manage

comparison results from the FCOMI instruction, and to back

out results from a branch misprediction.

136

AMD Athlon™ Processor Microarchitecture

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Floating-Point Execution Unit

The floating-point execution unit (FPU) is implemented as a

coprocessor that has its own out-of-order control in addition to

the data path. The FPU handles all register operations for x87

instructions, all 3DNow! operations, and all MMX operations.

The FPU consists of a stack renaming unit, a register renaming

unit, a scheduler, a register file, and three parallel execution

units. Figure 3 shows a block diagram of the dataflow through

the FPU.

Pipeline

Stage

Instruction Control Unit

Stack Map

Scheduler (36-entry)

FPU Register File (88-entry)

FMUL

FADD

• MMX ALU

• MMX™ ALU

FSTORE

• MMX Mul

• 3DNow!™

• 3DNow!

Figure 3. Floating-Point Unit Block Diagram

As shown in Figure 3 on page 137, the floating-point logic uses

three separate execution positions or pipes for superscalar x87,

3DNow! and MMX operations. The first of the three pipes is

generally known as the adder pipe (FADD), and it contains

3DNow! add, MMX ALU/shifter, and floating-point add

execution units. The second pipe is known as the multiplier

(FMUL). It contains a 3DNow!/MMX multiplier/reciprocal unit,

an MMX ALU and a floating-point multiplier/divider/square

root unit. The third pipe is known as the floating-point

load/store (FSTORE), which handles floating-point constant

loads (FLDZ, FLDPI, etc.), stores, FILDs, as well as many OP

primitives used in VectorPath sequences.

AMD Athlon™ Processor Microarchitecture

137

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Load-Store Unit (LSU)

The load-store unit (LSU) manages data load and store accesses

to the L1 data cache and, if required, to the backside L2 cache

or system memory. The 44-entry LSU provides a data interface

for both the integer scheduler and the floating-point scheduler.

It consists of two queues—a 12-entry queue for L1 cache load

and store accesses and a 32-entry queue for L2 cache or system

memory load and store accesses. The 12-entry queue can

request a maximum of two L1 cache loads and two L1 cache

(32-bits) stores per cycle. The 32-entry queue effectively holds

requests that missed in the L1 cache probe by the 12-entry

queue. Finally, the LSU ensures that the architectural load and

store ordering rules are preserved (a requirement for x86

architecture compatibility).

Operand

Buses

Result Buses

from

Data Cache

LSU

Core

2-way,

44-Entry

64Kbytes

Store Data

to BIU

Figure 4. Load/Store Unit

138

AMD Athlon™ Processor Microarchitecture

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

L2 Cache Controller

The AMD Athlon processor contains a very flexible onboard L2

controller. It uses an independent backside bus to access up to

8-Mbytes of industry-standard SRAMs. There are full on-chip

tags for a 512-Kbyte cache, while larger sizes use a partial tag

system. In addition, there is a two-level data TLB structure. The

first-level TLB is fully associative and contains 32 entries (24

that map 4-Kbyte pages and eight that map 2-Mbyte or 4-Mbyte

pages). The second-level TLB is four-way set associative and

contains 256 entries, which can map 4-Kbyte pages.

Write Combining

See Appendix C, “Implementation of Write Combining ” on

page 155 for detailed information about write combining.

AMD Athlon™ System Bus

The AMD Athlon system bus is a high-speed bus that consists of

a pair of unidirectional 13-bit address and control channels and

a bidirectional 64-bit data bus. The AMD Athlon system bus

supports low-voltage swing, multiprocessing, clock forwarding,

and fast data transfers. The clock forwarding technique is used

to deliver data on both edges of the reference clock, therefore

doubling the transfer speed. A four-entry 64-byte write buffer is

integrated into the BIU. The write buffer improves bus

utilization by combining multiple writes into a single large

write cycle. By using the AMD Athlon system bus, the

AMD Athlon processor can transfer data on the 64-bit data bus

at 200 MHz, which yields an effective throughput of 1.6-Gbyte

per second.

AMD Athlon™ Processor Microarchitecture

139

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

140

AMD Athlon™ Processor Microarchitecture

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix B

Pipeline and Execution Unit

Resources Overview

The AMD Athlon™ processor contains two independent

execution pipelines—one for integer operations and one for

floating-point operations. The integer pipeline manages x86

integer operations and the floating-point pipeline manages all

x87, 3DNow!™ and MMX™ instructions. This appendix

describes the operation and functionality of these pipelines.

Fetch and Decode Pipeline Stages

Figure 5 on page 142 and F igure 6 on page 142 show the

AMD Athlon processor instruction fetch and decoding pipeline

stages. The pipeline consists of one cycle for instruction fetches

and four cycles of instruction alignment and decoding. The

three ports in stage 5 provide a maximum bandwidth of three

MacroOPs per cycle for dispatching to the instruction control

unit (ICU).

Fetch and Decode Pipeline Stages

141

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Entry

Point

VectorPath

Decode

M ROM

16 bytes

I-CACHE

Decode

MacroOps

DirectPath

Quadword

Queue

FETCH SCAN ALIGN1/

MECTL

ALIGN2/

MEROM

EDEC/

MEDEC

IDEC

Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware

The most common x86 instructions flow through the DirectPath

pipeline stages and are decoded by hardware. The less common

instructions, which require microcode assistance, flow through

the VectorPath. Although the DirectPath decodes the common

x86 instructions, it also contains VectorPath instruction data,

which allows it to maintain dispatch order at the end of cycle 5.

D irectPath

A LIGN 1

A LIG N2

M EROM

ED EC

FETCH

SCA N

ID EC

M ECTL

M ESEQ

VectorPath

Figure 6. Fetch/Scan/Align/Decode Pipeline Stages

142

Fetch and Decode Pipeline Stages

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Cycle 1–FETCH

The FETCH pipeline stage calculates the address of the next

x86 instruction window to fetch from the processor caches or

system memory.

Cycle 2–SCAN

SCAN determines the start and end pointers of instructions.

SCAN can send up to six aligned instructions (DirectPath and

VectorPath) to ALIGN1 and only one VectorPath instruction to

the microcode engine (MENG) per cycle.

Cycle 3 (DirectPath)–

ALIGN1

Because each 8-byte buffer (quadword queue) can contain up to

three instructions, ALIGN1 can buffer up to a maximum of nine

instructions, or 24 instruction bytes. ALIGN1 tries to send three

instructions from an 8-byte buffer to ALIGN2 per cycle.

Cycle 3 (VectorPath)–

MECTL

For VectorPath instructions, the microcode engine control

(MECTL) stage of the pipeline generates the microcode entry

points.

Cycle 4 (DirectPath)–

ALIGN2

ALIGN2 prioritizes prefix bytes, determines the opcode,

ModR/M, and SIB bytes for each instruction and sends the

accumulated prefix information to EDEC.

Cycle 4 (VectorPath)–

MEROM

In the microcode engine ROM (MEROM) pipeline stage, the

entry-point generated in the previous cycle, MECTL, is used to

index into the MROM to obtain the microcode lines necessary

to decode the instruction sent by SCAN.

Cycle 5 (DirectPath)–

EDEC

The early decode (EDEC) stage decodes information from the

DirectPath stage (ALIGN2) and VectorPath stage (MEROM)

into MacroOPs. In addition, EDEC determines register

pointers, flag updates, immediate values, displacements, and

other information. EDEC then selects either MacroOPs from

the DirectPath or MacroOPs from the VectorPath to send to the

instruction decoder (IDEC) stage.

Cycle 5 (VectorPath)–

MEDEC/MESEQ

The microcode engine decode (MEDEC) stage converts x86

instructions into MacroOPs. The microcode engine sequencer

(MESEQ) performs the sequence controls (redirects and

exceptions) for the MENG.

Cycle 6–

IDEC/Rename

At the instruction decoder (IDEC)/rename stage, integer and

floating-point MacroOPs diverge in the pipeline. Integer

MacroOPs are scheduled for execution in the next cycle.

Floating-point MacroOPs have their floating-point stack

Fetch and Decode Pipeline Stages

143

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

operands mapped to registers. Both integer and floating-point

MacroOPs are placed into the ICU.

Integer Pipeline Stages

The integer execution pipeline consists of four or more stages

for scheduling and execution and, if necessary, accessing data

in the processor caches or system memory. There are three

integer pipes associated with the three IEUs.

Pipeline

Stage

Instruction Control Unit and Register Files

MacroOPs

Integer Scheduler

(18-entry)

IEU0

AGU0

IEU1

AGU1

IEU2

AGU2

Integer Multiply (IMUL)

Figure 7. Integer Execution Pipeline

Figure 7 and Figure 8 show the integer execution resources and

the pipeline stages, which are described in the following

sections.

RESP

SCHED

EXEC

ADDGEN

DC ACC

Figure 8. Integer Pipeline Stages

144

Integer Pipeline Stages

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Cycle 7–SCHED

In the scheduler (SCHED) pipeline stage, the scheduler buffers

can contain MacroOPs that are waiting for integer operands

from the ICU or the IEU result bus. When all operands are

received, SCHED schedules the MacroOP for execution and

issues the OPs to the next stage, EXEC.

Cycle 8–EXEC

In the execution (EXEC) pipeline stage, the OP and its

associated operands are processed by an integer pipe (either

the IEU or the AGU). If addresses must be calculated to access

data necessary to complete the operation, the OP proceeds to

the next stages, ADDGEN and DCACC.

Cycle 9–ADDGEN

Cycle 10–DCACC

In the address generation (ADDGEN) pipeline stage, the load

or store OP calculates a linear address, which is sent to the data

cache TLBs and caches.

In the data cache access (DCACC) pipeline stage, the address

generated in the previous pipeline stage is used to access the

data cache arrays and TLBs. Any OP waiting in the scheduler

for this data snarfs this data and proceeds to the EXEC stage

(assuming all other operands were available).

Cycle 11–RESP

In the response (RESP) pipeline stage, the data cache returns

hit/miss status and data for the request from DCACC.

Integer Pipeline Stages

145

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Floating-Point Pipeline Stages

The floating-point unit (FPU) is implemented as a coprocessor

that has its own out-of-order control in addition to the data

path. The FPU handles all register operations for x87

instructions, all 3DNow! operations, and all MMX operations.

The FPU consists of a stack renaming unit, a register renaming

unit, a scheduler, a register file, and three parallel execution

units. Figure 9 shows a block diagram of the dataflow through

the FPU.

Pipeline

Stage

Instruction Control Unit

Stack Map

Scheduler (36-entry)

FPU Register File (88-entry)

FMUL

FADD

• MMX ALU

• MMX™ ALU

FSTORE

• MMX Mul

• 3DNow!

• 3DNow!™

• 3DNow!

Figure 9. Floating-Point Unit Block Diagram

The floating-point pipeline stages 7–15 are shown in Figure 10

and described in the following sections. Note that the

floating-point pipe and integer pipe separates at cycle 7.

STKREN

REGREN

SCHEDW

SCHED

FREG

FEXE1

FEXE4

Figure 10. Floating-Point Pipeline Stages

146

Floating-Point Pipeline Stages

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Cycle 7–STKREN

The stack rename (STKREN) pipeline stage in cycle 7 receives

up to three MacroOPs from IDEC and maps stack-relative

Cycle 8–REGREN

The register renaming (REGREN) pipeline stage in cycle 8 is

responsible for register renaming. In this stage, virtual register

tags are mapped into physical register tags. Likewise, each

destination is assigned a new physical register. The MacroOPs

are then sent to the 36-entry FPU scheduler.

Cycle 9–SCHEDW

Cycle 10–SCHED

The scheduler write (SCHEDW) pipeline stage in cycle 9 can

receive up to three MacroOPs per cycle.

The schedule (SCHED) pipeline stage in cycle 10 schedules up

to three MacroOPs per cycle from the 36-entry FPU scheduler

to the FREG pipeline stage to read register operands.

MacroOPs are sent when their operands and/or tags are

obtained.

Cycle 11–FREG

The register file read (FREG) pipeline stage reads the

floating-point register file for any register source operands of

MacroOPs. The register file read is done before the MacroOPs

are sent to the floating-point execution pipelines.

Cycle 12–15 –

Floating-Point

Execution (FEXEC1–4)

The FPU has three logical pipes—FADD, FMUL, and FSTORE.

Each pipe may have several associated execution units. MMX

execution is in both the FADD and FMUL pipes, with the

exception of MMX instructions involving multiplies, which are

limited to the FMUL pipe. The FMUL pipe has special support

for long latency operations.

DirectPath/VectorPath operations are dispatched to the FPU

during cycle 6, but are not acted upon until they receive

validation from the ICU in cycle 7.

Floating-Point Pipeline Stages

147

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Execution Unit Resources

Terminology

The execution units operate with two types of register values—

operands and results. There are three operand types and two

result types, which are described in this section.

Operands

The three types of operands are as follows:

ꢀ

Address register operands—Used for address calculations of

load and store instructions

ꢀ

Data register operands—Used for register instructions

Store data register operands—Used for memory stores

Results

The two types of results are as follows:

ꢀ

Data register results—Produced by load or register

instructions

ꢀ

Address register results—Produced by LEA or PUSH

instructions

Examples

The following examples illustrate the operand and result

definitions:

ADD EAX, EBX

The ADD instruction has two data register operands (EAX

and EBX) and one data register result (EAX).

MOV EBX, [ESP+4*ECX+8]

;Load

The Load instruction has two address register operands

(ESP and ECX as base and index registers, respectively)

and a data register result (EBX).

MOV [ESP+4*ECX+8], EAX

;Store

The Store instruction has a data register operand (EAX)

and two address register operands (ESP and ECX as base

and index registers, respectively).

LEA ESI, [ESP+4*ECX+8]

The LEA instruction has address register operands (ESP

and ECX as base and index registers, respectively), and an

address register result (ESI).

148

Execution Unit Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Integer Pipeline Operations

Table 2 shows the category or type of operations handled by the

integer pipeline. T able 3 shows examples of the decode type.

Table 2. Integer Pipeline Operation Types

Category

Execution Unit

FPU/3DNow!/MMX Load/store or

Miscellaneous Operations

FSTORE

FPU/3DNow!/MMX Multiply Operation

FPU/3DNow!/MMX Arithmetic Operation

FMUL

FADD

Table 5. Floating-Point Decode Types

x86 Instruction

FADD ST, ST(i)

FSIN

Decode Type

DirectPath

VectorPath

DirectPath

OPs

FADD

various

FADD

FMUL

PFACC

PFRSQRT

As shown in Table 4, the FADD register-to-register instruction

generates a single MacroOP targeted for the floating-point

scheduler. FSIN is considered a VectorPath instruction because

it is a complex instruction with long execution times, as

compared to the more common floating-point instructions. The

MMX PFACC instruction is DirectPath decodeable and

generates a single MacroOP targeted for the arithmetic

operation execution pipeline in the floating-point logic. Just

like PFACC, a single MacroOP is early decoded for the 3DNow!

PFRSQRT instruction, but it is targeted for the multiply

operation execution pipeline.

150

Execution Unit Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Load/Store Pipeline Operations

The AMD Athlon processor decodes any instruction that

references memory into primitive load/store operations. For

example, consider the following code sample:

MOV

PUSH

POP

AX, [EBX]

EAX

;1 load MacroOP

;1 store MacroOP

;1 load MacroOP

ADD

FSTP

MOVQ

[EAX], EBX

[EAX]

[EAX], MM0

;1 load/store and 1 IEU MacroOPs

;1 store MacroOP

As shown in T able 6, the load/store unit (LSU) consists of a

three-stage data cache lookup.

Table 6. Load/Store Unit Stages

Stage 1 (Cycle 8)

Stage 2 (Cycle 9)

Stage 3 (Cycle 10)

Address Calculation / LS1

Scan

Transport Address to Data

Cache

Data Cache Access / LS2

Data Forward

Loads and stores are first dispatched in order into a 12-entry

deep reservation queue called LS1. LS1 holds loads and stores

that are waiting to enter the cache subsystem. Loads and stores

are allocated into LS1 entries at dispatch time in program

order, and are required by LS1 to probe the data cache in

program order. The AGUs can calculate addresses out of

program order, therefore, LS1 acts as an address reorder buffer.

When a load or store is scanned out of the LS1 queue (Stage 1),

it is deallocated from the LS1 queue and inserted into the data

cache probe pipeline (Stage 2 and Stage 3). Up to two memory

operations can be scheduled (scanned out of LS1) to access the

data cache per cycle. The LSU can handle the following:

ꢀ

Two 64-bit loads per cycle or

One 64-bit load and one 64-bit store per cycle or

Two 32-bit stores per cycle

Execution Unit Resources

151

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Code Sample Analysis

The samples in T able 7 on page 153 and Table 8 on page 154

show the execution behavior of several series of instructions as

a function of decode constraints, dependencies, and execution

resource constraints.

The sample tables show the x86 instructions, the decode pipe in

the integer execution pipeline, the decode type, the clock

counts, and a description of the events occurring within the

processor. The decode pipe gives the specific IEU used (see

Figure 7 on page 144). The decode type specifies either

VectorPath (VP) or DirectPath (DP).

The following nomenclature is used to describe the current

location of a particular operation:

ꢀ

D—Dispatch stage (Allocate in ICU, reservation stations,

load-store (LS1) queue)

I—Issue stage (Schedule operation for AGU or FU

execution)

E—Integer Execution Unit (IEU number corresponds to

decode pipe)

&—Address Generation Unit (AGU number corresponds to

decode pipe)

ꢀ

M—Multiplier Execution

S—Load/Store pipe stage 1 (Schedule operation for

load/store pipe)

ꢀ

A—Load/Store pipe stage 2 (1st stage of data cache/LS2

buffer access)

$—Load/Store pipe stage 3 (2nd stage of data cache/LS2

buffer access)

Note: Instructions execute more efficiently (that is, without

delays) when scheduled apart by suitable distances based on

dependencies. In general, the samples in this section show

poorly scheduled code in order to illustrate the resultant

effects.

152

Execution Unit Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 7. Sample 1 – Integer Register Operations

Clocks

Instruction

Number

Decode Decode

Pipe

Type

Instruction

IMUL EAX, ECX

INC

ESI

MOV EDI, 0x07F4

ADD EDI, EBX

SHL EAX, 8

EAX, 0x0F

EBX

INC

ADD ESI, EDX

Comments for Each Instruction Number

1. The IMUL is a VectorPath instruction. It cannot be decode or paired with other operations and, therefore,

dispatches alone in pipe 0. The multiply latency is four cycles.

2. The simple INC operation is paired with instructions 3 and 4. The INC executes in IEU0 in cycle 4.

3. The MOV executes in IEU1 in cycle 4.

4. The ADD operation depends on instruction 3. It executes in IEU2 in cycle 5.

5. The SHL operation depends on the multiply result (instruction 1). The MacroOP waits in a reservation

station and is eventually scheduled to execute in cycle 7 after the multiply result is available.

6. This operation executes in cycle 8 in IEU1.

7. This simple operation has a resource contention for execution in IEU2 in cycle 5. Therefore, the operation

does not execute until cycle 6.

8. The ADD operation executes immediately in IEU0 after dispatching.

Execution Unit Resources

153

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 8. Sample 2 – Integer Register and Memory Load Operations

Clocks

Instruc

Num

Decode Decode

Instruction

DEC EDX

Pipe

Type

10 11 12

MOV EDI, [ECX]

SUB EAX, [EDX+20]

SAR EAX, 5

&/S

$/I

ADD ECX, [EDI+4]

AND EBX, 0x1F

MOV ESI, [0x0F100]

&/S

ECX, [ESI+EAX*4+8]

&/S

Comments for Each Instruction Number

1. The ALU operation executes in IEU0.

2. The load operation generates the address in AGU1 and is simultaneously scheduled for the load/store pipe in cycle 3. In

cycles 4 and 5, the load completes the data cache access.

3. The load-execute instruction accesses the data cache in tandem with instruction 2. After the load portion completes, the

subtraction is executed in cycle 6 in IEU2.

4. The shift operation executes in IEU0 (cycle 7) after instruction 3 completes.

5. This operation is stalled on its address calculation waiting for instruction 2 to update EDI. The address is calculated in

cycle 6. In cycle 7/8, the cache access completes.

6. This simple operation executes quickly in IEU2

7. The address for the load is calculated in cycle 5 in AGU0. However, the load is not scheduled to access the data cache

until cycle 6. The load is blocked for scheduling to access the data cache for one cycle by instruction 5. In cycles 7 and 8,

instruction 7 accesses the data cache concurrently with instruction 5.

8. The load execute instruction accesses the data cache in cycles 10/11 and executes the ‘OR’ operation in IEU1 in cycle 12.

154

Execution Unit Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix C

Implementation of

Write Combining

Introduction

This appendix describes the memory write-combining feature

as implemented in the AMD Athlon™ processor family. The

AMD Athlon processor supports the memory type and range

which allow software to define ranges of memory as either

writeback (WB), write-protected (WP), writethrough (WT),

uncacheable (UC), or write-combining (WC).

Defining the memory type for a range of memory as WC or WT

allows the processor to conditionally combine data from

multiple write cycles that are addressed within this range into a

merge buffer. Merging multiple write cycles into a single write

cycle reduces processor bus utilization and processor stalls,

thereby increasing the overall system performance.

To understand the information presented in this appendix, the

reader should possess a knowledge of K86™ processors, the x86

architecture, and programming requirements.

Introduction

155

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Write-Combining Definitions and Abbreviations

This appendix uses the following definitions and abbreviations:

ꢀ

UC—Uncacheable memory type

WC—Write-combining memory type

WT—Writethrough memory type

WP—Write-protected memory type

WB—Writeback memory type

One Byte—8 bits

One Word—16 bits

Longword—32 bits (same as a x86 doubleword)

Quadword—64 bits or 2 longwords

Octaword—128 bits or 2 quadwords

Cache Block—64 bytes or 4 octawords or 8 quadwords

What is Write Combining?

Write combining is the merging of multiple memory write

cycles that target locations within the address range of a write

buffer. The AMD Athlon processor combines multiple

memory-write cycles to a 64-byte buffer whenever the memory

address is within a WC or WT memory type region. The

processor continues to combine writes to this buffer without

writing the data to the system, as long as certain rules apply

(see Table 9 on page 158 for more information).

Programming Details

The steps required for programming write combining on the

AMD Athlon processor are as follows:

1. Verify the presence of an AMD Athlon processor by using

the CPUID instruction to check for the instruction family

code and vendor identification of the processor. Standard

function 0 on AMD processors returns a vendor

identification string of “AuthenticAMD” in registers EBX,

EDX, and ECX. Standard function 1 returns the processor

156

Write-Combining Definitions and Abbreviations

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

signature in register EAX, where EAX[11–8] contains the

instruction family code. For the AMD Athlon processor, the

instruction family code is six.

2. In addition, the presence of the MTRRs is indicated by bit

12 and the presence of the PAT extension is indicated by bit

16 of the extended features bits returned in the EDX

Processor Recognition Application Note, order# 20734 for

more details on the CPUID instruction.

3. Write combining is controlled by the MTRRs and PAT.

Write combining should be enabled for the appropriate

memory ranges. The AMD Athlon processor MTRRs and

PAT are compatible with the Pentium^®II.

Write-Combining Operations

In order to improve system performance, the AMD Athlon

processor aggressively combines multiple memory-write cycles

of any data size that address locations within a 64-byte write

buffer that is aligned to a cache-line boundary. The data sizes

can be bytes, words, longwords, or quadwords.

WC memory type writes can be combined in any order up to a

full 64-byte sized write buffer.

WT memory type writes can only be combined up to a fully

aligned quadword in the 64-byte buffer, and must be combined

contiguously in ascending order. Combining may be opened at

any byte boundary in a quadword, but is closed by a write that is

either not “contiguous and ascending” or fills byte 7.

All other memory types for stores that go through the write

buffer (UC and WP) cannot be combined.

Combining is able to continue until interrupted by one of the

conditions listed in Table 9 on page 158. When combining is

interrupted, one or more bus commands are issued to the

system for that write buffer, as described by Table 10 on

page 159.

Write-Combining Operations

157

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 9. Write Combining Completion Events

Event

Comment

The first non-WB write to a different cache block address

closes combining for previous writes. WB writes do not affect

write combining. Only one line-sized buffer can be open for

write combining at a time. Once a buffer is closed for write

combining, it cannot be reopened for write combining.

Non-WB write outside of

current buffer

Any IN/INS or OUT/OUTS instruction closes combining. The

implied memory type for all IN/OUT instructions is UC,

which cannot be combined.

I/O Read or Write

Any serializing instruction closes combining. These

instructions include: MOVCRx, MOVDRx, WRMSR, INVD,

INVLPG, WBINVD, LGDT, LLDT, LIDT, LTR, CPUID, IRET, RSM,

INIT, HALT.

Serializing instructions

Flushing instructions

Locks

Any flush instruction causes the WC to complete.

Any instruction or processor operation that requires a cache

or bus lock closes write combining before starting the lock.

Writes within a lock can be combined.

A UC read closes write combining. A WC read closes

combining only if a cache block address match occurs

between the WC read and a write in the write buffer.

Uncacheable Read

Any WT write while write-combining for WC memory or any

Different memory type WC write while write combining for WT memory closes write

combining.

Write combining is closed if all 64 bytes of the write buffer

Buffer full

are valid.

If 16 processor clocks have passed since the most recent

write for WT write combining, write combining is closed.

There is no time-out for WC write combining.

WT time-out

Write combining is closed if a write fills the most significant

byte of a quadword, which includes writes that are

misaligned across a quadword boundary. In the misaligned

case, combining is closed by the LS part of the misaligned

write and combining is opened by the MS part of the

misaligned store.

WT write fills byte 7

If a subsequent WT write is not in ascending sequential

order, the write combining completes. WC writes have no

addressing constraints within the 64-byte line being

combined.

WT Nonsequential

TLB AD bit set

Write combining is closed whenever a TLB reload sets the

accessed (A) or dirty (D) bits of a Pde or Pte.

158

Write-Combining Operations

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Sending Write-Buffer Data to the System

Once write combining is closed for a 64-byte write buffer, the

contents of the write buffer are eligible to be sent to the system

as one or more AMD Athlon system bus commands. T able 10

lists the rules for determining what system commands are

issued for a write buffer, as a function of the alignment of the

valid buffer data.

Table 10. AMD Athlon™ System Bus Commands Generation Rules

1. If all eight quadwords are either full (8 bytes valid) or empty (0 bytes valid), a

Write-Quadword system command is issued, with an 8-byte mask representing

which of the eight quadwords are valid. If this case is true, do not proceed to the

next rule.

2. If all longwords are either full (4 bytes valid) or empty (0 bytes valid), a

Write-Longword system command is issued for each 32-byte buffer half that

contains at least one valid longword. The mask for each Write-Longword system

command indicates which longwords are valid in that 32-byte write buffer half. If

this case is true, do not proceed to the next rule.

3. Sequence through all eight quadwords of the write buffer, from quadword 0 to

quadword 7. Skip over a quadword if no bytes are valid. Issue a Write-Quad system

command if all bytes are valid, asserting one mask bit. Issue a Write-Longword

system command if the quadword contains one aligned longword, asserting one

mask bit. Otherwise, issue a Write-Byte system command if there is at least one

valid byte, asserting a mask bit for each valid byte.

Write-Combining Operations

159

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

160

Write-Combining Operations

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix D

Performance-Monitoring

Counters

This chapter describes how to use the AMD Athlon™ processor

performance monitoring counters.

Overview

The AMD Athlon processor provides four 48-bit performance

counters, which allows four types of events to be monitored

simultaneously. These counters can either count events or

measure duration. When counting events, a counter is

incremented each time a specified event takes place or a

specified number of events takes place. When measuring

duration, a counter counts the number of processor clocks that

occur while a specified condition is true. The counters can

count events or measure durations that occur at any privilege

level. Table 11 on page 164 lists the events that can be counted

with the performance monitoring counters.

Performance Counter Usage

The performance monitoring counters are supported by eight

MSRs—PerfEvtSel[3:0] are the performance event select

MSRs, and PerfCtr[3:0] are the performance counter MSRs.

Overview

161

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

These registers can be read from and written to using the

RDMSR and WRMSR instructions, respectively.

The PerfEvtSel[3:0] registers are located at MSR locations

C001_0000h to C001_0003h. The PerfCtr[3:0] registers are

located at MSR locations C001_0004h to C0001_0007h and are

64-byte registers.

The PerfEvtSel[3:0] registers can be accessed using the

RDMSR/WRMSR instructions only when operating at privilege

level 0. The PerfCtr[3:0] MSRs can be read from any privilege

level using the RDPMC (read performance-monitoring

counters) instruction, if the PCE flag in CR4 is set.

PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000h–C001_0003h)

The PerfEvtSel[3:0] MSRs, shown in Figure 11, control the

operation of the performance-monitoring counters, with one

events to be counted, how they should be counted, and the

privilege levels at which counting should take place. The

functions of the flags and fields within these MSRs are as are

described in the following sections.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10

Counter Mask

Event Mask

Unit Mask

Reserved

Symbol

Description

Bit

USR

User Mode

Operating System Mode

Edge Detect

Pin Control

INT

INV

APIC Interrupt Enable

Enable Counter

Invert Mask

Figure 11. PerfEvtSel[3:0] Registers

Event Select Field

(Bits 0—7)

These bits are used to select the event to be monitored. See

Table 11 on page 164 for a list of event masks and their 8-bit

codes.

162

Performance Counter Usage

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Unit Mask Field (Bits

8—15)

These bits are used to further qualify the event selected in the

event select field. For example, for some cache events, the mask

is used as a MESI-protocol qualifier of cache states. See

Table 11 on page 164 for a list of unit masks and their 8-bit

codes.

USR (User Mode) Flag

(Bit 16)

Events are counted only when the processor is operating at

privilege levels 1, 2 or 3. This flag can be used in conjunction

with the OS flag.

OS (Operating System

Mode) Flag (Bit 17)

Events are counted only when the processor is operating at

privilege level 0. This flag can be used in conjunction with the

USR flag.

E (Edge Detect) Flag

(Bit 18)

When this flag is set, edge detection of events is enabled. The

processor counts the number of negated-to-asserted transitions

of any condition that can be expressed by the other fields. The

mechanism is limited in that it does not permit back-to-back

assertions to be distinguished. This mechanism allows software

to measure not only the fraction of time spent in a particular

state, but also the average length of time spent in such a state

(for example, the time spent waiting for an interrupt to be

serviced).

PC (Pin Control) Flag

(Bit 19)

When this flag is set, the processor toggles the PMi pins when

the counter overflows. When this flag is clear, the processor

toggles the PMi pins and increments the counter when

performance monitoring events occur. The toggling of a pin is

defined as assertion of the pin for one bus clock followed by

negation.

INT (APIC Interrupt

Enable) Flag (Bit 20)

When this flag is set, the processor generates an interrupt

through its local APIC on counter overflow.

EN (Enable Counter)

Flag (Bit 22)

This flag enables/disables the PerfEvtSeln MSR. When set,

performance counting is enabled for this counter. When clear,

this counter is disabled.

INV (Invert) Flag (Bit

23)

By inverting the Counter Mask Field, this flag inverts the result

of the counter comparison, allowing both greater than and less

than comparisons.

Counter Mask Field

(Bits 31–24)

For events which can have multiple occurrences within one

clock, this field is used to set a threshold. If the field is non-zero,

the counter increments each time the number of events is

Performance Counter Usage

163

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

greater than or equal to the counter mask. Otherwise if this

field is zero, then the counter increments by the total number of

events.

Table 11. Performance-Monitoring Counters

Event

Number

Source

Unit

Notes / Unit Mask (bits 15–8)

1xxx_xxxxb = reserved

Event Description

x1xx_xxxxb = HS

xx1x_xxxxb = GS

xxx1_xxxxb = FS

xxxx_1xxxb = DS

xxxx_x1xxb = SS

xxxx_xx1xb = CS

xxxx_xxx1b = ES

20h

Segment register loads

21h

40h

41h

Stores to active instruction stream

Data cache accesses

Data cache misses

xxx1_xxxxb = Modified (M)

xxxx_1xxxb = Owner (O)

xxxx_x1xxb = Exclusive (E)

xxxx_xx1xb = Shared (S)

xxxx_xxx1b = Invalid (I)

xxx1_xxxxb = Modified (M)

xxxx_1xxxb = Owner (O)

42h

43h

44h

Data cache refills

DC xxxx_x1xxb = Exclusive (E)

xxxx_xx1xb = Shared (S)

xxxx_xxx1b = Invalid (I)

xxx1_xxxxb = Modified (M)

xxxx_1xxxb = Owner (O)

DC xxxx_x1xxb = Exclusive (E)

xxxx_xx1xb = Shared (S)

xxxx_xxx1b = Invalid (I)

Data cache refills from system

Data cache writebacks

45h

46h

47h

64h

L1 DTLB misses and L2 DTLB hits

L1 and L2 DTLB misses

Misaligned data references

DRAM system requests

164

Performance Counter Usage

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 11. Performance-Monitoring Counters (Continued)

Event

Number

Source

Unit

Notes / Unit Mask (bits 15–8)

1xxx_xxxxb = reserved

Event Description

x1xx_xxxxb = WB

xx1x_xxxxb = WP

xxx1_xxxxb = WT

65h

System requests with the selected type

bits 11–10 = reserved

xxxx_xx1xb = WC

xxxx_xxx1b = UC

bits 15–11 = reserved

xxxx_x1xxb = L2 (L2 hit and no DC

hit)

73h

74h

Snoop hits

xxxx_xx1xb = Data cache

xxxx_xxx1b = Instruction cache

bits 15–10 = reserved

BU xxxx_xx1xb = L2 single bit error

xxxx_xxx1b = System single bit error

bits 15–12 = reserved

Single-bit ECC errors detected/corrected

xxxx_1xxxb = I invalidates D

BU xxxx_x1xxb = I invalidates I

xxxx_xx1xb = D invalidates D

xxxx_xxx1b = D invalidates I

75h

76h

Internal cache line invalidates

Cycles processor is running (not in HLT

or STPCLK)

1xxx_xxxxb = Data block write from

the L2 (TLB RMW)

x1xx_xxxxb = Data block write from

the DC

xx1x_xxxxb = Data block write from

the system

xxx1_xxxxb = Data block read data

store

79h

L2 requests

xxxx_1xxxb = Data block read data

load

xxxx_x1xxb = Data block read

instruction

xxxx_xx1xb = Tag write

xxxx_xxx1b = Tag read

Performance Counter Usage

165

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 11. Performance-Monitoring Counters (Continued)

Event

Number

Source

Unit

Notes / Unit Mask (bits 15–8)

Event Description

Cycles that at least one fill request

waited to use the L2

7Ah

80h

81h

82h

83h

84h

85h

86h

87h

88h

89h

Instruction cache fetches

Instruction cache misses

Instruction cache refills from L2

Instruction cache refills from system

L1 ITLB misses (and L2 ITLB hits)

(L1 and) L2 ITLB misses

Snoop resyncs

Instruction fetch stall cycles

Return stack hits

Return stack overflow

Retired instructions (includes

exceptions, interrupts, resyncs)

C0h

C1h

C2h

Retired Ops

Retired branches (conditional,

unconditional, exceptions, interrupts)

C3h

C4h

C5h

C6h

C8h

C9h

Retired branches mispredicted

Retired taken branches

Retired taken branches mispredicted

Retired far control transfers

Retired near returns

Retired near returns mispredicted

Retired indirect branches with target

mispredicted

CAh

CDh

CEh

Interrupts masked cycles (IF=0)

Interrupts masked while pending cycles

(INTR while IF=0)

CFh

D0h

Number of taken hardware interrupts

Instruction decoder empty

Dispatch stalls (event masks D2h

through DAh below combined)

D1h

D2h

D3h

D4h

Branch abort to retire

Serialize

Segment load stall

166

Performance Counter Usage

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 11. Performance-Monitoring Counters (Continued)

Event

Number

Source

Unit

Notes / Unit Mask (bits 15–8)

Event Description

D5h

D6h

D7h

D8h

D9h

DAh

DCh

DDh

DEh

DFh

ICU full

Reservation stations full

FPU full

LS full

All quiet stall

Far transfer or resync branch pending

Breakpoint matches for DR0

Breakpoint matches for DR1

Breakpoint matches for DR2

Breakpoint matches for DR3

PerfCtr[3:0] MSRs (MSR Addresses C001_0004h–C001_0007h)

The performance-counter MSRs contain the event or duration

counts for the selected events being counted. The RDPMC

instruction can be used by programs or procedures running at

any privilege level and in virtual-8086 mode to read these

counters. The PCE flag in control register CR4 (bit 8) allows the

use of this instruction to be restricted to only programs and

procedures running at privilege level 0.

The RDPMC instruction is not serializing or ordered with other

instructions. Therefore, it does not necessarily wait until all

previous instructions have been executed before reading the

counter. Similarly, subsequent instructions can begin execution

before the RDPMC instruction operation is performed.

Only the operating system, executing at privilege level 0, can

directly manipulate the performance counters, using the

RDMSR and WRMSR instructions. A secure operating system

would clear the PCE flag during system initialization, which

disables direct user access to the performance-monitoring

counters but provides a user-accessible programming interface

that emulates the RDPMC instruction.

The WRMSR instruction cannot arbitrarily write to the

performance-monitoring counter MSRs (PerfCtr[3:0]). Instead,

the value should be treated as 64-bit sign extended, which

Performance Counter Usage

167

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

allows writing both positive and negative values to the

performance counters. The performance counters may be

initialized using a 64-bit signed integer in the range -2⁴⁷and

+2⁴⁷. Negative values are useful for generating an interrupt

after a specific number of events.

Starting and Stopping the Performance-Monitoring Counters

The performance-monitoring counters are started by writing

valid setup information in one or more of the PerfEvtSel[3:0]

MSRs and setting the enable counters flag in the PerfEvtSel0

MSR. If the setup is valid, the counters begin counting

following the execution of a WRMSR instruction, which sets the

enable counter flag. The counters can be stopped by clearing

the enable counters flag or by clearing all the bits in the

PerfEvtSel[3:0] MSRs.

Event and Time-Stamp Monitoring Software

For applications to use the performance-monitoring counters

and time-stamp counter, the operating system needs to provide

an event-monitoring device driver. This driver should include

procedures for handling the following operations:

ꢀ

Feature checking

Initialize and start counters

Stop counters

Read the event counters

Reading of the time stamp counter

The event monitor feature determination procedure must

determine whether the current processor supports the

performance-monitoring counters and time-stamp counter. This

procedure compares the family and model of the processor

returned by the CPUID instruction with those of processors

known to support performance monitoring. In addition, the

procedure checks the MSR and TSC flags returned to register

EDX by the CPUID instruction to determine if the MSRs and

the RDTSC instruction are supported.

168

Event and Time-Stamp Monitoring Software

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

The initialization and start counters procedure sets the

PerfEvtSel0 and/or PerfEvtSel1 MSRs for the events to be

counted and the method used to count them and initializes the

counter MSRs (PerfCtr[3:0]) to starting counts. The stop

counters procedure stops the performance counters. (See

“Starting and Stopping the Performance-Monitoring Counters ”

on page 168 for more information about starting and stopping

the counters.)

The read counters procedure reads the values in the

PerfCtr[3:0] MSRs, and a read time-stamp counter procedure

reads the time-stamp counter. These procedures can be used

instead of enabling the RDTSC and RDPMC instructions, which

allow application code to read the counters directly.

Monitoring Counter Overflow

The AMD Athlon processor provides the option of generating a

debug interrupt when a performance-monitoring counter

overflows. This mechanism is enabled by setting the interrupt

enable flag in one of the PerfEvtSel[3:0] MSRs. The primary

use of this option is for statistical performance sampling.

To use this option, the operating system should do the

following:

ꢀ

Provide an interrupt routine for handling the counter

overflow as an APIC interrupt

Provide an entry in the IDT that points to a stub exception

handler that returns without executing any instructions

Provide an event monitor driver that provides the actual

interrupt handler and modifies the reserved IDT entry to

point to its interrupt routine

When interrupted by a counter overflow, the interrupt handler

needs to perform the following actions:

ꢀ

Save the instruction pointer (EIP register), code segment

selector, TSS segment selector, counter values and other

relevant information at the time of the interrupt

ꢀ

Reset the counter to its initial setting and return from the

interrupt

Monitoring Counter Overflow

169

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

An event monitor application utility or another application

program can read the collected performance information of the

profiled application.

170

Monitoring Counter Overflow

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix E

Programming the MTRR and

PAT

Introduction

The AMD Athlon™ processor includes a set of memory type

and range registers (MTRRs) to control cacheability and access

to specified memory regions. The processor also includes the

Page Address Table for defining attributes of pages. This

chapter documents the use and capabilities of this feature.

The purpose of the MTRRs is to provide system software with

the ability to manage the memory mapping of the hardware.

Both the BIOS software and operating systems utilize this

capability. The AMD Athlon processor’s implementation is

compatible to the Pentium^®II. Prior to the MTRR mechanism,

chipsets usually provided this capability.

Memory Type Range Register (MTRR) Mechanism

The memory type and range registers allow the processor to

determine cacheability of various memory locations prior to

bus access and to optimize access to the memory system. The

AMD Athlon processor implements the MTRR programming

model in a manner compatible with Pentium II.

Introduction

171

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

There are two types of address ranges: fixed and variable. (See

Figure 12.) For each address range, there is a memory type. For

each 4K, 16K or 64K segment within the first 1 Mbyte of

memory, there is one fixed address MTRR. The fixed address

ranges all exist in the first 1 Mbyte. There are eight variable

address ranges above 1 Mbytes. Each is programmed to a

specific memory starting address, size and alignment. If a

variable range overlaps the lower 1 MByte and the fixed

MTRRs are enabled, then the fixed-memory type dominates.

The address regions have the following priority with respect to

each other:

1. Fixed address ranges

2. Variable address ranges

3. Default memory type (UC at reset)

172

Memory Type Range Register (MTRR) Mechanism

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

FFFFFFFFh

SMM TSeg

0-8 Variable Ranges

(2¹²to 2³²)

64 Fixed Ranges

(4 Kbytes each)

100000h

256 Kbytes

C0000h

80000h

16 Fixed Ranges

(16 Kbytes each)

8 Fixed Ranges

256 Kbytes

512 Kbytes

(64 Kbytes each)

Figure 12. MTRR Mapping of Physical Memory

Memory Type Range Register (MTRR) Mechanism

173

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Memory Types

Five standard memory types are defined by the AMD Athlon

processor: writethrough (WT), writeback (WB), write-protect

(WP), write-combining (WC), and uncacheable (UC). These are

described in Table 12 on page 174.

Table 12. Memory Type Encodings

Type Number

Type Name

Type Description

Uncacheable for reads or writes. Cannot be combined. Must be

non-speculative for reads or writes.

00h

UC—Uncacheable

Uncacheable for reads or writes. Can be combined. Can be speculative for

reads. Writes can never be speculative.

01h

04h

05h

WC —Write-Combining

WT—Writethrough

WP—Write-Protect

Reads allocate on a miss, but only to the S-state. Writes do not allocate on

a miss and, for a hit, writes update the cached entry and main memory.

WP is functionally the same as the WT memory type, except stores do not

actually modify cached data and do not cause an exception.

Reads will allocate on a miss, and will allocate to:

state if returned with a ReadDataShared command.

06h

WB—Writeback

M state if returned with a ReadDataDirty command.

Writes allocate to the M state, if the read allows the line to be marked E.

MTRR Capability

The MTRR capability register is a read-only register that

defines the specific MTRR capability of the processor and is

defined as follows.

11 10

VCNT

Reserved

Symbol

FIX

Description

Write Combining Memory Type 10

Fixed Range Registers

No. of Variable Range Registers 7–0

Bits

VCNT

Figure 13. MTRR Capability Register Format

For the AMD Athlon processor, the MTRR capability register

should contain 0508h (write-combining, fixed MTRRs

supported, and eight variable MTRRs defined).

174

Memory Type Range Register (MTRR) Mechanism

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

MTRR Default Type Register Format. The MTRR default type register

is defined as follows.

11 10

Type

Reserved

Symbol

Description

MTRRs Enabled

Bits

Type

Fixed Range Enabled

Default Memory Type

7–0

Figure 14. MTRR Default Type Register Format

MTRRs are enabled when set. All MTRRs (both fixed and

variable range) are disabled when clear, and all of

physical memory is mapped as uncacheable memory

(reset state = 0).

Fixed-range MTRRs are enabled when set. All MTRRs

are disabled when clear. When the fixed-range MTRRs

are enabled and an overlap occurs with a variable-range

MTRR, the fixed-range MTRR takes priority (reset state

= 0).

Type Defines the default memory type (reset state = 0). See

Table 13 for more details.

Memory Type Range Register (MTRR) Mechanism

175

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 13. Standard MTRR Types and Properties

Allows

Encoding in

MTRR

Internally

Cacheable

Writeback

Cacheable

Memory Type

Speculative Memory Ordering Model

Reads

Uncacheable (UC)

Write Combining (WC)

Reserved

Yes

Strong ordering

Weak ordering

Reserved

Yes

Writethrough (WT)

Yes

Speculative ordering

Yes, reads

No, Writes

Yes

Write Protected (WP)

Yes

Speculative ordering

Writeback (WB)

Reserved

Yes

Speculative ordering

7-255

Note that if two or more variable memory ranges match then

the interactions are defined as follows:

1. If the memory types are identical, then that memory type is

used.

2. If one or more of the memory types is UC, the UC memory

type is used.

3. If one or more of the memory types is WT and the only other

matching memory type is WB then the WT memory type is

used.

4. Otherwise, if the combination of memory types is not listed

above then the behavior of the processor is undefined.

MTRR Overlapping

The Intel documentation (P6/PII) states that the mapping of

large pages into regions that are mapped with differing memory

types can result in undefined behavior. However, testing shows

that these processors decompose these large pages into 4-Kbyte

pages.

When a large page (2 Mbytes/4 Mbytes) mapping covers a

region that contains more than one memory type (as mapped by

the MTRRs), the AMD Athlon processor does not suppress the

caching of that large page mapping and only caches the

mapping for just that 4-Kbyte piece in the 4-Kbyte TLB.

Therefore, the AMD Athlon processor does not decompose

large pages under these conditions. The fixed range MTRRs are

176

Memory Type Range Register (MTRR) Mechanism

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

not affected by this issue, only the variable range (and MTRR

DefType) registers are affected.

Page Attribute Table (PAT)

The Page Attribute Table (PAT) is an extension of the page

table entry format, which allows the specification of memory

types to regions of physical memory based on the linear

address. The PAT provides the same functionality as MTRRs

with the flexibility of the page tables. It provides the operating

systems and applications to determine the desired memory

type for optimal performance. PAT support is detected in the

feature flags (bit 16) of the CPUID instruction.

MSR Access

The PAT is located in a 64-bit MSR at location 277h. It is

illustrated in Figure 15. Each of the eight PAn fields can contain

the memory type encodings as described in Table 12 on

page 174. An attempt to write an undefined memory type

encoding into the PAT will generate a GP fault.

PA1

PA5

PA0

PA4

PA3

PA7

PA2

PA6

Reserved

Figure 15. Page Attribute Table (MSR 277h)

Page Attribute Table (PAT)

177

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Accessing the PAT

A 3-bit index consisting of the PATi, PCD, and PWT bits of the

page table entry, is used to select one of the seven PAT register

fields to acquire the memory type for the desired page (PATi is

defined as bit 7 for 4-Kbyte PTEs and bit 12 for PDEs which

map to 2-Mbyte or 4-Mbyte pages). The memory type from the

PAT is used instead of the PCD and PWT for the effective

memory type.

A 2-bit index consisting of PCD and PWT bits of the page table

entry, is used to select one of four PAT register fields when PAE

(page address extensions) is enabled, or when the PDE doesn’t

describe a large page. In the latter case, the PATi bit for a PTE

(bit 7) corresponds to the page size bit in a PDE. Therefore, the

OS should only use PA0-3 when setting the memory type for a

page table that is also used as a page directory. See Table 14 on

page 178.

Table 14. PATi 3-Bit Encodings

PATi

PCD

PWT

PAT Entry

Reset Value

MTRRs and PAT

The processor contains MTRRs as described earlier which

provide a limited way of assigning memory types to specific

regions. However, the page tables allow memory types to be

assigned to the pages used for linear to physical translation.

The memory type as defined by PAT and MTRRs are combined

to determine the effective memory type as listed in Table 15

and T able 16. Shaded areas indicated reserved settings.

178

Page Attribute Table (PAT)

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 15. Effective Memory Type Based on PAT and MTRRs

PAT Memory Type

MTRR Memory Type

Effective Memory Type

UC-

WB, WT, WP, WC

UC-Page

UC-MTRR

WB, WT

WB, WP

UC-MTRR

WC, WT

Notes:

1. UC-MTRR indicates that the UC attribute came from the MTRRs and that the processor caches

should not be probed for performance reasons.

2. UC-Page indicates that the UC attribute came from the page tables and that the processor

caches must be probed due to page aliasing.

3. All reserved combinations default to CD.

Page Attribute Table (PAT)

179

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 16. Final Output Memory Types

Input Memory Type

Output Memory Type

AMD-751

Note

ꢁ

1, 2

ꢁ

180

Page Attribute Table (PAT)

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 16. Final Output Memory Types (Continued)

Input Memory Type

Output Memory Type

AMD-751

Note

ꢁ

Notes:

1. WP is not functional for RdMem/WrMem.

2. ForceCD must cause the MTRR memory type to be ignored in order to avoid x’s.

3. D-I should always be WP because the BIOS will only program RdMem-WrIO for WP. CD

is forced to preserve the write-protect intent.

4. Since cached IO lines cannot be copied back to IO, the processor forces WB to WT to

prevent cached IO from going dirty.

5. ForceCD. The memory type is forced CD due to (1) CR0[CD]=1, (2) memory type is for

the ITLB and the I-Cache is disabled or for the DTLB and the D-Cache is disabled, (3)

when clean victims must be written back and RdIO and WrIO and WT, WB, or WP, or

(4) access to Local APIC space.

6. The processor does not support this memory type.

Page Attribute Table (PAT)

181

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

MTRR Fixed-Range

The memory types defined for memory segments defined in

each of the MTRR fixed-range registers are defined in Table 17

(Also S ee “S tandard MTRR T y pes and Properties ” o n

page 176.).

Table 17. MTRR Fixed Range Register Format

Address Range (in hexadecimal)

63:56

55:48

47:40

39:32

31:24

23:16

15:8

7:0

70000- 60000- 50000- 40000- 30000- 20000- 10000-

00000-

0FFFF

MTRR_fix64K_00000

MTRR_fix16K_80000

7FFFF

9C000

9FFFF

6FFFF

98000

9BFFF

5FFFF

94000

97FFF

4FFFF

90000

93FFF

3FFFF

8C000

8FFFF

2FFFF

88000

8BFFF

1FFFF

84000

87FFF

80000

83FFF

BC000- B8000- B4000- B0000- AC000- A8000- A4000- A0000-

BFFFF BBFFF B7FFF B3FFF AFFFF ABFFF A7FFF A3FFF

C7000- C6000- C5000- C4000- C3000- C2000- C1000- C0000-

C7FFF C6FFF C5FFF C4FFF C3FFF C2FFF C1FFF C0FFF

CF000C- CE000- CD000- CC000- CB000- CA000- C9000- C8000-

FFFF CEFFF CDFFF CCFFF CBFFF CAFFF C9FFF C8FFF

D7000- D6000- D5000- D4000- D3000- D2000- D1000- D0000-

D7FFF D6FFF D5FFF D4FFF D3FFF D2FFF D1FFF D0FFF

DF000- DE000- DD000- DC000- DB000- DA000- D9000- D8000-

MTRR_fix16K_A0000

MTRR_fix4K_C0000

MTRR_fix4K_C8000

MTRR_fix4K_D0000

MTRR_fix4K_D8000

MTRR_fix4K_E0000

MTRR_fix4K_E8000

DFFFF

E7000- E6000- E5000- E4000- E3000- E2000- E1000-

E7FFF E6FFF E5FFF E4FFF E3FFF E2FFF E1FFF

DEFFF

DDFFF

DCFFF

DBFFF

DAFFF

D9FFF

D8FFF

E0000-

E0FFF

EF000- EE000- ED000- EC000- EB000- EA000- E9000- E8000-

EFFFF

F7000

F7FFF

FF000

FFFFF

EEFFF

F6000

F6FFF

FE000

FEFFF

EDFFF

F5000

F5FFF

ECFFF

F4000

F4FFF

EBFFF

F3000

F3FFF

EAFFF

F2000

F2FFF

E9FFF

F1000

F1FFF

F9000

F9FFF

E8FFF

F0000

F0FFF

F8000

F8FFF

MTRR_fix4K_F0000

MTRR_fix4K_F8000

FD000- FC000- FB000- FA000-

FDFFF FCFFF FBFFF FAFFF

182

Page Attribute Table (PAT)

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Variable-Range

MTRRs

A variable MTRR can be programmed to start at address

0000_0000h because the fixed MTRRs always override the

variable ones. However, it is recommended not to create an

overlap.

The upper two variable MTRRs should not be used by the BIOS

and are reserved for operating system use.

Variable-RangeMTRR

The variable address range is power of 2 sized and aligned. The

range of supported sizes is from 2¹²to 2³⁶in powers of 2. The

AMD Athlon processor does not implement A[35:32].

36 35

12 11

8 7

Type

Physical Base

Reserved

Symbol

Physical Base Base address in Register Pair

Type

Description

Bits

35–12

See MTRR Types and Properties 7–0

Figure 16. MTRRphysBasen Register Format

Note: A software attempt to write to reserved bits will generate a

general protection exception.

Physical

Base

Specifies a 24-bit value which is extended by 12

bits to form the base address of the region defined

in the register pair.

Type

See “Standard MTRR T y pes and Properties ” on

page 176.

Page Attribute Table (PAT)

183

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

36 35

12 11 10

Physical Mask

Reserved

Symbol

Description

Bits

Physical Mask 24-Bit Mask

35–12

Variable Range Register Pair Enabled 11

(V = 0 at reset)

Figure 17. MTRRphysMaskn Register Format

Note: A software attempt to write to reserved bits will generate a

general protection exception.

Physical

Mask

Specifies a 24-bit mask to determine the range of

the region defined in the register pair.

Enables the register pair when set (V = 0 at reset).

Mask values can represent discontinuous ranges (when the

mask defines a lower significant bit as zero and a higher

significant bit as one). In a discontinuous range, the memory

area not mapped by the mask value is set to the default type.

Discontinuous ranges should not be used.

The range that is mapped by the variable-range MTRR register

pair must meet the following range size and alignment rule:

ꢀ

Each defined memory range must have a size equal to 2ⁿ(11

< n < 36).

ꢀ

The base address for the address pair must be aligned to a

similar 2ⁿboundary.

An example of a variable MTRR pair is as follows:

To map the address range from 8 Mbytes (0080_0000h) to

16 Mbytes (00FF_FFFFh) as writeback memory, the base

should be loaded with FFF8_00800h.

184

Page Attribute Table (PAT)

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

MTRR MSR Format

This table defines the model-specific registers related to the

memory type range register implementation. All MTRRs are

defined to be 64 bits.

Table 18. MTRR-Related Model-Specific Register (MSR) Map

0FEh

MTRRcap

Description

See “MTRR Capability Register Format ” on page 1 74.

See “MTRRphysBasen Register Format ” on page 183.

See “MTRRphysMaskn Register Format ” on page 184.

200h

201h

MTRR Base0

MTRR Mask0

202h

203h

204h

205h

206h

207h

MTRR Base1

MTRR Mask1

MTRR Base2

MTRR Mask2

MTRR Base3

MTRR Mask3

208h

209h

20Ah

20Bh

20Ch

20Dh

20Eh

MTRR Base4

MTRR Mask4

MTRR Base5

MTRR Mask5

MTRR Base6

MTRR Mask6

MTRR Base7

20Fh

MTRR Mask7

250h

258h

259h

268h

269h

26Ah

26Bh

26Ch

26Dh

26Eh

MTRRFIX64k_00000

MTRRFIX16k_80000

MTRRFIX16k_A0000

MTRRFIX4k_C0000

MTRRFIX4k_C8000

MTRRFIX4k_D0000

MTRRFIX4k_D8000

MTRRFIX4k_E0000

MTRRFIX4k_E8000

MTRRFIX4k_F0000

MTRRFIX4k_F8000

MTRRdefType

See “MTRR Fixed-Range Register Format ” on page 182.

See “MTRR Default Type Register Format ” on page 1 75.

26Fh

2FFh

Page Attribute Table (PAT)

185

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

186

Page Attribute Table (PAT)

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix F

Instruction Dispatch and

Execution Resources

This chapter describes the MacroOPs generated by each

decoded instruction, along with the relative static execution

latencies of these groups of operations. Tables 19 through 24

starting on page 188 define the integer, MMX™, MMX

extensions, floating-point, 3DNow!™, and 3DNow! extensions

instructions, respectively.

The first column in these tables indicates the instruction

mnemonic and operand types with the following notations:

ꢀ

reg8—byte integer register defined by instruction byte(s) or

bits 5, 4, and 3 of the modR/M byte

mreg8—byte integer register defined by bits 2, 1, and 0 of

the modR/M byte

reg16/32—word and doubleword integer register defined by

instruction byte(s) or bits 5, 4, and 3 of the modR/M byte

mreg16/32—word and doubleword integer register defined

by bits 2, 1, and 0 of the modR/M byte

ꢀ

mem8—byte memory location

mem16/32—word or doubleword memory location

mem32/48—doubleword or 6-byte memory location

mem48—48-bit integer value in memory

mem64—64-bit value in memory

imm8/16/32—8-bit, 16-bit or 32-bit immediate value

disp8—8-bit displacement value

Instruction Dispatch and Execution Resources

187

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

ꢀ

disp16/32—16-bit or 32-bit displacement value

disp32/48—32-bit or 48-bit displacement value

eXX—register width depending on the operand size

mem32real—32-bit floating-point value in memory

mem64real—64-bit floating-point value in memory

mem80real—80-bit floating-point value in memory

mmreg—MMX/3DNow! register

mmreg1—MMX/3DNow! register defined by bits 5, 4, and 3

of the modR/M byte

ꢀ

mmreg2—MMX/3DNow! register defined by bits 2, 1, and 0

of the modR/M byte

The second and third columns list all applicable encoding

opcode bytes.

The fourth column lists the modR/M byte used by the

instruction. The modR/M byte defines the instruction as

mm (memory form), mm can only be 10b, 01b, or 00b.

The fifth column lists the type of instruction decode—

DirectPath or VectorPath (see “DirectPath Decoder ” on page

133 and “VectorPath Decoder ” o n page 133 for more

information). The AMD Athlon™ processor enhanced decode

logic can process three instructions per clock.

The FPU, MMX, and 3DNow! instruction tables have an

additional column that lists the possible FPU execution

pipelines available for use by any particular DirectPath

decoded operation. Typically, VectorPath instructions require

more than one execution pipe resource.

Table 19. Integer Instructions

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

AAA

AAD

AAM

AAS

37h

VectorPath

D5h

D4h

3Fh

0Ah

188

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

ADC mreg8, reg8

10h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

ADC mem8, reg8

10h

ADC mreg16/32, reg16/32

ADC mem16/32, reg16/32

ADC reg8, mreg8

11h

12h

ADC reg8, mem8

12h

ADC reg16/32, mreg16/32

ADC reg16/32, mem16/32

ADC AL, imm8

13h

14h

ADC EAX, imm16/32

15h

DirectPath

ADC mreg8, imm8

80h

81h

11-010-xxx DirectPath

mm-010-xxx DirectPath

11-010-xxx DirectPath

mm-010-xxx DirectPath

11-010-xxx DirectPath

mm-010-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

ADC mem8, imm8

ADC mreg16/32, imm16/32

ADC mem16/32, imm16/32

ADC mreg16/32, imm8 (sign extended)

ADC mem16/32, imm8 (sign extended)

ADD mreg8, reg8

81h

83h

00h

01h

ADD mem8, reg8

ADD mreg16/32, reg16/32

ADD mem16/32, reg16/32

ADD reg8, mreg8

01h

02h

03h

04h

05h

80h

81h

ADD reg8, mem8

ADD reg16/32, mreg16/32

ADD reg16/32, mem16/32

ADD AL, imm8

ADD EAX, imm16/32

DirectPath

ADD mreg8, imm8

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-xxx-xxx DirectPath

ADD mem8, imm8

ADD mreg16/32, imm16/32

ADD mem16/32, imm16/32

ADD mreg16/32, imm8 (sign extended)

ADD mem16/32, imm8 (sign extended)

AND mreg8, reg8

81h

83h

20h

Instruction Dispatch and Execution Resources

189

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

AND mem8, reg8

20h

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

AND mreg16/32, reg16/32

AND mem16/32, reg16/32

AND reg8, mreg8

21h

22h

AND reg8, mem8

22h

AND reg16/32, mreg16/32

AND reg16/32, mem16/32

AND AL, imm8

23h

24h

AND EAX, imm16/32

AND mreg8, imm8

AND mem8, imm8

AND mreg16/32, imm16/32

AND mem16/32, imm16/32

AND mreg16/32, imm8 (sign extended)

AND mem16/32, imm8 (sign extended)

ARPL mreg16, reg16

ARPL mem16, reg16

BOUND

25h

DirectPath

80h

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

VectorPath

80h

81h

83h

63h

62h

BSF reg16/32, mreg16/32

BSF reg16/32, mem16/32

BSR reg16/32, mreg16/32

BSR reg16/32, mem16/32

BSWAP EAX

0Fh

BCh

BDh

C8h

C9h

CAh

CBh

CCh

CDh

CEh

CFh

A3h

BAh

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

DirectPath

BSWAP ECX

DirectPath

BSWAP EDX

DirectPath

BSWAP EBX

DirectPath

BSWAP ESP

DirectPath

BSWAP EBP

DirectPath

BSWAP ESI

DirectPath

BSWAP EDI

DirectPath

BT mreg16/32, reg16/32

BT mem16/32, reg16/32

BT mreg16/32, imm8

11-xxx-xxx DirectPath

mm-xxx-xxx VectorPath

11-100-xxx DirectPath

190

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Byte Byte Byte

Decode

Type

BT mem16/32, imm8

BTC mreg16/32, reg16/32

BTC mem16/32, reg16/32

BTC mreg16/32, imm8

BTC mem16/32, imm8

BTR mreg16/32, reg16/32

BTR mem16/32, reg16/32

BTR mreg16/32, imm8

BTR mem16/32, imm8

BTS mreg16/32, reg16/32

BTS mem16/32, reg16/32

BTS mreg16/32, imm8

BTS mem16/32, imm8

CALL full pointer

0Fh

9Ah

E8h

FFh

98h

F8h

FCh

FAh

0Fh

F5h

0Fh

BAh mm-100-xxx DirectPath

BBh

BAh

B3h

BAh

ABh

BAh

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-111-xxx VectorPath

mm-111-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-110-xxx VectorPath

mm-110-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-101-xxx VectorPath

mm-101-xxx VectorPath

VectorPath

CALL near imm16/32

CALL mem16:16/32

CALL near mreg32 (indirect)

CALL near mem32 (indirect)

CBW/CWDE

VectorPath

11-011-xxx VectorPath

11-010-xxx VectorPath

mm-010-xxx VectorPath

DirectPath

CLC

DirectPath

CLD

VectorPath

CLI

VectorPath

CLTS

06h

VectorPath

CMC

DirectPath

CMOVA/CMOVNBE reg16/32, reg16/32

CMOVA/CMOVNBE reg16/32, mem16/32

47h

43h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

CMOVAE/CMOVNB/CMOVNC reg16/32, mem16/32 0Fh

CMOVAE/CMOVNB/CMOVNC mem16/32,

mem16/32

0Fh

43h

mm-xxx-xxx DirectPath

CMOVB/CMOVC/CMOVNAE reg16/32, reg16/32

CMOVB/CMOVC/CMOVNAE mem16/32, reg16/32

CMOVBE/CMOVNA reg16/32, reg16/32

0Fh

42h

46h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

CMOVBE/CMOVNA reg16/32, mem16/32

Instruction Dispatch and Execution Resources

191

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Byte Byte Byte

Decode

Type

CMOVE/CMOVZ reg16/32, reg16/32

CMOVE/CMOVZ reg16/32, mem16/32

CMOVG/CMOVNLE reg16/32, reg16/32

CMOVG/CMOVNLE reg16/32, mem16/32

CMOVGE/CMOVNL reg16/32, reg16/32

CMOVGE/CMOVNL reg16/32, mem16/32

CMOVL/CMOVNGE reg16/32, reg16/32

CMOVL/CMOVNGE reg16/32, mem16/32

CMOVLE/CMOVNG reg16/32, reg16/32

CMOVLE/CMOVNG reg16/32, mem16/32

CMOVNE/CMOVNZ reg16/32, reg16/32

CMOVNE/CMOVNZ reg16/32, mem16/32

CMOVNO reg16/32, reg16/32

CMOVNO reg16/32, mem16/32

CMOVNP/CMOVPO reg16/32, reg16/32

CMOVNP/CMOVPO reg16/32, mem16/32

CMOVNS reg16/32, reg16/32

0Fh

38h

39h

3Ah

3Bh

3Ch

44h

4Fh

4Dh

4Ch

4Eh

45h

41h

4Bh

49h

40h

4Ah

48h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

CMOVNS reg16/32, mem16/32

CMOVO reg16/32, reg16/32

CMOVO reg16/32, mem16/32

CMOVP/CMOVPE reg16/32, reg16/32

CMOVP/CMOVPE reg16/32, mem16/32

CMOVS reg16/32, reg16/32

CMOVS reg16/32, mem16/32

CMP mreg8, reg8

CMP mem8, reg8

CMP mreg16/32, reg16/32

CMP mem16/32, reg16/32

CMP reg8, mreg8

CMP reg8, mem8

CMP reg16/32, mreg16/32

CMP reg16/32, mem16/32

CMP AL, imm8

192

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

CMP EAX, imm16/32

CMP mreg8, imm8

CMP mem8, imm8

CMP mreg16/32, imm16/32

CMP mem16/32, imm16/32

CMP mreg16/32, imm8 (sign extended)

CMP mem16/32, imm8 (sign extended)

CMPSB mem8,mem8

CMPSW mem16, mem32

CMPSD mem32, mem32

CMPXCHG mreg8, reg8

CMPXCHG mem8, reg8

CMPXCHG mreg16/32, reg16/32

CMPXCHG mem16/32, reg16/32

CMPXCHG8B mem64

CPUID

3Dh

DirectPath

80h

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

VectorPath

80h

81h

83h

A6h

A7h

VectorPath

A7h

VectorPath

0Fh

99h

27h

2Fh

48h

49h

4Ah

4Bh

4Ch

4Dh

4Eh

4Fh

FEh

FFh

F6h

B0h

B1h

C7h

A2h

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

VectorPath

CWD/CDQ

DirectPath

DAA

VectorPath

DAS

VectorPath

DEC EAX

DirectPath

DEC ECX

DirectPath

DEC EDX

DirectPath

DEC EBX

DirectPath

DEC ESP

DirectPath

DEC EBP

DirectPath

DEC ESI

DirectPath

DEC EDI

DirectPath

DEC mreg8

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-110-xxx VectorPath

mm-110-xxx VectorPath

DEC mem8

DEC mreg16/32

DEC mem16/32

DIV AL, mreg8

DIV AL, mem8

Instruction Dispatch and Execution Resources

193

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

DIV EAX, mreg16/32

DIV EAX, mem16/32

ENTER

F7h

11-110-xxx VectorPath

mm-110-xxx VectorPath

VectorPath

F7h

IDIV mreg8

F6h

11-111-xxx VectorPath

mm-111-xxx VectorPath

11-111-xxx VectorPath

mm-111-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-101-xxx VectorPath

mm-101-xxx VectorPath

11-101-xxx VectorPath

mm-101-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

VectorPath

IDIV mem8

F6h

IDIV EAX, mreg16/32

IDIV EAX, mem16/32

IMUL reg16/32, imm16/32

IMUL reg16/32, mreg16/32, imm16/32

IMUL reg16/32, mem16/32, imm16/32

IMUL reg16/32, imm8 (sign extended)

IMUL reg16/32, mreg16/32, imm8 (signed)

IMUL reg16/32, mem16/32, imm8 (signed)

IMUL AX, AL, mreg8

IMUL AX, AL, mem8

IMUL EDX:EAX, EAX, mreg16/32

IMUL EDX:EAX, EAX, mem16/32

IMUL reg16/32, mreg16/32

IMUL reg16/32, mem16/32

IN AL, imm8

F7h

69h

6Bh

F6h

F7h

0Fh

E4h

E5h

ECh

EDh

40h

41h

42h

43h

44h

45h

46h

47h

AFh

IN AX, imm8

VectorPath

IN EAX, imm8

VectorPath

IN AL, DX

VectorPath

IN AX, DX

VectorPath

IN EAX, DX

VectorPath

INC EAX

DirectPath

INC ECX

DirectPath

INC EDX

DirectPath

INC EBX

DirectPath

INC ESP

DirectPath

INC EBP

DirectPath

INC ESI

DirectPath

INC EDI

DirectPath

194

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

INC mreg8

FEh

FFh

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

VectorPath

INC mem8

INC mreg16/32

INC mem16/32

INVD

0Fh

70h

71h

72h

73h

74h

75h

76h

77h

78h

79h

7Ah

7Bh

7Ch

7Dh

7Eh

7Fh

E3h

0Fh

08h

01h

INVLPG

mm-111-xxx VectorPath

DirectPath

JO short disp8

JNO short disp8

DirectPath

JB/JNAE/JC short disp8

JNB/JAE/JNC short disp8

JZ/JE short disp8

JNZ/JNE short disp8

JBE/JNA short disp8

JNBE/JA short disp8

JS short disp8

DirectPath

JNS short disp8

DirectPath

JP/JPE short disp8

JNP/JPO short disp8

JL/JNGE short disp8

JNL/JGE short disp8

JLE/JNG short disp8

JNLE/JG short disp8

JCXZ/JEC short disp8

JO near disp16/32

JNO near disp16/32

JB/JNAE near disp16/32

JNB/JAE near disp16/32

JZ/JE near disp16/32

JNZ/JNE near disp16/32

JBE/JNA near disp16/32

JNBE/JA near disp16/32

JS near disp16/32

JNS near disp16/32

DirectPath

VectorPath

80h

81h

82h

83h

84h

85h

86h

87h

88h

89h

DirectPath

Instruction Dispatch and Execution Resources

195

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Byte Byte Byte

Decode

Type

JP/JPE near disp16/32

JNP/JPO near disp16/32

JL/JNGE near disp16/32

JNL/JGE near disp16/32

JLE/JNG near disp16/32

JNLE/JG near disp16/32

JMP near disp16/32 (direct)

JMP far disp32/48 (direct)

JMP disp8 (short)

0Fh

E9h

EAh

EBh

EFh

FFh

9Fh

0Fh

C5h

8Dh

C9h

C4h

0Fh

ACh

ADh

E2h

8Ah

8Bh

8Ch

8Dh

8Eh

8Fh

DirectPath

VectorPath

DirectPath

JMP far mem32 (indirect)

JMP far mreg32 (indirect)

JMP near mreg16/32 (indirect)

JMP near mem16/32 (indirect)

LAHF

mm-101-xxx VectorPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

VectorPath

LAR reg16/32, mreg16/32

LAR reg16/32, mem16/32

LDS reg16/32, mem32/48

LEA reg16, mem16/32

LEA reg32, mem16/32

LEAVE

02h

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

mm-xxx-xxx DirectPath

VectorPath

LES reg16/32, mem32/48

LFS reg16/32, mem32/48

LGDT mem48

mm-xxx-xxx VectorPath

VectorPath

B4h

01h

B5h

01h

00h

01h

mm-010-xxx VectorPath

VectorPath

LGS reg16/32, mem32/48

LIDT mem48

mm-011-xxx VectorPath

11-010-xxx VectorPath

mm-010-xxx VectorPath

11-100-xxx VectorPath

mm-100-xxx VectorPath

VectorPath

LLDT mreg16

LLDT mem16

LMSW mreg16

LMSW mem16

LODSB AL, mem8

LODSW AX, mem16

LODSD EAX, mem32

LOOP disp8

VectorPath

196

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

LOOPE/LOOPZ disp8

LOOPNE/LOOPNZ disp8

LSL reg16/32, mreg16/32

LSL reg16/32, mem16/32

LSS reg16/32, mem32/48

LTR mreg16

E1h

VectorPath

E0h

0Fh

88h

89h

8Ah

8Bh

8Ch

8Eh

A0h

A1h

A2h

A3h

B0h

B1h

B2h

B3h

B4h

B5h

B6h

B7h

B8h

B9h

03h

B2h

00h

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-011-xxx VectorPath

mm-011-xxx VectorPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

DirectPath

LTR mem16

MOV mreg8, reg8

MOV mem8, reg8

MOV mreg16/32, reg16/32

MOV mem16/32, reg16/32

MOV reg8, mreg8

MOV reg8, mem8

MOV reg16/32, mreg16/32

MOV reg16/32, mem16/32

MOV mreg16, segment reg

MOV mem16, segment reg

MOV segment reg, mreg16

MOV segment reg, mem16

MOV AL, mem8

MOV EAX, mem16/32

MOV mem8, AL

DirectPath

MOV mem16/32, EAX

MOV AL, imm8

DirectPath

MOV CL, imm8

DirectPath

MOV DL, imm8

DirectPath

MOV BL, imm8

DirectPath

MOV AH, imm8

DirectPath

MOV CH, imm8

DirectPath

MOV DH, imm8

DirectPath

MOV BH, imm8

DirectPath

MOV EAX, imm16/32

MOV ECX, imm16/32

DirectPath

Instruction Dispatch and Execution Resources

197

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

MOV EDX, imm16/32

MOV EBX, imm16/32

MOV ESP, imm16/32

MOV EBP, imm16/32

MOV ESI, imm16/32

MOV EDI, imm16/32

MOV mreg8, imm8

MOV mem8, imm8

MOV mreg16/32, imm16/32

MOV mem16/32, imm16/32

MOVSB mem8,mem8

MOVSD mem16, mem16

MOVSW mem32, mem32

MOVSX reg16/32, mreg8

MOVSX reg16/32, mem8

MOVSX reg32, mreg16

MOVSX reg32, mem16

MOVZX reg16/32, mreg8

MOVZX reg16/32, mem8

MOVZX reg32, mreg16

MOVZX reg32, mem16

MUL AL, mreg8

BAh

DirectPath

BBh

BCh

BDh

BEh

BFh

C6h

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

VectorPath

C6h

C7h

A4h

A5h

VectorPath

A5h

VectorPath

0Fh

F6h

F7h

F6h

F7h

90h

F6h

BEh

BFh

B6h

B7h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-100-xxx VectorPath

mm-100-xx VectorPath

11-100-xxx VectorPath

mm-100-xxx VectorPath

11-100-xxx VectorPath

mm-100-xx VectorPath

11-011-xxx DirectPath

mm-011-xx DirectPath

11-011-xxx DirectPath

mm-011-xx DirectPath

DirectPath

MUL AL, mem8

MUL AX, mreg16

MUL AX, mem16

MUL EAX, mreg32

MUL EAX, mem32

NEG mreg8

NEG mem8

NEG mreg16/32

NEG mem16/32

NOP (XCHG EAX, EAX)

NOT mreg8

11-010-xxx DirectPath

198

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

NOT mem8

F6h

mm-010-xx DirectPath

11-010-xxx DirectPath

mm-010-xx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

NOT mreg16/32

NOT mem16/32

OR mreg8, reg8

OR mem8, reg8

OR mreg16/32, reg16/32

OR mem16/32, reg16/32

OR reg8, mreg8

OR reg8, mem8

OR reg16/32, mreg16/32

OR reg16/32, mem16/32

OR AL, imm8

F7h

08h

09h

0Ah

0Bh

0Ch

0Dh

80h

81h

OR EAX, imm16/32

OR mreg8, imm8

OR mem8, imm8

OR mreg16/32, imm16/32

OR mem16/32, imm16/32

OR mreg16/32, imm8 (sign extended)

OR mem16/32, imm8 (sign extended)

OUT imm8, AL

DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

VectorPath

81h

83h

E6h

E7h

EEh

EFh

OUT imm8, AX

VectorPath

OUT imm8, EAX

OUT DX, AL

VectorPath

OUT DX, AX

VectorPath

OUT DX, EAX

EFh

VectorPath

POP ES

07h

VectorPath

POP SS

17h

VectorPath

POP DS

1Fh

VectorPath

POP FS

0Fh

58h

59h

5Ah

A1h

A9h

VectorPath

POP GS

VectorPath

POP EAX

VectorPath

POP ECX

VectorPath

POP EDX

VectorPath

Instruction Dispatch and Execution Resources

199

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

POP EBX

5Bh

VectorPath

POP ESP

5Ch

POP EBP

5Dh

POP ESI

5Eh

POP EDI

5Fh

POP mreg 16/32

POP mem 16/32

POPA/POPAD

POPF/POPFD

PUSH ES

8Fh

11-000-xxx VectorPath

mm-000-xxx VectorPath

VectorPath

8Fh

61h

9Dh

VectorPath

06h

VectorPath

PUSH CS

0Eh

VectorPath

PUSH FS

0Fh

16h

1Eh

50h

51h

52h

53h

54h

55h

56h

57h

6Ah

68h

FFh

60h

9Ch

C0h

C1h

A0h

A8h

VectorPath

PUSH GS

VectorPath

PUSH SS

VectorPath

PUSH DS

VectorPath

PUSH EAX

DirectPath

PUSH ECX

DirectPath

PUSH EDX

DirectPath

PUSH EBX

DirectPath

PUSH ESP

DirectPath

PUSH EBP

DirectPath

PUSH ESI

DirectPath

PUSH EDI

DirectPath

PUSH imm8

PUSH imm16/32

PUSH mreg16/32

PUSH mem16/32

PUSHA/PUSHAD

PUSHF/PUSHFD

RCL mreg8, imm8

RCL mem8, imm8

RCL mreg16/32, imm8

RCL mem16/32, imm8

DirectPath

11-110-xxx VectorPath

mm-110-xxx VectorPath

VectorPath

11-010-xxx DirectPath

mm-010-xxx VectorPath

11-010-xxx DirectPath

mm-010-xxx VectorPath

200

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

D0h

D1h

D2h

D3h

C0h

Byte

RCL mreg8, 1

11-010-xxx DirectPath

mm-010-xxx DirectPath

11-010-xxx DirectPath

mm-010-xxx DirectPath

11-010-xxx DirectPath

mm-010-xxx VectorPath

11-010-xxx DirectPath

mm-010-xxx VectorPath

11-011-xxx DirectPath

mm-011-xxx VectorPath

11-011-xxx DirectPath

mm-011-xxx VectorPath

11-011-xxx DirectPath

mm-011-xxx DirectPath

11-011-xxx DirectPath

mm-011-xxx DirectPath

11-011-xxx DirectPath

mm-011-xxx VectorPath

11-011-xxx DirectPath

mm-011-xxx VectorPath

VectorPath

RCL mem8, 1

RCL mreg16/32, 1

RCL mem16/32, 1

RCL mreg8, CL

RCL mem8, CL

RCL mreg16/32, CL

RCL mem16/32, CL

RCR mreg8, imm8

RCR mem8, imm8

RCR mreg16/32, imm8

RCR mem16/32, imm8

RCR mreg8, 1

C0h

C1h

D0h

D1h

D2h

D3h

RCR mem8, 1

RCR mreg16/32, 1

RCR mem16/32, 1

RCR mreg8, CL

RCR mem8, CL

RCR mreg16/32, CL

RCR mem16/32, CL

RDMSR

0Fh

32h

33h

31h

RDPMC

VectorPath

RDTSC

VectorPath

RET near imm16

RET near

C2h

C3h

CAh

CBh

C0h

C1h

D0h

VectorPath

RET far imm16

RET far

VectorPath

ROL mreg8, imm8

ROL mem8, imm8

ROL mreg16/32, imm8

ROL mem16/32, imm8

ROL mreg8, 1

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

ROL mem8, 1

Instruction Dispatch and Execution Resources

201

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Byte

Decode

Type

Byte Byte

D1h

D2h

D3h

C0h

C1h

D0h

D1h

D2h

D3h

9Eh

ROL mreg16/32, 1

ROL mem16/32, 1

ROL mreg8, CL

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

mm-001-xxx DirectPath

VectorPath

ROL mem8, CL

ROL mreg16/32, CL

ROL mem16/32, CL

ROR mreg8, imm8

ROR mem8, imm8

ROR mreg16/32, imm8

ROR mem16/32, imm8

ROR mreg8, 1

ROR mem8, 1

ROR mreg16/32, 1

ROR mem16/32, 1

ROR mreg8, CL

ROR mem8, CL

ROR mreg16/32, CL

ROR mem16/32, CL

SAHF

SAR mreg8, imm8

SAR mem8, imm8

SAR mreg16/32, imm8

SAR mem16/32, imm8

SAR mreg8, 1

C0h

C1h

D0h

D1h

D2h

D3h

18h

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

SAR mem8, 1

SAR mreg16/32, 1

SAR mem16/32, 1

SAR mreg8, CL

SAR mem8, CL

SAR mreg16/32, CL

SAR mem16/32, CL

SBB mreg8, reg8

SBB mem8, reg8

18h

202

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

SBB mreg16/32, reg16/32

SBB mem16/32, reg16/32

SBB reg8, mreg8

19h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

19h

1Ah

1Bh

SBB reg8, mem8

SBB reg16/32, mreg16/32

SBB reg16/32, mem16/32

SBB AL, imm8

1Bh

1Ch

1Dh

80h

SBB EAX, imm16/32

SBB mreg8, imm8

DirectPath

11-011-xxx DirectPath

mm-011-xxx DirectPath

11-011-xxx DirectPath

mm-011-xxx DirectPath

11-011-xxx DirectPath

mm-011-xxx DirectPath

VectorPath

SBB mem8, imm8

80h

SBB mreg16/32, imm16/32

SBB mem16/32, imm16/32

SBB mreg16/32, imm8 (sign extended)

SBB mem16/32, imm8 (sign extended)

SCASB AL, mem8

81h

83h

AEh

SCASW AX, mem16

AFh

VectorPath

SCASD EAX, mem32

SETO mreg8

AFh

VectorPath

0Fh

90h

91h

92h

93h

94h

95h

96h

97h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

SETO mem8

SETNO mreg8

SETNO mem8

SETB/SETC/SETNAE mreg8

SETB/SETC/SETNAE mem8

SETAE/SETNB/SETNC mreg8

SETAE/SETNB/SETNC mem8

SETE/SETZ mreg8

SETE/SETZ mem8

SETNE/SETNZ mreg8

SETNE/SETNZ mem8

SETBE/SETNA mreg8

SETBE/SETNA mem8

SETA/SETNBE mreg8

SETA/SETNBE mem8

Instruction Dispatch and Execution Resources

203

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Byte Byte Byte

Decode

Type

SETS mreg8

0Fh

C0h

C1h

D0h

D1h

D2h

D3h

C0h

C1h

98h

99h

9Ah

9Bh

9Ch

9Dh

9Eh

9Fh

01h

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

mm-000-xxx VectorPath

mm-001-xxx VectorPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

mm-100-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-101-xxx DirectPath

SETS mem8

SETNS mreg8

SETNS mem8

SETP/SETPE mreg8

SETP/SETPE mem8

SETNP/SETPO mreg8

SETNP/SETPO mem8

SETL/SETNGE mreg8

SETL/SETNGE mem8

SETGE/SETNL mreg8

SETGE/SETNL mem8

SETLE/SETNG mreg8

SETLE/SETNG mem8

SETG/SETNLE mreg8

SETG/SETNLE mem8

SGDT mem48

SIDT mem48

SHL/SAL mreg8, imm8

SHL/SAL mem8, imm8

SHL/SAL mreg16/32, imm8

SHL/SAL mem16/32, imm8

SHL/SAL mreg8, 1

SHL/SAL mem8, 1

SHL/SAL mreg16/32, 1

SHL/SAL mem16/32, 1

SHL/SAL mreg8, CL

SHL/SAL mem8, CL

SHL/SAL mreg16/32, CL

SHL/SAL mem16/32, CL

SHR mreg8, imm8

SHR mem8, imm8

SHR mreg16/32, imm8

204

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

SHR mem16/32, imm8

SHR mreg8, 1

C1h

mm-101-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-000-xxx VectorPath

mm-000-xxx VectorPath

11-100-xxx VectorPath

mm-100-xxx VectorPath

DirectPath

D0h

SHR mem8, 1

D0h

SHR mreg16/32, 1

SHR mem16/32, 1

SHR mreg8, CL

D1h

D2h

SHR mem8, CL

D2h

SHR mreg16/32, CL

SHR mem16/32, CL

SHLD mreg16/32, reg16/32, imm8

SHLD mem16/32, reg16/32, imm8

SHLD mreg16/32, reg16/32, CL

SHLD mem16/32, reg16/32, CL

SHRD mreg16/32, reg16/32, imm8

SHRD mem16/32, reg16/32, imm8

SHRD mreg16/32, reg16/32, CL

SHRD mem16/32, reg16/32, CL

SLDT mreg16

D3h

0Fh

F9h

FDh

FBh

AAh

ABh

0Fh

28h

29h

A4h

A5h

ACh

ADh

00h

01h

SLDT mem16

SMSW mreg16

SMSW mem16

01h

STC

STD

VectorPath

STI

VectorPath

STOSB mem8, AL

VectorPath

STOSW mem16, AX

STOSD mem32, EAX

STR mreg16

VectorPath

00h

11-001-xxx VectorPath

mm-001-xxx VectorPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

STR mem16

SUB mreg8, reg8

SUB mem8, reg8

SUB mreg16/32, reg16/32

SUB mem16/32, reg16/32

Instruction Dispatch and Execution Resources

205

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Decode

Type

Byte Byte

Byte

SUB reg8, mreg8

SUB reg8, mem8

SUB reg16/32, mreg16/32

SUB reg16/32, mem16/32

SUB AL, imm8

2Ah

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

2Ah

2Bh

2Ch

SUB EAX, imm16/32

SUB mreg8, imm8

SUB mem8, imm8

SUB mreg16/32, imm16/32

SUB mem16/32, imm16/32

SUB mreg16/32, imm8 (sign extended)

SUB mem16/32, imm8 (sign extended)

SYSCALL

2Dh

DirectPath

80h

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

VectorPath

80h

81h

83h

0Fh

84h

85h

A8h

A9h

F6h

F7h

0Fh

9Bh

0Fh

05h

34h

35h

07h

SYSENTER

VectorPath

SYSEXIT

VectorPath

SYSRET

VectorPath

TEST mreg8, reg8

TEST mem8, reg8

TEST mreg16/32, reg16/32

TEST mem16/32, reg16/32

TEST AL, imm8

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

TEST EAX, imm16/32

TEST mreg8, imm8

TEST mem8, imm8

TEST mreg8, imm16/32

TEST mem8, imm16/32

VERR mreg16

DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-100-xxx VectorPath

mm-100-xxx VectorPath

11-101-xxx VectorPath

mm-101-xxx VectorPath

DirectPath

00h

VERR mem16

VERW mreg16

VERW mem16

WAIT

WBINVD

09h

30h

VectorPath

WRMSR

VectorPath

206

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 19. Integer Instructions (Continued)

Instruction Mnemonic

First Second ModR/M

Byte Byte Byte

Decode

Type

XADD mreg8, reg8

0Fh

86h

87h

90h

91h

92h

93h

94h

95h

96h

97h

D7h

30h

31h

32h

33h

34h

35h

80h

81h

83h

C0h

C1h

11-100-xxx VectorPath

mm-100-xxx VectorPath

11-101-xxx VectorPath

mm-101-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

11-xxx-xxx VectorPath

mm-xxx-xxx VectorPath

DirectPath

XADD mem8, reg8

XADD mreg16/32, reg16/32

XADD mem16/32, reg16/32

XCHG reg8, mreg8

XCHG reg8, mem8

XCHG reg16/32, mreg16/32

XCHG reg16/32, mem16/32

XCHG EAX, EAX

XCHG EAX, ECX

VectorPath

XCHG EAX, EDX

VectorPath

XCHG EAX, EBX

VectorPath

XCHG EAX, ESP

VectorPath

XCHG EAX, EBP

VectorPath

XCHG EAX, ESI

VectorPath

XCHG EAX, EDI

VectorPath

XLAT

VectorPath

XOR mreg8, reg8

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

DirectPath

XOR mem8, reg8

XOR mreg16/32, reg16/32

XOR mem16/32, reg16/32

XOR reg8, mreg8

XOR reg8, mem8

XOR reg16/32, mreg16/32

XOR reg16/32, mem16/32

XOR AL, imm8

XOR EAX, imm16/32

XOR mreg8, imm8

DirectPath

11-110-xxx DirectPath

mm-110-xxx DirectPath

11-110-xxx DirectPath

mm-110-xxx DirectPath

11-110-xxx DirectPath

mm-110-xxx DirectPath

XOR mem8, imm8

XOR mreg16/32, imm16/32

XOR mem16/32, imm16/32

XOR mreg16/32, imm8 (sign extended)

XOR mem16/32, imm8 (sign extended)

Instruction Dispatch and Execution Resources

207

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

Table 20. MMX™ Instructions

22007E/0—November 1999

Prefix First ModR/M

Byte(s) Byte Byte

77h

Decode

Type

Instruction Mnemonic

EMMS

FPU Pipe(s)

Notes

0Fh

DirectPath FADD/FMUL/FSTORE

MOVD mmreg, reg32

6Eh 11-xxx-xxx VectorPath

MOVD mmreg, mem32

MOVD reg32, mmreg

6Eh mm-xxx-xxx DirectPath FADD/FMUL/FSTORE

7Eh 11-xxx-xxx VectorPath

MOVD mem32, mmreg

MOVQ mmreg1, mmreg2

MOVQ mmreg, mem64

MOVQ mmreg2, mmreg1

MOVQ mem64, mmreg

PACKSSDW mmreg1, mmreg2

PACKSSDW mmreg, mem64

PACKSSWB mmreg1, mmreg2

PACKSSWB mmreg, mem64

PACKUSWB mmreg1, mmreg2

PACKUSWB mmreg, mem64

PADDB mmreg1, mmreg2

PADDB mmreg, mem64

PADDD mmreg1, mmreg2

PADDD mmreg, mem64

PADDSB mmreg1, mmreg2

PADDSB mmreg, mem64

PADDSW mmreg1, mmreg2

PADDSW mmreg, mem64

PADDUSB mmreg1, mmreg2

PADDUSB mmreg, mem64

PADDUSW mmreg1, mmreg2

PADDUSW mmreg, mem64

PADDW mmreg1, mmreg2

PADDW mmreg, mem64

PAND mmreg1, mmreg2

PAND mmreg, mem64

7Eh mm-xxx-xxx DirectPath

6Fh 11-xxx-xxx DirectPath

FSTORE

FADD/FMUL

6Fh mm-xxx-xxx DirectPath FADD/FMUL/FSTORE

7Fh 11-xxx-xxx DirectPath

7Fh mm-xxx-xxx DirectPath

6Bh 11-xxx-xxx DirectPath

6Bh mm-xxx-xxx DirectPath

63h 11-xxx-xxx DirectPath

63h mm-xxx-xxx DirectPath

67h 11-xxx-xxx DirectPath

67h mm-xxx-xxx DirectPath

FCh 11-xxx-xxx DirectPath

FCh mm-xxx-xxx DirectPath

FEh 11-xxx-xxx DirectPath

FEh mm-xxx-xxx DirectPath

ECh 11-xxx-xxx DirectPath

ECh mm-xxx-xxx DirectPath

EDh 11-xxx-xxx DirectPath

EDh mm-xxx-xxx DirectPath

DCh 11-xxx-xxx DirectPath

DCh mm-xxx-xxx DirectPath

DDh 11-xxx-xxx DirectPath

DDh mm-xxx-xxx DirectPath

FDh 11-xxx-xxx DirectPath

FDh mm-xxx-xxx DirectPath

DBh 11-xxx-xxx DirectPath

DBh mm-xxx-xxx DirectPath

FADD/FMUL

FSTORE

FADD/FMUL

Notes:

1. Bits 2, 1, and 0 of the modR/M byte select the integer register.

208

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 20. MMX™ Instructions (Continued)

Prefix First ModR/M

Byte(s) Byte Byte

Decode

Type

Instruction Mnemonic

FPU Pipe(s)

Notes

PANDN mmreg1, mmreg2

PANDN mmreg, mem64

PCMPEQB mmreg1, mmreg2

PCMPEQB mmreg, mem64

PCMPEQD mmreg1, mmreg2

PCMPEQD mmreg, mem64

PCMPEQW mmreg1, mmreg2

PCMPEQW mmreg, mem64

PCMPGTB mmreg1, mmreg2

PCMPGTB mmreg, mem64

PCMPGTD mmreg1, mmreg2

PCMPGTD mmreg, mem64

PCMPGTW mmreg1, mmreg2

PCMPGTW mmreg, mem64

PMADDWD mmreg1, mmreg2

PMADDWD mmreg, mem64

PMULHW mmreg1, mmreg2

PMULHW mmreg, mem64

PMULLW mmreg1, mmreg2

PMULLW mmreg, mem64

POR mmreg1, mmreg2

0Fh

DFh 11-xxx-xxx DirectPath

DFh mm-xxx-xxx DirectPath

74h 11-xxx-xxx DirectPath

74h mm-xxx-xxx DirectPath

76h 11-xxx-xxx DirectPath

76h mm-xxx-xxx DirectPath

75h 11-xxx-xxx DirectPath

75h mm-xxx-xxx DirectPath

64h 11-xxx-xxx DirectPath

64h mm-xxx-xxx DirectPath

66h 11-xxx-xxx DirectPath

66h mm-xxx-xxx DirectPath

65h 11-xxx-xxx DirectPath

65h mm-xxx-xxx DirectPath

F5h 11-xxx-xxx DirectPath

F5h mm-xxx-xxx DirectPath

E5h 11-xxx-xxx DirectPath

E5h mm-xxx-xxx DirectPath

D5h 11-xxx-xxx DirectPath

D5h mm-xxx-xxx DirectPath

EBh 11-xxx-xxx DirectPath

EBh mm-xxx-xxx DirectPath

F2h 11-xxx-xxx DirectPath

F2h mm-xxx-xxx DirectPath

72h 11-110-xxx DirectPath

F3h 11-xxx-xxx DirectPath

F3h mm-xxx-xxx DirectPath

73h 11-110-xxx DirectPath

F1h 11-xxx-xxx DirectPath

F1h mm-xxx-xxx DirectPath

71h 11-110-xxx DirectPath

FADD/FMUL

FMUL

FADD/FMUL

POR mmreg, mem64

PSLLD mmreg1, mmreg2

PSLLD mmreg, mem64

PSLLD mmreg, imm8

PSLLQ mmreg1, mmreg2

PSLLQ mmreg, mem64

PSLLQ mmreg, imm8

PSLLW mmreg1, mmreg2

PSLLW mmreg, mem64

PSLLW mmreg, imm8

Notes:

1. Bits 2, 1, and 0 of the modR/M byte select the integer register.

Instruction Dispatch and Execution Resources

209

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

Table 20. MMX™ Instructions (Continued)

22007E/0—November 1999

Prefix First ModR/M

Byte(s) Byte Byte

Decode

Type

Instruction Mnemonic

FPU Pipe(s)

Notes

PSRAW mmreg1, mmreg2

PSRAW mmreg, mem64

PSRAW mmreg, imm8

0Fh

E1h 11-xxx-xxx DirectPath

E1h mm-xxx-xxx DirectPath

71h 11-100-xxx DirectPath

E2h 11-xxx-xxx DirectPath

E2h mm-xxx-xxx DirectPath

72h 11-100-xxx DirectPath

D2h 11-xxx-xxx DirectPath

D2h mm-xxx-xxx DirectPath

72h 11-010-xxx DirectPath

D3h 11-xxx-xxx DirectPath

D3h mm-xxx-xxx DirectPath

73h 11-010-xxx DirectPath

D1h 11-xxx-xxx DirectPath

D1h mm-xxx-xxx DirectPath

71h 11-010-xxx DirectPath

F8h 11-xxx-xxx DirectPath

F8h mm-xxx-xxx DirectPath

FAh 11-xxx-xxx DirectPath

FAh mm-xxx-xxx DirectPath

E8h 11-xxx-xxx DirectPath

E8h mm-xxx-xxx DirectPath

E9h 11-xxx-xxx DirectPath

E9h mm-xxx-xxx DirectPath

D8h 11-xxx-xxx DirectPath

D8h mm-xxx-xxx DirectPath

D9h 11-xxx-xxx DirectPath

D9h mm-xxx-xxx DirectPath

F9h 11-xxx-xxx DirectPath

F9h mm-xxx-xxx DirectPath

68h 11-xxx-xxx DirectPath

68h mm-xxx-xxx DirectPath

FADD/FMUL

PSRAD mmreg1, mmreg2

PSRAD mmreg, mem64

PSRAD mmreg, imm8

PSRLD mmreg1, mmreg2

PSRLD mmreg, mem64

PSRLD mmreg, imm8

PSRLQ mmreg1, mmreg2

PSRLQ mmreg, mem64

PSRLQ mmreg, imm8

PSRLW mmreg1, mmreg2

PSRLW mmreg, mem64

PSRLW mmreg, imm8

PSUBB mmreg1, mmreg2

PSUBB mmreg, mem64

PSUBD mmreg1, mmreg2

PSUBD mmreg, mem64

PSUBSB mmreg1, mmreg2

PSUBSB mmreg, mem64

PSUBSW mmreg1, mmreg2

PSUBSW mmreg, mem64

PSUBUSB mmreg1, mmreg2

PSUBUSB mmreg, mem64

PSUBUSW mmreg1, mmreg2

PSUBUSW mmreg, mem64

PSUBW mmreg1, mmreg2

PSUBW mmreg, mem64

PUNPCKHBW mmreg1, mmreg2

PUNPCKHBW mmreg, mem64

Notes:

1. Bits 2, 1, and 0 of the modR/M byte select the integer register.

210

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 20. MMX™ Instructions (Continued)

Prefix First ModR/M

Byte(s) Byte Byte

Decode

Type

Instruction Mnemonic

FPU Pipe(s)

Notes

PUNPCKHDQ mmreg1, mmreg2

PUNPCKHDQ mmreg, mem64

PUNPCKHWD mmreg1, mmreg2

PUNPCKHWD mmreg, mem64

PUNPCKLBW mmreg1, mmreg2

PUNPCKLBW mmreg, mem64

PUNPCKLDQ mmreg1, mmreg2

PUNPCKLDQ mmreg, mem64

PUNPCKLWD mmreg1, mmreg2

PUNPCKLWD mmreg, mem64

PXOR mmreg1, mmreg2

0Fh

6Ah 11-xxx-xxx DirectPath

6Ah mm-xxx-xxx DirectPath

69h 11-xxx-xxx DirectPath

69h mm-xxx-xxx DirectPath

60h 11-xxx-xxx DirectPath

60h mm-xxx-xxx DirectPath

62h 11-xxx-xxx DirectPath

62h mm-xxx-xxx DirectPath

61h 11-xxx-xxx DirectPath

61h mm-xxx-xxx DirectPath

EFh 11-xxx-xxx DirectPath

EFh mm-xxx-xxx DirectPath

FADD/FMUL

PXOR mmreg, mem64

Notes:

1. Bits 2, 1, and 0 of the modR/M byte select the integer register.

Table 21. MMX™ Extensions

Prefix First ModR/M

Instruction Mnemonic

Decode

Type

FPU

Pipe(s)

Notes

Byte(s) Byte

Byte

MASKMOVQ mmreg1, mmreg2

MOVNTQ mem64, mmreg

PAVGB mmreg1, mmreg2

PAVGB mmreg, mem64

0Fh

F7h

E7h

VectorPath FADD/FMUL/FSTORE

DirectPath

FSTORE

E0h 11-xxx-xxx DirectPath

E0h mm-xxx-xxx DirectPath

E3h 11-xxx-xxx DirectPath

E3h mm-xxx-xxx DirectPath

FADD/FMUL

PAVGW mmreg1, mmreg2

PAVGW mmreg, mem64

PEXTRW reg32, mmreg, imm8

PINSRW mmreg, reg32, imm8

PINSRW mmreg, mem16, imm8

PMAXSW mmreg1, mmreg2

PMAXSW mmreg, mem64

PMAXUB mmreg1, mmreg2

PMAXUB mmreg, mem64

PMINSW mmreg1, mmreg2

Notes:

C5h

C4h

VectorPath

EEh 11-xxx-xxx DirectPath

EEh mm-xxx-xxx DirectPath

DEh 11-xxx-xxx DirectPath

DEh mm-xxx-xxx DirectPath

EAh 11-xxx-xxx DirectPath

FADD/FMUL

1. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line that will be prefetched.

Instruction Dispatch and Execution Resources

211

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

Table 21. MMX™ Extensions (Continued)

22007E/0—November 1999

Prefix First ModR/M

Byte(s) Byte Byte

Decode

Type

FPU

Pipe(s)

Instruction Mnemonic

Notes

PMINSW mmreg, mem64

PMINUB mmreg1, mmreg2

PMINUB mmreg, mem64

PMOVMSKB reg32, mmreg

PMULHUW mmreg1, mmreg2

PMULHUW mmreg, mem64

PSADBW mmreg1, mmreg2

PSADBW mmreg, mem64

PSHUFW mmreg1, mmreg2, imm8

PSHUFW mmreg, mem64, imm8

PREFETCHNTA mem8

PREFETCHT0 mem8

0Fh

EAh mm-xxx-xxx DirectPath

DAh 11-xxx-xxx DirectPath

DAh mm-xxx-xxx DirectPath

FADD/FMUL

D7h

VectorPath

E4h 11-xxx-xxx DirectPath

E4h mm-xxx-xxx DirectPath

F6h 11-xxx-xxx DirectPath

F6h mm-xxx-xxx DirectPath

FMUL

FADD

70h

18h

AEh

DirectPath

VectorPath

FADD/FMUL

PREFETCHT1 mem8

PREFETCHT2 mem8

SFENCE

Notes:

1. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line that will be prefetched.

Table 22. Floating-Point Instructions

First Second ModR/M

Byte Byte Byte

Decode

Type

FPU

Pipe(s)

Instruction Mnemonic

F2XM1

Note

D9h

D8h

DCh

DEh

DFh

D9h

DBh

F0h

E1h

VectorPath

DirectPath

FABS

FMUL

FADD

FADD ST, ST(i)

FADD [mem32real]

FADD ST(i), ST

FADD [mem64real]

FADDP ST(i), ST

FBLD [mem80]

FBSTP [mem80]

FCHS

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-000-xxx DirectPath

11-000-xxx DirectPath

mm-100-xxx VectorPath

mm-110-xxx VectorPath

DirectPath

E0h

E2h

FMUL

FCLEX

VectorPath

Notes:

1. The last three bits of the modR/M byte select the stack entry ST(i).

212

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 22. Floating-Point Instructions (Continued)

First Second ModR/M

Instruction Mnemonic

Decode

Type

FPU

Pipe(s)

Note

Byte Byte

DAh C0-C7h

DAh C8-CFh

DAh D0-D7h

DAh D8-DFh

DBh C0-C7h

DBh C8-CFh

DBh D0-D7h

DBh D8-DFh

D8h

Byte

FCMOVB ST(0), ST(i)

FCMOVE ST(0), ST(i)

FCMOVBE ST(0), ST(i)

FCMOVU ST(0), ST(i)

FCMOVNB ST(0), ST(i)

FCMOVNE ST(0), ST(i)

FCMOVNBE ST(0), ST(i)

FCMOVNU ST(0), ST(i)

FCOM ST(i)

VectorPath

11-010-xxx DirectPath

11-011-xxx DirectPath

mm-010-xxx DirectPath

VectorPath

FADD

FCOMP ST(i)

D8h

FCOM [mem32real]

FCOM [mem64real]

FCOMI ST, ST(i)

FCOMIP ST, ST(i)

FCOMP [mem32real]

FCOMP [mem64real]

FCOMPP

D8h

DCh

DBh F0-F7h

DFh F0-F7h

D8h

VectorPath

mm-011-xxx DirectPath

11-011-001 DirectPath

VectorPath

DCh

DEh

D9h

D8h

DCh

D8h

DCh

DEh

D8h

DCh

D8h

DCh

DEh

DDh

D9h

FFh

F6h

FCOS

FDECSTP

DirectPath FADD/FMUL/FSTORE

FDIV ST, ST(i)

11-110-xxx DirectPath

FMUL

FDIV ST(i), ST

11-111-xxx DirectPath

mm-110-xxx DirectPath

11-111-xxx DirectPath

11-110-xxx DirectPath

11-111-xxx DirectPath

mm-111-xxx DirectPath

11-110-xxx DirectPath

FDIV [mem32real]

FDIV [mem64real]

FDIVP ST, ST(i)

FDIVR ST, ST(i)

FDIVR ST(i), ST

FDIVR [mem32real]

FDIVR [mem64real]

FDIVRP ST(i), ST

FFREE ST(i)

11-000-xxx DirectPath FADD/FMUL/FSTORE

DirectPath FADD/FMUL/FSTORE

FFREEP ST(i)

DFh C0-C7h

Notes:

1. The last three bits of the modR/M byte select the stack entry ST(i).

Instruction Dispatch and Execution Resources

213

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 22. Floating-Point Instructions (Continued)

First Second ModR/M

Instruction Mnemonic

Decode

Type

FPU

Pipe(s)

Note

Byte Byte

DAh

Byte

FIADD [mem32int]

FIADD [mem16int]

FICOM [mem32int]

FICOM [mem16int]

FICOMP [mem32int]

FICOMP [mem16int]

FIDIV [mem32int]

FIDIV [mem16int]

FIDIVR [mem32int]

FIDIVR [mem16int]

FILD [mem16int]

FILD [mem32int]

FILD [mem64int]

FIMUL [mem32int]

FIMUL [mem16int]

FINCSTP

mm-000-xxx VectorPath

mm-010-xxx VectorPath

mm-011-xxx VectorPath

mm-110-xxx VectorPath

mm-111-xxx VectorPath

mm-000-xxx DirectPath

mm-101-xxx DirectPath

mm-001-xxx VectorPath

DEh

DAh

DEh

DAh

DEh

DAh

DEh

DAh

DEh

DFh

FSTORE

DBh

DFh

DAh

DEh

D9h

DBh

DFh

DBh

DFh

DBh

DFh

DAh

DEh

DAh

DEh

D9h

DDh

DBh

D9h

F7h

E3h

DirectPath FADD/FMUL/FSTORE

VectorPath

FINIT

FIST [mem16int]

FIST [mem32int]

FISTP [mem16int]

FISTP [mem32int]

FISTP [mem64int]

FISUB [mem32int]

FISUB [mem16int]

FISUBR [mem32int]

FISUBR [mem16int]

FLD ST(i)

mm-010-xxx DirectPath

FSTORE

mm-010-xxx DirectPath

mm-011-xxx DirectPath

mm-111-xxx DirectPath

mm-100-xxx VectorPath

mm-101-xxx VectorPath

11-000-xxx DirectPath

FADD/FMUL

FLD [mem32real]

FLD [mem64real]

FLD [mem80real]

FLD1

mm-000-xxx DirectPath FADD/FMUL/FSTORE

mm-101-xxx VectorPath

E8h

DirectPath

FSTORE

Notes:

1. The last three bits of the modR/M byte select the stack entry ST(i).

214

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 22. Floating-Point Instructions (Continued)

First Second ModR/M

Instruction Mnemonic

Decode

Type

FPU

Pipe(s)

Note

Byte Byte

D9h

Byte

FLDCW [mem16]

FLDENV [mem14byte]

FLDENV [mem28byte]

FLDL2E

mm-101-xxx VectorPath

mm-100-xxx VectorPath

DirectPath

D9h

D8h

DCh

D8h

DCh

DEh

D9h

DDh

D9h

DDh

EAh

E9h

ECh

EDh

EBh

EEh

FSTORE

FMUL

FLDL2T

DirectPath

FLDLG2

DirectPath

FLDLN2

DirectPath

FLDPI

DirectPath

FLDZ

DirectPath

FMUL ST, ST(i)

FMUL ST(i), ST

FMUL [mem32real]

FMUL [mem64real]

FMULP ST, ST(i)

FNOP

11-001-xxx DirectPath

mm-001-xxx DirectPath

11-001-xxx DirectPath

FMUL

D0h

F2h

F3h

F8h

F5h

FCh

DirectPath FADD/FMUL/FSTORE

VectorPath

FPTAN

FPATAN

VectorPath

FPREM

DirectPath

FMUL

FPREM1

FRNDINT

VectorPath

FRSTOR [mem94byte]

FRSTOR [mem108byte]

FSAVE [mem94byte]

FSAVE [mem108byte]

FSCALE

mm-100-xxx VectorPath

mm-110-xxx VectorPath

VectorPath

FDh

FEh

FBh

FAh

FSIN

VectorPath

FSINCOS

VectorPath

FSQRT

DirectPath

FMUL

FSTORE

FST [mem32real]

FST [mem64real]

FST ST(i)

mm-010-xxx DirectPath

11-010xxx DirectPath

FSTORE

FADD/FMUL

Notes:

1. The last three bits of the modR/M byte select the stack entry ST(i).

Instruction Dispatch and Execution Resources

215

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 22. Floating-Point Instructions (Continued)

First Second ModR/M

Instruction Mnemonic

Decode

Type

FPU

Pipe(s)

Note

Byte Byte

D9h

Byte

FSTCW [mem16]

FSTENV [mem14byte]

FSTENV [mem28byte]

FSTP [mem32real]

FSTP [mem64real]

FSTP [mem80real]

FSTP ST(i)

mm-111-xxx VectorPath

mm-110-xxx VectorPath

mm-011-xxx DirectPath

mm-111-xxx VectorPath

11-011-xxx DirectPath

VectorPath

D9h

FADD/FMUL

DDh

D9h

DDh

FADD/FMUL

FSTSW AX

DFh

DDh

D8h

DCh

D8h

DCh

DEh

D8h

DCh

D8h

DCh

DEh

D9h

DDh

E0h

FSTSW [mem16]

FSUB [mem32real]

FSUB [mem64real]

FSUB ST, ST(i)

FSUB ST(i), ST

FSUBP ST, ST(i)

FSUBR [mem32real]

FSUBR [mem64real]

FSUBR ST, ST(i)

FSUBR ST(i), ST

FSUBRP ST(i), ST

FTST

mm-111-xxx VectorPath

mm-100-xxx DirectPath

11-100-xxx DirectPath

11-101-xxx DirectPath

mm-101-xxx DirectPath

11-100-xxx DirectPath

11-101-xxx DirectPath

11-100-xxx DirectPath

DirectPath

FSTORE

FADD

E4h

FUCOM

11-100-xxx DirectPath

VectorPath

FUCOMI ST, ST(i)

FUCOMIP ST, ST(i)

FUCOMP

DB E8-EFh

DF E8-EFh

DDh

VectorPath

11-101-xxx DirectPath

DirectPath

FUCOMPP

DAh

9Bh

D9h

E9h

FWAIT

DirectPath

FXAM

E5h

VectorPath

FXCH

11-001-xxx DirectPath FADD/FMUL/FSTORE

FXTRACT

F4h

F1h

F9h

VectorPath

FYL2X

FYL2XP1

Notes:

1. The last three bits of the modR/M byte select the stack entry ST(i).

216

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 23. 3DNow!™ Instructions

Prefix

Byte(s)

ModR/M

Byte

Decode

Type

FPU

Pipe(s)

Instruction Mnemonic

FEMMS

imm8

Note

0Fh

0Eh

BFh

DirectPath FADD/FMUL/FSTORE

PAVGUSB mmreg1, mmreg2 0Fh, 0Fh

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

FADD/FMUL

FADD

PAVGUSB mmreg, mem64

PF2ID mmreg1, mmreg2

PF2ID mmreg, mem64

PFACC mmreg1, mmreg2

PFACC mmreg, mem64

PFADD mmreg1, mmreg2

PFADD mmreg, mem64

0Fh, 0Fh

0Fh, 0Fh 1Dh

FADD

0Fh, 0Fh

AEh

9Eh

B0h

90h

A0h

A4h

94h

B4h

96h

A6h

B6h

A7h

97h

FADD

PFCMPEQ mmreg1, mmreg2 0Fh, 0Fh

PFCMPEQ mmreg, mem64 0Fh, 0Fh

PFCMPGE mmreg1, mmreg2 0Fh, 0Fh

PFCMPGE mmreg, mem64 0Fh, 0Fh

PFCMPGT mmreg1, mmreg2 0Fh, 0Fh

FADD

PFCMPGT mmreg, mem64

PFMAX mmreg1, mmreg2

PFMAX mmreg, mem64

PFMIN mmreg1, mmreg2

PFMIN mmreg, mem64

PFMUL mmreg1, mmreg2

PFMUL mmreg, mem64

PFRCP mmreg1, mmreg2

PFRCP mmreg, mem64

0Fh, 0Fh

FADD

FMUL

PFRCPIT1 mmreg1, mmreg2 0Fh, 0Fh

PFRCPIT1 mmreg, mem64 0Fh, 0Fh

PFRCPIT2 mmreg1, mmreg2 0Fh, 0Fh

PFRCPIT2 mmreg, mem64 0Fh, 0Fh

PFRSQIT1 mmreg1, mmreg2 0Fh, 0Fh

PFRSQIT1 mmreg, mem64 0Fh, 0Fh

PFRSQRT mmreg1, mmreg2 0Fh, 0Fh

Notes:

1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line that will be

prefetched.

2. The byte listed in the column titled ‘imm8’ is actually the opcode byte.

Instruction Dispatch and Execution Resources

217

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 23. 3DNow!™ Instructions (Continued)

Prefix

ModR/M

Byte

Decode

Type

FPU

Pipe(s)

Instruction Mnemonic

imm8

Note

Byte(s)

0Fh, 0Fh

PFRSQRT mmreg, mem64

PFSUB mmreg1, mmreg2

PFSUB mmreg, mem64

PFSUBR mmreg1, mmreg2

PFSUBR mmreg, mem64

PI2FD mmreg1, mmreg2

PI2FD mmreg, mem64

97h

9Ah

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

mm-000-xxx DirectPath

mm-001-xxx DirectPath

FMUL

FADD

FMUL

0Fh, 0Fh AAh

0Fh, 0Fh 0Dh

PMULHRW mmreg1, mmreg2 0Fh, 0Fh

PMULHRW mmreg1, mem64 0Fh, 0Fh

B7h

0Dh

PREFETCH mem8

PREFETCHW mem8

Notes:

0Fh

1, 2

1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line that will be

prefetched.

2. The byte listed in the column titled ‘imm8’ is actually the opcode byte.

Table 24. 3DNow!™ Extensions

Prefix

Byte(s)

ModR/M

Byte

Decode

Type

FPU

Pipe(s)

Instruction Mnemonic

imm8

Note

PF2IW mmreg1, mmreg2

PF2IW mmreg, mem64

0Fh, 0Fh

1Ch

8Ah

8Eh

0Ch

BBh

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

11-xxx-xxx DirectPath

mm-xxx-xxx DirectPath

FADD

PFNACC mmreg1, mmreg2

PFNACC mmreg, mem64

PFPNACC mmreg1, mmreg2

PFPNACC mmreg, mem64

PI2FW mmreg1, mmreg2

PI2FW mmreg, mem64

FADD

PSWAPD mmreg1, mmreg2

PSWAPD mmreg, mem64

FADD/FMUL

218

Instruction Dispatch and Execution Resources

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Appendix G

DirectPath versus

VectorPath Instructions

Select DirectPath Over VectorPath Instructions

Use DirectPath instructions rather than VectorPath

instructions. DirectPath instructions are optimized for decode

and execute efficiently by minimizing the number of operations

per x86 instruction, which includes ‘register←register op

memory’ as well as ‘register←register op register’ forms of

instructions.

DirectPath Instructions

The following tables contain DirectPath instructions, which

should be used in the AMD Athlon processor wherever possible:

ꢀ

Table 25, “DirectPath Integer Instructions,” on page 220

ꢀ

Table 26, “DirectPath MMX ™ Instructions,” on page 227

and Table 27, “DirectPath MMX ™ Extensions,” on page 228

ꢀ

Table 28, “DirectPath Floating-Point Instructions,” on

page 229

All 3DNow! instructions, including the 3DNow! Extensions,

are DirectPath and are listed in Table 23, “3DNow!™

Instructions,” on page 217 a nd T able 24, “3DNow!™ Exten-

sions,” on page 218.

Select DirectPath Over VectorPath Instructions

219

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 25. DirectPath Integer Instructions (Continued)

Table 25. DirectPath Integer Instructions

Instruction Mnemonic

AND mreg16/32, reg16/32

AND mem16/32, reg16/32

AND reg8, mreg8

Instruction Mnemonic

ADC mreg8, reg8

ADC mem8, reg8

ADC mreg16/32, reg16/32

ADC mem16/32, reg16/32

ADC reg8, mreg8

AND reg8, mem8

AND reg16/32, mreg16/32

AND reg16/32, mem16/32

AND AL, imm8

ADC reg8, mem8

ADC reg16/32, mreg16/32

ADC reg16/32, mem16/32

ADC AL, imm8

AND EAX, imm16/32

AND mreg8, imm8

AND mem8, imm8

ADC EAX, imm16/32

AND mreg16/32, imm16/32

AND mem16/32, imm16/32

AND mreg16/32, imm8 (sign extended)

AND mem16/32, imm8 (sign extended)

BSWAP EAX

ADC mreg8, imm8

ADC mem8, imm8

ADC mreg16/32, imm16/32

ADC mem16/32, imm16/32

ADC mreg16/32, imm8 (sign extended)

ADC mem16/32, imm8 (sign extended)

ADD mreg8, reg8

BSWAP ECX

BSWAP EDX

BSWAP EBX

ADD mem8, reg8

BSWAP ESP

ADD mreg16/32, reg16/32

ADD mem16/32, reg16/32

ADD reg8, mreg8

BSWAP EBP

BSWAP ESI

BSWAP EDI

ADD reg8, mem8

BT mreg16/32, reg16/32

BT mreg16/32, imm8

ADD reg16/32, mreg16/32

ADD reg16/32, mem16/32

ADD AL, imm8

BT mem16/32, imm8

CBW/CWDE

ADD EAX, imm16/32

CLC

ADD mreg8, imm8

CMC

ADD mem8, imm8

CMOVA/CMOVBE reg16/32, reg16/32

CMOVA/CMOVBE reg16/32, mem16/32

CMOVAE/CMOVNB/CMOVNC reg16/32, mem16/32

CMOVAE/CMOVNB/CMOVNC mem16/32, mem16/32

CMOVB/CMOVC/CMOVNAE reg16/32, reg16/32

CMOVB/CMOVC/CMOVNAE mem16/32, reg16/32

ADD mreg16/32, imm16/32

ADD mem16/32, imm16/32

ADD mreg16/32, imm8 (sign extended)

ADD mem16/32, imm8 (sign extended)

AND mreg8, reg8

AND mem8, reg8

220

DirectPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)

Instruction Mnemonic

CMOVBE/CMOVNA reg16/32, reg16/32

CMOVBE/CMOVNA reg16/32, mem16/32

CMOVE/CMOVZ reg16/32, reg16/32

CMOVE/CMOVZ reg16/32, mem16/32

CMOVG/CMOVNLE reg16/32, reg16/32

CMOVG/CMOVNLE reg16/32, mem16/32

CMOVGE/CMOVNL reg16/32, reg16/32

CMOVGE/CMOVNL reg16/32, mem16/32

CMOVL/CMOVNGE reg16/32, reg16/32

CMOVL/CMOVNGE reg16/32, mem16/32

CMOVLE/CMOVNG reg16/32, reg16/32

CMOVLE/CMOVNG reg16/32, mem16/32

CMOVNE/CMOVNZ reg16/32, reg16/32

CMOVNE/CMOVNZ reg16/32, mem16/32

CMOVNO reg16/32, reg16/32

Instruction Mnemonic

CMP AL, imm8

CMP EAX, imm16/32

CMP mreg8, imm8

CMP mem8, imm8

CMP mreg16/32, imm16/32

CMP mem16/32, imm16/32

CMP mreg16/32, imm8 (sign extended)

CMP mem16/32, imm8 (sign extended)

CWD/CDQ

DEC EAX

DEC ECX

DEC EDX

DEC EBX

DEC ESP

DEC EBP

CMOVNO reg16/32, mem16/32

CMOVNP/CMOVPO reg16/32, reg16/32

CMOVNP/CMOVPO reg16/32, mem16/32

CMOVNS reg16/32, reg16/32

DEC ESI

DEC EDI

DEC mreg8

DEC mem8

CMOVNS reg16/32, mem16/32

CMOVO reg16/32, reg16/32

DEC mreg16/32

DEC mem16/32

INC EAX

CMOVO reg16/32, mem16/32

CMOVP/CMOVPE reg16/32, reg16/32

CMOVP/CMOVPE reg16/32, mem16/32

CMOVS reg16/32, reg16/32

INC ECX

INC EDX

INC EBX

CMOVS reg16/32, mem16/32

INC ESP

CMP mreg8, reg8

INC EBP

CMP mem8, reg8

INC ESI

CMP mreg16/32, reg16/32

INC EDI

CMP mem16/32, reg16/32

INC mreg8

CMP reg8, mreg8

INC mem8

CMP reg8, mem8

INC mreg16/32

INC mem16/32

JO short disp8

CMP reg16/32, mreg16/32

CMP reg16/32, mem16/32

DirectPath Instructions

221

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)

Instruction Mnemonic

JNO short disp8

Instruction Mnemonic

JMP near mreg16/32 (indirect)

JB/JNAE short disp8

JNB/JAE short disp8

JZ/JE short disp8

JMP near mem16/32 (indirect)

LEA reg32, mem16/32

MOV mreg8, reg8

JNZ/JNE short disp8

JBE/JNA short disp8

JNBE/JA short disp8

JS short disp8

MOV mem8, reg8

MOV mreg16/32, reg16/32

MOV mem16/32, reg16/32

MOV reg8, mreg8

JNS short disp8

MOV reg8, mem8

JP/JPE short disp8

MOV reg16/32, mreg16/32

MOV reg16/32, mem16/32

MOV AL, mem8

JNP/JPO short disp8

JL/JNGE short disp8

JNL/JGE short disp8

JLE/JNG short disp8

JNLE/JG short disp8

JO near disp16/32

MOV EAX, mem16/32

MOV mem8, AL

MOV mem16/32, EAX

MOV AL, imm8

JNO near disp16/32

JB/JNAE near disp16/32

JNB/JAE near disp16/32

JZ/JE near disp16/32

JNZ/JNE near disp16/32

JBE/JNA near disp16/32

JNBE/JA near disp16/32

JS near disp16/32

MOV CL, imm8

MOV DL, imm8

MOV BL, imm8

MOV AH, imm8

MOV CH, imm8

MOV DH, imm8

MOV BH, imm8

MOV EAX, imm16/32

MOV ECX, imm16/32

MOV EDX, imm16/32

MOV EBX, imm16/32

MOV ESP, imm16/32

MOV EBP, imm16/32

MOV ESI, imm16/32

MOV EDI, imm16/32

MOV mreg8, imm8

MOV mem8, imm8

MOV mreg16/32, imm16/32

JNS near disp16/32

JP/JPE near disp16/32

JNP/JPO near disp16/32

JL/JNGE near disp16/32

JNL/JGE near disp16/32

JLE/JNG near disp16/32

JNLE/JG near disp16/32

JMP near disp16/32 (direct)

JMP far disp32/48 (direct)

JMP disp8 (short)

222

DirectPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)

Instruction Mnemonic

MOV mem16/32, imm16/32

Instruction Mnemonic

PUSH EAX

PUSH ECX

PUSH EDX

PUSH EBX

PUSH ESP

PUSH EBP

PUSH ESI

PUSH EDI

PUSH imm8

MOVSX reg16/32, mreg8

MOVSX reg16/32, mem8

MOVSX reg32, mreg16

MOVSX reg32, mem16

MOVZX reg16/32, mreg8

MOVZX reg16/32, mem8

MOVZX reg32, mreg16

MOVZX reg32, mem16

NEG mreg8

PUSH imm16/32

RCL mreg8, imm8

RCL mreg16/32, imm8

RCL mreg8, 1

NEG mem8

NEG mreg16/32

NEG mem16/32

NOP (XCHG EAX, EAX)

NOT mreg8

RCL mem8, 1

RCL mreg16/32, 1

RCL mem16/32, 1

RCL mreg8, CL

NOT mem8

NOT mreg16/32

NOT mem16/32

RCL mreg16/32, CL

RCR mreg8, imm8

RCR mreg16/32, imm8

RCR mreg8, 1

OR mreg8, reg8

OR mem8, reg8

OR mreg16/32, reg16/32

OR mem16/32, reg16/32

OR reg8, mreg8

RCR mem8, 1

RCR mreg16/32, 1

RCR mem16/32, 1

RCR mreg8, CL

OR reg8, mem8

OR reg16/32, mreg16/32

OR reg16/32, mem16/32

OR AL, imm8

RCR mreg16/32, CL

ROL mreg8, imm8

ROL mem8, imm8

ROL mreg16/32, imm8

ROL mem16/32, imm8

ROL mreg8, 1

OR EAX, imm16/32

OR mreg8, imm8

OR mem8, imm8

OR mreg16/32, imm16/32

OR mem16/32, imm16/32

OR mreg16/32, imm8 (sign extended)

OR mem16/32, imm8 (sign extended)

ROL mem8, 1

ROL mreg16/32, 1

ROL mem16/32, 1

DirectPath Instructions

223

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)

Instruction Mnemonic

SBB reg16/32, mreg16/32

ROL mreg8, CL

ROL mem8, CL

SBB reg16/32, mem16/32

SBB AL, imm8

ROL mreg16/32, CL

ROL mem16/32, CL

ROR mreg8, imm8

ROR mem8, imm8

ROR mreg16/32, imm8

ROR mem16/32, imm8

ROR mreg8, 1

SBB EAX, imm16/32

SBB mreg8, imm8

SBB mem8, imm8

SBB mreg16/32, imm16/32

SBB mem16/32, imm16/32

SBB mreg16/32, imm8 (sign extended)

SBB mem16/32, imm8 (sign extended)

SETO mreg8

ROR mem8, 1

ROR mreg16/32, 1

ROR mem16/32, 1

ROR mreg8, CL

SETO mem8

SETNO mreg8

ROR mem8, CL

SETNO mem8

ROR mreg16/32, CL

ROR mem16/32, CL

SAR mreg8, imm8

SAR mem8, imm8

SAR mreg16/32, imm8

SAR mem16/32, imm8

SAR mreg8, 1

SETB/SETC/SETNAE mreg8

SETB/SETC/SETNAE mem8

SETAE/SETNB/SETNC mreg8

SETAE/SETNB/SETNC mem8

SETE/SETZ mreg8

SETE/SETZ mem8

SETNE/SETNZ mreg8

SETNE/SETNZ mem8

SETBE/SETNA mreg8

SETBE/SETNA mem8

SETA/SETNBE mreg8

SETA/SETNBE mem8

SETS mreg8

SAR mem8, 1

SAR mreg16/32, 1

SAR mem16/32, 1

SAR mreg8, CL

SAR mem8, CL

SAR mreg16/32, CL

SAR mem16/32, CL

SBB mreg8, reg8

SBB mem8, reg8

SBB mreg16/32, reg16/32

SBB mem16/32, reg16/32

SBB reg8, mreg8

SBB reg8, mem8

SETS mem8

SETNS mreg8

SETNS mem8

SETP/SETPE mreg8

SETP/SETPE mem8

SETNP/SETPO mreg8

SETNP/SETPO mem8

224

DirectPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 25. DirectPath Integer Instructions (Continued)

Instruction Mnemonic

SUB mem8, reg8

Instruction Mnemonic

SETL/SETNGE mreg8

SETL/SETNGE mem8

SETGE/SETNL mreg8

SETGE/SETNL mem8

SETLE/SETNG mreg8

SETLE/SETNG mem8

SETG/SETNLE mreg8

SETG/SETNLE mem8

SHL/SAL mreg8, imm8

SHL/SAL mem8, imm8

SHL/SAL mreg16/32, imm8

SHL/SAL mem16/32, imm8

SHL/SAL mreg8, 1

SUB mreg16/32, reg16/32

SUB mem16/32, reg16/32

SUB reg8, mreg8

SUB reg8, mem8

SUB reg16/32, mreg16/32

SUB reg16/32, mem16/32

SUB AL, imm8

SUB EAX, imm16/32

SUB mreg8, imm8

SUB mem8, imm8

SUB mreg16/32, imm16/32

SUB mem16/32, imm16/32

SUB mreg16/32, imm8 (sign extended)

SUB mem16/32, imm8 (sign extended)

TEST mreg8, reg8

SHL/SAL mem8, 1

SHL/SAL mreg16/32, 1

SHL/SAL mem16/32, 1

SHL/SAL mreg8, CL

SHL/SAL mem8, CL

SHL/SAL mreg16/32, CL

SHL/SAL mem16/32, CL

SHR mreg8, imm8

TEST mem8, reg8

TEST mreg16/32, reg16/32

TEST mem16/32, reg16/32

TEST AL, imm8

TEST EAX, imm16/32

TEST mreg8, imm8

SHR mem8, imm8

TEST mem8, imm8

SHR mreg16/32, imm8

SHR mem16/32, imm8

SHR mreg8, 1

TEST mreg8, imm16/32

TEST mem8, imm16/32

WAIT

SHR mem8, 1

XCHG EAX, EAX

SHR mreg16/32, 1

XOR mreg8, reg8

SHR mem16/32, 1

XOR mem8, reg8

SHR mreg8, CL

XOR mreg16/32, reg16/32

XOR mem16/32, reg16/32

XOR reg8, mreg8

SHR mem8, CL

SHR mreg16/32, CL

SHR mem16/32, CL

STC

XOR reg8, mem8

XOR reg16/32, mreg16/32

SUB mreg8, reg8

DirectPath Instructions

225

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 25. DirectPath Integer Instructions (Continued)

Instruction Mnemonic

XOR reg16/32, mem16/32

XOR AL, imm8

XOR EAX, imm16/32

XOR mreg8, imm8

XOR mem8, imm8

XOR mreg16/32, imm16/32

XOR mem16/32, imm16/32

XOR mreg16/32, imm8 (sign extended)

XOR mem16/32, imm8 (sign extended)

226

DirectPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 26. DirectPath MMX™ Instructions

Table 26. DirectPath MMX™ Instructions (Continued)

Instruction Mnemonic

EMMS

Instruction Mnemonic

PCMPEQD mmreg, mem64

PCMPEQW mmreg1, mmreg2

PCMPEQW mmreg, mem64

PCMPGTB mmreg1, mmreg2

PCMPGTB mmreg, mem64

PCMPGTD mmreg1, mmreg2

PCMPGTD mmreg, mem64

PCMPGTW mmreg1, mmreg2

PCMPGTW mmreg, mem64

PMADDWD mmreg1, mmreg2

PMADDWD mmreg, mem64

PMULHW mmreg1, mmreg2

PMULHW mmreg, mem64

PMULLW mmreg1, mmreg2

PMULLW mmreg, mem64

POR mmreg1, mmreg2

MOVD mmreg, mem32

MOVD mem32, mmreg

MOVQ mmreg1, mmreg2

MOVQ mmreg, mem64

MOVQ mmreg2, mmreg1

MOVQ mem64, mmreg

PACKSSDW mmreg1, mmreg2

PACKSSDW mmreg, mem64

PACKSSWB mmreg1, mmreg2

PACKSSWB mmreg, mem64

PACKUSWB mmreg1, mmreg2

PACKUSWB mmreg, mem64

PADDB mmreg1, mmreg2

PADDB mmreg, mem64

PADDD mmreg1, mmreg2

PADDD mmreg, mem64

PADDSB mmreg1, mmreg2

PADDSB mmreg, mem64

PADDSW mmreg1, mmreg2

PADDSW mmreg, mem64

PADDUSB mmreg1, mmreg2

PADDUSB mmreg, mem64

PADDUSW mmreg1, mmreg2

PADDUSW mmreg, mem64

PADDW mmreg1, mmreg2

PADDW mmreg, mem64

PAND mmreg1, mmreg2

PAND mmreg, mem64

POR mmreg, mem64

PSLLD mmreg1, mmreg2

PSLLD mmreg, mem64

PSLLD mmreg, imm8

PSLLQ mmreg1, mmreg2

PSLLQ mmreg, mem64

PSLLQ mmreg, imm8

PSLLW mmreg1, mmreg2

PSLLW mmreg, mem64

PSLLW mmreg, imm8

PSRAW mmreg1, mmreg2

PSRAW mmreg, mem64

PSRAW mmreg, imm8

PANDN mmreg1, mmreg2

PANDN mmreg, mem64

PCMPEQB mmreg1, mmreg2

PCMPEQB mmreg, mem64

PCMPEQD mmreg1, mmreg2

PSRAD mmreg1, mmreg2

PSRAD mmreg, mem64

PSRAD mmreg, imm8

PSRLD mmreg1, mmreg2

PSRLD mmreg, mem64

DirectPath Instructions

227

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 26. DirectPath MMX™ Instructions (Continued) Table 26. DirectPath MMX™ Instructions (Continued)

Instruction Mnemonic

PSRLD mmreg, imm8

Instruction Mnemonic

PXOR mmreg, mem64

PSRLQ mmreg1, mmreg2

PSRLQ mmreg, mem64

Table 27. DirectPath MMX™ Extensions

PSRLQ mmreg, imm8

Instruction Mnemonic

MOVNTQ mem64, mmreg

PAVGB mmreg1, mmreg2

PAVGB mmreg, mem64

PSRLW mmreg1, mmreg2

PSRLW mmreg, mem64

PSRLW mmreg, imm8

PSUBB mmreg1, mmreg2

PSUBB mmreg, mem64

PAVGW mmreg1, mmreg2

PAVGW mmreg, mem64

PMAXSW mmreg1, mmreg2

PMAXSW mmreg, mem64

PMAXUB mmreg1, mmreg2

PMAXUB mmreg, mem64

PMINSW mmreg1, mmreg2

PMINSW mmreg, mem64

PMINUB mmreg1, mmreg2

PMINUB mmreg, mem64

PMULHUW mmreg1, mmreg2

PMULHUW mmreg, mem64

PSADBW mmreg1, mmreg2

PSADBW mmreg, mem64

PSHUFW mmreg1, mmreg2, imm8

PSHUFW mmreg, mem64, imm8

PREFETCHNTA mem8

PSUBD mmreg1, mmreg2

PSUBD mmreg, mem64

PSUBSB mmreg1, mmreg2

PSUBSB mmreg, mem64

PSUBSW mmreg1, mmreg2

PSUBSW mmreg, mem64

PSUBUSB mmreg1, mmreg2

PSUBUSB mmreg, mem64

PSUBUSW mmreg1, mmreg2

PSUBUSW mmreg, mem64

PSUBW mmreg1, mmreg2

PSUBW mmreg, mem64

PUNPCKHBW mmreg1, mmreg2

PUNPCKHBW mmreg, mem64

PUNPCKHDQ mmreg1, mmreg2

PUNPCKHDQ mmreg, mem64

PUNPCKHWD mmreg1, mmreg2

PUNPCKHWD mmreg, mem64

PUNPCKLBW mmreg1, mmreg2

PUNPCKLBW mmreg, mem64

PUNPCKLDQ mmreg1, mmreg2

PUNPCKLDQ mmreg, mem64

PUNPCKLWD mmreg1, mmreg2

PUNPCKLWD mmreg, mem64

PXOR mmreg1, mmreg2

PREFETCHT0 mem8

PREFETCHT1 mem8

PREFETCHT2 mem8

228

DirectPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 28. DirectPath Floating-Point Instructions

Instruction Mnemonic

FIST [mem32int]

FISTP [mem16int]

FISTP [mem32int]

FISTP [mem64int]

FLD ST(i)

Instruction Mnemonic

FABS

FADD ST, ST(i)

FADD [mem32real]

FADD ST(i), ST

FADD [mem64real]

FADDP ST(i), ST

FCHS

FLD [mem32real]

FLD [mem64real]

FLD [mem80real]

FLD1

FCOM ST(i)

FCOMP ST(i)

FLDL2E

FCOM [mem32real]

FCOM [mem64real]

FCOMP [mem32real]

FCOMP [mem64real]

FCOMPP

FLDL2T

FLDLG2

FLDLN2

FLDPI

FLDZ

FDECSTP

FMUL ST, ST(i)

FMUL ST(i), ST

FMUL [mem32real]

FMUL [mem64real]

FMULP ST, ST(i)

FNOP

FDIV ST, ST(i)

FDIV ST(i), ST

FDIV [mem32real]

FDIV [mem64real]

FDIVP ST, ST(i)

FDIVR ST, ST(i)

FDIVR ST(i), ST

FDIVR [mem32real]

FDIVR [mem64real]

FDIVRP ST(i), ST

FFREE ST(i)

FPREM

FPREM1

FSQRT

FST [mem32real]

FST [mem64real]

FST ST(i)

FFREEP ST(i)

FSTP [mem32real]

FSTP [mem64real]

FSTP [mem80real]

FSTP ST(i)

FILD [mem16int]

FILD [mem32int]

FILD [mem64int]

FIMUL [mem32int]

FIMUL [mem16int]

FINCSTP

FSUB [mem32real]

FSUB [mem64real]

FSUB ST, ST(i)

FIST [mem16int]

DirectPath Instructions

229

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 28. DirectPath Floating-Point Instructions

Instruction Mnemonic

FSUB ST(i), ST

FSUBP ST, ST(i)

FSUBR [mem32real]

FSUBR [mem64real]

FSUBR ST, ST(i)

FSUBR ST(i), ST

FSUBRP ST(i), ST

FTST

FUCOM

FUCOMP

FUCOMPP

FWAIT

FXCH

230

DirectPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

VectorPath Instructions

The following tables contain VectorPath instructions, which

should be avoided in the AMD Athlon processor:

ꢀ

Table 29, “VectorPath Integer Instructions,” on page 231

ꢀ

Table 30, “VectorPath MMX ™ Instructions,” on page 234

and Table 31, “V e ctorPath MMX ™ Extensions,” on

page 234

ꢀ

Table 32, “VectorPath Floating-Point Instructions,” on

page 235

Table 29. VectorPath Integer Instructions

Table 29. VectorPath Integer Instructions (Continued)

Instruction Mnemonic

AAA

Instruction Mnemonic

BTS mem16/32, imm8

CALL full pointer

AAD

AAM

CALL near imm16/32

CALL mem16:16/32

CALL near mreg32 (indirect)

CALL near mem32 (indirect)

CLD

AAS

ARPL mreg16, reg16

ARPL mem16, reg16

BOUND

BSF reg16/32, mreg16/32

BSF reg16/32, mem16/32

BSR reg16/32, mreg16/32

BSR reg16/32, mem16/32

BT mem16/32, reg16/32

BTC mreg16/32, reg16/32

BTC mem16/32, reg16/32

BTC mreg16/32, imm8

BTC mem16/32, imm8

BTR mreg16/32, reg16/32

BTR mem16/32, reg16/32

BTR mreg16/32, imm8

BTR mem16/32, imm8

BTS mreg16/32, reg16/32

BTS mem16/32, reg16/32

BTS mreg16/32, imm8

CLI

CLTS

CMPSB mem8,mem8

CMPSW mem16, mem32

CMPSD mem32, mem32

CMPXCHG mreg8, reg8

CMPXCHG mem8, reg8

CMPXCHG mreg16/32, reg16/32

CMPXCHG mem16/32, reg16/32

CMPXCHG8B mem64

CPUID

DAA

DAS

DIV AL, mreg8

DIV AL, mem8

DIV EAX, mreg16/32

VectorPath Instructions

231

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 29. VectorPath Integer Instructions (Continued)

Instruction Mnemonic

DIV EAX, mem16/32

ENTER

Instruction Mnemonic

LEA reg16, mem16/32

LEAVE

IDIV mreg8

LES reg16/32, mem32/48

LFS reg16/32, mem32/48

LGDT mem48

IDIV mem8

IDIV EAX, mreg16/32

IDIV EAX, mem16/32

IMUL reg16/32, imm16/32

IMUL reg16/32, mreg16/32, imm16/32

IMUL reg16/32, mem16/32, imm16/32

IMUL reg16/32, imm8 (sign extended)

IMUL reg16/32, mreg16/32, imm8 (signed)

IMUL reg16/32, mem16/32, imm8 (signed)

IMUL AX, AL, mreg8

IMUL AX, AL, mem8

IMUL EDX:EAX, EAX, mreg16/32

IMUL EDX:EAX, EAX, mem16/32

IMUL reg16/32, mreg16/32

IMUL reg16/32, mem16/32

IN AL, imm8

LGS reg16/32, mem32/48

LIDT mem48

LLDT mreg16

LLDT mem16

LMSW mreg16

LMSW mem16

LODSB AL, mem8

LODSW AX, mem16

LODSD EAX, mem32

LOOP disp8

LOOPE/LOOPZ disp8

LOOPNE/LOOPNZ disp8

LSL reg16/32, mreg16/32

LSL reg16/32, mem16/32

LSS reg16/32, mem32/48

LTR mreg16

IN AX, imm8

IN EAX, imm8

IN AL, DX

LTR mem16

IN AX, DX

MOV mreg16, segment reg

MOV mem16, segment reg

MOV segment reg, mreg16

MOV segment reg, mem16

MOVSB mem8,mem8

MOVSD mem16, mem16

MOVSW mem32, mem32

MUL AL, mreg8

IN EAX, DX

INVD

INVLPG

JCXZ/JEC short disp8

JMP far disp32/48 (direct)

JMP far mem32 (indirect)

JMP far mreg32 (indirect)

LAHF

MUL AL, mem8

LAR reg16/32, mreg16/32

LAR reg16/32, mem16/32

LDS reg16/32, mem32/48

MUL AX, mreg16

MUL AX, mem16

MUL EAX, mreg32

232

VectorPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 29. VectorPath Integer Instructions (Continued)Table 29. VectorPath Integer Instructions (Continued)

Instruction Mnemonic

MUL EAX, mem32

Instruction Mnemonic

RCL mem8, imm8

OUT imm8, AL

OUT imm8, AX

OUT imm8, EAX

OUT DX, AL

OUT DX, AX

OUT DX, EAX

POP ES

RCL mem16/32, imm8

RCL mem8, CL

RCL mem16/32, CL

RCR mem8, imm8

RCR mem16/32, imm8

RCR mem8, CL

RCR mem16/32, CL

RDMSR

POP SS

POP DS

RDPMC

POP FS

RDTSC

POP GS

RET near imm16

POP EAX

RET near

POP ECX

RET far imm16

POP EDX

RET far

POP EBX

SAHF

POP ESP

SCASB AL, mem8

SCASW AX, mem16

SCASD EAX, mem32

SGDT mem48

POP EBP

POP ESI

POP EDI

POP mreg 16/32

POP mem 16/32

POPA/POPAD

POPF/POPFD

PUSH ES

SIDT mem48

SHLD mreg16/32, reg16/32, imm8

SHLD mem16/32, reg16/32, imm8

SHLD mreg16/32, reg16/32, CL

SHLD mem16/32, reg16/32, CL

SHRD mreg16/32, reg16/32, imm8

SHRD mem16/32, reg16/32, imm8

SHRD mreg16/32, reg16/32, CL

SHRD mem16/32, reg16/32, CL

SLDT mreg16

PUSH CS

PUSH FS

PUSH GS

PUSH SS

PUSH DS

PUSH mreg16/32

PUSH mem16/32

PUSHA/PUSHAD

PUSHF/PUSHFD

SLDT mem16

SMSW mreg16

SMSW mem16

STD

VectorPath Instructions

233

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Table 29. VectorPath Integer Instructions (Continued) Table 30. VectorPath MMX™ Instructions

Instruction Mnemonic

MOVD mmreg, mreg32

STI

STOSB mem8, AL

STOSW mem16, AX

STOSD mem32, EAX

STR mreg16

MOVD mreg32, mmreg

Table 31. VectorPath MMX™ Extensions

Instruction Mnemonic

MASKMOVQ mmreg1, mmreg2

PEXTRW reg32, mmreg, imm8

PINSRW mmreg, reg32, imm8

PINSRW mmreg, mem16, imm8

PMOVMSKB reg32, mmreg

SFENCE

STR mem16

SYSCALL

SYSENTER

SYSEXIT

SYSRET

VERR mreg16

VERR mem16

VERW mreg16

VERW mem16

WBINVD

WRMSR

XADD mreg8, reg8

XADD mem8, reg8

XADD mreg16/32, reg16/32

XADD mem16/32, reg16/32

XCHG reg8, mreg8

XCHG reg8, mem8

XCHG reg16/32, mreg16/32

XCHG reg16/32, mem16/32

XCHG EAX, ECX

XCHG EAX, EDX

XCHG EAX, EBX

XCHG EAX, ESP

XCHG EAX, EBP

XCHG EAX, ESI

XCHG EAX, EDI

XLAT

234

VectorPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Table 32. VectorPath Floating-Point Instructions

Table 32. VectorPath Floating-Point Instructions (Continued)

Instruction Mnemonic

F2XM1

Instruction Mnemonic

FLDENV [mem14byte]

FLDENV [mem28byte]

FPTAN

FBLD [mem80]

FBSTP [mem80]

FCLEX

FPATAN

FCMOVB ST(0), ST(i)

FCMOVE ST(0), ST(i)

FCMOVBE ST(0), ST(i)

FCMOVU ST(0), ST(i)

FCMOVNB ST(0), ST(i)

FCMOVNE ST(0), ST(i)

FCMOVNBE ST(0), ST(i)

FCMOVNU ST(0), ST(i)

FCOMI ST, ST(i)

FRNDINT

FRSTOR [mem94byte]

FRSTOR [mem108byte]

FSAVE [mem94byte]

FSAVE [mem108byte]

FSCALE

FSIN

FSINCOS

FSTCW [mem16]

FSTENV [mem14byte]

FSTENV [mem28byte]

FSTP [mem80real]

FSTSW AX

FCOMIP ST, ST(i)

FCOS

FIADD [mem32int]

FIADD [mem16int]

FICOM [mem32int]

FICOM [mem16int]

FICOMP [mem32int]

FICOMP [mem16int]

FIDIV [mem32int]

FIDIV [mem16int]

FIDIVR [mem32int]

FIDIVR [mem16int]

FIMUL [mem32int]

FIMUL [mem16int]

FINIT

FSTSW [mem16]

FUCOMI ST, ST(i)

FUCOMIP ST, ST(i)

FXAM

FXTRACT

FYL2X

FYL2XP1

FISUB [mem32int]

FISUB [mem16int]

FISUBR [mem32int]

FISUBR [mem16int]

FLD [mem80real]

FLDCW [mem16]

VectorPath Instructions

235

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

236

VectorPath Instructions

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

Index

Numerics

Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . .1 0, 107

3DNow! and MMX ™ Intra-Operand Swapping . . . . . . . 112

Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

F a st Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

F a st Square Root and Reciprocal Square Root . . . . . . . 110

FEMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

PAV G USB for MPEG-2 Motion Compensatio n . . . . . . . . 1 23

PFCMP Instructio n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 14

Decodin g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3, 133

Dependencie s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 28

DirectPath

Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

DirectPath Over VectorPath Instructions . . . . . . 9, 34, 219

Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Displacements, 8-Bit Sign-Extended . . . . . . . . . . . . . . . . . . 39

PFMUL Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . 113–114

PI2FW Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

PREFETCH and PREFETCHW Instruction s . 8, 46–47, 49

PS W A PD Instruction . . . . . . . . . . . . . . . . . . . . . . . .1 12, 126

Scalar Code Translated into 3DNow! Code . . . . . . . . 61–64

Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77–80, 93, 95

Replace Divides with Multiplies, Integer . . . . . . . . . 31, 77

Using 3DNow! Instruction s . . . . . . . . . . . . . . . . . . . 1 08–109

Dynamic Memory Allocation Consideratio n . . . . . . . . . . . . 2 5

Event and Time-Stamp Monitoring Software . . . . . . . . . . 168

Execution Unit Resources. . . . . . . . . . . . . . . . . . . . . . . . . . 148

Extended-Precision Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Address Generation Interlocks . . . . . . . . . . . . . . . . . . . . . . . 72

AMD Athlon™ Processor

Branch-Free Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Code Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

F a mily. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . .4, 129–130

AMD Athlon ™ System Bus . . . . . . . . . . . . . . . . . . . . . . . . . 139

F a r Control Transfer Instructions. . . . . . . . . . . . . . . . . . . . . 65

Fetch and Decode Pipeline Stage s . . . . . . . . . . . . . . . . . . . 1 41

FFREEP Macro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Floating-Point

Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Divides and Square Root s . . . . . . . . . . . . . . . . . . . . . . . . . 2 9

Execution Uni t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 37

Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Pipeline Operation s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 50

Pipeline Stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Subexpression Eliminatio n . . . . . . . . . . . . . . . . . . . . . . . 1 03

To I nteger Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Variables and Expressions are T y pe Float . . . . . . . . . . . 13

FRNDINT Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

FSINCOS Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Blended Code, AMD-K6 and AMD Athlon Processors

3DNow! and MMX Intra-Operand Swapping . . . . . . . . . 112

Block Copies and Block Fills . . . . . . . . . . . . . . . . . . . . . . 115

Branch Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Code Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Signed Words to Floating-Point Example. . . . . . . . . . . . 113

Branches

Align Branch Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Compound Branch Condition s . . . . . . . . . . . . . . . . . . . . . . 2 0

Dependent on Random Data . . . . . . . . . . . . . . . . . . . .1 0, 57

Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Replace with Computation in 3DNow! Code . . . . . . . . . . 60

FXCH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99, 103

Group I — Essential Optimizations . . . . . . . . . . . . . . . . . . 7–8

Group II — Secondary Optimizations . . . . . . . . . . . . . . . . 7, 9

C Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Array-Style Over Pointer-Style Code . . . . . . . . . . . . . . . . 15

C Code to 3DNow! Code Examples . . . . . . . . . . . . . . . 61–64

Structure Component Considerations . . . . . . . . . . . .2 7, 55

Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

64-Byte Cache Lin e . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1, 50

Cache and Memory Optimizations . . . . . . . . . . . . . . . . . . 45

CALL and RETURN Instructions . . . . . . . . . . . . . . . . . . . . . 59

Code Padding Using Neutral Code Fillers . . . . . . . . . . . . . . 39

Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Complex Number Arithmetic . . . . . . . . . . . . . . . . . . . . . . . 126

Const T y pe Qualifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Constant Control Code, Multiple . . . . . . . . . . . . . . . . . . . . . 23

If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Immediates, 8-Bit Sign-Extende d . . . . . . . . . . . . . . . . . . . . . 3 8

Inline Function s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1, 72, 86

Inline REP String with Low Count s . . . . . . . . . . . . . . . . . . . 8 5

Index

237

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

Instruction

MOVZX and MOVSX Instructions . . . . . . . . . . . . . . . . . . . . 73

MSR Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Multiplication

Alternative Code When Multiplying by a Constant . . . . 81

Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Multiplies over Divides, Floating Poin t . . . . . . . . . . . . . . 9 7

Muxing Constructs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Dispatch and Execution Resources. . . . . . . . . . . . . . . . . 187

Short Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Short Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Arithmetic, 64-Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Operand, Consider Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Use 32-Bit Data T y pes for Integer Code . . . . . . . . . . . . . . 13

Newton-Raphson Reciprocal. . . . . . . . . . . . . . . . . . . . . . . . 109

Newton-Raphson Reciprocal Square Root . . . . . . . . . . . . 111

Operands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Largest Possible Operand Size, Repeated String . . . . . . 84

Optimization Star. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

L2 Cache Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

LEA Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Load/Store Pipeline Operations . . . . . . . . . . . . . . . . . . . . . 151

Load-Execute Instruction s . . . . . . . . . . . . . . . . . . . . . . . . .9, 34

Page Attribute Table (PAT). . . . . . . . . . . . . . . . . 171, 177–178

Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

PerfCtr MS R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 67

PerfEvtSel MSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . .1 0, 35

Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Load-Store Unit (LSU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Local Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28, 31, 56

Loop Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Loops

Deriving Loop Control For Partially Unrolle d . . . . . . . . . 7 0

Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Minimize Pointer Arithmetic. . . . . . . . . . . . . . . . . . . . . . . 73

Partial Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

REP String with Low V a riable Counts . . . . . . . . . . . . . . . 85

Unroll Small Loop s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 8

Unrolling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Performance-Monitoring Counters. . . . . . . . . . . 161, 168–169

Pipeline and Execution Unit Resources Overview. . . . . . 141

Pointers

De-referenced Argument s . . . . . . . . . . . . . . . . . . . . . . . . . 3 1

Use Array-Style Code Instead . . . . . . . . . . . . . . . . . . . . . 15

Population Count Function. . . . . . . . . . . . . . . . . . . . . . . . . . 91

Predecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Prefetch

Determing Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Multiple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Prototypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Memory

Pushing Memory Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . .8, 45

T y pes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Memory T y pe Range Register (MTRR) . . . . . . . . . . . . . . . 171

Capability Register Format . . . . . . . . . . . . . . . . . . . . . . . 174

Default T y pe Register Forma t . . . . . . . . . . . . . . . . . . . . . 1 75

Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Fixed-Range Register Format . . . . . . . . . . . . . . . . . . . . . 182

MSR Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

MTRRs and P AT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Overlappin g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 76

V a riable-Range MTRR Register Format . . . . . . . . . . . . 183

MMX ™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Block Copies and Block Fills . . . . . . . . . . . . . . . . . . . . . . 115

Integer-Only Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

MOVQ Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

PAND to Find Absolute Value in 3DNow! Code. . . . . . . 119

PCMP Instead of 3DNow! PFCMP . . . . . . . . . . . . . . . . . 114

PCMPEQD to Set an MMX Register. . . . . . . . . . . . . . . . 119

PMADDWD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 111

PREFETCHNTA/T0/T1/T2 Instruction . . . . . . . . . . . . . . . 47

REP Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 84–85

Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

SHLD Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

SHR Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Signed Words to Floating-Point Conversion . . . . . . . . . . . 113

Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Stack

Alignment Considerations . . . . . . . . . . . . . . . . . . . . . . . . 54

Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Store-to-Load Forwarding . . . . . . . . . . . . . . . . . . 18, 51, 53–54

Stream of Packed Unsigned Bytes . . . . . . . . . . . . . . . . . . . 125

String Instruction s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4

Structure (Struct). . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 –28 , 56

Subexpressions, Explicitly Extract Common . . . . . . . . . . . 26

Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Switch Statemen t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1, 24

PXOR Instruction . . . . . . . . . . . . . . . . . . . . . . .113, 118–119

238

Index

Download from Www.Somanuals.com. All Manuals Search And Download.

22007E/0—November 1999

AMD Athlon™ Processor x86 Code Optimization

TBYTE V a riables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Trigonometric Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 103

Write Combining . . . . . . . . . . . . . . 10, 50, 139, 155–157, 159

x86 Optimization Guidelines . . . . . . . . . . . . . . . . . . . . . . . 127

XOR Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

V e ctorPath Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

V e ctorPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

Index

239

Download from Www.Somanuals.com. All Manuals Search And Download.

AMD Athlon™ Processor x86 Code Optimization

22007E/0—November 1999

240

Index

Download from Www.Somanuals.com. All Manuals Search And Download.

Acoustic Research Speaker System AW827 User Manual
Addonics Technologies Computer Drive AD5EHPMEU3 User Manual
Agilent Technologies Laptop 81200 User Manual
AG Neovo Computer Monitor TS 17R User Manual
Aiwa Car Stereo System CDC X145 User Manual
AVERATEC GPS Receiver 320 GPS User Manual
Avet Reels Fishing Equipment HXJ 5 2 User Manual
Avital Automobile Alarm Model 4300 User Manual
Axis Communications TV DVD Combo AXIS P7701 User Manual
BB Electronics Power Supply DR 75 12 User Manual