Optimal Coding 3 – Processor Architecture

This is where it gets interresting.
Removing redundant code is not too hard, although it is a bit surprising that compilers of this age do such a lazy job (MS VC2008, BDS2006). But it is nothing compared to writting code with keeping an eye on processor pipelines, branch predictions, cache levels… Which makes it more difficult is that there is a lot of different processors out there (486,PPro,PII,PIII,P4,AMD,AMDXP,AMD64… not mentioning PowerPC, ARM…) and thely all require different optimization.

Let’s start with integer instruction paring
More info at: http://www.agner.org/assem/
Mike Schmit’s Top Ten Rules for Pairing Pentium Instructions
Pentium Optimization Cross-Reference by Instruction
Optimization Stratagies for the Pentium Processor
3DNow! Instruction Porting Guide
AMD Athlonâ„¢ Processor x86 Code Optimization Guide

There are two pipelines for executing instructions, called the U-pipe and the V-pipe
There are many rules you need to know (you can read them at the links above)

So lets take a look at our previous example
function AddInt64_A(A, B : Int64) : Int64;
004089E0 55 push ebp
004089E1 8BEC mov ebp,esp
004089E3 8B4510 mov eax,[ebp+$10]
004089E6 8B5514 mov edx,[ebp+$14]
004089E9 034508 add eax,[ebp+$08]
004089EC 13550C adc edx,[ebp+$0c]
004089EF 5D pop ebp
004089F0 C21000 ret $0010

mov eax,dword ptr [esp+04h]//[A]
add eax,dword ptr [esp+0ch]
mov edx,dword ptr [esp+08h]
adc edx,dword ptr [esp+10h]

MOV register, memory, or immediate into register or memory are pairable in either pipe
PUSH register or immediate, POP register are pairable in either pipe
INC, DEC, ADD, SUB, CMP, AND, OR, XOR are pairable in either pipe
ADC, SBB are pairable in the U-pipe only
near call, short and near [conditional] jump are only pairable when in the V-pipe

So “push ebp / mov ebp,esp” should pair although not needed
“mov eax,[ebp+$10] / mov edx,[ebp+$14]” can also be paired
“add eax,[ebp+$08] /” adc cannot run in V pipe
“adc edx,[ebp+$0c] / pop ebp”

In the second example
“mov eax,dword ptr [esp+04h] / ” the add instruction won’t pair beacause it would violate the “second instruction does not read or write a register which the first instruction writes to” rule
“add eax,dword ptr [esp+0ch] / mov edx,dword ptr [esp+08h]”
“adc edx,dword ptr [esp+10h]”

There are empty V-pipe slots in booth cases we could use for other instructions like inc ecx if we also need to increment a counter…

Extended instructions, optimizing with processor specificc instructions and code paths.
some random opcodes :
MMX:
movq mm0, [ebp+$10]
movq mm1, [ebp+$08]
SSE2:
paddq mm0, mm1
movd eax, mm0
psrlq mm0, 32
movd edx, mm0
emms

Moving data between MMX register and IA32 registers is expensive.

I love “lea eax, [eax+edx*4+5]”…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.