Optimal Coding 2 – Borland vs. MS vs. GNU

Delphi (Borland Developer Studio 2006):

function AddInt64_1(A, B : Int64) : Int64; begin Result := A + B; end;

The BDS2006 compiler generated this:
OptimizeMathFunctions.dpr.74: writeln(IntToHex(AddInt64_1($1111000022223333,$1234567890123456),16)); 0040915F 6800001111 push $11110000 00409164 6833332222 push $22223333 00409169 6878563412 push $12345678 0040916E 6856341290 push $90123456 00409173 E844F8FFFF call AddInt64_1 ... OptimizeMathFunctions.dpr.9: begin 004089BC 55 push ebp 004089BD 8BEC mov ebp,esp 004089BF 83C4F8 add esp,-$08 OptimizeMathFunctions.dpr.10: Result := A + B; 004089C2 8B4510 mov eax,[ebp+$10] 004089C5 8B5514 mov edx,[ebp+$14] 004089C8 034508 add eax,[ebp+$08] 004089CB 13550C adc edx,[ebp+$0c] 004089CE 8945F8 mov [ebp-$08],eax 004089D1 8955FC mov [ebp-$04],edx OptimizeMathFunctions.dpr.11: end; 004089D4 8B45F8 mov eax,[ebp-$08] 004089D7 8B55FC mov edx,[ebp-$04] 004089DA 59 pop ecx 004089DB 59 pop ecx 004089DC 5D pop ebp 004089DD C21000 ret $0010

What We wanted is this:

function AddInt64_A(A, B : Int64) : Int64; asm mov eax,[ebp+$10] mov edx,[ebp+$14] add eax,[ebp+$08] adc edx,[ebp+$0c] end;

Which looks like this in action:

OptimizeMathFunctions.dpr.82: writeln(IntToHex(AddInt64_A($1111000022223333,$1234567890123456),16)); 004091A1 6800001111 push $11110000 004091A6 6833332222 push $22223333 004091AB 6878563412 push $12345678 004091B0 6856341290 push $90123456 004091B5 E826F8FFFF call AddInt64_A ... OptimizeMathFunctions.dpr.14: asm 004089E0 55 push ebp 004089E1 8BEC mov ebp,esp 004089E3 8B4510 mov eax,[ebp+$10] 004089E6 8B5514 mov edx,[ebp+$14] 004089E9 034508 add eax,[ebp+$08] 004089EC 13550C adc edx,[ebp+$0c] OptimizeMathFunctions.dpr.19: end; 004089EF 5D pop ebp 004089F0 C21000 ret $0010

Much better, (the delphi optimizer looks lazy).

Lets see the Visual Studio 2008 Pro C++:

typedef long long int64; int64 AddInt64_1(int64 A, int64 B) { return (A+B); }

printf("%I64xn",AddInt64_1(0x1111000022223333UL,0x1234567890123456UL)); 004135AC push 12345678h 004135B1 push 90123456h 004135B6 push 11110000h 004135BB push 22223333h 004135C0 call AddInt64_1 (4111D1h) ... 004111D1 jmp AddInt64_1 (4113A0h) ... 004113A0 push ebp 004113A1 mov ebp,esp 004113A3 sub esp,0C0h 004113A9 push ebx 004113AA push esi 004113AB push edi 004113AC lea edi,[ebp-0C0h] 004113B2 mov ecx,30h 004113B7 mov eax,0CCCCCCCCh 004113BC rep stos dword ptr es:[edi] return (A+B); 004113BE mov eax,dword ptr [A] 004113C1 add eax,dword ptr [B] 004113C4 mov edx,dword ptr [ebp+0Ch] 004113C7 adc edx,dword ptr [ebp+14h] } 004113CA pop edi 004113CB pop esi 004113CC pop ebx 004113CD mov esp,ebp 004113CF pop ebp 004113D0 ret

I would have expected something more efficient. I thought MS VS2008 would produce better code than BDS2006 for sure. I was wrong.

What I wanted looks something like this:

int64 __declspec(naked) AddInt64_AN2(int64 A, int64 B) { __asm { mov eax,dword ptr [esp+04h]//[A] add eax,dword ptr [esp+0ch] mov edx,dword ptr [esp+08h] adc edx,dword ptr [esp+10h] ret } }

Assembly is still king. And so is pascal/delphi with assembler rutines.

Next is gcc…

MinGW32 – gcc 3.4.5 (mingw-vista special r3)

int64 AddInt64_1(int64 A, int64 B) { return (A+B); } ... printf("%I64xn",AddInt64_1(0x1111000022223333LL,0x1234567890123456LL));

the result of gcc -S :

movl $-1877855146, 8(%esp) movl $305419896, 12(%esp) movl $572666675, (%esp) movl $286326784, 4(%esp) call __Z10AddInt64_1xx ... __Z10AddInt64_1xx: pushl %ebp movl %esp, %ebp movl 16(%ebp), %eax movl 20(%ebp), %edx addl 8(%ebp), %eax adcl 12(%ebp), %edx popl %ebp ret

Thats what I call nice code (apart from the GAS syntax)

2 thoughts on “Optimal Coding 2 – Borland vs. MS vs. GNU”

There was some intrmediate steps in the optimalization process:

First of all standard epilog and epilog in VS C++ is sort of unacceptable so this is no good:
int64 AddInt64_A(int64 A, int64 B) { __asm { mov eax,dword ptr [A] add eax,dword ptr [B] mov edx,dword ptr [ebp+0Ch] adc edx,dword ptr [ebp+14h] } }
cause it produces this insane code:
00413510 push ebp 00413511 mov ebp,esp 00413513 sub esp,0C0h 00413519 push ebx 0041351A push esi 0041351B push edi 0041351C lea edi,[ebp-0C0h] 00413522 mov ecx,30h 00413527 mov eax,0CCCCCCCCh 0041352C rep stos dword ptr es:[edi] __asm { 0041352E mov eax,dword ptr [A] 00413531 add eax,dword ptr [B] 00413534 mov edx,dword ptr [ebp+0Ch] 00413537 adc edx,dword ptr [ebp+14h] } } 0041353A pop edi 0041353B pop esi 0041353C pop ebx 0041353D add esp,0C0h 00413543 cmp ebp,esp 00413545 call @ILT+320(__RTC_CheckEsp) (411145h) 0041354A mov esp,ebp 0041354C pop ebp 0041354D ret

To trim this code down we need naked functions, so we can write the code we want, and essentially write efficient code. Here is with the suggested prolog, epilog.
Well actually the MSDN reference includes the “sub esp, __LOCAL_SIZE” line, which is plain wrong without the matching “mov esp,ebp” instruction in the epilog since it screws up the stack frame and generates access violation in the end…

int64 __declspec(naked) AddInt64_AN(int64 A, int64 B) { // Naked functions must provide their own prolog... __asm { push ebp mov ebp, esp //sub esp, __LOCAL_SIZE /// !!! mov eax,dword ptr [A] add eax,dword ptr [B] mov edx,dword ptr [ebp+0Ch] adc edx,dword ptr [ebp+14h] // ... and epilog pop ebp ret } }

If we look at the code closely we realise that we don’t need the prolog at all if we use the esp instead of ebp in the address reference. The MSDN manual states “When using __asm to write assembly language in C/C++ functions, you don’t need to preserve the EAX, EBX, ECX, EDX, ESI, or EDI registers.”, ” by using EBX, ESI or EDI in inline assembly code, you force the compiler to save and restore those registers in the function prologue and epilogue” and “You should preserve the ESP and EBP registers”
You should also notice that in the final version I changed the arguments offset since ther is no push at the begining of the procedure.
mov eax,dword ptr [esp+04h]//[A] add eax,dword ptr [esp+0ch] mov edx,dword ptr [esp+08h] adc edx,dword ptr [esp+10h]

Reply ↓

Let’s try optimizing with processor specificc instructions and code paths.
http://dennishomepage.gugs-cats.dk/BASM-filer/BASMForBeginners3.htm

MMX:
movq mm0, [ebp+$10]
movq mm1, [ebp+$08]
SSE2:
paddq mm0, mm1

movd eax, mm0
psrlq mm0, 32
movd edx, mm0
emms

Moving data between MMX register and IA32 registers is expensive.
In this example using extended instructions is not worth it.
64bit CPU and x64 OS is an other thing.

Reply ↓

szir on 2008.10.31. at 17:25:26 said:

There was some intrmediate steps in the optimalization process:

First of all standard epilog and epilog in VS C++ is sort of unacceptable so this is no good:
int64 AddInt64_A(int64 A, int64 B) { __asm { mov eax,dword ptr [A] add eax,dword ptr [B] mov edx,dword ptr [ebp+0Ch] adc edx,dword ptr [ebp+14h] } }
cause it produces this insane code:
00413510 push ebp 00413511 mov ebp,esp 00413513 sub esp,0C0h 00413519 push ebx 0041351A push esi 0041351B push edi 0041351C lea edi,[ebp-0C0h] 00413522 mov ecx,30h 00413527 mov eax,0CCCCCCCCh 0041352C rep stos dword ptr es:[edi] __asm { 0041352E mov eax,dword ptr [A] 00413531 add eax,dword ptr [B] 00413534 mov edx,dword ptr [ebp+0Ch] 00413537 adc edx,dword ptr [ebp+14h] } } 0041353A pop edi 0041353B pop esi 0041353C pop ebx 0041353D add esp,0C0h 00413543 cmp ebp,esp 00413545 call @ILT+320(__RTC_CheckEsp) (411145h) 0041354A mov esp,ebp 0041354C pop ebp 0041354D ret

To trim this code down we need naked functions, so we can write the code we want, and essentially write efficient code. Here is with the suggested prolog, epilog.
Well actually the MSDN reference includes the “sub esp, __LOCAL_SIZE” line, which is plain wrong without the matching “mov esp,ebp” instruction in the epilog since it screws up the stack frame and generates access violation in the end…

int64 __declspec(naked) AddInt64_AN(int64 A, int64 B) { // Naked functions must provide their own prolog... __asm { push ebp mov ebp, esp //sub esp, __LOCAL_SIZE /// !!! mov eax,dword ptr [A] add eax,dword ptr [B] mov edx,dword ptr [ebp+0Ch] adc edx,dword ptr [ebp+14h] // ... and epilog pop ebp ret } }

If we look at the code closely we realise that we don’t need the prolog at all if we use the esp instead of ebp in the address reference. The MSDN manual states “When using __asm to write assembly language in C/C++ functions, you don’t need to preserve the EAX, EBX, ECX, EDX, ESI, or EDI registers.”, ” by using EBX, ESI or EDI in inline assembly code, you force the compiler to save and restore those registers in the function prologue and epilogue” and “You should preserve the ESP and EBP registers”
You should also notice that in the final version I changed the arguments offset since ther is no push at the begining of the procedure.
mov eax,dword ptr [esp+04h]//[A] add eax,dword ptr [esp+0ch] mov edx,dword ptr [esp+08h] adc edx,dword ptr [esp+10h]

Reply ↓
szir on 2008.10.31. at 23:53:06 said:

Let’s try optimizing with processor specificc instructions and code paths.
http://dennishomepage.gugs-cats.dk/BASM-filer/BASMForBeginners3.htm

MMX:
movq mm0, [ebp+$10]
movq mm1, [ebp+$08]
SSE2:
paddq mm0, mm1

movd eax, mm0
psrlq mm0, 32
movd edx, mm0
emms

Moving data between MMX register and IA32 registers is expensive.
In this example using extended instructions is not worth it.
64bit CPU and x64 OS is an other thing.

Reply ↓

Foton's weBLOG

Optimal Coding 2 – Borland vs. MS vs. GNU

2 thoughts on “Optimal Coding 2 – Borland vs. MS vs. GNU”

Leave a Reply to szirCancel reply