Optimal Coding 5 – Algorithms – Mul

Posted on 2008.11.14. by szir

The Fastcode Project is mostly about optimizing for architecture. I was looking for similar project, but sadly i couldn’t find any. Maybe I wasn’t usign the right phrases in google.
A few days ago I remembered the LibTom Library for C I downloaded a while ago. The LibTomMath is a multiple precision integer library. I was reading the code of mp_mul function. Then I looked up the referenced algorithms and more…

Comba multiplication
Karatsuba multiplication
Toom-Cook multiplication
Strassen algorithm for polynomial multiplication
Multiplication using the Fast Fourier Transform
Martin Fürer FFT algorithm

the time complexity of conventional long multiplication is O(n^2)
Karatsuba Multiplication: O(n^lg2(3)) or O(n^1.585)
Toom-Cook 3-way: O(n^1.464)
FFT: O(n log n)

A promising library: GNU Multiple Precision Arithmetic Library

And here is some instruction timings for optimizing for architectures I forgot to put up last time: http://swox.com/doc/x86-timing.pdf

Optimal Coding 4 – Code Libraries

Posted on 2008.11.03. by szir

AMD – ACML 4.1.0
AMD Core Math Library (ACML) for Windows®. Built with Intel® Fortran. 32bit/64bit
You can use Freepascal’s h2pas.exe to convert the acml.h to acml.pas. While it’s not perfect it helps.
procedure acmlversion(major:Plongint; minor:Plongint; patch:Plongint);cdecl; external 'libacml_dll.dll'; procedure acmlinfo; cdecl; external 'libacml_dll.dll'; ... var a,b,c:integer; begin acmlversion(@a,@b,@c); ...

Intel
Intel® Performance Libraries
Intel® Integrated Performance Primitives (Intel® IPP) for Windows
Intel® Math Kernel Library (Intel® MKL) for Windows
Intel® Software Evaluation Center
Intel® Compiler Suite Professional Edition for Windows
Intel® Visual Fortran Compiler Professional Edition for Windows

Optimal Coding 3 – Processor Architecture

Posted on 2008.11.01. by szir

This is where it gets interresting.
Removing redundant code is not too hard, although it is a bit surprising that compilers of this age do such a lazy job (MS VC2008, BDS2006). But it is nothing compared to writting code with keeping an eye on processor pipelines, branch predictions, cache levels… Which makes it more difficult is that there is a lot of different processors out there (486,PPro,PII,PIII,P4,AMD,AMDXP,AMD64… not mentioning PowerPC, ARM…) and thely all require different optimization.

Let’s start with integer instruction paring
More info at: http://www.agner.org/assem/
Mike Schmit’s Top Ten Rules for Pairing Pentium Instructions
Pentium Optimization Cross-Reference by Instruction
Optimization Stratagies for the Pentium Processor
3DNow! Instruction Porting Guide
AMD Athlon™ Processor x86 Code Optimization Guide

There are two pipelines for executing instructions, called the U-pipe and the V-pipe
There are many rules you need to know (you can read them at the links above)

So lets take a look at our previous example
function AddInt64_A(A, B : Int64) : Int64; 004089E0 55 push ebp 004089E1 8BEC mov ebp,esp 004089E3 8B4510 mov eax,[ebp+$10] 004089E6 8B5514 mov edx,[ebp+$14] 004089E9 034508 add eax,[ebp+$08] 004089EC 13550C adc edx,[ebp+$0c] 004089EF 5D pop ebp 004089F0 C21000 ret $0010

mov eax,dword ptr [esp+04h]//[A] add eax,dword ptr [esp+0ch] mov edx,dword ptr [esp+08h] adc edx,dword ptr [esp+10h]

MOV register, memory, or immediate into register or memory are pairable in either pipe
PUSH register or immediate, POP register are pairable in either pipe
INC, DEC, ADD, SUB, CMP, AND, OR, XOR are pairable in either pipe
ADC, SBB are pairable in the U-pipe only
near call, short and near [conditional] jump are only pairable when in the V-pipe

So “push ebp / mov ebp,esp” should pair although not needed
“mov eax,[ebp+$10] / mov edx,[ebp+$14]” can also be paired
“add eax,[ebp+$08] /” adc cannot run in V pipe
“adc edx,[ebp+$0c] / pop ebp”

In the second example
“mov eax,dword ptr [esp+04h] / ” the add instruction won’t pair beacause it would violate the “second instruction does not read or write a register which the first instruction writes to” rule
“add eax,dword ptr [esp+0ch] / mov edx,dword ptr [esp+08h]”
“adc edx,dword ptr [esp+10h]”

There are empty V-pipe slots in booth cases we could use for other instructions like inc ecx if we also need to increment a counter…

Extended instructions, optimizing with processor specificc instructions and code paths.
some random opcodes :
MMX: movq mm0, [ebp+$10] movq mm1, [ebp+$08] SSE2: paddq mm0, mm1 movd eax, mm0 psrlq mm0, 32 movd edx, mm0 emms

Moving data between MMX register and IA32 registers is expensive.

I love “lea eax, [eax+edx*4+5]”…

Optimal Coding 2 – Borland vs. MS vs. GNU

Posted on 2008.10.31. by szir

Delphi (Borland Developer Studio 2006):

function AddInt64_1(A, B : Int64) : Int64; begin Result := A + B; end;

The BDS2006 compiler generated this:
OptimizeMathFunctions.dpr.74: writeln(IntToHex(AddInt64_1($1111000022223333,$1234567890123456),16)); 0040915F 6800001111 push $11110000 00409164 6833332222 push $22223333 00409169 6878563412 push $12345678 0040916E 6856341290 push $90123456 00409173 E844F8FFFF call AddInt64_1 ... OptimizeMathFunctions.dpr.9: begin 004089BC 55 push ebp 004089BD 8BEC mov ebp,esp 004089BF 83C4F8 add esp,-$08 OptimizeMathFunctions.dpr.10: Result := A + B; 004089C2 8B4510 mov eax,[ebp+$10] 004089C5 8B5514 mov edx,[ebp+$14] 004089C8 034508 add eax,[ebp+$08] 004089CB 13550C adc edx,[ebp+$0c] 004089CE 8945F8 mov [ebp-$08],eax 004089D1 8955FC mov [ebp-$04],edx OptimizeMathFunctions.dpr.11: end; 004089D4 8B45F8 mov eax,[ebp-$08] 004089D7 8B55FC mov edx,[ebp-$04] 004089DA 59 pop ecx 004089DB 59 pop ecx 004089DC 5D pop ebp 004089DD C21000 ret $0010

What We wanted is this:

function AddInt64_A(A, B : Int64) : Int64; asm mov eax,[ebp+$10] mov edx,[ebp+$14] add eax,[ebp+$08] adc edx,[ebp+$0c] end;

Which looks like this in action:

OptimizeMathFunctions.dpr.82: writeln(IntToHex(AddInt64_A($1111000022223333,$1234567890123456),16)); 004091A1 6800001111 push $11110000 004091A6 6833332222 push $22223333 004091AB 6878563412 push $12345678 004091B0 6856341290 push $90123456 004091B5 E826F8FFFF call AddInt64_A ... OptimizeMathFunctions.dpr.14: asm 004089E0 55 push ebp 004089E1 8BEC mov ebp,esp 004089E3 8B4510 mov eax,[ebp+$10] 004089E6 8B5514 mov edx,[ebp+$14] 004089E9 034508 add eax,[ebp+$08] 004089EC 13550C adc edx,[ebp+$0c] OptimizeMathFunctions.dpr.19: end; 004089EF 5D pop ebp 004089F0 C21000 ret $0010

Much better, (the delphi optimizer looks lazy).

Lets see the Visual Studio 2008 Pro C++:

typedef long long int64; int64 AddInt64_1(int64 A, int64 B) { return (A+B); }

printf("%I64xn",AddInt64_1(0x1111000022223333UL,0x1234567890123456UL)); 004135AC push 12345678h 004135B1 push 90123456h 004135B6 push 11110000h 004135BB push 22223333h 004135C0 call AddInt64_1 (4111D1h) ... 004111D1 jmp AddInt64_1 (4113A0h) ... 004113A0 push ebp 004113A1 mov ebp,esp 004113A3 sub esp,0C0h 004113A9 push ebx 004113AA push esi 004113AB push edi 004113AC lea edi,[ebp-0C0h] 004113B2 mov ecx,30h 004113B7 mov eax,0CCCCCCCCh 004113BC rep stos dword ptr es:[edi] return (A+B); 004113BE mov eax,dword ptr [A] 004113C1 add eax,dword ptr [B] 004113C4 mov edx,dword ptr [ebp+0Ch] 004113C7 adc edx,dword ptr [ebp+14h] } 004113CA pop edi 004113CB pop esi 004113CC pop ebx 004113CD mov esp,ebp 004113CF pop ebp 004113D0 ret

I would have expected something more efficient. I thought MS VS2008 would produce better code than BDS2006 for sure. I was wrong.

What I wanted looks something like this:

int64 __declspec(naked) AddInt64_AN2(int64 A, int64 B) { __asm { mov eax,dword ptr [esp+04h]//[A] add eax,dword ptr [esp+0ch] mov edx,dword ptr [esp+08h] adc edx,dword ptr [esp+10h] ret } }

Assembly is still king. And so is pascal/delphi with assembler rutines.

Next is gcc…

MinGW32 – gcc 3.4.5 (mingw-vista special r3)

int64 AddInt64_1(int64 A, int64 B) { return (A+B); } ... printf("%I64xn",AddInt64_1(0x1111000022223333LL,0x1234567890123456LL));

the result of gcc -S :

movl $-1877855146, 8(%esp) movl $305419896, 12(%esp) movl $572666675, (%esp) movl $286326784, 4(%esp) call __Z10AddInt64_1xx ... __Z10AddInt64_1xx: pushl %ebp movl %esp, %ebp movl 16(%ebp), %eax movl 20(%ebp), %edx addl 8(%ebp), %eax adcl 12(%ebp), %edx popl %ebp ret

Thats what I call nice code (apart from the GAS syntax)

Optimalis kod iras

Posted on 2008.10.31. by szir

Regen volt mar amikor a HelpPC es hasonlo dos-os programokbol neztem ki az utasitas vegrehajtasi idoket az OP kod mereteket.
I love xor ax,ax !

Azota jo par ev eltelt, egyre tobb fajta processzor es egyre bonyolultabb pipeline es cache szervezes van a mai PC architekturakban.

Sajnos a Delphi fordito nem eppen produkal optimalis kodot alapbol. Ez nem jelenti azt, hogy delphiben akrar csak pascalban ne lehetne optimalizalt kodot irni. Igenis lehet. Mi sem jobb bizonyitek erre mint a fast code project.
FastMM, BASM
http://dennishomepage.gugs-cats.dk/CodingForSpeedInDelphi.doc
http://fastcode.sourceforge.net/
http://mikedimmick.blogspot.com/2008/03/what-heck-does-ret-mean.html

Nehany altalanos hasznos anyag:
http://www.agner.org/optimize/

AMD
AMD Athlon™ Processor x86 Code Optimization Guide
Software Optimization Guide for AMD64 Processors
AMD64 Architecture Programmer’s Manual Volume 1: Application Programming
AMD64 Architecture Programmer’s Manual Volume 2: System Programming
AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions
AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit Media Instructions
AMD64 Architecture Programmer’s Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions
Performance Guidelines for AMD Athlon™ and AMD Opteron™ ccNUMA Multiprocessor Systems
Software Optimization Guide for AMD Family 10h Rev C Processors
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739,00.html
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_3748,00.html
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_7044,00.html
http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_875_7044,00.html

Intel
IA-32 Intel® Architecture Optimization Reference Manual
IA-32 Intel® Architecture Software Developer’s Manual Volume 1: Basic Architecture
IA-32 Intel Architecture Software Developer’s Manual Volume 2: Instruction Set Reference

Assembly
Sok jo konyv es leiras van ebben a temaban az egyik ilyen az
The Art of Assembly Language Programming

Foton's weBLOG

Tag Archives: develop

Optimal Coding 5 – Algorithms – Mul

Optimal Coding 4 – Code Libraries

Optimal Coding 3 – Processor Architecture

Optimal Coding 2 – Borland vs. MS vs. GNU

Optimalis kod iras