155714SkrisFirst up, let me say I don't like writing in assembler. It is not portable, 255714Skrisdependant on the particular CPU architecture release and is generally a pig 355714Skristo debug and get right. Having said that, the x86 architecture is probably 455714Skristhe most important for speed due to number of boxes and since 555714Skrisit appears to be the worst architecture to to get 655714Skrisgood C compilers for. So due to this, I have lowered myself to do 755714Skrisassembler for the inner DES routines in libdes :-). 855714Skris 955714SkrisThe file to implement in assembler is des_enc.c. Replace the following 1055714Skris4 functions 1176866Skrisdes_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt); 1255714Skrisdes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt); 1355714Skrisdes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 1455714Skrisdes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 1555714Skris 1655714SkrisThey encrypt/decrypt the 64 bits held in 'data' using 1755714Skristhe 'ks' key schedules. The only difference between the 4 functions is that 1855714Skrisdes_encrypt2() does not perform IP() or FP() on the data (this is an 1955714Skrisoptimization for when doing triple DES and des_encrypt3() and des_decrypt3() 2055714Skrisperform triple des. The triple DES routines are in here because it does 2155714Skrismake a big difference to have them located near the des_encrypt2 function 2255714Skrisat link time.. 2355714Skris 2455714SkrisNow as we all know, there are lots of different operating systems running on 2555714Skrisx86 boxes, and unfortunately they normally try to make sure their assembler 2655714Skrisformating is not the same as the other peoples. 2755714SkrisThe 4 main formats I know of are 2855714SkrisMicrosoft Windows 95/Windows NT 2955714SkrisElf Includes Linux and FreeBSD(?). 3055714Skrisa.out The older Linux. 3155714SkrisSolaris Same as Elf but different comments :-(. 3255714Skris 3355714SkrisNow I was not overly keen to write 4 different copies of the same code, 3455714Skrisso I wrote a few perl routines to output the correct assembler, given 3555714Skrisa target assembler type. This code is ugly and is just a hack. 3655714SkrisThe libraries are x86unix.pl and x86ms.pl. 3755714Skrisdes586.pl, des686.pl and des-som[23].pl are the programs to actually 3855714Skrisgenerate the assembler. 3955714Skris 4055714SkrisSo to generate elf assembler 4155714Skrisperl des-som3.pl elf >dx86-elf.s 4255714SkrisFor Windows 95/NT 4355714Skrisperl des-som2.pl win32 >win32.asm 4455714Skris 4555714Skris[ update 4 Jan 1996 ] 4655714SkrisI have added another way to do things. 4755714Skrisperl des-som3.pl cpp >dx86-cpp.s 4855714Skrisgenerates a file that will be included by dx86unix.cpp when it is compiled. 4955714SkrisTo build for elf, a.out, solaris, bsdi etc, 5055714Skriscc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o 5155714Skriscc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o 5255714Skriscc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o 5355714Skriscc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o 5455714SkrisThis was done to cut down the number of files in the distribution. 5555714Skris 5655714SkrisNow the ugly part. I acquired my copy of Intels 5755714Skris"Optimization's For Intel's 32-Bit Processors" and found a few interesting 5855714Skristhings. First, the aim of the exersize is to 'extract' one byte at a time 5955714Skrisfrom a word and do an array lookup. This involves getting the byte from 6055714Skristhe 4 locations in the word and moving it to a new word and doing the lookup. 6155714SkrisThe most obvious way to do this is 6255714Skrisxor eax, eax # clear word 6355714Skrismovb al, cl # get low byte 6455714Skrisxor edi DWORD PTR 0x100+des_SP[eax] # xor in word 6555714Skrismovb al, ch # get next byte 6655714Skrisxor edi DWORD PTR 0x300+des_SP[eax] # xor in word 6755714Skrisshr ecx 16 6855714Skriswhich seems ok. For the pentium, this system appears to be the best. 6955714SkrisOne has to do instruction interleaving to keep both functional units 7055714Skrisoperating, but it is basically very efficient. 7155714Skris 7255714SkrisNow the crunch. When a full register is used after a partial write, eg. 7355714Skrismov al, cl 7455714Skrisxor edi, DWORD PTR 0x100+des_SP[eax] 7555714Skris386 - 1 cycle stall 7655714Skris486 - 1 cycle stall 7755714Skris586 - 0 cycle stall 7855714Skris686 - at least 7 cycle stall (page 22 of the above mentioned document). 7955714Skris 8055714SkrisSo the technique that produces the best results on a pentium, according to 8155714Skristhe documentation, will produce hideous results on a pentium pro. 8255714Skris 8355714SkrisTo get around this, des686.pl will generate code that is not as fast on 8455714Skrisa pentium, should be very good on a pentium pro. 8555714Skrismov eax, ecx # copy word 8655714Skrisshr ecx, 8 # line up next byte 8755714Skrisand eax, 0fch # mask byte 8855714Skrisxor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup 8955714Skrismov eax, ecx # get word 9055714Skrisshr ecx 8 # line up next byte 9155714Skrisand eax, 0fch # mask byte 9255714Skrisxor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup 9355714Skris 9455714SkrisDue to the execution units in the pentium, this actually works quite well. 9555714SkrisFor a pentium pro it should be very good. This is the type of output 9655714SkrisVisual C++ generates. 9755714Skris 9855714SkrisThere is a third option. instead of using 9955714Skrismov al, ch 10055714Skriswhich is bad on the pentium pro, one may be able to use 10155714Skrismovzx eax, ch 10255714Skriswhich may not incur the partial write penalty. On the pentium, 10355714Skristhis instruction takes 4 cycles so is not worth using but on the 10455714Skrispentium pro it appears it may be worth while. I need access to one to 10555714Skrisexperiment :-). 10655714Skris 10755714Skriseric (20 Oct 1996) 10855714Skris 10955714Skris22 Nov 1996 - I have asked people to run the 2 different version on pentium 11055714Skrispros and it appears that the intel documentation is wrong. The 11155714Skrismov al,bh is still faster on a pentium pro, so just use the des586.pl 11255714Skrisinstall des686.pl 11355714Skris 11455714Skris3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these 11555714Skrisfunctions into des_enc.c because it does make a massive performance 11655714Skrisdifference on some boxes to have the functions code located close to 11755714Skristhe des_encrypt2() function. 11855714Skris 11955714Skris9 Jan 1997 - des-som2.pl is now the correct perl script to use for 12055714Skrispentiums. It contains an inner loop from 12155714SkrisSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at 12255714Skris273,000 per second. He had a previous version at 250,000 and the best 12355714SkrisI was able to get was 203,000. The content has not changed, this is all 12455714Skrisdue to instruction sequencing (and actual instructions choice) which is able 12555714Skristo keep both functional units of the pentium going. 12655714SkrisWe may have lost the ugly register usage restrictions when x86 went 32 bit 12755714Skrisbut for the pentium it has been replaced by evil instruction ordering tricks. 12855714Skris 12955714Skris13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf. 13055714Skrisraw DES at 281,000 per second on a pentium 100. 13155714Skris 132