readme revision 55714
150477SpeterFirst up, let me say I don't like writing in assembler. It is not portable, 233548Sjkhdependant on the particular CPU architecture release and is generally a pig 32893Sdfrto debug and get right. Having said that, the x86 architecture is probably 433548Sjkhthe most important for speed due to number of boxes and since 533548Sjkhit appears to be the worst architecture to to get 633548Sjkhgood C compilers for. So due to this, I have lowered myself to do 733548Sjkhassembler for the inner DES routines in libdes :-). 833548Sjkh 933548SjkhThe file to implement in assembler is des_enc.c. Replace the following 1033548Sjkh4 functions 1133548Sjkhdes_encrypt(DES_LONG data[2],des_key_schedule ks, int encrypt); 1233548Sjkhdes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt); 1333548Sjkhdes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 1433548Sjkhdes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 1533548Sjkh 1633548SjkhThey encrypt/decrypt the 64 bits held in 'data' using 1733548Sjkhthe 'ks' key schedules. The only difference between the 4 functions is that 1833548Sjkhdes_encrypt2() does not perform IP() or FP() on the data (this is an 1933548Sjkhoptimization for when doing triple DES and des_encrypt3() and des_decrypt3() 2033548Sjkhperform triple des. The triple DES routines are in here because it does 2133548Sjkhmake a big difference to have them located near the des_encrypt2 function 2233548Sjkhat link time.. 2333548Sjkh 2433548SjkhNow as we all know, there are lots of different operating systems running on 2533548Sjkhx86 boxes, and unfortunately they normally try to make sure their assembler 2633548Sjkhformating is not the same as the other peoples. 2733548SjkhThe 4 main formats I know of are 2833548SjkhMicrosoft Windows 95/Windows NT 2933548SjkhElf Includes Linux and FreeBSD(?). 3033548Sjkha.out The older Linux. 3133548SjkhSolaris Same as Elf but different comments :-(. 3233548Sjkh 3333548SjkhNow I was not overly keen to write 4 different copies of the same code, 3433548Sjkhso I wrote a few perl routines to output the correct assembler, given 352893Sdfra target assembler type. This code is ugly and is just a hack. 362893SdfrThe libraries are x86unix.pl and x86ms.pl. 378876Srgrimesdes586.pl, des686.pl and des-som[23].pl are the programs to actually 382893Sdfrgenerate the assembler. 392893Sdfr 408876SrgrimesSo to generate elf assembler 412893Sdfrperl des-som3.pl elf >dx86-elf.s 428876SrgrimesFor Windows 95/NT 432893Sdfrperl des-som2.pl win32 >win32.asm 442893Sdfr 452893Sdfr[ update 4 Jan 1996 ] 462893SdfrI have added another way to do things. 478876Srgrimesperl des-som3.pl cpp >dx86-cpp.s 482893Sdfrgenerates a file that will be included by dx86unix.cpp when it is compiled. 492893SdfrTo build for elf, a.out, solaris, bsdi etc, 502893Sdfrcc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o 512893Sdfrcc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o 522893Sdfrcc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o 532893Sdfrcc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o 542893SdfrThis was done to cut down the number of files in the distribution. 552893Sdfr 562893SdfrNow the ugly part. I acquired my copy of Intels 5733548Sjkh"Optimization's For Intel's 32-Bit Processors" and found a few interesting 587465Sachethings. First, the aim of the exersize is to 'extract' one byte at a time 5933548Sjkhfrom a word and do an array lookup. This involves getting the byte from 602893Sdfrthe 4 locations in the word and moving it to a new word and doing the lookup. 612893SdfrThe most obvious way to do this is 622893Sdfrxor eax, eax # clear word 632893Sdfrmovb al, cl # get low byte 6477162Sruxor edi DWORD PTR 0x100+des_SP[eax] # xor in word 652893Sdfrmovb al, ch # get next byte 662893Sdfrxor edi DWORD PTR 0x300+des_SP[eax] # xor in word 677465Sacheshr ecx 16 682893Sdfrwhich seems ok. For the pentium, this system appears to be the best. 6912144SphkOne has to do instruction interleaving to keep both functional units 707465Sacheoperating, but it is basically very efficient. 717465Sache 722893SdfrNow the crunch. When a full register is used after a partial write, eg. 732893Sdfrmov al, cl 742893Sdfrxor edi, DWORD PTR 0x100+des_SP[eax] 757465Sache386 - 1 cycle stall 762893Sdfr486 - 1 cycle stall 7712144Sphk586 - 0 cycle stall 787465Sache686 - at least 7 cycle stall (page 22 of the above mentioned document). 797465Sache 802893SdfrSo the technique that produces the best results on a pentium, according to 812893Sdfrthe documentation, will produce hideous results on a pentium pro. 822893Sdfr 832893SdfrTo get around this, des686.pl will generate code that is not as fast on 842893Sdfra pentium, should be very good on a pentium pro. 852893Sdfrmov eax, ecx # copy word 8633181Seivindshr ecx, 8 # line up next byte 8733181Seivindand eax, 0fch # mask byte 8833181Seivindxor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup 8933181Seivindmov eax, ecx # get word 902893Sdfrshr ecx 8 # line up next byte 9192727Salfredand eax, 0fch # mask byte 9233747Sachexor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup 932893Sdfr 942893SdfrDue to the execution units in the pentium, this actually works quite well. 952893SdfrFor a pentium pro it should be very good. This is the type of output 962893SdfrVisual C++ generates. 972893Sdfr 9833548SjkhThere is a third option. instead of using 992893Sdfrmov al, ch 10033548Sjkhwhich is bad on the pentium pro, one may be able to use 10133548Sjkhmovzx eax, ch 10233548Sjkhwhich may not incur the partial write penalty. On the pentium, 1032893Sdfrthis instruction takes 4 cycles so is not worth using but on the 1042893Sdfrpentium pro it appears it may be worth while. I need access to one to 1052893Sdfrexperiment :-). 1062893Sdfr 1072893Sdfreric (20 Oct 1996) 1082893Sdfr 1092893Sdfr22 Nov 1996 - I have asked people to run the 2 different version on pentium 1102893Sdfrpros and it appears that the intel documentation is wrong. The 1112893Sdfrmov al,bh is still faster on a pentium pro, so just use the des586.pl 1122893Sdfrinstall des686.pl 1132893Sdfr 1142893Sdfr3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these 115110299Sphkfunctions into des_enc.c because it does make a massive performance 11615055Sachedifference on some boxes to have the functions code located close to 11715053Sachethe des_encrypt2() function. 11833548Sjkh 1192893Sdfr9 Jan 1997 - des-som2.pl is now the correct perl script to use for 1202893Sdfrpentiums. It contains an inner loop from 12133548SjkhSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at 1222893Sdfr273,000 per second. He had a previous version at 250,000 and the best 1232893SdfrI was able to get was 203,000. The content has not changed, this is all 1242893Sdfrdue to instruction sequencing (and actual instructions choice) which is able 1252893Sdfrto keep both functional units of the pentium going. 1262893SdfrWe may have lost the ugly register usage restrictions when x86 went 32 bit 1272893Sdfrbut for the pentium it has been replaced by evil instruction ordering tricks. 1282893Sdfr 1292893Sdfr13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf. 1302893Sdfrraw DES at 281,000 per second on a pentium 100. 1312893Sdfr 1322893Sdfr