readme revision 55714
150477SpeterFirst up, let me say I don't like writing in assembler.  It is not portable,
233548Sjkhdependant on the particular CPU architecture release and is generally a pig
32893Sdfrto debug and get right.  Having said that, the x86 architecture is probably
433548Sjkhthe most important for speed due to number of boxes and since
533548Sjkhit appears to be the worst architecture to to get
633548Sjkhgood C compilers for.  So due to this, I have lowered myself to do
733548Sjkhassembler for the inner DES routines in libdes :-).
833548Sjkh
933548SjkhThe file to implement in assembler is des_enc.c.  Replace the following
1033548Sjkh4 functions
1133548Sjkhdes_encrypt(DES_LONG data[2],des_key_schedule ks, int encrypt);
1233548Sjkhdes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
1333548Sjkhdes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
1433548Sjkhdes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
1533548Sjkh
1633548SjkhThey encrypt/decrypt the 64 bits held in 'data' using
1733548Sjkhthe 'ks' key schedules.   The only difference between the 4 functions is that
1833548Sjkhdes_encrypt2() does not perform IP() or FP() on the data (this is an
1933548Sjkhoptimization for when doing triple DES and des_encrypt3() and des_decrypt3()
2033548Sjkhperform triple des.  The triple DES routines are in here because it does
2133548Sjkhmake a big difference to have them located near the des_encrypt2 function
2233548Sjkhat link time..
2333548Sjkh
2433548SjkhNow as we all know, there are lots of different operating systems running on
2533548Sjkhx86 boxes, and unfortunately they normally try to make sure their assembler
2633548Sjkhformating is not the same as the other peoples.
2733548SjkhThe 4 main formats I know of are
2833548SjkhMicrosoft	Windows 95/Windows NT
2933548SjkhElf		Includes Linux and FreeBSD(?).
3033548Sjkha.out		The older Linux.
3133548SjkhSolaris		Same as Elf but different comments :-(.
3233548Sjkh
3333548SjkhNow I was not overly keen to write 4 different copies of the same code,
3433548Sjkhso I wrote a few perl routines to output the correct assembler, given
352893Sdfra target assembler type.  This code is ugly and is just a hack.
362893SdfrThe libraries are x86unix.pl and x86ms.pl.
378876Srgrimesdes586.pl, des686.pl and des-som[23].pl are the programs to actually
382893Sdfrgenerate the assembler.
392893Sdfr
408876SrgrimesSo to generate elf assembler
412893Sdfrperl des-som3.pl elf >dx86-elf.s
428876SrgrimesFor Windows 95/NT
432893Sdfrperl des-som2.pl win32 >win32.asm
442893Sdfr
452893Sdfr[ update 4 Jan 1996 ]
462893SdfrI have added another way to do things.
478876Srgrimesperl des-som3.pl cpp >dx86-cpp.s
482893Sdfrgenerates a file that will be included by dx86unix.cpp when it is compiled.
492893SdfrTo build for elf, a.out, solaris, bsdi etc,
502893Sdfrcc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o
512893Sdfrcc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o
522893Sdfrcc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o
532893Sdfrcc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o
542893SdfrThis was done to cut down the number of files in the distribution.
552893Sdfr
562893SdfrNow the ugly part.  I acquired my copy of Intels
5733548Sjkh"Optimization's For Intel's 32-Bit Processors" and found a few interesting
587465Sachethings.  First, the aim of the exersize is to 'extract' one byte at a time
5933548Sjkhfrom a word and do an array lookup.  This involves getting the byte from
602893Sdfrthe 4 locations in the word and moving it to a new word and doing the lookup.
612893SdfrThe most obvious way to do this is
622893Sdfrxor	eax,	eax				# clear word
632893Sdfrmovb	al,	cl				# get low byte
6477162Sruxor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in word
652893Sdfrmovb	al,	ch				# get next byte
662893Sdfrxor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in word
677465Sacheshr	ecx	16
682893Sdfrwhich seems ok.  For the pentium, this system appears to be the best.
6912144SphkOne has to do instruction interleaving to keep both functional units
707465Sacheoperating, but it is basically very efficient.
717465Sache
722893SdfrNow the crunch.  When a full register is used after a partial write, eg.
732893Sdfrmov	al,	cl
742893Sdfrxor	edi,	DWORD PTR 0x100+des_SP[eax]
757465Sache386	- 1 cycle stall
762893Sdfr486	- 1 cycle stall
7712144Sphk586	- 0 cycle stall
787465Sache686	- at least 7 cycle stall (page 22 of the above mentioned document).
797465Sache
802893SdfrSo the technique that produces the best results on a pentium, according to
812893Sdfrthe documentation, will produce hideous results on a pentium pro.
822893Sdfr
832893SdfrTo get around this, des686.pl will generate code that is not as fast on
842893Sdfra pentium, should be very good on a pentium pro.
852893Sdfrmov	eax,	ecx				# copy word 
8633181Seivindshr	ecx,	8				# line up next byte
8733181Seivindand	eax,	0fch				# mask byte
8833181Seivindxor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in array lookup
8933181Seivindmov	eax,	ecx				# get word
902893Sdfrshr	ecx	8				# line up next byte
9192727Salfredand	eax,	0fch				# mask byte
9233747Sachexor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in array lookup
932893Sdfr
942893SdfrDue to the execution units in the pentium, this actually works quite well.
952893SdfrFor a pentium pro it should be very good.  This is the type of output
962893SdfrVisual C++ generates.
972893Sdfr
9833548SjkhThere is a third option.  instead of using
992893Sdfrmov	al,	ch
10033548Sjkhwhich is bad on the pentium pro, one may be able to use
10133548Sjkhmovzx	eax,	ch
10233548Sjkhwhich may not incur the partial write penalty.  On the pentium,
1032893Sdfrthis instruction takes 4 cycles so is not worth using but on the
1042893Sdfrpentium pro it appears it may be worth while.  I need access to one to
1052893Sdfrexperiment :-).
1062893Sdfr
1072893Sdfreric (20 Oct 1996)
1082893Sdfr
1092893Sdfr22 Nov 1996 - I have asked people to run the 2 different version on pentium
1102893Sdfrpros and it appears that the intel documentation is wrong.  The
1112893Sdfrmov al,bh is still faster on a pentium pro, so just use the des586.pl
1122893Sdfrinstall des686.pl
1132893Sdfr
1142893Sdfr3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
115110299Sphkfunctions into des_enc.c because it does make a massive performance
11615055Sachedifference on some boxes to have the functions code located close to
11715053Sachethe des_encrypt2() function.
11833548Sjkh
1192893Sdfr9 Jan 1997 - des-som2.pl is now the correct perl script to use for
1202893Sdfrpentiums.  It contains an inner loop from
12133548SjkhSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
1222893Sdfr273,000 per second.  He had a previous version at 250,000 and the best
1232893SdfrI was able to get was 203,000.  The content has not changed, this is all
1242893Sdfrdue to instruction sequencing (and actual instructions choice) which is able
1252893Sdfrto keep both functional units of the pentium going.
1262893SdfrWe may have lost the ugly register usage restrictions when x86 went 32 bit
1272893Sdfrbut for the pentium it has been replaced by evil instruction ordering tricks.
1282893Sdfr
1292893Sdfr13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.
1302893Sdfrraw DES at 281,000 per second on a pentium 100.
1312893Sdfr
1322893Sdfr