155714SkrisFirst up, let me say I don't like writing in assembler.  It is not portable,
255714Skrisdependant on the particular CPU architecture release and is generally a pig
355714Skristo debug and get right.  Having said that, the x86 architecture is probably
455714Skristhe most important for speed due to number of boxes and since
555714Skrisit appears to be the worst architecture to to get
655714Skrisgood C compilers for.  So due to this, I have lowered myself to do
755714Skrisassembler for the inner DES routines in libdes :-).
855714Skris
955714SkrisThe file to implement in assembler is des_enc.c.  Replace the following
1055714Skris4 functions
1176866Skrisdes_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);
1255714Skrisdes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
1355714Skrisdes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
1455714Skrisdes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
1555714Skris
1655714SkrisThey encrypt/decrypt the 64 bits held in 'data' using
1755714Skristhe 'ks' key schedules.   The only difference between the 4 functions is that
1855714Skrisdes_encrypt2() does not perform IP() or FP() on the data (this is an
1955714Skrisoptimization for when doing triple DES and des_encrypt3() and des_decrypt3()
2055714Skrisperform triple des.  The triple DES routines are in here because it does
2155714Skrismake a big difference to have them located near the des_encrypt2 function
2255714Skrisat link time..
2355714Skris
2455714SkrisNow as we all know, there are lots of different operating systems running on
2555714Skrisx86 boxes, and unfortunately they normally try to make sure their assembler
2655714Skrisformating is not the same as the other peoples.
2755714SkrisThe 4 main formats I know of are
2855714SkrisMicrosoft	Windows 95/Windows NT
2955714SkrisElf		Includes Linux and FreeBSD(?).
3055714Skrisa.out		The older Linux.
3155714SkrisSolaris		Same as Elf but different comments :-(.
3255714Skris
3355714SkrisNow I was not overly keen to write 4 different copies of the same code,
3455714Skrisso I wrote a few perl routines to output the correct assembler, given
3555714Skrisa target assembler type.  This code is ugly and is just a hack.
3655714SkrisThe libraries are x86unix.pl and x86ms.pl.
3755714Skrisdes586.pl, des686.pl and des-som[23].pl are the programs to actually
3855714Skrisgenerate the assembler.
3955714Skris
4055714SkrisSo to generate elf assembler
4155714Skrisperl des-som3.pl elf >dx86-elf.s
4255714SkrisFor Windows 95/NT
4355714Skrisperl des-som2.pl win32 >win32.asm
4455714Skris
4555714Skris[ update 4 Jan 1996 ]
4655714SkrisI have added another way to do things.
4755714Skrisperl des-som3.pl cpp >dx86-cpp.s
4855714Skrisgenerates a file that will be included by dx86unix.cpp when it is compiled.
4955714SkrisTo build for elf, a.out, solaris, bsdi etc,
5055714Skriscc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o
5155714Skriscc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o
5255714Skriscc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o
5355714Skriscc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o
5455714SkrisThis was done to cut down the number of files in the distribution.
5555714Skris
5655714SkrisNow the ugly part.  I acquired my copy of Intels
5755714Skris"Optimization's For Intel's 32-Bit Processors" and found a few interesting
5855714Skristhings.  First, the aim of the exersize is to 'extract' one byte at a time
5955714Skrisfrom a word and do an array lookup.  This involves getting the byte from
6055714Skristhe 4 locations in the word and moving it to a new word and doing the lookup.
6155714SkrisThe most obvious way to do this is
6255714Skrisxor	eax,	eax				# clear word
6355714Skrismovb	al,	cl				# get low byte
6455714Skrisxor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in word
6555714Skrismovb	al,	ch				# get next byte
6655714Skrisxor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in word
6755714Skrisshr	ecx	16
6855714Skriswhich seems ok.  For the pentium, this system appears to be the best.
6955714SkrisOne has to do instruction interleaving to keep both functional units
7055714Skrisoperating, but it is basically very efficient.
7155714Skris
7255714SkrisNow the crunch.  When a full register is used after a partial write, eg.
7355714Skrismov	al,	cl
7455714Skrisxor	edi,	DWORD PTR 0x100+des_SP[eax]
7555714Skris386	- 1 cycle stall
7655714Skris486	- 1 cycle stall
7755714Skris586	- 0 cycle stall
7855714Skris686	- at least 7 cycle stall (page 22 of the above mentioned document).
7955714Skris
8055714SkrisSo the technique that produces the best results on a pentium, according to
8155714Skristhe documentation, will produce hideous results on a pentium pro.
8255714Skris
8355714SkrisTo get around this, des686.pl will generate code that is not as fast on
8455714Skrisa pentium, should be very good on a pentium pro.
8555714Skrismov	eax,	ecx				# copy word 
8655714Skrisshr	ecx,	8				# line up next byte
8755714Skrisand	eax,	0fch				# mask byte
8855714Skrisxor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in array lookup
8955714Skrismov	eax,	ecx				# get word
9055714Skrisshr	ecx	8				# line up next byte
9155714Skrisand	eax,	0fch				# mask byte
9255714Skrisxor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in array lookup
9355714Skris
9455714SkrisDue to the execution units in the pentium, this actually works quite well.
9555714SkrisFor a pentium pro it should be very good.  This is the type of output
9655714SkrisVisual C++ generates.
9755714Skris
9855714SkrisThere is a third option.  instead of using
9955714Skrismov	al,	ch
10055714Skriswhich is bad on the pentium pro, one may be able to use
10155714Skrismovzx	eax,	ch
10255714Skriswhich may not incur the partial write penalty.  On the pentium,
10355714Skristhis instruction takes 4 cycles so is not worth using but on the
10455714Skrispentium pro it appears it may be worth while.  I need access to one to
10555714Skrisexperiment :-).
10655714Skris
10755714Skriseric (20 Oct 1996)
10855714Skris
10955714Skris22 Nov 1996 - I have asked people to run the 2 different version on pentium
11055714Skrispros and it appears that the intel documentation is wrong.  The
11155714Skrismov al,bh is still faster on a pentium pro, so just use the des586.pl
11255714Skrisinstall des686.pl
11355714Skris
11455714Skris3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
11555714Skrisfunctions into des_enc.c because it does make a massive performance
11655714Skrisdifference on some boxes to have the functions code located close to
11755714Skristhe des_encrypt2() function.
11855714Skris
11955714Skris9 Jan 1997 - des-som2.pl is now the correct perl script to use for
12055714Skrispentiums.  It contains an inner loop from
12155714SkrisSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
12255714Skris273,000 per second.  He had a previous version at 250,000 and the best
12355714SkrisI was able to get was 203,000.  The content has not changed, this is all
12455714Skrisdue to instruction sequencing (and actual instructions choice) which is able
12555714Skristo keep both functional units of the pentium going.
12655714SkrisWe may have lost the ugly register usage restrictions when x86 went 32 bit
12755714Skrisbut for the pentium it has been replaced by evil instruction ordering tricks.
12855714Skris
12955714Skris13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.
13055714Skrisraw DES at 281,000 per second on a pentium 100.
13155714Skris
132