1160814Ssimon#!/usr/bin/env perl 2160814Ssimon# 3160814Ssimon# ==================================================================== 4160814Ssimon# Written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL 5194206Ssimon# project. The module is, however, dual licensed under OpenSSL and 6194206Ssimon# CRYPTOGAMS licenses depending on where you obtain it. For further 7194206Ssimon# details see http://www.openssl.org/~appro/cryptogams/. 8160814Ssimon# ==================================================================== 9160814Ssimon# 10238405Sjkim# July 2004 11238405Sjkim# 12160814Ssimon# 2.22x RC4 tune-up:-) It should be noted though that my hand [as in 13160814Ssimon# "hand-coded assembler"] doesn't stand for the whole improvement 14160814Ssimon# coefficient. It turned out that eliminating RC4_CHAR from config 15160814Ssimon# line results in ~40% improvement (yes, even for C implementation). 16160814Ssimon# Presumably it has everything to do with AMD cache architecture and 17160814Ssimon# RAW or whatever penalties. Once again! The module *requires* config 18160814Ssimon# line *without* RC4_CHAR! As for coding "secret," I bet on partial 19160814Ssimon# register arithmetics. For example instead of 'inc %r8; and $255,%r8' 20160814Ssimon# I simply 'inc %r8b'. Even though optimization manual discourages 21160814Ssimon# to operate on partial registers, it turned out to be the best bet. 22160814Ssimon# At least for AMD... How IA32E would perform remains to be seen... 23160814Ssimon 24238405Sjkim# November 2004 25238405Sjkim# 26160814Ssimon# As was shown by Marc Bevand reordering of couple of load operations 27160814Ssimon# results in even higher performance gain of 3.3x:-) At least on 28160814Ssimon# Opteron... For reference, 1x in this case is RC4_CHAR C-code 29160814Ssimon# compiled with gcc 3.3.2, which performs at ~54MBps per 1GHz clock. 30160814Ssimon# Latter means that if you want to *estimate* what to expect from 31160814Ssimon# *your* Opteron, then multiply 54 by 3.3 and clock frequency in GHz. 32160814Ssimon 33238405Sjkim# November 2004 34238405Sjkim# 35160814Ssimon# Intel P4 EM64T core was found to run the AMD64 code really slow... 36160814Ssimon# The only way to achieve comparable performance on P4 was to keep 37160814Ssimon# RC4_CHAR. Kind of ironic, huh? As it's apparently impossible to 38160814Ssimon# compose blended code, which would perform even within 30% marginal 39160814Ssimon# on either AMD and Intel platforms, I implement both cases. See 40160814Ssimon# rc4_skey.c for further details... 41160814Ssimon 42238405Sjkim# April 2005 43238405Sjkim# 44160814Ssimon# P4 EM64T core appears to be "allergic" to 64-bit inc/dec. Replacing 45160814Ssimon# those with add/sub results in 50% performance improvement of folded 46160814Ssimon# loop... 47160814Ssimon 48238405Sjkim# May 2005 49238405Sjkim# 50160814Ssimon# As was shown by Zou Nanhai loop unrolling can improve Intel EM64T 51160814Ssimon# performance by >30% [unlike P4 32-bit case that is]. But this is 52160814Ssimon# provided that loads are reordered even more aggressively! Both code 53160814Ssimon# pathes, AMD64 and EM64T, reorder loads in essentially same manner 54160814Ssimon# as my IA-64 implementation. On Opteron this resulted in modest 5% 55160814Ssimon# improvement [I had to test it], while final Intel P4 performance 56160814Ssimon# achieves respectful 432MBps on 2.8GHz processor now. For reference. 57160814Ssimon# If executed on Xeon, current RC4_CHAR code-path is 2.7x faster than 58160814Ssimon# RC4_INT code-path. While if executed on Opteron, it's only 25% 59291721Sjkim# slower than the RC4_INT one [meaning that if CPU ��-arch detection 60160814Ssimon# is not implemented, then this final RC4_CHAR code-path should be 61160814Ssimon# preferred, as it provides better *all-round* performance]. 62160814Ssimon 63238405Sjkim# March 2007 64238405Sjkim# 65194206Ssimon# Intel Core2 was observed to perform poorly on both code paths:-( It 66194206Ssimon# apparently suffers from some kind of partial register stall, which 67194206Ssimon# occurs in 64-bit mode only [as virtually identical 32-bit loop was 68194206Ssimon# observed to outperform 64-bit one by almost 50%]. Adding two movzb to 69194206Ssimon# cloop1 boosts its performance by 80%! This loop appears to be optimal 70194206Ssimon# fit for Core2 and therefore the code was modified to skip cloop8 on 71194206Ssimon# this CPU. 72194206Ssimon 73238405Sjkim# May 2010 74238405Sjkim# 75238405Sjkim# Intel Westmere was observed to perform suboptimally. Adding yet 76238405Sjkim# another movzb to cloop1 improved performance by almost 50%! Core2 77238405Sjkim# performance is improved too, but nominally... 78160814Ssimon 79238405Sjkim# May 2011 80238405Sjkim# 81238405Sjkim# The only code path that was not modified is P4-specific one. Non-P4 82238405Sjkim# Intel code path optimization is heavily based on submission by Maxim 83238405Sjkim# Perminov, Maxim Locktyukhin and Jim Guilford of Intel. I've used 84238405Sjkim# some of the ideas even in attempt to optmize the original RC4_INT 85238405Sjkim# code path... Current performance in cycles per processed byte (less 86238405Sjkim# is better) and improvement coefficients relative to previous 87238405Sjkim# version of this module are: 88238405Sjkim# 89238405Sjkim# Opteron 5.3/+0%(*) 90238405Sjkim# P4 6.5 91238405Sjkim# Core2 6.2/+15%(**) 92238405Sjkim# Westmere 4.2/+60% 93238405Sjkim# Sandy Bridge 4.2/+120% 94238405Sjkim# Atom 9.3/+80% 95238405Sjkim# 96238405Sjkim# (*) But corresponding loop has less instructions, which should have 97238405Sjkim# positive effect on upcoming Bulldozer, which has one less ALU. 98238405Sjkim# For reference, Intel code runs at 6.8 cpb rate on Opteron. 99238405Sjkim# (**) Note that Core2 result is ~15% lower than corresponding result 100238405Sjkim# for 32-bit code, meaning that it's possible to improve it, 101238405Sjkim# but more than likely at the cost of the others (see rc4-586.pl 102238405Sjkim# to get the idea)... 103238405Sjkim 104238405Sjkim$flavour = shift; 105238405Sjkim$output = shift; 106238405Sjkimif ($flavour =~ /\./) { $output = $flavour; undef $flavour; } 107238405Sjkim 108238405Sjkim$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/); 109238405Sjkim 110194206Ssimon$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; 111194206Ssimon( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or 112194206Ssimon( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or 113194206Ssimondie "can't locate x86_64-xlate.pl"; 114194206Ssimon 115246772Sjkimopen OUT,"| \"$^X\" $xlate $flavour $output"; 116246772Sjkim*STDOUT=*OUT; 117194206Ssimon 118160814Ssimon$dat="%rdi"; # arg1 119160814Ssimon$len="%rsi"; # arg2 120160814Ssimon$inp="%rdx"; # arg3 121160814Ssimon$out="%rcx"; # arg4 122160814Ssimon 123238405Sjkim{ 124160814Ssimon$code=<<___; 125160814Ssimon.text 126238405Sjkim.extern OPENSSL_ia32cap_P 127160814Ssimon 128160814Ssimon.globl RC4 129160814Ssimon.type RC4,\@function,4 130160814Ssimon.align 16 131160814SsimonRC4: or $len,$len 132160814Ssimon jne .Lentry 133160814Ssimon ret 134160814Ssimon.Lentry: 135238405Sjkim push %rbx 136160814Ssimon push %r12 137160814Ssimon push %r13 138238405Sjkim.Lprologue: 139238405Sjkim mov $len,%r11 140238405Sjkim mov $inp,%r12 141238405Sjkim mov $out,%r13 142238405Sjkim___ 143238405Sjkimmy $len="%r11"; # reassign input arguments 144238405Sjkimmy $inp="%r12"; 145238405Sjkimmy $out="%r13"; 146160814Ssimon 147238405Sjkimmy @XX=("%r10","%rsi"); 148238405Sjkimmy @TX=("%rax","%rbx"); 149238405Sjkimmy $YY="%rcx"; 150238405Sjkimmy $TY="%rdx"; 151238405Sjkim 152238405Sjkim$code.=<<___; 153238405Sjkim xor $XX[0],$XX[0] 154238405Sjkim xor $YY,$YY 155238405Sjkim 156238405Sjkim lea 8($dat),$dat 157238405Sjkim mov -8($dat),$XX[0]#b 158238405Sjkim mov -4($dat),$YY#b 159160814Ssimon cmpl \$-1,256($dat) 160160814Ssimon je .LRC4_CHAR 161238405Sjkim mov OPENSSL_ia32cap_P(%rip),%r8d 162238405Sjkim xor $TX[1],$TX[1] 163160814Ssimon inc $XX[0]#b 164238405Sjkim sub $XX[0],$TX[1] 165238405Sjkim sub $inp,$out 166160814Ssimon movl ($dat,$XX[0],4),$TX[0]#d 167238405Sjkim test \$-16,$len 168160814Ssimon jz .Lloop1 169238405Sjkim bt \$30,%r8d # Intel CPU? 170238405Sjkim jc .Lintel 171238405Sjkim and \$7,$TX[1] 172238405Sjkim lea 1($XX[0]),$XX[1] 173238405Sjkim jz .Loop8 174238405Sjkim sub $TX[1],$len 175238405Sjkim.Loop8_warmup: 176238405Sjkim add $TX[0]#b,$YY#b 177238405Sjkim movl ($dat,$YY,4),$TY#d 178238405Sjkim movl $TX[0]#d,($dat,$YY,4) 179238405Sjkim movl $TY#d,($dat,$XX[0],4) 180238405Sjkim add $TY#b,$TX[0]#b 181238405Sjkim inc $XX[0]#b 182238405Sjkim movl ($dat,$TX[0],4),$TY#d 183238405Sjkim movl ($dat,$XX[0],4),$TX[0]#d 184238405Sjkim xorb ($inp),$TY#b 185238405Sjkim movb $TY#b,($out,$inp) 186238405Sjkim lea 1($inp),$inp 187238405Sjkim dec $TX[1] 188238405Sjkim jnz .Loop8_warmup 189238405Sjkim 190238405Sjkim lea 1($XX[0]),$XX[1] 191238405Sjkim jmp .Loop8 192160814Ssimon.align 16 193238405Sjkim.Loop8: 194160814Ssimon___ 195160814Ssimonfor ($i=0;$i<8;$i++) { 196238405Sjkim$code.=<<___ if ($i==7); 197238405Sjkim add \$8,$XX[1]#b 198238405Sjkim___ 199160814Ssimon$code.=<<___; 200160814Ssimon add $TX[0]#b,$YY#b 201160814Ssimon movl ($dat,$YY,4),$TY#d 202160814Ssimon movl $TX[0]#d,($dat,$YY,4) 203238405Sjkim movl `4*($i==7?-1:$i)`($dat,$XX[1],4),$TX[1]#d 204238405Sjkim ror \$8,%r8 # ror is redundant when $i=0 205238405Sjkim movl $TY#d,4*$i($dat,$XX[0],4) 206160814Ssimon add $TX[0]#b,$TY#b 207238405Sjkim movb ($dat,$TY,4),%r8b 208160814Ssimon___ 209238405Sjkimpush(@TX,shift(@TX)); #push(@XX,shift(@XX)); # "rotate" registers 210160814Ssimon} 211160814Ssimon$code.=<<___; 212238405Sjkim add \$8,$XX[0]#b 213238405Sjkim ror \$8,%r8 214160814Ssimon sub \$8,$len 215160814Ssimon 216238405Sjkim xor ($inp),%r8 217238405Sjkim mov %r8,($out,$inp) 218238405Sjkim lea 8($inp),$inp 219160814Ssimon 220160814Ssimon test \$-8,$len 221238405Sjkim jnz .Loop8 222160814Ssimon cmp \$0,$len 223160814Ssimon jne .Lloop1 224238405Sjkim jmp .Lexit 225238405Sjkim 226238405Sjkim.align 16 227238405Sjkim.Lintel: 228238405Sjkim test \$-32,$len 229238405Sjkim jz .Lloop1 230238405Sjkim and \$15,$TX[1] 231238405Sjkim jz .Loop16_is_hot 232238405Sjkim sub $TX[1],$len 233238405Sjkim.Loop16_warmup: 234238405Sjkim add $TX[0]#b,$YY#b 235238405Sjkim movl ($dat,$YY,4),$TY#d 236238405Sjkim movl $TX[0]#d,($dat,$YY,4) 237238405Sjkim movl $TY#d,($dat,$XX[0],4) 238238405Sjkim add $TY#b,$TX[0]#b 239238405Sjkim inc $XX[0]#b 240238405Sjkim movl ($dat,$TX[0],4),$TY#d 241238405Sjkim movl ($dat,$XX[0],4),$TX[0]#d 242238405Sjkim xorb ($inp),$TY#b 243238405Sjkim movb $TY#b,($out,$inp) 244238405Sjkim lea 1($inp),$inp 245238405Sjkim dec $TX[1] 246238405Sjkim jnz .Loop16_warmup 247238405Sjkim 248238405Sjkim mov $YY,$TX[1] 249238405Sjkim xor $YY,$YY 250238405Sjkim mov $TX[1]#b,$YY#b 251238405Sjkim 252238405Sjkim.Loop16_is_hot: 253238405Sjkim lea ($dat,$XX[0],4),$XX[1] 254160814Ssimon___ 255238405Sjkimsub RC4_loop { 256238405Sjkim my $i=shift; 257238405Sjkim my $j=$i<0?0:$i; 258238405Sjkim my $xmm="%xmm".($j&1); 259238405Sjkim 260238405Sjkim $code.=" add \$16,$XX[0]#b\n" if ($i==15); 261238405Sjkim $code.=" movdqu ($inp),%xmm2\n" if ($i==15); 262238405Sjkim $code.=" add $TX[0]#b,$YY#b\n" if ($i<=0); 263238405Sjkim $code.=" movl ($dat,$YY,4),$TY#d\n"; 264238405Sjkim $code.=" pxor %xmm0,%xmm2\n" if ($i==0); 265238405Sjkim $code.=" psllq \$8,%xmm1\n" if ($i==0); 266238405Sjkim $code.=" pxor $xmm,$xmm\n" if ($i<=1); 267238405Sjkim $code.=" movl $TX[0]#d,($dat,$YY,4)\n"; 268238405Sjkim $code.=" add $TY#b,$TX[0]#b\n"; 269238405Sjkim $code.=" movl `4*($j+1)`($XX[1]),$TX[1]#d\n" if ($i<15); 270238405Sjkim $code.=" movz $TX[0]#b,$TX[0]#d\n"; 271238405Sjkim $code.=" movl $TY#d,4*$j($XX[1])\n"; 272238405Sjkim $code.=" pxor %xmm1,%xmm2\n" if ($i==0); 273238405Sjkim $code.=" lea ($dat,$XX[0],4),$XX[1]\n" if ($i==15); 274238405Sjkim $code.=" add $TX[1]#b,$YY#b\n" if ($i<15); 275238405Sjkim $code.=" pinsrw \$`($j>>1)&7`,($dat,$TX[0],4),$xmm\n"; 276238405Sjkim $code.=" movdqu %xmm2,($out,$inp)\n" if ($i==0); 277238405Sjkim $code.=" lea 16($inp),$inp\n" if ($i==0); 278238405Sjkim $code.=" movl ($XX[1]),$TX[1]#d\n" if ($i==15); 279238405Sjkim} 280238405Sjkim RC4_loop(-1); 281160814Ssimon$code.=<<___; 282238405Sjkim jmp .Loop16_enter 283238405Sjkim.align 16 284238405Sjkim.Loop16: 285238405Sjkim___ 286160814Ssimon 287238405Sjkimfor ($i=0;$i<16;$i++) { 288238405Sjkim $code.=".Loop16_enter:\n" if ($i==1); 289238405Sjkim RC4_loop($i); 290238405Sjkim push(@TX,shift(@TX)); # "rotate" registers 291238405Sjkim} 292238405Sjkim$code.=<<___; 293238405Sjkim mov $YY,$TX[1] 294238405Sjkim xor $YY,$YY # keyword to partial register 295238405Sjkim sub \$16,$len 296238405Sjkim mov $TX[1]#b,$YY#b 297238405Sjkim test \$-16,$len 298238405Sjkim jnz .Loop16 299238405Sjkim 300238405Sjkim psllq \$8,%xmm1 301238405Sjkim pxor %xmm0,%xmm2 302238405Sjkim pxor %xmm1,%xmm2 303238405Sjkim movdqu %xmm2,($out,$inp) 304238405Sjkim lea 16($inp),$inp 305238405Sjkim 306238405Sjkim cmp \$0,$len 307238405Sjkim jne .Lloop1 308238405Sjkim jmp .Lexit 309238405Sjkim 310160814Ssimon.align 16 311160814Ssimon.Lloop1: 312160814Ssimon add $TX[0]#b,$YY#b 313160814Ssimon movl ($dat,$YY,4),$TY#d 314160814Ssimon movl $TX[0]#d,($dat,$YY,4) 315160814Ssimon movl $TY#d,($dat,$XX[0],4) 316160814Ssimon add $TY#b,$TX[0]#b 317160814Ssimon inc $XX[0]#b 318160814Ssimon movl ($dat,$TX[0],4),$TY#d 319160814Ssimon movl ($dat,$XX[0],4),$TX[0]#d 320160814Ssimon xorb ($inp),$TY#b 321238405Sjkim movb $TY#b,($out,$inp) 322238405Sjkim lea 1($inp),$inp 323160814Ssimon dec $len 324160814Ssimon jnz .Lloop1 325160814Ssimon jmp .Lexit 326160814Ssimon 327160814Ssimon.align 16 328160814Ssimon.LRC4_CHAR: 329160814Ssimon add \$1,$XX[0]#b 330160814Ssimon movzb ($dat,$XX[0]),$TX[0]#d 331160814Ssimon test \$-8,$len 332160814Ssimon jz .Lcloop1 333160814Ssimon jmp .Lcloop8 334160814Ssimon.align 16 335160814Ssimon.Lcloop8: 336238405Sjkim mov ($inp),%r8d 337238405Sjkim mov 4($inp),%r9d 338160814Ssimon___ 339160814Ssimon# unroll 2x4-wise, because 64-bit rotates kill Intel P4... 340160814Ssimonfor ($i=0;$i<4;$i++) { 341160814Ssimon$code.=<<___; 342160814Ssimon add $TX[0]#b,$YY#b 343160814Ssimon lea 1($XX[0]),$XX[1] 344160814Ssimon movzb ($dat,$YY),$TY#d 345160814Ssimon movzb $XX[1]#b,$XX[1]#d 346160814Ssimon movzb ($dat,$XX[1]),$TX[1]#d 347160814Ssimon movb $TX[0]#b,($dat,$YY) 348160814Ssimon cmp $XX[1],$YY 349160814Ssimon movb $TY#b,($dat,$XX[0]) 350160814Ssimon jne .Lcmov$i # Intel cmov is sloooow... 351160814Ssimon mov $TX[0],$TX[1] 352160814Ssimon.Lcmov$i: 353160814Ssimon add $TX[0]#b,$TY#b 354238405Sjkim xor ($dat,$TY),%r8b 355238405Sjkim ror \$8,%r8d 356160814Ssimon___ 357160814Ssimonpush(@TX,shift(@TX)); push(@XX,shift(@XX)); # "rotate" registers 358160814Ssimon} 359160814Ssimonfor ($i=4;$i<8;$i++) { 360160814Ssimon$code.=<<___; 361160814Ssimon add $TX[0]#b,$YY#b 362160814Ssimon lea 1($XX[0]),$XX[1] 363160814Ssimon movzb ($dat,$YY),$TY#d 364160814Ssimon movzb $XX[1]#b,$XX[1]#d 365160814Ssimon movzb ($dat,$XX[1]),$TX[1]#d 366160814Ssimon movb $TX[0]#b,($dat,$YY) 367160814Ssimon cmp $XX[1],$YY 368160814Ssimon movb $TY#b,($dat,$XX[0]) 369160814Ssimon jne .Lcmov$i # Intel cmov is sloooow... 370160814Ssimon mov $TX[0],$TX[1] 371160814Ssimon.Lcmov$i: 372160814Ssimon add $TX[0]#b,$TY#b 373238405Sjkim xor ($dat,$TY),%r9b 374238405Sjkim ror \$8,%r9d 375160814Ssimon___ 376160814Ssimonpush(@TX,shift(@TX)); push(@XX,shift(@XX)); # "rotate" registers 377160814Ssimon} 378160814Ssimon$code.=<<___; 379160814Ssimon lea -8($len),$len 380238405Sjkim mov %r8d,($out) 381160814Ssimon lea 8($inp),$inp 382238405Sjkim mov %r9d,4($out) 383160814Ssimon lea 8($out),$out 384160814Ssimon 385160814Ssimon test \$-8,$len 386160814Ssimon jnz .Lcloop8 387160814Ssimon cmp \$0,$len 388160814Ssimon jne .Lcloop1 389160814Ssimon jmp .Lexit 390160814Ssimon___ 391160814Ssimon$code.=<<___; 392160814Ssimon.align 16 393160814Ssimon.Lcloop1: 394160814Ssimon add $TX[0]#b,$YY#b 395238405Sjkim movzb $YY#b,$YY#d 396160814Ssimon movzb ($dat,$YY),$TY#d 397160814Ssimon movb $TX[0]#b,($dat,$YY) 398160814Ssimon movb $TY#b,($dat,$XX[0]) 399160814Ssimon add $TX[0]#b,$TY#b 400160814Ssimon add \$1,$XX[0]#b 401194206Ssimon movzb $TY#b,$TY#d 402194206Ssimon movzb $XX[0]#b,$XX[0]#d 403160814Ssimon movzb ($dat,$TY),$TY#d 404160814Ssimon movzb ($dat,$XX[0]),$TX[0]#d 405160814Ssimon xorb ($inp),$TY#b 406160814Ssimon lea 1($inp),$inp 407160814Ssimon movb $TY#b,($out) 408160814Ssimon lea 1($out),$out 409160814Ssimon sub \$1,$len 410160814Ssimon jnz .Lcloop1 411160814Ssimon jmp .Lexit 412238405Sjkim 413238405Sjkim.align 16 414238405Sjkim.Lexit: 415238405Sjkim sub \$1,$XX[0]#b 416238405Sjkim movl $XX[0]#d,-8($dat) 417238405Sjkim movl $YY#d,-4($dat) 418238405Sjkim 419238405Sjkim mov (%rsp),%r13 420238405Sjkim mov 8(%rsp),%r12 421238405Sjkim mov 16(%rsp),%rbx 422238405Sjkim add \$24,%rsp 423238405Sjkim.Lepilogue: 424238405Sjkim ret 425160814Ssimon.size RC4,.-RC4 426160814Ssimon___ 427238405Sjkim} 428160814Ssimon 429194206Ssimon$idx="%r8"; 430194206Ssimon$ido="%r9"; 431194206Ssimon 432194206Ssimon$code.=<<___; 433238405Sjkim.globl private_RC4_set_key 434238405Sjkim.type private_RC4_set_key,\@function,3 435194206Ssimon.align 16 436238405Sjkimprivate_RC4_set_key: 437194206Ssimon lea 8($dat),$dat 438194206Ssimon lea ($inp,$len),$inp 439194206Ssimon neg $len 440194206Ssimon mov $len,%rcx 441194206Ssimon xor %eax,%eax 442194206Ssimon xor $ido,$ido 443194206Ssimon xor %r10,%r10 444194206Ssimon xor %r11,%r11 445194206Ssimon 446194206Ssimon mov OPENSSL_ia32cap_P(%rip),$idx#d 447238405Sjkim bt \$20,$idx#d # RC4_CHAR? 448238405Sjkim jc .Lc1stloop 449238405Sjkim jmp .Lw1stloop 450194206Ssimon 451194206Ssimon.align 16 452194206Ssimon.Lw1stloop: 453194206Ssimon mov %eax,($dat,%rax,4) 454194206Ssimon add \$1,%al 455194206Ssimon jnc .Lw1stloop 456194206Ssimon 457194206Ssimon xor $ido,$ido 458194206Ssimon xor $idx,$idx 459194206Ssimon.align 16 460194206Ssimon.Lw2ndloop: 461194206Ssimon mov ($dat,$ido,4),%r10d 462194206Ssimon add ($inp,$len,1),$idx#b 463194206Ssimon add %r10b,$idx#b 464194206Ssimon add \$1,$len 465194206Ssimon mov ($dat,$idx,4),%r11d 466194206Ssimon cmovz %rcx,$len 467194206Ssimon mov %r10d,($dat,$idx,4) 468194206Ssimon mov %r11d,($dat,$ido,4) 469194206Ssimon add \$1,$ido#b 470194206Ssimon jnc .Lw2ndloop 471194206Ssimon jmp .Lexit_key 472194206Ssimon 473194206Ssimon.align 16 474194206Ssimon.Lc1stloop: 475194206Ssimon mov %al,($dat,%rax) 476194206Ssimon add \$1,%al 477194206Ssimon jnc .Lc1stloop 478194206Ssimon 479194206Ssimon xor $ido,$ido 480194206Ssimon xor $idx,$idx 481194206Ssimon.align 16 482194206Ssimon.Lc2ndloop: 483194206Ssimon mov ($dat,$ido),%r10b 484194206Ssimon add ($inp,$len),$idx#b 485194206Ssimon add %r10b,$idx#b 486194206Ssimon add \$1,$len 487194206Ssimon mov ($dat,$idx),%r11b 488194206Ssimon jnz .Lcnowrap 489194206Ssimon mov %rcx,$len 490194206Ssimon.Lcnowrap: 491194206Ssimon mov %r10b,($dat,$idx) 492194206Ssimon mov %r11b,($dat,$ido) 493194206Ssimon add \$1,$ido#b 494194206Ssimon jnc .Lc2ndloop 495194206Ssimon movl \$-1,256($dat) 496194206Ssimon 497194206Ssimon.align 16 498194206Ssimon.Lexit_key: 499194206Ssimon xor %eax,%eax 500194206Ssimon mov %eax,-8($dat) 501194206Ssimon mov %eax,-4($dat) 502194206Ssimon ret 503238405Sjkim.size private_RC4_set_key,.-private_RC4_set_key 504194206Ssimon 505194206Ssimon.globl RC4_options 506238405Sjkim.type RC4_options,\@abi-omnipotent 507194206Ssimon.align 16 508194206SsimonRC4_options: 509238405Sjkim lea .Lopts(%rip),%rax 510194206Ssimon mov OPENSSL_ia32cap_P(%rip),%edx 511194206Ssimon bt \$20,%edx 512238405Sjkim jc .L8xchar 513238405Sjkim bt \$30,%edx 514194206Ssimon jnc .Ldone 515238405Sjkim add \$25,%rax 516238405Sjkim ret 517238405Sjkim.L8xchar: 518194206Ssimon add \$12,%rax 519194206Ssimon.Ldone: 520194206Ssimon ret 521194206Ssimon.align 64 522194206Ssimon.Lopts: 523194206Ssimon.asciz "rc4(8x,int)" 524194206Ssimon.asciz "rc4(8x,char)" 525238405Sjkim.asciz "rc4(16x,int)" 526194206Ssimon.asciz "RC4 for x86_64, CRYPTOGAMS by <appro\@openssl.org>" 527194206Ssimon.align 64 528194206Ssimon.size RC4_options,.-RC4_options 529194206Ssimon___ 530194206Ssimon 531238405Sjkim# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame, 532238405Sjkim# CONTEXT *context,DISPATCHER_CONTEXT *disp) 533238405Sjkimif ($win64) { 534238405Sjkim$rec="%rcx"; 535238405Sjkim$frame="%rdx"; 536238405Sjkim$context="%r8"; 537238405Sjkim$disp="%r9"; 538160814Ssimon 539238405Sjkim$code.=<<___; 540238405Sjkim.extern __imp_RtlVirtualUnwind 541238405Sjkim.type stream_se_handler,\@abi-omnipotent 542238405Sjkim.align 16 543238405Sjkimstream_se_handler: 544238405Sjkim push %rsi 545238405Sjkim push %rdi 546238405Sjkim push %rbx 547238405Sjkim push %rbp 548238405Sjkim push %r12 549238405Sjkim push %r13 550238405Sjkim push %r14 551238405Sjkim push %r15 552238405Sjkim pushfq 553238405Sjkim sub \$64,%rsp 554194206Ssimon 555238405Sjkim mov 120($context),%rax # pull context->Rax 556238405Sjkim mov 248($context),%rbx # pull context->Rip 557238405Sjkim 558238405Sjkim lea .Lprologue(%rip),%r10 559238405Sjkim cmp %r10,%rbx # context->Rip<prologue label 560238405Sjkim jb .Lin_prologue 561238405Sjkim 562238405Sjkim mov 152($context),%rax # pull context->Rsp 563238405Sjkim 564238405Sjkim lea .Lepilogue(%rip),%r10 565238405Sjkim cmp %r10,%rbx # context->Rip>=epilogue label 566238405Sjkim jae .Lin_prologue 567238405Sjkim 568238405Sjkim lea 24(%rax),%rax 569238405Sjkim 570238405Sjkim mov -8(%rax),%rbx 571238405Sjkim mov -16(%rax),%r12 572238405Sjkim mov -24(%rax),%r13 573238405Sjkim mov %rbx,144($context) # restore context->Rbx 574238405Sjkim mov %r12,216($context) # restore context->R12 575238405Sjkim mov %r13,224($context) # restore context->R13 576238405Sjkim 577238405Sjkim.Lin_prologue: 578238405Sjkim mov 8(%rax),%rdi 579238405Sjkim mov 16(%rax),%rsi 580238405Sjkim mov %rax,152($context) # restore context->Rsp 581238405Sjkim mov %rsi,168($context) # restore context->Rsi 582238405Sjkim mov %rdi,176($context) # restore context->Rdi 583238405Sjkim 584238405Sjkim jmp .Lcommon_seh_exit 585238405Sjkim.size stream_se_handler,.-stream_se_handler 586238405Sjkim 587238405Sjkim.type key_se_handler,\@abi-omnipotent 588238405Sjkim.align 16 589238405Sjkimkey_se_handler: 590238405Sjkim push %rsi 591238405Sjkim push %rdi 592238405Sjkim push %rbx 593238405Sjkim push %rbp 594238405Sjkim push %r12 595238405Sjkim push %r13 596238405Sjkim push %r14 597238405Sjkim push %r15 598238405Sjkim pushfq 599238405Sjkim sub \$64,%rsp 600238405Sjkim 601238405Sjkim mov 152($context),%rax # pull context->Rsp 602238405Sjkim mov 8(%rax),%rdi 603238405Sjkim mov 16(%rax),%rsi 604238405Sjkim mov %rsi,168($context) # restore context->Rsi 605238405Sjkim mov %rdi,176($context) # restore context->Rdi 606238405Sjkim 607238405Sjkim.Lcommon_seh_exit: 608238405Sjkim 609238405Sjkim mov 40($disp),%rdi # disp->ContextRecord 610238405Sjkim mov $context,%rsi # context 611238405Sjkim mov \$154,%ecx # sizeof(CONTEXT) 612238405Sjkim .long 0xa548f3fc # cld; rep movsq 613238405Sjkim 614238405Sjkim mov $disp,%rsi 615238405Sjkim xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER 616238405Sjkim mov 8(%rsi),%rdx # arg2, disp->ImageBase 617238405Sjkim mov 0(%rsi),%r8 # arg3, disp->ControlPc 618238405Sjkim mov 16(%rsi),%r9 # arg4, disp->FunctionEntry 619238405Sjkim mov 40(%rsi),%r10 # disp->ContextRecord 620238405Sjkim lea 56(%rsi),%r11 # &disp->HandlerData 621238405Sjkim lea 24(%rsi),%r12 # &disp->EstablisherFrame 622238405Sjkim mov %r10,32(%rsp) # arg5 623238405Sjkim mov %r11,40(%rsp) # arg6 624238405Sjkim mov %r12,48(%rsp) # arg7 625238405Sjkim mov %rcx,56(%rsp) # arg8, (NULL) 626238405Sjkim call *__imp_RtlVirtualUnwind(%rip) 627238405Sjkim 628238405Sjkim mov \$1,%eax # ExceptionContinueSearch 629238405Sjkim add \$64,%rsp 630238405Sjkim popfq 631238405Sjkim pop %r15 632238405Sjkim pop %r14 633238405Sjkim pop %r13 634238405Sjkim pop %r12 635238405Sjkim pop %rbp 636238405Sjkim pop %rbx 637238405Sjkim pop %rdi 638238405Sjkim pop %rsi 639238405Sjkim ret 640238405Sjkim.size key_se_handler,.-key_se_handler 641238405Sjkim 642238405Sjkim.section .pdata 643238405Sjkim.align 4 644238405Sjkim .rva .LSEH_begin_RC4 645238405Sjkim .rva .LSEH_end_RC4 646238405Sjkim .rva .LSEH_info_RC4 647238405Sjkim 648238405Sjkim .rva .LSEH_begin_private_RC4_set_key 649238405Sjkim .rva .LSEH_end_private_RC4_set_key 650238405Sjkim .rva .LSEH_info_private_RC4_set_key 651238405Sjkim 652238405Sjkim.section .xdata 653238405Sjkim.align 8 654238405Sjkim.LSEH_info_RC4: 655238405Sjkim .byte 9,0,0,0 656238405Sjkim .rva stream_se_handler 657238405Sjkim.LSEH_info_private_RC4_set_key: 658238405Sjkim .byte 9,0,0,0 659238405Sjkim .rva key_se_handler 660238405Sjkim___ 661238405Sjkim} 662238405Sjkim 663238405Sjkimsub reg_part { 664238405Sjkimmy ($reg,$conv)=@_; 665238405Sjkim if ($reg =~ /%r[0-9]+/) { $reg .= $conv; } 666238405Sjkim elsif ($conv eq "b") { $reg =~ s/%[er]([^x]+)x?/%$1l/; } 667238405Sjkim elsif ($conv eq "w") { $reg =~ s/%[er](.+)/%$1/; } 668238405Sjkim elsif ($conv eq "d") { $reg =~ s/%[er](.+)/%e$1/; } 669238405Sjkim return $reg; 670238405Sjkim} 671238405Sjkim 672238405Sjkim$code =~ s/(%[a-z0-9]+)#([bwd])/reg_part($1,$2)/gem; 673238405Sjkim$code =~ s/\`([^\`]*)\`/eval $1/gem; 674238405Sjkim 675160814Ssimonprint $code; 676160814Ssimon 677160814Ssimonclose STDOUT; 678