From dean@arctic.org Tue Dec 28 02:54:54 2004
X-Original-To: openssl-dev@openssl.org
Date: Mon, 27 Dec 2004 18:54:11 -0800 (PST)
From: dean gaudet <dean@arctic.org>
To: openssl-dev@openssl.org
Cc: crypt@bis.doc.gov
Subject: aes improvements (TSU NOTIFICATION)
Reply-To: openssl-dev@openssl.org
X-List-Manager: OpenSSL Majordomo [version 1.94.5]
X-List-Name: openssl-dev

On Thu, 23 Dec 2004, Andy Polyakov wrote:

> aes-586.pl module is committed to CVS now [see
> http://cvs.openssl.org/rlog?f=openssl/crypto/aes/asm/aes-586.pl]. Take
> "Special note about instruction choice" in commentary section for
> consideration even for AMD64. Merry Christmas to everybody:-) A.

hmmm... i seem to have done better by switching back to scaling :)

with the patch below i'm getting the following throughput improvements for 
aes-128-cbc 8192B buffer:

                     patch delta

        p4-2            + 3.8%
        p4-3            +11%
        p-m             + 8.8%
        k8              +12%
        efficeon        + 4.3%

the code is 229 bytes smaller in $small_footprint=1 ... i didn't look to 
see how much smaller it is for the fully unrolled variety (i would assume 
1145 bytes or so).  unfortunately this space improvement is hidden by the 
alignment pain caused by the placement of AES_Te and AES_Td :)  i suggest 
moving both those tables to the top of the module so that their 64 byte 
alignment is taken care of once only.

here's an updated comparison versus the gladman code -- this is in 
cycles/byte for 8192 byte buffer (smaller is better):

		       openssl w/patch
			small	large	gladman

	p4-2		31.7	26.1	  27.3
	p4-3		32.3	32.9	  18.7
	p-m		23.8	23.3	  16.9
	k8		21.8	21.5	  18.1
	efficeon	25.1	22.6	  17.8

damn the p4 is a weird beast -- notice how the gladman code is better 
everywhere except p4-2 ... and p4-3 gladman is nearly twice as good as the 
openssl code.  i'm a bit disappointed with efficeon but i know what the 
problems are (efficeon lacks native bswap, so your "1%" estimation on the 
bswaps is more painful for efficeon, and the loop could be rotated 
differently).  fixing that is a more significant effort -- so i figured 
i'd checkpoint by sending you a patch now.

i made the following changes:

o	shifts by 24 don't need to be followed by and $0xff:

	shr $22,%esi			-->	shr $24,%esi
	and $0x3fc,%esi				mov offs,(%ebp,%esi,4),esi
	mov offs(%ebp,%esi,1),esi	

o	movzbl is 3 bytes shorter than and $imm:

	shr $14,%eax			-->	shr $16,%eax
	and $0x3fc,%eax				movzbl %al,%eax
	mov offs(%ebp,%eax,1),%eax		mov offs,(%ebp,%eax,4),%eax

	there's no perf degredation by making this change (in fact it
	improves on p4-2, p-m and efficeon).  for consistency i made
	the same change to all the "and $0x3fc,%edi" ... unfortunately
	movzbl isn't an option there, but there was no negative perf
	impact anywhere with the change (and efficeon internally uses
	movzbl for this case so i'm slightly biased).

o	movzbl is 3 bytes shorter than and $imm (part 2):

	movl mem,%reg			-->	movzbl mem,%reg
	and $0xff,%reg

	this occurs several times loading from tables -- it's a space
	win and a perf win everywhere.

o	used a "gladman trick" on %edx in encode because it was easy
	enough -- during encode we finish with the low half of %edx
	before we need the high half, so after finishing with the low
	half i inserted a "shr $16,%edx" which lets us use movzbl %dl/%dh
	to get the top two bytes (similarly for %ecx during decode).

	i think i gave up a bit on p4-2 with this step, but i figured
	it was worth it because it helped everywhere else and p4-2 has
	been superceded by p4-3.  plus this transform saves code space.

	it's not easy to transform the other 3 registers in this way 
	without major surgery around loop edges ... which will have to
	wait for another rainy day.

-dean

SUBMISSION TYPE: TSU
SUBMITTED BY: dean gaudet
SUBMITTED FOR: dean gaudet
POINT OF CONTACT: dean@arctic.org
PHONE and/or FAX: (408) 919-3086
MANUFACTURER: openssl
PRODUCT NAME/MODEL #: 0.9.8-dev
ECCN: 5D002

Index: crypto/aes/asm/aes-586.pl
===================================================================
RCS file: /home/dean/openssl/Repository/openssl/crypto/aes/asm/aes-586.pl,v
retrieving revision 1.1
diff -u -r1.1 aes-586.pl
--- crypto/aes/asm/aes-586.pl	23 Dec 2004 21:32:34 -0000	1.1
+++ crypto/aes/asm/aes-586.pl	28 Dec 2004 02:31:39 -0000
@@ -4,6 +4,8 @@
 # Written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL
 # project. Rights for redistribution and usage in source and binary
 # forms are granted according to the OpenSSL license.
+#
+# Additional contributions by dean gaudet <dean@arctic.org>.
 # ====================================================================
 #
 # You might fail to appreciate this module performance from the first
@@ -15,15 +17,6 @@
 # more than *twice* as fast! Yes, all this buzz about PIC means that
 # [unlike other implementations] this module was explicitly designed
 # to be safe to use even in shared library context...
-#
-# Special note about instruction choice. Do you recall RC4_INT code
-# performing poorly on P4? It might be the time to figure out why.
-# RC4_INT code implies effective address calculations in base+offset*4
-# form. Trouble is that it seems that offset scaling turned to be
-# critical path... At least eliminating scaling resulted in 2.8x RC4
-# performance improvement [as you might recall]. As AES code is hungry
-# for scaling too, I [try to] avoid the latter by favoring off-by-2
-# shifts and masking the result with 0xFF<<2 instead of "boring" 0xFF.
 
 push(@INC,"perlasm","../../perlasm");
 require "x86asm.pl";
@@ -41,26 +34,30 @@
 { my ($i,$te,@s) = @_;
   my $tmp,$out;
 
-	if ($i==3)  {	$out=$s[0]; &mov ("edi",&DWP(12,"esp"));}
-	else        {	$out="esi"; &mov ($out,$s[0]);		}
-			&shr	($out,24-2);
-			&and	($out,0xFF<<2);
-			&mov	($out,&DWP(1024*0,$te,$out));
-
-	if ($i==3)  {	$tmp=$s[1];				}
-	else        {	$tmp="edi"; &mov ($tmp,$s[1]);		}
-			&shr	($tmp,16-2);
-			&and	($tmp,0xFF<<2);
-			&xor	($out,&DWP(1024*1,$te,$tmp));
+	if ($i==3)  {	$out=$s[0]; &movz($out,&HB($s[0]));
+			&mov	("edi",&DWP(12,"esp"));		}
+	else        {	$out="esi"; &mov ($out,$s[0]);
+			&shr	($out,24);			}
+			&mov	($out,&DWP(1024*0,$te,$out,4));
+
+	if ($i==2)  {	$tmp="edi"; &movz ($tmp,&LB($s[1]));	}
+	elsif ($i==3){	$tmp=$s[1];
+			&shr	($tmp,16);
+			&movz	($tmp,&LB($tmp));		}
+	else        {	$tmp="edi"; &mov ($tmp,$s[1]);
+			&shr	($tmp,16);
+			&and	($tmp,0xFF);			}
+			&xor	($out,&DWP(1024*1,$te,$tmp,4));
 
 	if ($i==3)  {	$tmp=$s[2]; &mov ($s[1],&DWP(0,"esp"));	}
 	else        {	$tmp="edi";				}
 			&movz	($tmp,&HB($s[2]));
+	if ($i==1)  {	&shr	($s[2],16);			}
 			&xor	($out,&DWP(1024*2,$te,$tmp,4));
 
-	if ($i==3)  {	$tmp=$s[3]; &mov ($s[2],&DWP(4,"esp"));	}
-	else        {	$tmp="edi"; &mov ($tmp,$s[3]);		} 
-			&and	($tmp,0xFF);
+	if ($i==3)  {	$tmp=$s[3]; &movz ($tmp,&LB($s[3]));
+			&mov 	($s[2],&DWP(4,"esp"));		}
+	else        {	$tmp="edi"; &movz ($tmp,&LB($s[3]));	} 
 			&xor	($out,&DWP(1024*3,$te,$tmp,4));
 	if ($i<2)   {	&mov	(&DWP(4*$i,"esp"),$out);	}
 	if ($i==3)  {	&mov	($s[3],"esi");			}
@@ -70,33 +67,36 @@
 { my ($i,$te,@s)=@_;
   my $tmp,$out;
 
-	if ($i==3)  {	$out=$s[0]; &mov ("edi",&DWP(12,"esp"));}
-	else        {	$out="esi"; &mov ($out,$s[0]);		}
-			&shr	($out,24-2);
-			&and	($out,0xFF<<2);
-			&mov	($out,&DWP(0,$te,$out));
+	if ($i==3)  {	$out=$s[0]; &movz ($out,&HB($s[0]));
+			&mov ("edi",&DWP(12,"esp"));		}
+	else        {	$out="esi"; &mov ($out,$s[0]);
+			&shr	($out,24);			}
+			&mov	($out,&DWP(0,$te,$out,4));
 			&and	($out,0xff000000);
 
-	if ($i==3)  {	$tmp=$s[1];				}
-	else        {	$tmp="edi"; &mov ($tmp,$s[1]);		}
-			&shr	($tmp,16-2);
-			&and	($tmp,0xFF<<2);
-			&mov	($tmp,&DWP(0,$te,$tmp));
+	if ($i==2)  {	$tmp="edi"; &movz ($tmp,&LB($s[1]));	}
+	elsif ($i==3){	$tmp=$s[1];
+			&shr	($tmp,16);
+			&movz	($tmp,&LB($tmp));		}
+	else        {	$tmp="edi"; &mov ($tmp,$s[1]);
+			&shr	($tmp,16);
+			&and	($tmp,0xFF);			}
+			&mov	($tmp,&DWP(0,$te,$tmp,4));
 			&and	($tmp,0x00ff0000);
 			&xor	($out,$tmp);
 
 	if ($i==3)  {	$tmp=$s[2]; &mov ($s[1],&DWP(0,"esp"));	}
 	else        {	$tmp="edi"; 				}
 			&movz	($tmp,&HB($s[2]));
+	if ($i==1)  {	&shr	($s[2],16);			}
 			&mov	($tmp,&DWP(0,$te,$tmp,4));
 			&and	($tmp,0x0000ff00);
 			&xor	($out,$tmp);
 
-	if ($i==3)  {	$tmp=$s[3]; &mov ($s[2],&DWP(4,"esp"));	}
-	else        {	$tmp="edi"; &mov ($tmp,$s[3]);		} 
-			&and	($tmp,0xFF);
-			&mov	($tmp,&DWP(0,$te,$tmp,4));
-			&and	($tmp,0x000000ff);
+	if ($i==3)  {	$tmp=$s[3]; &movz ($tmp,&LB($s[3]));
+			&mov ($s[2],&DWP(4,"esp"));		}
+	else        {	$tmp="edi"; &movz ($tmp,&LB($s[3]));	} 
+			&movz	($tmp,&BP(0,$te,$tmp,4));
 			&xor	($out,$tmp);
 	if ($i<2)   {	&mov	(&DWP(4*$i,"esp"),$out);	}
 	if ($i==3)  {	&mov	($s[3],"esi");			}
@@ -565,26 +565,29 @@
 { my ($i,$td,@s) = @_;
   my $tmp,$out;
 
-	if ($i==3)  {	$out=$s[0]; &mov ("edi",&DWP(12,"esp"));}
-	else        {	$out="esi"; &mov ($out,$s[0]);		}
-			&shr	($out,24-2);
-			&and	($out,0xFF<<2);
-			&mov	($out,&DWP(1024*0,$td,$out));
-
-	if ($i==3)  {	$tmp=$s[1];				}
-	else        {	$tmp="edi"; &mov ($tmp,$s[1]);		}
-			&shr	($tmp,16-2);
-			&and	($tmp,0xFF<<2);
-			&xor	($out,&DWP(1024*1,$td,$tmp));
+	if ($i==2)  {	$out="esi"; &movz ($out,&HB($s[0]));	}
+	else {
+	  if ($i==3){	$out=$s[0]; &mov ("edi",&DWP(12,"esp"));}
+	  else      {	$out="esi"; &mov ($out,$s[0]);		}
+			&shr	($out,24);
+	}
+			&mov	($out,&DWP(1024*0,$td,$out,4));
+
+	if ($i==3)  {	$tmp=$s[1]; &movz ($tmp,&LB($s[1]));	}
+	else        {	$tmp="edi"; &mov ($tmp,$s[1]);
+			&shr	($tmp,16);
+			&and	($tmp,0xFF);			}
+			&xor	($out,&DWP(1024*1,$td,$tmp,4));
 
 	if ($i==3)  {	$tmp=$s[2]; &mov ($s[1],"esi");		}
 	else        {	$tmp="edi";				}
 			&movz	($tmp,&HB($s[2]));
 			&xor	($out,&DWP(1024*2,$td,$tmp,4));
 
-	if ($i==3)  {	$tmp=$s[3]; &mov ($s[2],&DWP(4,"esp"));	}
-	else        {	$tmp="edi"; &mov ($tmp,$s[3]);		} 
-			&and	($tmp,0xFF);
+	if ($i==3)  {	$tmp=$s[3]; &movz ($tmp,&LB($s[3]));
+			&mov	($s[2],&DWP(4,"esp"));		}
+	else        {	$tmp="edi"; &movz ($tmp,&LB($s[3]));	} 
+	if ($i==1)  {	&shr	($s[3],16);			}
 			&xor	($out,&DWP(1024*3,$td,$tmp,4));
 	if ($i<2)   {	&mov	(&DWP(4*$i,"esp"),$out);	}
 	if ($i==3)  {	&mov	($s[3],&DWP(0,"esp"));		}
@@ -594,18 +597,20 @@
 { my ($i,$td,@s)=@_;
   my $tmp,$out;
 
-	if ($i==3)  {	$out=$s[0]; &mov ("edi",&DWP(12,"esp"));}
-	else        {	$out="esi"; &mov ($out,$s[0]);		}
-			&shr	($out,24-2);
-			&and	($out,0xFF<<2);
-			&mov	($out,&DWP(0,$td,$out));
+	if ($i==2)  {	$out="esi"; &movz ($out,&HB($s[0]));	}
+	else {
+	  if ($i==3){	$out=$s[0]; &mov ("edi",&DWP(12,"esp"));}
+	  else      {	$out="esi"; &mov ($out,$s[0]);		}
+			&shr	($out,24);
+	}
+			&mov	($out,&DWP(0,$td,$out,4));
 			&and	($out,0xff000000);
 
-	if ($i==3)  {	$tmp=$s[1];				}
-	else        {	$tmp="edi"; &mov ($tmp,$s[1]);		}
-			&shr	($tmp,16-2);
-			&and	($tmp,0xFF<<2);
-			&mov	($tmp,&DWP(0,$td,$tmp));
+	if ($i==3)  {	$tmp=$s[1]; &movz ($tmp,&LB($s[1]));	}
+	else        {	$tmp="edi"; &mov ($tmp,$s[1]);
+			&shr	($tmp,16);
+			&and	($tmp,0xFF);			}
+			&mov	($tmp,&DWP(0,$td,$tmp,4));
 			&and	($tmp,0x00ff0000);
 			&xor	($out,$tmp);
 
@@ -616,11 +621,11 @@
 			&and	($tmp,0x0000ff00);
 			&xor	($out,$tmp);
 
-	if ($i==3)  {	$tmp=$s[3]; &mov ($s[2],&DWP(4,"esp"));	}
-	else        {	$tmp="edi"; &mov ($tmp,$s[3]);		} 
-			&and	($tmp,0xFF);
-			&mov	($tmp,&DWP(0,$td,$tmp,4));
-			&and	($tmp,0x000000ff);
+	if ($i==3)  {	$tmp=$s[3]; &movz ($tmp,&LB($s[3]));
+			&mov	($s[2],&DWP(4,"esp"));		}
+	else        {	$tmp="edi"; &movz ($tmp,&LB($s[3]));	} 
+	if ($i==1)  {	&shr	($s[3],16);			}
+			&movz	($tmp,&BP(0,$td,$tmp,4));
 			&xor	($out,$tmp);
 	if ($i<2)   {	&mov	(&DWP(4*$i,"esp"),$out);	}
 	if ($i==3)  {	&mov	($s[3],&DWP(0,"esp"));		}
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       openssl-dev@openssl.org
Automated List Manager                           majordomo@openssl.org

