Here is some performance data on the latest implementation.  These are
all Cycles per byte for an 8192-byte buffer (lower is better):

                openssl         SSE2            SSE2            SSE2
                cvs head        gcc-cvs         gcc34           icc71

sha1:

p4 model 3       9.59           16.9            21.4            14.2
p4 model 2      10.6            15.4            28.4            13.5
p-m             10.3            15.0            14.4            13.3
k8               8.18           10.4            10.3             8.70
efficeon         9.40            7.1             7.04            6.20

sha256:

p4 model 3      31.8            51.8            38.8            31.3
p4 model 2      38.6            46.5            38.1            39.2
p-m             32.7            34.3            32.0            29.0
k8              25.9            29.2            22.2            21.6
efficeon        27.9            20.9            15.4            16.4

notes:

- openssl cvs head as of 20041218, 32-bit x86 only (even on k8 and p4-3)
- gcc cvs head as of 20041218
- gcc-3.4.2-3 debian package
- intel C compiler 7.1 build 20030307Z
- p4 model 2 is most widespread p4; p4 model 3 are the recent chips only
  -- all EM64T are p4 model 3

Notice how all over the map gcc is.  Efficeon generally doesn't care much
because we reschedule/reoptimize everything anyhow.  But lacking any sort
of sane results from gcc i'm not going advocate even my sha256 code for
general consumption yet.

The latest archive is here:
http://arctic.org/~dean/crypto/sha-sse2-20041218.tar.bz2

-dean

p.s. there's a description of the technique used here:
http://arctic.org/~dean/crypto/sha1.html