Here is some performance data on the latest implementation. These are all Cycles per byte for an 8192-byte buffer (lower is better): openssl SSE2 SSE2 SSE2 cvs head gcc-cvs gcc34 icc71 sha1: p4 model 3 9.59 16.9 21.4 14.2 p4 model 2 10.6 15.4 28.4 13.5 p-m 10.3 15.0 14.4 13.3 k8 8.18 10.4 10.3 8.70 efficeon 9.40 7.1 7.04 6.20 sha256: p4 model 3 31.8 51.8 38.8 31.3 p4 model 2 38.6 46.5 38.1 39.2 p-m 32.7 34.3 32.0 29.0 k8 25.9 29.2 22.2 21.6 efficeon 27.9 20.9 15.4 16.4 notes: - openssl cvs head as of 20041218, 32-bit x86 only (even on k8 and p4-3) - gcc cvs head as of 20041218 - gcc-3.4.2-3 debian package - intel C compiler 7.1 build 20030307Z - p4 model 2 is most widespread p4; p4 model 3 are the recent chips only -- all EM64T are p4 model 3 Notice how all over the map gcc is. Efficeon generally doesn't care much because we reschedule/reoptimize everything anyhow. But lacking any sort of sane results from gcc i'm not going advocate even my sha256 code for general consumption yet. The latest archive is here: http://arctic.org/~dean/crypto/sha-sse2-20041218.tar.bz2 -dean p.s. there's a description of the technique used here: http://arctic.org/~dean/crypto/sha1.html