From dean-list-lmbench-users@arctic.org Mon Apr 26 20:01:19 2004
From: dean gaudet <dean-list-lmbench-users@arctic.org>
To: lmbench-users@bitmover.com
Subject: [Lmbench-users] [patch] lat_ops fp division results misleading
Date: Mon, 26 Apr 2004 19:57:46 -0700 (PDT)

it's not uncommon for hardware shortcircuits for various special division
cases... division by 1 being one of the most obvious.  the quick fix below
uses 3.14159 in several cases... but i didn't do a full analysis to make
sure it hasn't caused over/underflow.

here are before / after timings in nanoseconds on two processors which i
know to have the fdiv-by-1 shortcircuit, and one which doesn't:

                        pentium-m 1GHz     k8 1.8GHz     p4-2 2.4GHz
float div               23.78 / 38.47    9.78 / 13.38   17.92 / 17.96
double div              24.59 / 38.49    9.77 / 13.39   17.92 / 17.95
float bogomflops:        8.33 / 37.55    5.73 / 13.61   17.97 / 17.96
double bogomflops:       8.55 / 37.90    5.06 / 12.04   17.97 / 17.96

the new results correspond exactly with latencies i've measured for x87
80-bit precision divisions... (all x87 divisions are at fcw.pc precision
and you need to change this global setting to see shorter latencies for
32-bit or 64-bit operations -- linux defaults to 80-bit, windows to
64-bit).

if i recompile -msse2 -mfpmath=sse then i get numbers corresponding to
sse/sse2 latencies i've measured through other means.

note the bogomflops benchmark isn't all that useful on its own -- the
division in there basically means that the benchmark proves whether or not
the hardware is capable of overlapping an fp division with other
operations.  this is definitely interesting information -- as you can see
above these processors are both capable of overlapping division completely
(or nearly completely) with other operations, and the division is the
dominating cost.

but a critical benchmark is missing -- fp muladd.  in general it's very
interesting to know how well a processor does on a balanced sequence of fp
multiplications and adds (i.e. think polynomial expansion/approximation,
matrix multiply, dot product, ...)  (the processors above have quite
different capabilities for pairing x87, sse, and sse2 muls and adds).

-dean

--- lat_ops.c.orig	2003-01-13 03:16:13.000000000 -0800
+++ lat_ops.c	2004-04-26 19:21:45.000000000 -0700
@@ -187,7 +187,7 @@
 do_float_div(iter_t iterations, void* cookie)
 {
 	struct _state *pState = (struct _state*)cookie;
-	register float f = (float)pState->N;
+	register float f = 3.14159*(float)pState->N;
 	register float g = (float)pState->M;

 	while (iterations-- > 0) {
@@ -240,7 +240,7 @@
 do_double_div(iter_t iterations, void* cookie)
 {
 	struct _state *pState = (struct _state*)cookie;
-	register double f = (double)pState->N;
+	register double f = 3.14159*(double)pState->N;
 	register double g = (double)pState->M;

 	while (iterations-- > 0) {
@@ -264,7 +264,7 @@

 	pState->data = (double*)x;
 	for (i = 0; i < pState->M; ++i) {
-		x[i] = 1.;
+		x[i] = 3.14159;
 	}
 }

@@ -276,7 +276,7 @@

 	pState->data = (double*)malloc(pState->M * sizeof(double));
 	for (i = 0; i < pState->M; ++i) {
-		pState->data[i] = 1.;
+		pState->data[i] = 3.14159;
 	}
 }

_______________________________________________
Lmbench-users mailing list
Lmbench-users@bitmover.com
http://bitmover.com/mailman/listinfo/lmbench-users