Floating Point Performance

Hello Folks,
   I built nbench for my beagleboard and compiled with flags that one would be led to believe would enable floating point operation:

-mcpu=cortex-a8 -mfloat-abi=softpf -mfpu=neon

   My nbench results are below (just to share around). My question is... are these results consistent with what folks expect? Why is the FP performance still so low even with the floating point hardware "engaged"? Any thoughts? Compiler maturity? What would one expect the BB to achieve on simple FP benchmarks like this?

root@beagleboard:~/Tools/nbench-byte-2.2.3# ./nbench

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST : Iterations/sec. : Old Index : New Index
                    : : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 236.56 : 6.07 : 1.99
STRING SORT : 23.696 : 10.59 : 1.64
BITFIELD : 8.0473e+07 : 13.80 : 2.88
FP EMULATION : 50.999 : 24.47 : 5.65
FOURIER : 886.67 : 1.01 : 0.57
ASSIGNMENT : 3.3465 : 12.73 : 3.30
IDEA : 583.89 : 8.93 : 2.65
HUFFMAN : 289.46 : 8.03 : 2.56
NEURAL NET : 0.95786 : 1.54 : 0.65
LU DECOMPOSITION : 37.915 : 1.96 : 1.42
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 11.026
FLOATING-POINT INDEX: 1.450
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.6.26-omap1
C compiler : gcc version 4.3.1 (GCC)
libc : libc-2.6.1.so
MEMORY INDEX : 2.499
INTEGER INDEX : 2.957
FLOATING-POINT INDEX: 0.804
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

rtpack@comcast.net writes:

Hello Folks,
I built nbench for my beagleboard and compiled with flags that one
would be led to believe would enable floating point operation:

-mcpu=cortex-a8 -mfloat-abi=softpf -mfpu=neon

Try adding -ffast-math -fno-math-errno

My nbench results are below (just to share around). My question
is... are these results consistent with what folks expect? Why is
the FP performance still so low even with the floating point
hardware "engaged"? Any thoughts? Compiler maturity? What would one
expect the BB to achieve on simple FP benchmarks like this?

On the Cortex-A8, double-precision floating-point maths is not
pipelined, and neither is single-precision if full IEEE compliance is
required. The flags above should let the compiler generate
floating-point code that can execute in the pipelined NEON unit for
single-precision maths.

root@beagleboard:~/Tools/nbench-byte-2.2.3# ./nbench

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST : Iterations/sec. : Old Index : New Index
                    : : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 236.56 : 6.07 : 1.99
STRING SORT : 23.696 : 10.59 : 1.64
BITFIELD : 8.0473e+07 : 13.80 : 2.88
FP EMULATION : 50.999 : 24.47 : 5.65
FOURIER : 886.67 : 1.01 : 0.57
ASSIGNMENT : 3.3465 : 12.73 : 3.30
IDEA : 583.89 : 8.93 : 2.65
HUFFMAN : 289.46 : 8.03 : 2.56
NEURAL NET : 0.95786 : 1.54 : 0.65
LU DECOMPOSITION : 37.915 : 1.96 : 1.42

Honestly, how often does anyone run code even resembling those
benchmarks?

==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 11.026
FLOATING-POINT INDEX: 1.450
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.6.26-omap1
C compiler : gcc version 4.3.1 (GCC)
libc : libc-2.6.1.so
MEMORY INDEX : 2.499
INTEGER INDEX : 2.957
FLOATING-POINT INDEX: 0.804
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

That baseline is hardly relevant these days.

Måns Rullgård wrote:

rtpack@comcast.net writes:

Hello Folks,
I built nbench for my beagleboard and compiled with flags that one
would be led to believe would enable floating point operation:

[...]

root@beagleboard:~/Tools/nbench-byte-2.2.3# ./nbench

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST : Iterations/sec. : Old Index : New Index
                    : : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 236.56 : 6.07 : 1.99
STRING SORT : 23.696 : 10.59 : 1.64
BITFIELD : 8.0473e+07 : 13.80 : 2.88
FP EMULATION : 50.999 : 24.47 : 5.65
FOURIER : 886.67 : 1.01 : 0.57
ASSIGNMENT : 3.3465 : 12.73 : 3.30
IDEA : 583.89 : 8.93 : 2.65
HUFFMAN : 289.46 : 8.03 : 2.56
NEURAL NET : 0.95786 : 1.54 : 0.65
LU DECOMPOSITION : 37.915 : 1.96 : 1.42

Honestly, how often does anyone run code even resembling those
benchmarks?

==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 11.026
FLOATING-POINT INDEX: 1.450
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.6.26-omap1
C compiler : gcc version 4.3.1 (GCC)
libc : libc-2.6.1.so
MEMORY INDEX : 2.499
INTEGER INDEX : 2.957
FLOATING-POINT INDEX: 0.804
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

That baseline is hardly relevant these days.

I'm really interested in finding/measuring useful performance numbers
for the boards/cpus in my hand including bb/cortex-A8. Do you have any
pointers for a head start? Even a methodology can be helpful.

Thanks,
Caglar

I get these results:

root@beagleboard:/data/src/nbench-byte-2.2.3# ./nbench

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST : Iterations/sec. : Old Index : New Index
                    : : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 286.92 : 7.36 : 2.42
STRING SORT : 27.893 : 12.46 : 1.93
BITFIELD : 1.2647e+08 : 21.69 : 4.53
FP EMULATION : 61.671 : 29.59 : 6.83
FOURIER : 1064 : 1.21 : 0.68
ASSIGNMENT : 4.6512 : 17.70 : 4.59
IDEA : 700.21 : 10.71 : 3.18
HUFFMAN : 364.96 : 10.12 : 3.23
NEURAL NET : 1.1517 : 1.85 : 0.78
LU DECOMPOSITION : 48.921 : 2.53 : 1.83
==========================ORIGINAL BYTEMARK
RESULTS==========================
INTEGER INDEX : 14.139
FLOATING-POINT INDEX: 1.784
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler
10.0
==============================LINUX DATA
BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.6.27-rc7-omap1
C compiler : gcc version 4.3.2 (GCC)
libc :
MEMORY INDEX : 3.424
INTEGER INDEX : 3.609
FLOATING-POINT INDEX: 0.989
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3,
libc-5.4.38
* Trademarks are property of their respective holder.
root@beagleboard:/data/src/nbench-byte-2.2.3#

That's with -ffast-math -fno-math-errno -mcpu=cortex-a8 -mfloat-
abi=softfp -mfpu=neon -ftree-vectorize -fomit-frame-pointer -funroll-
loops -O3

-ftree-vectorize makes the "NUMERIC SORT" 10% faster, no real impact
on the rest

regards,

Koen

Yusuf Caglar AKYUZ <caglarakyuz@gmail.com> writes:

I'm really interested in finding/measuring useful performance numbers
for the boards/cpus in my hand including bb/cortex-A8. Do you have any
pointers for a head start? Even a methodology can be helpful.

Run real applications. If you have specific uses in mind, run those.

To stress double-precision floating-point, try POVRay. Many audio
codecs use floating-point, usually single precision. Vorbis is an
example.

You also need to consider whether the tests are CPU-bound or
memory-bound. Most artificial benchmarks are CPU-bound, except those
specifically designed to stress the memory subsystem.