softfp vs. hardfp - povray benchmark

Hello,

beeing curious about hardfp, I've used the povray benchmark to get some numbers. I've used povray 2.6.1, see http://www.povray.org/download/benchmark.php for an explanation. I think that will give an impression about how much applications with heavy floating point usage might gain from hardfp.

I've run those tests using a BeagleBoard C4 ((w/o XM, 720 MHz) using the same (vanilla) kernel 2.6.37.3, both systems where on the same usb-hd, using different (ext4-)partitions with the same size.

The whole softfp-system was compiled using

CFLAGS="-Os -pipe -mtune=cortex-a8 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -fomit-frame-pointer"
CXXFLAGS="${CFLAGS} -std=gnu++0x -fvisibility-inlines-hidden"
CFLAGS="${CFLAGS} -std=gnu99"
LDFLAGS="-Wl,-O1 -Wl,--enable-new-dtags -Wl,--sort-common -Wl,--as-needed"

and the hardfp-system was compiled using

CFLAGS="-Os -pipe -mtune=cortex-a8 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=hard -fomit-frame-pointer"
CXXFLAGS="${CFLAGS} -std=gnu++0x -fvisibility-inlines-hidden"
CFLAGS="${CFLAGS} -std=gnu99"
LDFLAGS="-Wl,-O1 -Wl,--enable-new-dtags -Wl,--sort-common -Wl,--as-needed"

All package-versions where the same and the same patches (if any) where used. The gcc version was 4.5.2, binutils was 2.21 and glibc was 2.11.2.

Here are the times for "time povray benchmark.ini":

softfp:
Total Time: 10 hours 39 minutes 23 seconds (38363 seconds)
real 639m23.292s
user 639m17.914s
sys 0m0.430s

hardfp:
Total Time: 10 hours 3 minutes 25 seconds (36205 seconds)
real 603m24.803s
user 603m21.188s
sys 0m0.422s

Beeing curious about the compiler optimisations I've done the same benchmark on the same systems just using -O3 instead of -Os to compile povray:

softfp:
Total Time: 9 hours 49 minutes 29 seconds (35369 seconds)
real 589m29.634s
user 589m24.016s
sys 0m0.422s

hardfp:
Total Time: 9 hours 22 minutes 13 seconds (33733 seconds)
real 562m12.603s
user 562m9.320s
sys 0m0.469s

So it looks like using hardfp instead of softp might gain about 5-6 % for applications which are heavily using floating points.

I don't want to interpret if -Os, -O2 or -O3 might be better for your use case, those optimizations could have heavy implications, escpecially in regard to floating point and using the fastest optimizations won't fit allways.

Regards,

Alexander Holler

PS: Before someone asks why I'm using -std=gnu++0x, I'm using it because c++0x offers some new nice to have features, especially in regard to "perfect forwarding", and I think almost all c++-programs might benefit from that, if those new features are used e.g. by the STL. I haven't checked if those new features are already used somewhere in the standard libraries (or templates), but ...
Be aware, using -std=gnu++0x actually breaks compilation of some few c++-programs.

Alexander Holler <holler@ahsoftware.de> writes:

Hello,

beeing curious about hardfp, I've used the povray benchmark to get
some numbers. I've used povray 2.6.1, see
POV-Ray: Download: Benchmarking with POV-Ray for an explanation. I
think that will give an impression about how much applications with
heavy floating point usage might gain from hardfp.

Here are the times for "time povray benchmark.ini":

softfp:
Total Time: 10 hours 39 minutes 23 seconds (38363 seconds)
real 639m23.292s
user 639m17.914s
sys 0m0.430s

hardfp:
Total Time: 10 hours 3 minutes 25 seconds (36205 seconds)
real 603m24.803s
user 603m21.188s
sys 0m0.422s

Beeing curious about the compiler optimisations I've done the same
benchmark on the same systems just using -O3 instead of -Os to compile
povray:

softfp:
Total Time: 9 hours 49 minutes 29 seconds (35369 seconds)
real 589m29.634s
user 589m24.016s
sys 0m0.422s

hardfp:
Total Time: 9 hours 22 minutes 13 seconds (33733 seconds)
real 562m12.603s
user 562m9.320s
sys 0m0.469s

So it looks like using hardfp instead of softp might gain about 5-6 %
for applications which are heavily using floating points.

For applications passing floats to and from functions a lot, yes.
Povray appears to be one of these.

I don't want to interpret if -Os, -O2 or -O3 might be better for your
use case, those optimizations could have heavy implications,
escpecially in regard to floating point and using the fastest
optimizations won't fit allways.

If you want fast, get a Panda:

Total Time: 1 hours 36 minutes 38 seconds (5798 seconds)

For reference, on an Intel Core i7 940 (2.93GHz):

Total Time: 0 hours 19 minutes 8 seconds (1148 seconds)

Hello,

It's hard to believe that beagle is more than 4 time slower than Panda. More over the difference between hardfp and softfp, is so small that I will bet that the beagle does not use hardfp at all. But I might be misleading there ...

Mans, do you confirm the beagle figures ? Just out of curiosity, do you have numbers for an Atom based system ?

Laurent GONZALEZ <macmanus38@gmail.com> writes:

Alexander Holler<holler@ahsoftware.de> writes:

Hello,

beeing curious about hardfp, I've used the povray benchmark to get
some numbers. I've used povray 2.6.1, see
POV-Ray: Download: Benchmarking with POV-Ray for an explanation. I
think that will give an impression about how much applications with
heavy floating point usage might gain from hardfp.

Here are the times for "time povray benchmark.ini":

softfp:
Total Time: 10 hours 39 minutes 23 seconds (38363 seconds)
real 639m23.292s
user 639m17.914s
sys 0m0.430s

hardfp:
Total Time: 10 hours 3 minutes 25 seconds (36205 seconds)
real 603m24.803s
user 603m21.188s
sys 0m0.422s

Beeing curious about the compiler optimisations I've done the same
benchmark on the same systems just using -O3 instead of -Os to compile
povray:

softfp:
Total Time: 9 hours 49 minutes 29 seconds (35369 seconds)
real 589m29.634s
user 589m24.016s
sys 0m0.422s

hardfp:
Total Time: 9 hours 22 minutes 13 seconds (33733 seconds)
real 562m12.603s
user 562m9.320s
sys 0m0.469s

So it looks like using hardfp instead of softp might gain about 5-6 %
for applications which are heavily using floating points.

For applications passing floats to and from functions a lot, yes.
Povray appears to be one of these.

I don't want to interpret if -Os, -O2 or -O3 might be better for your
use case, those optimizations could have heavy implications,
escpecially in regard to floating point and using the fastest
optimizations won't fit allways.

If you want fast, get a Panda:

Total Time: 1 hours 36 minutes 38 seconds (5798 seconds)

For reference, on an Intel Core i7 940 (2.93GHz):

Total Time: 0 hours 19 minutes 8 seconds (1148 seconds)

It's hard to believe that beagle is more than 4 time slower than
Panda. More over the difference between hardfp and softfp, is so small
that I will bet that the beagle does not use hardfp at all. But I
might be misleading there ...

Mans, do you confirm the beagle figures ?

Yes, I got similar figures on a Beagle C3. The huge difference is due
to the non-pipelined VFP in Cortex-A8. The hard vs softfp difference
should also be much smaller on A9.

The Nvidia Tegra2 with VFP3-D16 gets this result, also at 1GHz:

Total Time: 1 hours 43 minutes 48 seconds (6228 seconds)

It is a little slower than the Panda but not much.

Just out of curiosity, do you have numbers for an Atom based system ?

I do not.