Update On Floating Point Performance

Hello Folks,

   Just following up:

Thanks to folks for the feedback and suggestions. I tried the suggested options and I even "hacked" the nbench benchmarks (which use all doubles in their C code) to internally use all floats only everywhere. It might be of some interest that there was essentially no effect in doing these things. FP performance still lagged an old x86 clone at 1/2 the clock speed.

The floating point performance is important for many of the applications in 3D graphics and robotics for which I had been considering the OMAP 3. I often have to write code that handles LU decompositions, 3D transformations, etc. in real-time. So, the fact that the processor is so slow (relative to it's integer performance) seems odd. I'm grateful that the Beagleboard is helping me evaluate it thoroughly.

Any other ideas? Is there a compiler branch somewhere that will let this new "SIMD 128bit pipelined FP unit" that is in there somewhere beat out an AMD K6/233 from 12 years ago? It would seem with such a touted (reading ARMs website) hardware FP unit, that the gap between FP performance and INT performance would not be so large.

So, I'm still a bit puzzled unless compiler support is so immature for Neon that we're not seeing anything like the real performance.

-Sincerely,
Todd Pack

rtpack@comcast.net writes:

Hello Folks,

   Just following up:

Thanks to folks for the feedback and suggestions. I tried the
suggested options and I even "hacked" the nbench benchmarks (which use
all doubles in their C code) to internally use all floats only
everywhere. It might be of some interest that there was essentially no
effect in doing these things. FP performance still lagged an old x86
clone at 1/2 the clock speed.

The floating point performance is important for many of the
applications in 3D graphics and robotics for which I had been
considering the OMAP 3. I often have to write code that handles LU
decompositions, 3D transformations, etc. in real-time. So, the fact
that the processor is so slow (relative to it's integer performance)
seems odd. I'm grateful that the Beagleboard is helping me evaluate it
thoroughly.

Any other ideas? Is there a compiler branch somewhere that will let
this new "SIMD 128bit pipelined FP unit" that is in there somewhere
beat out an AMD K6/233 from 12 years ago? It would seem with such a
touted (reading ARMs website) hardware FP unit, that the gap between
FP performance and INT performance would not be so large.

You have to make sure that what ARM calls runfast mode is enabled for
normal FP instructions to execute in the NEON pipeline. This includes
disabling FP exceptions and selecting the proper rounding mode. The
details should be in the manual.

So, I'm still a bit puzzled unless compiler support is so immature for
Neon that we're not seeing anything like the real performance.

Compilers are certainly not very good at using the vector operations
the NEON unit is capable of.

This is described here:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0133c/index.html

Note I did not test it.

This was discussed on IRC yesterday:
http://www.beagleboard.org/irclogs/index.php?date=2008-10-28#T17:54:14

Laurent

The Cortex-A8 was designed for high performance vector processing
using the new NEON engine, including vectorized single-precision
floating point but the NEON instructions must be used. Using NEON
means some extra effort for the software engineer identifying critical
sections followed by modifications to ensure NEON instructions are
used. Both gcc and armcc offer auto-vectorization (some source code
changes may be required), but there are also other approaches such as
instrinsics. These can yield substantial performance benefits on
Cortex-A8 for typical applications where there are a small number of
critical routines.

The next-gen Cortex-A9 increases the fp performance with fully
pipelined scalar floating point, and the NEON unit retains the same
pipeline as on A8.