very slow FPU performance with neon enabled

Hi,

I'm testing the FFTW performance in the beagleboard and the
performance is very poor.

The kernel is NEON enabled:
uname -a
Linux beagleboard 2.6.29-omap1 #1 Fri Apr 24 07:19:09 COT 2009 armv7l unknown

cat /proc/cpuinfo
Processor : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 480.01
Features : swp half thumb fastmult vfp edsp thumbee neon
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x1
CPU part : 0xc08
CPU revision : 3

Hardware : OMAP3 Beagle Board
Revision : 0020
Serial : 0000000000000000

FFTW version: fftw-3.2.1

configuration of the fftw:
CFLAGS="-O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon
-ftree-vectorize -mfloat-abi=softfp" \
./configure \
--prefix=/tmp/beagle \
--build=i686-linux \
--host=arm-angstrom-linux-gnueabi \
--enable-single

Some results (with the fftw bench):
N=64 mflops=29.606
N=256 mflops=47.514
N=512 mflops=43.451
N=1024 mflops=32.265

The performance is very close to expected with softfloat implementation.

Any recommendation?

Regards,

Andrés Calderón
Cel: +57 (300) 275 3666
Email: andres.calderon@emqbit.com
Web: www.emqbit.com

Are you sure it's using neon and not the unpipelined vfplite?

regards,

Koen

Koen Kooi <koen@beagleboard.org> writes:

Hi,

I'm testing the FFTW performance in the beagleboard and the
performance is very poor.

configuration of the fftw:
CFLAGS="-O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon
-ftree-vectorize -mfloat-abi=softfp" \

Drop -ftree-vectorize. It is VERY buggy.

./configure \
--prefix=/tmp/beagle \
--build=i686-linux \
--host=arm-angstrom-linux-gnueabi \
--enable-single

Some results (with the fftw bench):
N=64 mflops=29.606
N=256 mflops=47.514
N=512 mflops=43.451
N=1024 mflops=32.265

The performance is very close to expected with softfloat
implementation.

Compile it with softfloat and compare.

Are you sure it's using neon and not the unpipelined vfplite?

It's almost certainly not using NEON.

Hi,

I'm testing the FFTW performance in the beagleboard and the
performance is very poor.

The kernel is NEON enabled:
uname -a
Linux beagleboard 2.6.29-omap1 #1 Fri Apr 24 07:19:09 COT 2009 armv7l unknown

cat /proc/cpuinfo
Processor : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 480.01
Features : swp half thumb fastmult vfp edsp thumbee neon
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x1
CPU part : 0xc08
CPU revision : 3

Hardware : OMAP3 Beagle Board
Revision : 0020
Serial : 0000000000000000

FFTW version: fftw-3.2.1

configuration of the fftw:
CFLAGS="-O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon
-ftree-vectorize -mfloat-abi=softfp" \
./configure \
--prefix=/tmp/beagle \
--build=i686-linux \
--host=arm-angstrom-linux-gnueabi \
--enable-single

Some results (with the fftw bench):
N=64 mflops=29.606
N=256 mflops=47.514
N=512 mflops=43.451
N=1024 mflops=32.265

The performance is very close to expected with softfloat implementation.

Any recommendation?

Look in the directory simd in the fftw source and add NEON support.
I'd be glad to test any patches youi come up with!

Philip

Hello,
I am testing fftw test application on beagle board. I replace -O3
instead of -O2 command in makefile.

command - bench -s if2048

i got result like that,
size setup-time time(ms) mflops
                (ms)
2048 152.80 2.38 ms 47.24 In-place, backward and
complex data
2048 399.60 1.47 ms 38.35 In-place, backward and real data

2048 83.53 3.54ms 31.82 Out-of-place, backward and complex
data
2048 218.14 1.48 ms 38.05 Out-of-place, backward and real data

Is it true result ?

Hi !

On kernel start trace check that "floating point emulation" is not
enabled and you really are using hard vfp.
As Philipp let you understand don't expect gcc to generate neon
instruction, in most case someone (like Mans forffmpeg) as put some
assembly source.
If you found neon assembly source, check that it is enabled during the
configure step : check how fftw enables neon instruction, there must
be a test on 'host' or another '--enable-xxxx' instruction.
As Mans Said, compare your result with soft computation (change you
flag -mfloat-abi=soft)

Goog luck and let us know !

Hi !

On kernel start trace check that "floating point emulation" is not
enabled and you really are using hard vfp.
As Philipp let you understand don't expect gcc to generate neon
instruction, in most case someone (like Mans forffmpeg) as put some
assembly source.
If you found neon assembly source, check that it is enabled during the
configure step : check how fftw enables neon instruction, there must
be a test on 'host' or another '--enable-xxxx' instruction.
As Mans Said, compare your result with soft computation (change you
flag -mfloat-abi=soft)

My understanding is that flag only changes calling conventions, not
that actual code generated, both mfloat-abi=[soft,hard] use the
hardware floating point capability, just not very well :slight_smile:

Philip

you mean -mfloat-abi=softfp combined with -mfpu={neon,vpfv3}, right? The 'soft' bit will give you some suprises if you want to use the fpu.

regards,

Koen

Philip Balister <philip.balister@gmail.com> writes:

Hi !

On kernel start trace check that "floating point emulation" is not
enabled and you really are using hard vfp.
As Philipp let you understand don't expect gcc to generate neon
instruction, in most case someone (like Mans forffmpeg) as put some
assembly source.
If you found neon assembly source, check that it is enabled during the
configure step : check how fftw enables neon instruction, there must
be a test on 'host' or another '--enable-xxxx' instruction.
As Mans Said, compare your result with soft computation (change you
flag -mfloat-abi=soft)

My understanding is that flag only changes calling conventions, not
that actual code generated, both mfloat-abi=[soft,hard] use the
hardware floating point capability, just not very well :slight_smile:

Not quite so simple. The -mfloat-abi flag takes three different
values:

1. soft - all floating-point emulated
2. hard - real FPU instructions used
3. softfp - real FPU instructions used with softfloat calling convention.

In addition, the -mfpu flag specifies which FPU to generate code for,
or in the case of -mfloat-abi=soft, which one to emulate. Legal
values include vfp, vfpv2, vfpv3, and neon.

Only gcc-csl 2009q1 supports the VFP (hard) calling convention, and to
use it, you must rebuild everything on the system using floating-point.

If your code makes many calls to functions taking or returning
floating-point numbers, using -mfloat-abi=hard can save expensive
register transfers, so could be worth trying too.

I may be repeating what Mans/philipp said.
‘mfpu=’ option specify with hardware (included hardware emulation) to use.
concerning ‘-mfloat-abi=soft’ manpage gcc says : “soft and hard are equivalent to -msoft-float and -mhard-float respectively. softfp allows the generation of floating point instructions, but still uses the soft-float calling conventions.”
If -msoft-float is specified, functions in libgcc are used to issue floating-point operation, then ‘mpfu’ option just specify the format of floating point value.

Lastly I found the following in BB image : “support for NEON instruction”, and in kernel trace I found “VFP emulation”.
I explained this as NEON was vectorizing INTEGER instuction, and VFP was softly emulated (in that kernel version).
That’s why you should check that your using that hard fpu unit.

And you should really check that you have neon code in fftw, if not I dont give much chance (others may tell you better) for gcc to generate it.