Using ARM Neon Intrinsics in Code Sourcery Toolchain for Beagle Board-xM

Hi,

I want to use arm neon intrinsics set for optimization.
I want to build my my code and compile it using the Code Sourcery
Toolchain and then Port it on to the Beagle Board-xM. How to use
Neon Intrinsics in my Code. Will Beagle Board-xM support it…?

Thanks,
Rajiv.

Rajiv Biswas wrote:

  Hi,

  I want to use arm neon intrinsics set for optimization.

don't do that, write direct neon assembly code instead

  I want to build my my code and compile it using the Code Sourcery
  Toolchain and then Port it on to the Beagle Board-xM. How to use
  Neon Intrinsics in my Code. Will Beagle Board-xM support it..?

Yes, the CPU on the BB-XM does have neon support.

Why…? Wht will be the drawbacks of using Intrinsics when they r provided by the
Compiler Optimization Guide of the ARM…? There we can a find the entire sets
of Intrinsics we can use in our C Code.

Definitely Assembly would be faster, but porting issues will crop up, since assembler
would be different for RVDS or Code Sourcery for tht matter. Hence, i want to use
Intrinsics thinking abt this issue, and hoping tht it will be faster than the C Code.
Only thing is tht i am unable to use it, since the compiler is throwing “Undefined references”
errors after using the intrinsics in my code.

Thanks,
Rajiv.

The problem is that gcc intrinsics does not generate good code for
NEON. Actually it is a known bug in gcc documented as "some neon
functions generate an obsessive amount of code" :slight_smile:

A good reference is:
http://hilbert-space.de/?p=22

From this post, the results are:

  C-version: 15.1 cycles per pixel.
  NEON-version: 9.9 cycles per pixel.
  Assembler: 2.0 cycles per pixel.

I did a test also. I wrote a stereo vision program. It takes 329ms to
run. Using neon (assembler), it can run in 77ms. Thats about 4-5 times
faster, but you could get even more depending on the application and
code.

Seems like armcc from RVDS optimizes the code very well, so that could
be a good solution. I tried to compile code for linux from this tool
chain and never got it working. Neither the support from ARM Inc was
helpful. They sent me an obscure document that helped me compiling the
code and getting binaries, but them did not run in Linux.

Rafael

Great…!! Great Info for the same, Thanks a Lot… Yep, i have to think now about
writing my Module of Motion Compensation in Neon Assembly. But as from the
figures still, Neon would be faster than the Plain C Code, where nothing more
other than algorithm level changes or Loop Unrolling can be done.

I suppose, have figured out from this artikel tht “Use them to get your algorithm
working and then rewrite the NEON-parts of it in assembler.” Yep, first level of
optimization with Intrinsics and if tht works out maybe small parts of the big
Motion Compensation Algorithm in assembly.

But how to resolve the errors, after i use the intrinsics, like “undefined reference to
func_yyy” if i use the intrinsic functions?

Thanks,
Rajiv.