Neon Instrincts

I am compiling a filter like this:

  for(j=0; j<nNumElementData-nNumElementFilter; j++){
    nHelp = 0;
    for(k=0; k<nNumElementFilter; k++){
      nHelp += pnDataSrc[j+k]*pnFilter[k];
    }
    pnDataDest[j] = nHelp;
  }

To my surprise it performs with
OPT =-O3 -march=armv7-a -mtune=cortex-a8
much better than
OPT =-O3 -fpromote-loop-indices -funroll-loops -ftree-vectorize \
  -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp

108 vs 140 ms (2009q1 compiler). Is this normal?

I would try to optimze this by intrinsics, but don't know how to use
the commands in
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
There is no explaination for its usage.

Is there a command adding 4 integers in a single SIMD word to do this
"nHelp+" -
so adding the 4 32bit values in a 128bit SIMD register?
Until now found and tried these :
vld1q_u32, vmulq_u32, vst1q_u32
but this is not excactly what I need.

Thanks for helping comments.

Arno <arno.steffen@web.de> writes:

I am compiling a filter like this:

  for(j=0; j<nNumElementData-nNumElementFilter; j++){
    nHelp = 0;
    for(k=0; k<nNumElementFilter; k++){
      nHelp += pnDataSrc[j+k]*pnFilter[k];
    }
    pnDataDest[j] = nHelp;
  }

To my surprise it performs with
OPT =-O3 -march=armv7-a -mtune=cortex-a8
much better than
OPT =-O3 -fpromote-loop-indices -funroll-loops -ftree-vectorize \
  -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp

108 vs 140 ms (2009q1 compiler). Is this normal?

With those options you'll be lucky of the code runs at all, let alone
does the right thing. The -ftree-vectorize option is known to
introduce bugs and massive slowdown. Do not use it. What's worse,
it's on by default at -O3, so you should use -fno-tree-vectorize
whenever you use -O3.

I would try to optimze this by intrinsics, but don't know how to use
the commands in
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
There is no explaination for its usage.

Intrinsics are almost as bad as the vectoriser. Do no use them.

Is there a command adding 4 integers in a single SIMD word to do this
"nHelp+" -
so adding the 4 32bit values in a 128bit SIMD register?

There is no such instruction. You have to do a VADD followed by a
VPADD instead.

Måns Rullgård wrote:

Arno <arno.steffen@web.de> writes:

I would try to optimze this by intrinsics, but don't know how to use
the commands in
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
There is no explaination for its usage.
    
Intrinsics are almost as bad as the vectoriser. Do no use them.
  

My experience with the instrinsics aren't *that* bad. I got a nice
factor 6 speedup for my FFT-code just by replacing the complex
arithmetic with intrinsics. Most of the speedup came from getting rid of
the NEON-stalls for functions that pass back floats in integer-registers
though.

I haven't found much documentation on the intrinsics either, so I took
what I knew about other SIMD architectures and just played with NEON for
a weekend. That was enough to get a good feeling which instructions
exist and which not.

here's the online documentation.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204h/CIHEJBIE.html

It could be much better, but as said it was enough to get going. I still
search the GCC online-docs when I need instructions that exact syntax
I've forgotten.

Cheers,
    Nils

np <np@planetarc.de> writes:

Måns Rullgård wrote:

Arno <arno.steffen@web.de> writes:

I would try to optimze this by intrinsics, but don't know how to use
the commands in
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
There is no explaination for its usage.
    
Intrinsics are almost as bad as the vectoriser. Do no use them.
  

My experience with the instrinsics aren't *that* bad. I got a nice
factor 6 speedup for my FFT-code just by replacing the complex
arithmetic with intrinsics.

In FFmpeg we got a 12x speedup compared to C by writing the FFT in
pure assembler.

Most of the speedup came from getting rid of the NEON-stalls for
functions that pass back floats in integer-registers though.

Sounds like you're doing far more function calls than you should be.

Måns Rullgård wrote:

np <np@planetarc.de> writes:

In FFmpeg we got a 12x speedup compared to C by writing the FFT in
pure assembler.
  

Nice!

Do you remember what made the difference? Was it just proper scheduling
of the instructions or was there something else? I always take a close
look at the compiled output and besides the fact that the GCC seems to
be unaware of the dual issue capabilities it all looks fine so far.

Sounds like you're doing far more function calls than you should be.
  

Don't ask :slight_smile:

Cheers,
    Nils Pipenbrinck

np <np@planetarc.de> writes:

Måns Rullgård wrote:

np <np@planetarc.de> writes:

In FFmpeg we got a 12x speedup compared to C by writing the FFT in
pure assembler.
  

Nice!

Do you remember what made the difference? Was it just proper scheduling
of the instructions or was there something else?

Proper scheduling always makes a difference, but choosing the right
instructions is obviously necessary too.

I always take a close look at the compiled output and besides the
fact that the GCC seems to be unaware of the dual issue capabilities
it all looks fine so far.

That's reason enough not to use it. The dual-issue alone can give a
50% speedup if properly utilised. Another problem with intrinsics is
that they provide no way of specifying alignment for loads and stores.
The gcc register allocator isn't very good either, so if you're
running low on registers, it is likely to start using memory, which
you don't want.

What the dual issue thing means?

I ended up from:
  for(j=0; j<nNumElementData-nNumElementFilter; j++){
    nHelp = 0;
    for(k=0; k<nNumElementFilter; k++){
      nHelp += pnDataSrc[j+k]*pnFilter[k];
    }
    pnDataDest[j] = nHelp;
  }

to

  for(j=0; j<nNumElementData-nNumElementFilter; j++){
    nHelp = 0;
    z4_sum = vld1q_u32(null);
    for(k=0; k<nNumElementFilter/4; k++){
      x4 = vld1q_u32(pnDataSrc); // intrinsic to load x4 with 4 values
from x
      y4 = vld1q_u32(pnFilter); // intrinsic to load y4
      z4 = vmulq_u32(x4,y4); // intrinsic to mul z4=x4*y4
      z4_sum = vaddq_u32(z4_sum,z4); // intrinsic to add z4_sum
      pnDataSrc+=4; // increment pointers
      pnFilter+=4;
    }
    pnDataSrc-=11; // increment pointers
    pnFilter-=12;
    vst1q_u32(results, z4_sum);
    for (i=0; i<4; i++)
      nHelp += results[i];
    pnDataDest[j] = nHelp;
  }

Arno wrote:

What the dual issue thing means?
  

Hi Arno,

Dual issue: The Cortex-A8 integer-core is able to execute two
instructions in the same cycle if some requirements are met. The exact
details are somewhere in the reference manual. The Neon-unit is able to
to something similar. Afaik it can do a memory or shuffle instruction in
parallel with ordinary arithmetic.

In theory a compiler should be able to optimize the code in a way that
most instructions met the dual issue requirements. Unfortunately the
GCC doesn't do that much of a good job, so hand written assembler has
the potential to run at twice the speed.

  for(j=0; j<nNumElementData-nNumElementFilter; j++){
    nHelp = 0;
    z4_sum = vld1q_u32(null);
    for(k=0; k<nNumElementFilter/4; k++){
      x4 = vld1q_u32(pnDataSrc); // intrinsic to load x4 with 4 values.
      y4 = vld1q_u32(pnFilter); // intrinsic to load y4
      z4 = vmulq_u32(x4,y4); // intrinsic to mul z4=x4*y4
      z4_sum = vaddq_u32(z4_sum,z4); // intrinsic to add z4_sum
      pnDataSrc+=4; // increment pointers
      pnFilter+=4;
    }
    pnDataSrc-=11; // increment pointers
    pnFilter-=12;
    vst1q_u32(results, z4_sum);
    for (i=0; i<4; i++)
      nHelp += results[i];
    pnDataDest[j] = nHelp;
  }

Looks good. Some comments though:

You can save one instruction in the innerloop if you use vmlaq_u32
instead of vmulq_u32 and vaddq_u32.

Also you do something *very* evil on the cortex-a8 in this code:

  vst1q_u32(results, z4_sum);
  for (i=0; i<4; i++)
    nHelp += results[i];

Here you do a data transfer from the NEON-unit to the general purpose core. This will result in an huge pipeline-stall because the NEON-unit always laggs behind about 20 cycles in execution. Since you also move your data via memory you'll most likely also get an even longer stall because the data has to be written to memory before the non-neon registers can access it.

In general always avoid transfers from NEON to the general purpose registers. The other way (CPU -> NEON) is fast though.

You could replace your code with something like this (untested):

  // add upper and lower part of z4_sum:
  uint32x2_t temp = vadd_u32 (vget_high_u32(z4_sum),
        vget_low_u32(z4_sum));

  // pairwise add:
  temp = vpadd_u32 (temp,temp);

  // store single integer:
  vst1_lane_u32 (&pnDataDest[j], temp, 0);

There are a couple of other things that you could do. It's for example possible to load more than 4 integers at a time. Check out the vldq2 to vldq4 instrinsics. (warning: these instructions not only load data from RAM but also do a transpose the elements! Watch out for this).

Btw - any benchmark numbers how much difference your NEON changes made?

Cheers,
    Nils Pipenbrinck

Thanks all of you, I learned a lot.
The best results (CS lite 2009q1) I got with
OPT = -O3 -march=armv7-a -mtune=cortex-a8 -mfloat-abi=softfp -
mfpu=neon -fno-tree-vectorize

The C version takes around 130ms, versus aoptimized according Nils
recommendations are 86ms.
This is not much (50%), but my first Neon examples take longer than
C.
Also the inner loop is just 3 loops, 12 elements.

Nevertheless optimizing seems a miracle to me. Sometime I get
dependencies whether memory is allocated with new or malloc or calloc
or malloc+memset - and this differs from kernel to kernel and
compiler ... end endless big matrix of cases. Some functions are
better in other OPT settings - no fun with that.