Arno wrote:
What the dual issue thing means?
Hi Arno,
Dual issue: The Cortex-A8 integer-core is able to execute two
instructions in the same cycle if some requirements are met. The exact
details are somewhere in the reference manual. The Neon-unit is able to
to something similar. Afaik it can do a memory or shuffle instruction in
parallel with ordinary arithmetic.
In theory a compiler should be able to optimize the code in a way that
most instructions met the dual issue requirements. Unfortunately the
GCC doesn't do that much of a good job, so hand written assembler has
the potential to run at twice the speed.
for(j=0; j<nNumElementData-nNumElementFilter; j++){
nHelp = 0;
z4_sum = vld1q_u32(null);
for(k=0; k<nNumElementFilter/4; k++){
x4 = vld1q_u32(pnDataSrc); // intrinsic to load x4 with 4 values.
y4 = vld1q_u32(pnFilter); // intrinsic to load y4
z4 = vmulq_u32(x4,y4); // intrinsic to mul z4=x4*y4
z4_sum = vaddq_u32(z4_sum,z4); // intrinsic to add z4_sum
pnDataSrc+=4; // increment pointers
pnFilter+=4;
}
pnDataSrc-=11; // increment pointers
pnFilter-=12;
vst1q_u32(results, z4_sum);
for (i=0; i<4; i++)
nHelp += results[i];
pnDataDest[j] = nHelp;
}
Looks good. Some comments though:
You can save one instruction in the innerloop if you use vmlaq_u32
instead of vmulq_u32 and vaddq_u32.
Also you do something *very* evil on the cortex-a8 in this code:
vst1q_u32(results, z4_sum);
for (i=0; i<4; i++)
nHelp += results[i];
Here you do a data transfer from the NEON-unit to the general purpose core. This will result in an huge pipeline-stall because the NEON-unit always laggs behind about 20 cycles in execution. Since you also move your data via memory you'll most likely also get an even longer stall because the data has to be written to memory before the non-neon registers can access it.
In general always avoid transfers from NEON to the general purpose registers. The other way (CPU -> NEON) is fast though.
You could replace your code with something like this (untested):
// add upper and lower part of z4_sum:
uint32x2_t temp = vadd_u32 (vget_high_u32(z4_sum),
vget_low_u32(z4_sum));
// pairwise add:
temp = vpadd_u32 (temp,temp);
// store single integer:
vst1_lane_u32 (&pnDataDest[j], temp, 0);
There are a couple of other things that you could do. It's for example possible to load more than 4 integers at a time. Check out the vldq2 to vldq4 instrinsics. (warning: these instructions not only load data from RAM but also do a transpose the elements! Watch out for this).
Btw - any benchmark numbers how much difference your NEON changes made?
Cheers,
Nils Pipenbrinck