Hello everyone,
I may be asking a stupid question, but I'm having all sort of troubles
using the inline assembly to speed up my software with NEON. Would
anyone be able to tell me why the NEON piece of code breaks my
program?
#ifdef __ARM_NEON__
void FastMat4x4x4Mul(float* out, const float* a, const float* b)
{
__asm__ __volatile__
("vldmia %[A], { q4-q7 } \n\t"
"vldmia %[B], { q8-q11 } \n\t"
"vmul.f32 q0, q8, d8[0] \n\t"
"vmul.f32 q1, q8, d10[0] \n\t"
"vmul.f32 q2, q8, d12[0] \n\t"
"vmul.f32 q3, q8, d14[0] \n\t"
"vmla.f32 q0, q9, d8[1] \n\t"
"vmla.f32 q1, q9, d10[1] \n\t"
"vmla.f32 q2, q9, d12[1] \n\t"
"vmla.f32 q3, q9, d14[1] \n\t"
"vmla.f32 q0, q10, d9[0] \n\t"
"vmla.f32 q1, q10, d11[0] \n\t"
"vmla.f32 q2, q10, d13[0] \n\t"
"vmla.f32 q3, q10, d15[0] \n\t"
"vmla.f32 q0, q11, d9[1] \n\t"
"vmla.f32 q1, q11, d11[1] \n\t"
"vmla.f32 q2, q11, d13[1] \n\t"
"vmla.f32 q3, q11, d15[1] \n\t"
"vstmia %[R], { q0-q3 } \n\t"
::[R]"r" (out), [A]"r" (a), [B]"r" (b)
:
"memory","q0","q1","q2","q3","q4","q5","q6","q7","q8","q9","q10","q11");
}
#else
void FastMat4x4x4Mul(float* output, const float* a, const float* b)
{
int i, j, k;
int r1=4, c1r2=4, c2=4;
float sum;
for(i=0; i < c2; i++) {
for(j=0; j < r1; j++) {
sum = 0.0;
for(k=0; k < c1r2; k++) {
sum += a[j*c1r2+k] * b[k*c2+i];
}
output[j*c2+i] = sum;
}
}
}
#endif
Specifically, it looks like the result of the function is fine, but
the program does not execute in the same way afterwards.. it's like
some clobbered register is not restored.. I don't understand.
I use the Gumstix OpenEmbedded GCC Toolchain on the Overo Earth:
~/overo-oe/tmp/cross/armv7a/bin/arm-angstrom-linux-gnueabi-gcc
with options:
-Wall -g -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-
abi=softfp
and the image is the default Gumstix v0.92.
Using CodeSourcery Lite 2009q3, I have even different results (a
pthread locks somewhere).
Cheers,
Michele