Looking for NEON optimization example

A friend uses following code to do some benchmarking. Does anybody have some short (inline assembly?) example how to optimize the multiply/accumulate in the loop for Beagle/Cortex/NEON?

i = 0;
for(j = 0; j < 512; j++) {
   i += a[j] * b[j];
}

Initialization is done by

volatile float i;
float a[512], b[512];
int j;

for(j = 0; j < 512; j++) {
    a[j] = j * 0.1;
    b[j] = j * 0.1;
}

Thanks

Dirk

You can try using "-O3 -mcpu=cortex-a8 mfloat-abi=softfp -mfpu=neon -ftree-vectorize" as compile option and see whether gcc automatically generates this for you in this case.

arm-2007q3\lib\gcc\arm-none-linux-gnueabi\4.2.1\include\arm_neon.h defines NEON intrinsics(http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html).

I was using 2007-q3 version and it generated following NEON code for below C function

float a[256], b[256], c[256];

foo(int x)
{
    int i;
    for (i = 0; i < 256; i++) {
        a[i] = x * b[i] + c[i];
    }
}

foo:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        fmsr s14, r0 @ int
        fsitos s15, s14
        stmfd sp!, {r4, lr}
        ldr ip, .L8
        ldr r4, .L8+4
        ldr lr, .L8+8
        sub sp, sp, #8
        mov r0, #0
        fsts s15, [sp, #4]
        fsts s15, [sp, #0]
        fldd d5, [sp, #0]
.L2:
        add r1, r0, ip
        add r3, r0, r4
        add r2, r0, lr
        add r0, r0, #8
        cmp r0, #1024
        fldd d7, [r3, #0]
        vmul.f32 d7, d5, d7
        fldd d6, [r2, #0]
        vadd.f32 d7, d7, d6
        fstd d7, [r1, #0]
        bne .L2
        add sp, sp, #8
        ldmfd sp!, {r4, pc}
.L9:
        .align 2
.L8:
        .word a
        .word b
        .word c

Regards,
Pratheesh

Dirk Behme <dirk.behme@googlemail.com> writes:

A friend uses following code to do some benchmarking. Does anybody
have some short (inline assembly?) example how to optimize the
multiply/accumulate in the loop for Beagle/Cortex/NEON?

i = 0;
for(j = 0; j < 512; j++) {
   i += a[j] * b[j];
}

Initialization is done by

volatile float i;
float a[512], b[512];
int j;

for(j = 0; j < 512; j++) {
    a[j] = j * 0.1;
    b[j] = j * 0.1;
}

Here's a simple NEON version, unrolled 8 times:

float
vmac_neon(const float *a, const float *b, unsigned n)
{
    float s = 0;

    asm ("vmov.f32 q8, #0.0 \n\t"
         "vmov.f32 q9, #0.0 \n\t"
         "1: \n\t"
         "subs %3, %3, #8 \n\t"
         "vld1.32 {d0,d1,d2,d3}, [%1]! \n\t"
         "vld1.32 {d4,d5,d6,d7}, [%2]! \n\t"
         "vmla.f32 q8, q0, q2 \n\t"
         "vmla.f32 q9, q1, q3 \n\t"
         "bgt 1b \n\t"
         "vadd.f32 q8, q8, q9 \n\t"
         "vpadd.f32 d0, d16, d17 \n\t"
         "vadd.f32 %0, s0, s1 \n\t"
         : "=w"(s), "+r"(a), "+r"(b), "+r"(n)
         :: "q0", "q1", "q2", "q3", "q8", "q9");

    return s;
}

For comparison, I used this C function:

float
vmac_c(const float *a, const float *b, unsigned n)
{
    float s = 0;
    unsigned i;

    for(i = 0; i < n; i++) {
        s += a[i] * b[i];
    }

    return s;
}

Using gcc csl 2007q3 with flags -O3 -fomit-frame-pointer
-mfloat-abi=softfp -mfpu=neon -mcpu=cortex-a8 -ftree-vectorize
-ffast-math, the NEON version is about twice as fast as the C version.
Dropping -ffast-math makes the C version about 7 times slower, and
gives a slightly different result (differing in the 8th decimal
digit).

I was surprised to see that gcc actually managed to vectorise the code
a bit, even if hand-crafted assembler easily outperforms it.