Weekly Report

Hi all, it seems that the weekly report was not automatically picked
up by the mailer this week, so here is mine below. For a prettier
version, just follow this link instead.

http://gsoc2010-fftw-neon.blogspot.com/2010/07/fftw-weekly-report-week-7.html

FFTW: Weekly Report - Week 7

Over the last week, I've spent a fair amount of time implementing the
"standard" fftw simd interface for neon using intrinsics, and one
thing is certainly clear: there is absolutely no throughput gained at
all, as you can see in the graph below. And yes, I was definitely
using neon code for fftw3n. These results were more or less expected
(although maybe not so exactly), since I was really only using
intrinsics for verification purposes, and obviously inline C functions
produce some undesirable effects. As far as numerical accuracy goes,
its identical to the non-simd version, which is good. The benchmark
was made via benchfft for consistency, and you can grab a copy of it

<graphics were here>

Below is an example of how the VMUL and VZMUL inline functions appear
in the disassembled .so file. The best way to eliminate all of those
useless branches, push's and pop's, is to rewrite each inline function
in simd-neon.h with __asm__ statements in C-macro format, just as the
codesourcery folks did for their mips port. You might be able to tell
from the dump below, that only the simd code was compiled with
-mfpu=neon.

000dc268 :
vmul.f32 q0, q0, q1
bx lr
...
...
00149cf0 :
vorr q8, q0, q0
vtrn.32 q0, q8
push {lr} ; (str lr, [sp, #-4]!)
vpush {d8-d13}
sub sp, sp, #68 ; 0x44
vstr d16, [sp, #16]
vstr d17, [sp, #24]
vstmia sp, {d0-d1}
mov lr, sp
add ip, sp, #32 ; 0x20
ldm lr!, {r0, r1, r2, r3}
vorr q6, q1, q1
stmia ip!, {r0, r1, r2, r3}
vldr d0, [sp, #32]
vldr d1, [sp, #40]
ldm lr, {r0, r1, r2, r3}
stm ip, {r0, r1, r2, r3}
vldr d8, [sp, #48]
vldr d9, [sp, #56]
bl 149c30
vorr q5, q0, q0
vorr q0, q6, q6
bl 149c38
vorr q1, q0, q0
vorr q0, q4, q4
bl 149c30
vorr q1, q0, q0
vorr q0, q5, q5
bl 149ce8
add sp, sp, #68 ; 0x44
vpop {d8-d13}
pop {pc}

On a side note, I did discover that some as-of-yet unidentified effect
of the armv7 cycle-counter was making my benchmarks hang, so that was
currently configured out in my test libraries
(--enable-armv7-cycle-counter is not set).

On another side note, all of my simulations seem to indicate that simd
transforms are always out-of-place. This is probably a good thing in
any case because out-of-place transformations tend to be faster, but
it also allows me to implement pointer/register auto-incrementing.

Also, in pure C / neon intrinsics, there is no way to specify
load-store alignment or pointer auto-increment, which just means more
arm instruction syncopation. Ideally, there would only be a few
seldom, conditional branches in arm code while most of the work would
be done on the neon coprocessor.

In case anyone would like to test the library out on their own, I'm
configuring fftw3 with

CFLAGS="-Os -pipe -mcpu=cortex-a8 -mfloat-abi=softfp" ./configure
--prefix=/usr --host=arm-none-linux-gnueabi --enable-float --with-pic
--enable-neon --enable-shared

I would highly suggest adding -mfpu=neon to the cflags above as well
(for all code), otherwise configure.ac only adds -mfpu=neon to
simd-specific compiles.

Although I did get quite a bit done this week, it's been slower than I
would have liked for two reasons: 1) Canada Day (daycare is closed),
and 2) my significant other was on the other side of the continent for
an academic visit, and so I've been a single parent this week.
Although I did get to spend lots of extra time with my son, which is
always welcome, i think I lost about 1 or two hours a day getting to
the daycare and back.

Plans for next week:

1) rewrite inline functions as __asm__ blocks in C macros.
2) speed-ups!
3) continue investigating codelet-free approaches (i.e. sacrificing
the fftw methodology for speed)
4) fix cycle counter!