GCC inline assembly for NEON

mic · October 26, 2009, 9:05am

Hello everyone,
I may be asking a stupid question, but I'm having all sort of troubles
using the inline assembly to speed up my software with NEON. Would
anyone be able to tell me why the NEON piece of code breaks my
program?

#ifdef __ARM_NEON__

void FastMat4x4x4Mul(float* out, const float* a, const float* b)
{
  __asm__ __volatile__
    ("vldmia %[A], { q4-q7 } \n\t"
     "vldmia %[B], { q8-q11 } \n\t"
     "vmul.f32 q0, q8, d8[0] \n\t"
     "vmul.f32 q1, q8, d10[0] \n\t"
     "vmul.f32 q2, q8, d12[0] \n\t"
     "vmul.f32 q3, q8, d14[0] \n\t"
     "vmla.f32 q0, q9, d8[1] \n\t"
     "vmla.f32 q1, q9, d10[1] \n\t"
     "vmla.f32 q2, q9, d12[1] \n\t"
     "vmla.f32 q3, q9, d14[1] \n\t"
     "vmla.f32 q0, q10, d9[0] \n\t"
     "vmla.f32 q1, q10, d11[0] \n\t"
     "vmla.f32 q2, q10, d13[0] \n\t"
     "vmla.f32 q3, q10, d15[0] \n\t"
     "vmla.f32 q0, q11, d9[1] \n\t"
     "vmla.f32 q1, q11, d11[1] \n\t"
     "vmla.f32 q2, q11, d13[1] \n\t"
     "vmla.f32 q3, q11, d15[1] \n\t"
     "vstmia %[R], { q0-q3 } \n\t"
     ::[R]"r" (out), [A]"r" (a), [B]"r" (b)
     :
"memory","q0","q1","q2","q3","q4","q5","q6","q7","q8","q9","q10","q11");
}

#else

void FastMat4x4x4Mul(float* output, const float* a, const float* b)
{
  int i, j, k;
  int r1=4, c1r2=4, c2=4;
  float sum;

  for(i=0; i < c2; i++) {
    for(j=0; j < r1; j++) {
      sum = 0.0;
      for(k=0; k < c1r2; k++) {
  sum += a[j*c1r2+k] * b[k*c2+i];
      }
      output[j*c2+i] = sum;
    }
  }
}

#endif

Specifically, it looks like the result of the function is fine, but
the program does not execute in the same way afterwards.. it's like
some clobbered register is not restored.. I don't understand.
I use the Gumstix OpenEmbedded GCC Toolchain on the Overo Earth:

~/overo-oe/tmp/cross/armv7a/bin/arm-angstrom-linux-gnueabi-gcc

with options:

-Wall -g -O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-
abi=softfp

and the image is the default Gumstix v0.92.
Using CodeSourcery Lite 2009q3, I have even different results (a
pthread locks somewhere).

Cheers,
Michele

mansr · October 26, 2009, 2:24pm

mic <michele.bavaro@gmail.com> writes:

Hello everyone,
I may be asking a stupid question, but I'm having all sort of troubles
using the inline assembly to speed up my software with NEON. Would
anyone be able to tell me why the NEON piece of code breaks my
program?

You shouldn't use inline asm with NEON. It's almost impossible to get
it right.

[...]

Specifically, it looks like the result of the function is fine, but
the program does not execute in the same way afterwards.. it's like
some clobbered register is not restored.. I don't understand.

Look at the assembler generated by the compiler. That should tell you
what's going wrong.

balister · October 26, 2009, 2:49pm

mic <michele.bavaro@gmail.com> writes:

Hello everyone,
I may be asking a stupid question, but I'm having all sort of troubles
using the inline assembly to speed up my software with NEON. Would
anyone be able to tell me why the NEON piece of code breaks my
program?

You shouldn't use inline asm with NEON. It's almost impossible to get
it right.

Any particular reason? Why does this stuff have to be so hard ...

Philip

mansr · October 26, 2009, 3:23pm

Philip Balister <philip.balister@gmail.com> writes:

Koen_Kooi · October 26, 2009, 3:34pm

I suspect register allocation, but I'm only repeating what people say on IRC.

regards,

Koen

Laurent_Desnogues · October 26, 2009, 3:43pm

And instruction scheduling too.

Laurent

Siarhei_Siamashka · October 26, 2009, 3:46pm

Hello everyone,
I may be asking a stupid question, but I'm having all sort of troubles
using the inline assembly to speed up my software with NEON. Would
anyone be able to tell me why the NEON piece of code breaks my
program?

#ifdef __ARM_NEON__

void FastMat4x4x4Mul(float* out, const float* a, const float* b)
{
  __asm__ __volatile__
    ("vldmia %[A], { q4-q7 } \n\t"
     "vldmia %[B], { q8-q11 } \n\t"
     "vmul.f32 q0, q8, d8[0] \n\t"
     "vmul.f32 q1, q8, d10[0] \n\t"
     "vmul.f32 q2, q8, d12[0] \n\t"
     "vmul.f32 q3, q8, d14[0] \n\t"
     "vmla.f32 q0, q9, d8[1] \n\t"
     "vmla.f32 q1, q9, d10[1] \n\t"
     "vmla.f32 q2, q9, d12[1] \n\t"
     "vmla.f32 q3, q9, d14[1] \n\t"
     "vmla.f32 q0, q10, d9[0] \n\t"
     "vmla.f32 q1, q10, d11[0] \n\t"
     "vmla.f32 q2, q10, d13[0] \n\t"
     "vmla.f32 q3, q10, d15[0] \n\t"
     "vmla.f32 q0, q11, d9[1] \n\t"
     "vmla.f32 q1, q11, d11[1] \n\t"
     "vmla.f32 q2, q11, d13[1] \n\t"
     "vmla.f32 q3, q11, d15[1] \n\t"
     "vstmia %[R], { q0-q3 } \n\t"

^^^^^^^^^

This instruction updates %[R] register.

::[R]"r" (out), [A]"r" (a), [B]"r" (b)

^^^^^^^^^

And this tells gcc that %[R] is a constant input argument. Same for [A] and
[B].

"memory","q0","q1","q2","q3","q4","q5","q6","q7","q8","q9","q10","q11");

^^^^^^^^^^^
CodeSourcery toolchain 2007q3 has a bug with handling 'q' registers in the
clobber list. Not sure if it can affect you, but to be on a safe side, it is
better to replace them with equivalent 'd' registers.

Specifically, it looks like the result of the function is fine, but
the program does not execute in the same way afterwards.. it's like
some clobbered register is not restored.. I don't understand.

Yes, the constraints are wrong.

Dale_Weber · October 26, 2009, 4:00pm

Greetings everyone,

From a suggestion Gerald made, I've created the RoboticsBus project on
beagleboard.org. As I think about this and consider it more, it occurs to me
that this project does not have to be just for robotics expansion boards, but
can be used as a generalized expansion bus for Beagle. For now though, I won't
change the name of the project (RoboticsBus) though.

You can find the URL to the project wiki in my signature. Please feel free to
read, comment, add to, and generally contribute in any way you can. I want
this to be a completely Open Source project, both in hardware and software.

When this project is far enough along to start working on software, I'll
create a repository on github.com and add whomever wants to help with software
as a contributor. I believe we can even use this for schematics to track
versions.

I've been writing quite a lot on this project on my Wiki, and have been
trying to keep things in a logical format.

8-Dale

Radha_Krishna_Sriman · October 26, 2009, 4:10pm

Thanks Dale, I will be watching this space…

mic · October 26, 2009, 4:16pm

To Siarhei Siamashka:
In theory it's right, but changing the last piece to

:[R]"+r" (out), [A]"+r" (a), [B]"+r" (b)
::"memory","q0","q1","q2","q3","q4","q5","q6","q7","q8","q9","q10","q11");

does no difference at all. Please note that I've tried CS 2009q3, not
2007q3.

To Mans Rullgard:
Oh dear! Should I not use inline assembly with NEON??? Should I use
gcc intrinsics then? Is writing .S files the only (scaring) option?

Thanks everyone for the contributions,
Michele

mansr · October 26, 2009, 5:42pm

mic <michele.bavaro@gmail.com> writes:

To Siarhei Siamashka:
In theory it's right, but changing the last piece to

:[R]"+r" (out), [A]"+r" (a), [B]"+r" (b)
::"memory","q0","q1","q2","q3","q4","q5","q6","q7","q8","q9","q10","q11");

does no difference at all. Please note that I've tried CS 2009q3, not
2007q3.

To Mans Rullgard:
Oh dear! Should I not use inline assembly with NEON??? Should I use
gcc intrinsics then? Is writing .S files the only (scaring) option?

Intrinsics are worse than inline asm. Use .S files. They have the
additional advantage of allowing you to compile the C code with any
compiler, not only gcc.

Boireau_Laurent · October 27, 2009, 8:53am

A first step suggestion for this would be putting together a clear wiki describing the steps to bring out as many I2C, SPI, GPIO/interrupts and UARTs as possible to BeagleBoard Expansion connector, by kernel configuration/OE or u-boot config. That would define an "ideal Beagle configuration" for robotics, which would be a good starter to any robotics project, including this one, and independant from future hardware developpements. You can already do quite a lot with I2C and SPI components, without any extra microcontroller, including sensors, ADC, driving several servos, etc ... from beagle.
Something like http://elinux.org/BeagleBoardPinMux, with much more detail would be most welcome. Providing a "robotics" u-boot.bin including those settings file would also be nice, and useable with any Angstrom demo, for those more interested in "high level" applications than kernel configs ...
A standart API to access those ressources and others (timers, PWM) would also be interesting, as would sharing device drivers for usual sensors, servo drivers, etc...
This is a nice endeavour you're undertaking, Dale. Hope I can help,

Laurent

-----Message d'origine-----

Dale_Weber · October 27, 2009, 7:12pm

I'm in the process of looking at exactly which signals and in what
combinations they might be used. I'm still brand new to Beagle, so still have
quite a learning curve to go through. I know for sure we want I2C and SPI,
along with as many UARTs as can be made usable. I'm more interested in the
various communications methods available on Beagle, but am not as interested
in working directly with GPIOs and such directly, although this might be an
option also. I think Beagle would be better used for things like its main
processing power, graphics, vision processing, etc.

My original idea is to use smaller micros, like expansion boards with AVRs,
to handle most of the direct interfacing with sensors and GPIOs. We may need
some of Beagle's GPIOs for control signals. I haven't gotten far enough into
thinking about all this yet to know for sure. I'm proposing more of a
communication bus for communication between Beagle and smart expansion boards.

To effectively use the Beagle's GPIOs for digital I/Os and sensors in robotics
would require adding 3 pin headers with power and ground buses next to the
signal pins. I'm not sure I want to use board real space for these connected
to Beagle, but it might be an option if there is interest.

Expansion boards based on micros like AVRs can easily interface with the 3.3V
and 5V components most often found and used in robotics. They also have much
needed things like analog inputs, more UARTs, etc. I'd connect these to Beagle
through a bus type interface and let Beagle do all the heavy processing where
its required to use the data these boards provide. I don't think it would
really be appropriate to connect servos directly to a Beagle, for instance.

We can certainly discuss topics like GPIOs and such though, and see where
things go from there. This expansion bus is to be completely open and Open
Source, both in software and hardware, so anyone can feel free to contribute
in any way they want and add anything they want as long as everything works
together. I won't be able to actually get Linux up and running on my Beagle
for a couple weeks, because I need to get a USB Hub and SDHC card reader. I
think I have everything else I want for my Beagle except a Zippy.

I do have a Wiki started for this as well as other projects of mine. I've
also setup forums that allow attachments to be included as well as code within
postings. I also have my Blog, which I've been writing on for awhile now. They
are all available at http://www.thedynaplex.info now. Unfortunately, there is
no interaction between the three different packages and you to create accounts
on each one they want access on. I'm also considering putting something like
Drupal online for this, which has all these features rolled into a single
package.

8-Dale

mic · October 29, 2009, 11:53am

As a reference to all I attach the assembly output of gcc compiling
the following piece of code:

void FastMat4x4x4Mul(float* out, const float* a, const float* b)
{
  __asm__ __volatile__
    ("vldmia %[A], { q4-q7 } \n\t"
     "vldmia %[B], { q8-q11} \n\t"
     "vmul.f32 q0, q8, d8[0] \n\t"
     "vmul.f32 q1, q8, d10[0] \n\t"
     "vmul.f32 q2, q8, d12[0] \n\t"
     "vmul.f32 q3, q8, d14[0] \n\t"
     "vmla.f32 q0, q9, d8[1] \n\t"
     "vmla.f32 q1, q9, d10[1] \n\t"
     "vmla.f32 q2, q9, d12[1] \n\t"
     "vmla.f32 q3, q9, d14[1] \n\t"
     "vmla.f32 q0, q10, d9[0] \n\t"
     "vmla.f32 q1, q10, d11[0] \n\t"
     "vmla.f32 q2, q10, d13[0] \n\t"
     "vmla.f32 q3, q10, d15[0] \n\t"
     "vmla.f32 q0, q11, d9[1] \n\t"
     "vmla.f32 q1, q11, d11[1] \n\t"
     "vmla.f32 q2, q11, d13[1] \n\t"
     "vmla.f32 q3, q11, d15[1] \n\t"
     "vstmia %[R], { q0-q3 } \n\t"
     :[R]"+r" (out), [A]"+r" (a), [B]"+r" (b)
     ::"memory","d0","d1","d2","d3","d4","d5","d6","d7",
      "d8","d9","d10","d11","d12","d13","d14","d15",
      "d16","d17","d18","d19","d20","d21","d22","d23");
}

Here it is:

  .global FastMat4x4x4Mul
  .type FastMat4x4x4Mul, %function
FastMat4x4x4Mul:
.LFB2:
  .file 1 "./src/fastmat.c"
  .loc 1 11 0
  @ args = 0, pretend = 0, frame = 0
  @ frame_needed = 0, uses_anonymous_args = 0
  @ link register save eliminated.
.LVL0:
  fstmfdd sp!, {d8, d9, d10, d11, d12, d13, d14, d15}
.LCFI0:
  .loc 1 12 0
#APP
@ 12 "./src/fastmat.c" 1
  vldmia r1, { q4-q7 }
  vldmia r2, { q8-q11}
  vmul.f32 q0, q8, d8[0]
  vmul.f32 q1, q8, d10[0]
  vmul.f32 q2, q8, d12[0]
  vmul.f32 q3, q8, d14[0]
  vmla.f32 q0, q9, d8[1]
  vmla.f32 q1, q9, d10[1]
  vmla.f32 q2, q9, d12[1]
  vmla.f32 q3, q9, d14[1]
  vmla.f32 q0, q10, d9[0]
  vmla.f32 q1, q10, d11[0]
  vmla.f32 q2, q10, d13[0]
  vmla.f32 q3, q10, d15[0]
  vmla.f32 q0, q11, d9[1]
  vmla.f32 q1, q11, d11[1]
  vmla.f32 q2, q11, d13[1]
  vmla.f32 q3, q11, d15[1]
  vstmia r0, { q0-q3 }

@ 0 "" 2
.LVL1:
  .loc 1 36 0
  fldmfdd sp!, {d8, d9, d10, d11, d12, d13, d14, d15}
  bx lr

Cheers,
Michele

Siarhei_Siamashka · October 29, 2009, 2:08pm

As an additional experiment, try to execute the following command as root
before running your test:

# echo 4 > /proc/cpu/alignment

Kernel may behave really funny when it tries to emulate unaligned NEON memory
accesses.

mansr · October 29, 2009, 3:44pm

Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:

Siarhei_Siamashka · October 29, 2009, 4:00pm

It spins if you disable emulation.

But when emulation is enabled, it just decodes NEON instructions wrong and
interprets them as something else. Sometimes with weird side effects (like ARM
registers getting modified).

mansr · October 29, 2009, 4:12pm

Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:

Siarhei_Siamashka · October 30, 2009, 1:02pm

Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:
>> Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:
>> >> As a reference to all I attach the assembly output of gcc compiling
>> >> the following piece of code:
>> >>
>> >>
>> >> void FastMat4x4x4Mul(float* out, const float* a, const float* b)
>> >
>> > As an additional experiment, try to execute the following command as
>> > root before running your test:
>> >
>> > # echo 4 > /proc/cpu/alignment
>> >
>> > Kernel may behave really funny when it tries to emulate unaligned NEON
>> > memory accesses.
>>
>> It doesn't try, it just spins on the faulting instruction.
>
> It spins if you disable emulation.
>
> But when emulation is enabled, it just decodes NEON instructions
> wrong and interprets them as something else. Sometimes with weird
> side effects (like ARM registers getting modified).

Ouch.

I haven't checked the latest kernels yet, so don't know whether this issue
still exists. But it was present in 2.6.28 at least.

Anyway, considering the use of VLDM instructions instead of VLD1 in the posted
code snippet, alignment related problems could be potentially involved. Let's
wait for a reply from Michele to see if it really was the case.

I always run with unaligned fixups entirely disabled using that
patch RMK refuses to talk about. I want that SIGBUS.

I also have fixups disabled (from one of the scripts early at boot, did not
bother to patch kernel). I hope that even if kernel keeps having alignment
fixup as a default setup, at least linux distros will use something more
reasonable.