Alignment trap handing for ARMv6 and ARMv7

Shivdas · July 2, 2009, 7:25pm

Hi All,

Does anyone know how much hardware support ARMv6 or ARMv7 [Cortex-A8,
OMAP3, Beagleboard] has for unaligned memory access [Alignment trap
fault].
I seen recently there is a patch for it.[1] But not sure how much it
affects on performance if any unaligned memory access occurs.
I think this patch sets /proc/cpu/alignment to 2[fixup] as a default condition.

for ARMv6, I seen some information at [2] section "4.2.5. Support for
unaligned data access in ARMv6 (U=1)" if U bit is set from control
register?

Does ARMv6, or ARMv7 behaves almost like x86 in performance if U bit
is set to 1?

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blobdiff;f=arch/arm/mm/alignment.c;h=2d5884ce0435fb436a57bee6d314284b9101e87e;hp=133e65d166b315b0e54aba959846a162643bc927;hb=baa745a3378046ca1c5477495df6ccbec7690428;hpb=794baba637999b81aa40e60fae1fa91978e08808

[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211h/Cdffhdje.html

Thanks for your help.

Thanks and Regards,
Shivdas Gujare

Siarhei_Siamashka · July 2, 2009, 8:32pm

Hi All,

Does anyone know how much hardware support ARMv6 or ARMv7 [Cortex-A8,
OMAP3, Beagleboard] has for unaligned memory access [Alignment trap
fault].

You already found a relevant section in ARM documentation (your link [2]),
you can get all the details there.

I seen recently there is a patch for it.[1] But not sure how much it
affects on performance if any unaligned memory access occurs.
I think this patch sets /proc/cpu/alignment to 2[fixup] as a default
condition.

That's not a very wise default in my opinion. Better would be 4 (signal) or
at least 3 (fixup+warn). But you can change this behavior at runtime. I
remember there was also a kernel patch submitted somewhere for having initial
'/proc/cpu/alignment' setup configurable in the kernel config.

for ARMv6, I seen some information at [2] section "4.2.5. Support for
unaligned data access in ARMv6 (U=1)" if U bit is set from control
register?

IIRC U bit is always set in linux for the ARM chips which support it. And for
ARMv7 (beagleboard uses ARMv7), unaligned accesses support can't be even
turned off (CPU only supports U=1 mode).

Does ARMv6, or ARMv7 behaves almost like x86 in performance if U bit
is set to 1?

Not quite, there are some tricky things. One of the pitfalls is that not all
instructions support unaligned accesses, some generate exceptions on unaligned
memory accesses. Only the instructions dealing with the data sizes up to
32-bit fixup the alignment automagically, plus some NEON instructions. There
is a full table in ARM documentation about what combinations are supported.

To make everything even more fun, if you are a C programmer, you can't freely
use unaligned memory accesses even if you deal with data types not larger than
int.

Let's have a look at the following example (bad code!):

/********************/
#include <stdio.h>

int __attribute__((noinline)) f(int *x)
{
return x[0] + x[1];
}

int main()
{
    int buffer[3] = {0x12345678, 0x90ABCDEF, 0x12345678};
    printf("%08X\n", f((int *)((char *)buffer + 1)));
    return 0;
}
/********************/

If it is compiled with -Os optimizations, the following code is emitted by gcc
for 'f' function:

00000000 <f>:
   0: e8900009 ldm r0, {r0, r3}
   4: e0830000 add r0, r3, r0
   8: e12fff1e bx lr

It uses LDM instruction here (load multiple) to load 2 sequential ints into a
pair of ARM registers at once, so this is effectively a 64-bit load operation.
LDM instruction does not support unaligned reads and will generate an
exception if the address in not properly aligned. Depending of a value
in /proc/cpu/alignment, this program will:

0: freeze, constantly triggering exceptions, which are not handled right in
the kernel, so it is constantly jumping between userspace and kernelspace.
CPU is loaded 100%

2: provide you with the the same result as on x86, but silently spend a huge
amount of time on handling exception and emulating this unaligned access in
the kernel

4: die with SIGPIPE

As I mentioned before, configuration 2 (fixup) is a bad choice in general.
Average Joe "x86 programmer" can insert lots of nonportable code (in the
respect to dealing with alignment) into his programs. Even worse, as ARMv6 and
ARMv7 are supposed to support unaligned memory accesses based on the
information published here and there, he would probably even think that he is
doing the right thing

Configuration 4 (signal) at least lets you to find such bugs in the code and
fix them.

As to gcc generating such code with -Os optimization in the first place. It is
doing the right thing. The code example is buggy and results in unexpected
behavior according to C standard. It just happens to work seemingly right on
x86.

If you compile the example with '-Wcast-align' option, gcc will even issue a
warning on the problematic line. Such warnings may be handy sometimes when
porting applications to the platforms where alignment is more strict than on
x86.

Shivdas · July 2, 2009, 11:28pm

Hi Siarhei,

Thanks, this is really a great help.

Hi All,

Does anyone know how much hardware support ARMv6 or ARMv7 [Cortex-A8,
OMAP3, Beagleboard] has for unaligned memory access [Alignment trap
fault].

You already found a relevant section in ARM documentation (your link [2]),
you can get all the details there.

I seen recently there is a patch for it.[1] But not sure how much it
affects on performance if any unaligned memory access occurs.
I think this patch sets /proc/cpu/alignment to 2[fixup] as a default
condition.

That's not a very wise default in my opinion. Better would be 4 (signal) or
at least 3 (fixup+warn). But you can change this behavior at runtime. I
remember there was also a kernel patch submitted somewhere for having initial
'/proc/cpu/alignment' setup configurable in the kernel config.

I agree, using 3(fixup+warn) instead of 2(fixup) could be more better.
Or use 4/5 if 2 affects performance. This should be a good idea to
handle this in kernel configuration.

for ARMv6, I seen some information at [2] section "4.2.5. Support for
unaligned data access in ARMv6 (U=1)" if U bit is set from control
register?

IIRC U bit is always set in linux for the ARM chips which support it. And for
ARMv7 (beagleboard uses ARMv7), unaligned accesses support can't be even
turned off (CPU only supports U=1 mode).

experts comments will be more helpful here.

Does ARMv6, or ARMv7 behaves almost like x86 in performance if U bit
is set to 1?

Not quite, there are some tricky things. One of the pitfalls is that not all
instructions support unaligned accesses, some generate exceptions on unaligned
memory accesses. Only the instructions dealing with the data sizes up to
32-bit fixup the alignment automagically, plus some NEON instructions. There
is a full table in ARM documentation about what combinations are supported.

To make everything even more fun, if you are a C programmer, you can't freely
use unaligned memory accesses even if you deal with data types not larger than
int.

I tried your program, and it behaves same as you explained below.
Thanks a lot for detailed description.

If I run with 1/5 I get following warning,
Alignment trap: a.out (299) PC=0x00008360 Instr=0xe8900009
Address=0xbef99be5 FSR 0x011

I tried to understand this because, I need to trace back, who has
caused unaligned access.

#arm-none-linux-gnueabi-objdump -Dxls a.out | less

00008360 <f>:
f():
    8360: e8900009 ldm r0, {r0, r3}
    8364: e0830000 add r0, r3, r0
    8368: e12fff1e bx lr

This gives me details about PID(299), application(a.out)
PC=0x00008360 Instr=0xe8900009, but I am still finding out about
what/where is "Address=0xbef99be5 FSR 0x011". Do you have any Idea?

Let's have a look at the following example (bad code!):

/********************/
#include <stdio.h>

int __attribute__((noinline)) f(int *x)
{
return x[0] + x[1];
}

int main()
{
int buffer[3] = {0x12345678, 0x90ABCDEF, 0x12345678};
printf("%08X\n", f((int *)((char *)buffer + 1)));
return 0;
}
/********************/

If it is compiled with -Os optimizations, the following code is emitted by gcc
for 'f' function:

00000000 <f>:
0: e8900009 ldm r0, {r0, r3}
4: e0830000 add r0, r3, r0
8: e12fff1e bx lr

It uses LDM instruction here (load multiple) to load 2 sequential ints into a
pair of ARM registers at once, so this is effectively a 64-bit load operation.
LDM instruction does not support unaligned reads and will generate an
exception if the address in not properly aligned. Depending of a value
in /proc/cpu/alignment, this program will:

0: freeze, constantly triggering exceptions, which are not handled right in
the kernel, so it is constantly jumping between userspace and kernelspace.
CPU is loaded 100%

2: provide you with the the same result as on x86, but silently spend a huge
amount of time on handling exception and emulating this unaligned access in
the kernel

4: die with SIGPIPE

As I mentioned before, configuration 2 (fixup) is a bad choice in general.
Average Joe "x86 programmer" can insert lots of nonportable code (in the
respect to dealing with alignment) into his programs. Even worse, as ARMv6 and
ARMv7 are supposed to support unaligned memory accesses based on the
information published here and there, he would probably even think that he is
doing the right thing

Configuration 4 (signal) at least lets you to find such bugs in the code and
fix them.

As to gcc generating such code with -Os optimization in the first place. It is
doing the right thing. The code example is buggy and results in unexpected
behavior according to C standard. It just happens to work seemingly right on
x86.

If you compile the example with '-Wcast-align' option, gcc will even issue a
warning on the problematic line. Such warnings may be handy sometimes when
porting applications to the platforms where alignment is more strict than on
x86.

--
Best regards,
Siarhei Siamashka

If anyone needs more help,
Refer to http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment
ARM/NetWinder Structure Alignment FAQ

Thanks and Regards,
Shivdas Gujare

Siarhei_Siamashka · July 4, 2009, 7:14am

The most simple way to debug these problems is to set /proc/cpu/alignment to 4
and run the program in gdb, it will break exactly at the right place. You can
also enable core dumps generation and analyze core dumps afterwards.

Maybe for some large scale whole system analysis, something more automated and
convenient can be created. So that you just run the system normally, but get a
preprocessed statistics with the names of modules and functions which have
unaligned memory accesses, sorted by the frequency of occurrence.

The easiest way to achieve this would be to patch the kernel to report
unaligned memory accesses as special events to oprofile

Rach · July 29, 2009, 6:59am

Hi Siarhei,

I am using VLDM and VSTM instructions in my routine to access memory
which is aligned by 16 byte.

As per Cortex A8 TRM (ARM DDI 0344J ), It has mentioned about NEON
alignment and it's qualifier.

How to give alignment qualifier in instruction to avoid exrat cpu
cycles?

Thanks,
Rachit

Siarhei_Siamashka · July 29, 2009, 7:41am

This all is a bit confusing because gnu assembler uses its own syntax flavour.

You can generally find the needed information by checking:
info as
some existing ARM NEON assembly optimizations
binutils sources

Regarding your particular question, you need to use something like this:

VLD1.8 {d0, d1}, [r0, :128]

":128" part specifies alignment in bits

Good luck

mansr · July 29, 2009, 8:17am

Siarhei Siamashka <siarhei.siamashka@gmail.com> writes:

Rach · July 29, 2009, 10:03am

Hi Siarhei,

Thanks to guide me.

Where can I get existing ARM NEON assembly optimizations binutils
sources to get info about optimization?

I am looking to change instruction the way you have mentioned.

Thanks,
Rachit

Siarhei_Siamashka · July 29, 2009, 11:28am

Hi Siarhei,

Thanks to guide me.

Where can I get existing ARM NEON assembly optimizations

Some of the open source projects have NEON optimizations already, they
are listed here: http://elinux.org/BeagleBoard#ARM_NEON

binutils sources

http://www.gnu.org/software/binutils/

In the open source world, availability of sources can compensate
the absence of (good) documentation sometimes. I just assume that anybody
seriously interested in assembly optimizations should be already an
experienced software developer, quite familiar with C language. That's why I
also suggested this option. It may be the last resort if you don't find the
needed information in some easier way.

to get info about optimization?

Cortex-A8 TRM (the one that you already have) contains "Instruction Cycle
Timing" section.

Additionally I suggest checking the following document. It has some nice
pictures and Cortex-A8 pipeline overview:
http://www.arm.com/miscPDFs/24588.pdf

And the last thing. Always try to benchmark everything yourself. TRM has a
warning notice: "Detailed descriptions of all possible instruction
interactions and all possible events taking place in the processor is beyond
the scope of this document. Only a cycle-accurate model of the processor can
produce precise timings for a particular instruction sequence."
So the TRM describes some simplified model, which more or less correlates with
the reality. But it can always happen that those tiny omitted details may
have a major impact on your code, if they manifest themselves in a performance
critical tight loop.

Rach · July 29, 2009, 2:05pm

Hi Siarhei,

Thanks for your great support to me.

I will check the source you have mentioned.

I have checked by changing instruction from VLDM to VLD1 and I got bit
improvment in performance way in cpu cycle.

I am using clock_gettime function to get benchmark of my routine. I
have tried to get using oprofile but in that I didn't get execution
time directly but statistics of sample cpu time.

Do you know any other way to get benchmark of code ?

Again thanks you very much.
Rachit

Rach · July 31, 2009, 7:10am

Hi Siarhei,

How to get unaligned memory access as special even to oprofile ?

I have tried to set NEON_CYCLE and CYCLE_INST_STALL by using opcontrol
command but i didn't get that event counter result. I only got
CPU_CYCLE result from oprofile.

I have also checked with setting Cortex A8 PMNC register value and
checked only CPU Cycle counter value. I am not getting other event
counter value through PMNC register.

Do you have idea to get other event info in oprofile or using Cortex
A8 PMNC regsiter settings?

I want to check Dcache and Icache miss also.

Thanks,
Rachit