Hi All,
Does anyone know how much hardware support ARMv6 or ARMv7 [Cortex-A8,
OMAP3, Beagleboard] has for unaligned memory access [Alignment trap
fault].
You already found a relevant section in ARM documentation (your link [2]),
you can get all the details there.
I seen recently there is a patch for it.[1] But not sure how much it
affects on performance if any unaligned memory access occurs.
I think this patch sets /proc/cpu/alignment to 2[fixup] as a default
condition.
That's not a very wise default in my opinion. Better would be 4 (signal) or
at least 3 (fixup+warn). But you can change this behavior at runtime. I
remember there was also a kernel patch submitted somewhere for having initial
'/proc/cpu/alignment' setup configurable in the kernel config.
for ARMv6, I seen some information at [2] section "4.2.5. Support for
unaligned data access in ARMv6 (U=1)" if U bit is set from control
register?
IIRC U bit is always set in linux for the ARM chips which support it. And for
ARMv7 (beagleboard uses ARMv7), unaligned accesses support can't be even
turned off (CPU only supports U=1 mode).
Does ARMv6, or ARMv7 behaves almost like x86 in performance if U bit
is set to 1?
Not quite, there are some tricky things. One of the pitfalls is that not all
instructions support unaligned accesses, some generate exceptions on unaligned
memory accesses. Only the instructions dealing with the data sizes up to
32-bit fixup the alignment automagically, plus some NEON instructions. There
is a full table in ARM documentation about what combinations are supported.
To make everything even more fun, if you are a C programmer, you can't freely
use unaligned memory accesses even if you deal with data types not larger than
int.
Let's have a look at the following example (bad code!):
/********************/
#include <stdio.h>
int __attribute__((noinline)) f(int *x)
{
return x[0] + x[1];
}
int main()
{
int buffer[3] = {0x12345678, 0x90ABCDEF, 0x12345678};
printf("%08X\n", f((int *)((char *)buffer + 1)));
return 0;
}
/********************/
If it is compiled with -Os optimizations, the following code is emitted by gcc
for 'f' function:
00000000 <f>:
0: e8900009 ldm r0, {r0, r3}
4: e0830000 add r0, r3, r0
8: e12fff1e bx lr
It uses LDM instruction here (load multiple) to load 2 sequential ints into a
pair of ARM registers at once, so this is effectively a 64-bit load operation.
LDM instruction does not support unaligned reads and will generate an
exception if the address in not properly aligned. Depending of a value
in /proc/cpu/alignment, this program will:
0: freeze, constantly triggering exceptions, which are not handled right in
the kernel, so it is constantly jumping between userspace and kernelspace.
CPU is loaded 100%
2: provide you with the the same result as on x86, but silently spend a huge
amount of time on handling exception and emulating this unaligned access in
the kernel
4: die with SIGPIPE
As I mentioned before, configuration 2 (fixup) is a bad choice in general.
Average Joe "x86 programmer" can insert lots of nonportable code (in the
respect to dealing with alignment) into his programs. Even worse, as ARMv6 and
ARMv7 are supposed to support unaligned memory accesses based on the
information published here and there, he would probably even think that he is
doing the right thing 
Configuration 4 (signal) at least lets you to find such bugs in the code and
fix them.
As to gcc generating such code with -Os optimization in the first place. It is
doing the right thing. The code example is buggy and results in unexpected
behavior according to C standard. It just happens to work seemingly right on
x86.
If you compile the example with '-Wcast-align' option, gcc will even issue a
warning on the problematic line. Such warnings may be handy sometimes when
porting applications to the platforms where alignment is more strict than on
x86.