beagleboard/cortex-a8 performace

Hi,

I'm trying to do some tests to see how the cortex-a8 performs with
video and I'm getting very strange results with mplayer:

The test:

# wget http://samples.mplayerhq.hu/benchmark/testsuite1/matrixbench_normdivx_vbrmp3.avi
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts
idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK

On a nokia n800 (300MHz omap2420):

BENCHMARKs: VC: 122.543s VO: 0.162s A: 0.000s Sys: 1.416s =
124.120s

So it can decode the complete video in ~2 minutes. The beagle:

BENCHMARKs: VC: 193.856s VO: 0.153s A: 0.000s Sys: 2.718s =
196.727s

Wow! That's a *lot* slower than nokia n800. A CPU with twice the
megahertz is 50% slower!

The mplayer used is the one from https://garage.maemo.org/projects/mplayer/
because that has armv6 simd and armv6 vfp optimizations.

The CFLAGS used:

-march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp -
fexpensive-optimizations -ftree-vectorize -fomit-frame-pointer -O4 -
ffast-math

I wondered why that is and got a hint from this:

"Clocking rate (Crystal/DPLL/ARM core): 26.0/266/381 MHz"

So the cpu is not running at 600MHz, but at 381MHz, is that expected?
But even at 381 MHz it should be faster than an omap2.

Does anyone have some idea and/or hints on this? I'll try running the
test-idct and test-unquatize programs later this week

I'm trying to do some tests to see how the cortex-a8 performs with
video and I'm getting very strange results with mplayer:

The test:

# wget
http://samples.mplayerhq.hu/benchmark/testsuite1/matrixbench_normdivx_vbrmp
3.avi # mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts
idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK

This command line option forces ARMv5TE IDCT (useful for ARM9E and old XScale
cores without IWMMXT support). ARMv6 IDCT can be enabled using
'-lavdopts idct=17', it may work better.

On a nokia n800 (300MHz omap2420):

AFAIK N800 runs at 330MHz with OS2007 and at up to 400MHz with OS2008.
In order to disable frequency scaling in OS2008 and keep it running
at 400MHz for more reliable results, you can use:

# echo null > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# echo 0 > /sys/power/op_active
  

BENCHMARKs: VC: 122.543s VO: 0.162s A: 0.000s Sys: 1.416s =
124.120s

So it can decode the complete video in ~2 minutes. The beagle:

BENCHMARKs: VC: 193.856s VO: 0.153s A: 0.000s Sys: 2.718s =
196.727s

Wow! That's a *lot* slower than nokia n800. A CPU with twice the
megahertz is 50% slower!

From Cortex-A8 TRM. Instructions Cycle Timing:
Halfword: SMULxx and SMLAxx - 2 cycles
but Dual halfword: SMUAD, SMUSD - 1 cycle

ARMv5TE IDCT heavily uses SMULxx and SMLAxx instructions which take 1 cycle on
ARM9E, ARM11 and XScale.

Anyway, I suspect that the best results can be obtained when using NEON SIMD
optimizations :slight_smile:

The mplayer used is the one from https://garage.maemo.org/projects/mplayer/
because that has armv6 simd and armv6 vfp optimizations.

The CFLAGS used:

-march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp -
fexpensive-optimizations -ftree-vectorize -fomit-frame-pointer -O4 -
ffast-math

I wondered why that is and got a hint from this:

"Clocking rate (Crystal/DPLL/ARM core): 26.0/266/381 MHz"

So the cpu is not running at 600MHz, but at 381MHz, is that expected?
But even at 381 MHz it should be faster than an omap2.

Does anyone have some idea and/or hints on this? I'll try running the
test-idct and test-unquatize programs later this week

That would be also interesting. I'm especially interested in 'test-vfp',
because looking at TRM, seems like VFP also got a major slowdown on
Cortex-A8.

But Cortext-A9 claims to double VFP performance when compared with previous
generation :slight_smile:

Op 21 apr 2008, om 02:00 heeft Siarhei Siamashka het volgende
geschreven:

I'm trying to do some tests to see how the cortex-a8 performs with
video and I'm getting very strange results with mplayer:

The test:

# wget
http://samples.mplayerhq.hu/benchmark/testsuite1/matrixbench_normdivx_vbrmp
3.avi # mplayer -nosound -vo null -quiet -benchmark -loop 12 -
lavdopts
idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK

This command line option forces ARMv5TE IDCT (useful for ARM9E and
old XScale
cores without IWMMXT support). ARMv6 IDCT can be enabled using
'-lavdopts idct=17', it may work better.

with idct=17:

BENCHMARKs: VC: 186.421s VO: 0.143s A: 0.000s Sys: 2.025s =
188.588s
BENCHMARK%: VC: 98.8504% VO: 0.0760% A: 0.0000% Sys: 1.0736% =
100.0000%

That would be also interesting. I'm especially interested in 'test-
vfp',

because looking at TRM, seems like VFP also got a major slowdown on
Cortex-A8.

root@beagleboard:~/test# ./test-vfp --freq=$(dmesg | grep MHz | grep
ARM |awk -F/ '{print $5}' | awk '{print $1}')

Function: 'vector_fmul_vfp', time=123.040
Function: 'vector_fmul_reverse_vfp', time=116.570
Function: 'float_to_int16_vfp', time=143.864
Function: 'ff_float_to_int16_c', time=38.269

root@beagleboard:~/test# ./test-unquantize --freq=$(dmesg | grep MHz |
grep ARM |awk -F/ '{print $5}' | awk '{print $1}')
no cpu clock frequency specified, trying to autodetect it...
... detected as 469.6MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.05625 usec per element, or 26.4
cycles (469.6MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01772 usec per
element, or 8.3 cycles (469.6MHz)

root@beagleboard:~/test# ./test-idct --freq=$(dmesg | grep MHz | grep
ARM |awk -F/ '{print $5}' | awk '{print $1}') --enable-armv6
avg=-0.08, stddev=36.96, min=-168.00, max=149.00
Assuming cpu clock frequency 381MHz (ARMv6 enabled)
Please be patient and wait for the results, test requires quite a lot
of time to run...
correctness tests passed
- --- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=535.2
simple_idct_put_armv5te cache=no, time=668.3
simple_idct_put_armv5te cache=yes, time=662.9
simple_idct_add_armv5te cache=no, time=890.5
simple_idct_add_armv5te cache=yes, time=744.9
simple_idct_armv5te_ref time=935.8
simple_idct_put_armv5te_ref cache=no, time=1190.6
simple_idct_put_armv5te_ref cache=yes, time=1171.2
simple_idct_add_armv5te_ref cache=no, time=1372.2
simple_idct_add_armv5te_ref cache=yes, time=1229.4
simple_idct_armv6 time=665.1
simple_idct_put_armv6 cache=no, time=934.0
simple_idct_put_armv6 cache=yes, time=754.6
simple_idct_add_armv6 cache=no, time=999.4
simple_idct_add_armv6 cache=yes, time=854.8
- --- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1235.1
simple_idct_put_armv5te cache=no, time=1375.2
simple_idct_put_armv5te cache=yes, time=1367.0
simple_idct_add_armv5te cache=no, time=1617.9
simple_idct_add_armv5te cache=yes, time=1472.9
simple_idct_armv5te_ref time=1616.1
simple_idct_put_armv5te_ref cache=no, time=1863.3
simple_idct_put_armv5te_ref cache=yes, time=1843.0
simple_idct_add_armv5te_ref cache=no, time=2041.1
simple_idct_add_armv5te_ref cache=yes, time=1899.8
simple_idct_armv6 time=1038.1
simple_idct_put_armv6 cache=no, time=1299.8
simple_idct_put_armv6 cache=yes, time=1119.5
simple_idct_add_armv6 cache=no, time=1383.3
simple_idct_add_armv6 cache=yes, time=1234.1

regards,

Koen

Currently due to power issues we were running ARM MPU at 381Mhz with L2 cache off. We have to pump up this frequency to 500Mhz and more also enable L2 Cache.

I should be able to work on power mods board soon, will update you all on this ASAP.

Regards,
Khasim

Currently due to power issues we were running ARM MPU at 381Mhz with L2
cache off. We have to pump up this frequency to 500Mhz and more also enable
L2 Cache.

What kind of power issues, hardware or software?

I should be able to work on power mods board soon, will update you all on
this ASAP.

Great, is there anything I can test on on my board?

regards,

Koen

Koen,

You have all of these mods already on your board

Gerald

My understanding is he needs a new xloader though, is this correct?

Is there a replacement for the DVFlasher to install the xloader easily?

Philip

xloader is loaded from SD by the rom thingy, so I only need a new MLO
(signed xloader) binary.

regards,

Koen

koen wrote:

My understanding is he needs a new xloader though, is this correct?

Is there a replacement for the DVFlasher to install the xloader easily?

xloader is loaded from SD by the rom thingy, so I only need a new MLO
(signed xloader) binary.

Have you tried to generate MLO by your own? I haven't tried it yet, but

http://code.google.com/p/beagleboard/wiki/BeagleSourceCode

tells us:

-- cut --
Convert x-load.bin to MLO (required for MMC Boot)

1. Use the "SignGP" tool to sign the x-loader image. (“x-load.bin.ift” file is generated in the same folder.)

           ./signGP x-load.bin

  2. Rename x-load.bin.ift to MLO
-- cut --

X-Loader source is available via

http://elinux.org/BeagleBoard#Git

As X-Loader is a stripped down U-Boot, its include directory links to uboot. So you need a recent U-Boot with

http://groups.google.com/group/beagleboard/browse_thread/thread/3473b44af1e6e326#

on top. Have a look to omap3530beagle.h. Currently, there is PRCM_CLK_CFG2_266MHZ configured. Instead of this, PRCM_CLK_CFG2_332MHZ can be enabled.

Don't know how to enable L2 cache and/or other frequencies, though. Seems that there is no preparation for other (higher?) frequency configuration in the public code yet?

Dirk

CONFIG_L2_OFF looks suspicious like a cache disable option :slight_smile:

regards,

Koen

The POWER MODS I was refering to were hardware modifications that Gerald
confirmed that it is already in place for your boards.

For Enabling L2 Cache:

1. I have not disabled it in X-loader, so no changes to x-loader for this.
However in kernel it is disabled currently, to enabled it you have deselect
the option "Disable L2 Cache"

2. For running at 500 MPU, I can give out u-boot and x-loader changes, but
just waiting for everyone to get their boards modified otherwise it might
block others. For now, I have attached the MLO and u-boot.bin for testing.
Just try this out, boot the kernel and read out the MPU clock by doing
  cat /proc/omap_clocks | grep "MPU"

Regards,
Khasim

MLO (16.3 KB)

u-boot.bin (151 KB)

For future reference:

root@beagleboard:/media/mmcblk0p1# md5sum mlo
6a9f907d630de81f0b8ee8398cf94cf6 mlo
root@beagleboard:/media/mmcblk0p1# md5sum u-boot.bin
2408dd1757856d52e71c110aa653c178 u-boot.bin

For Enabling L2 Cache:

1. I have not disabled it in X-loader, so no changes to x-loader for this.
However in kernel it is disabled currently, to enabled it you have deselect
the option "Disable L2 Cache"

For 2.6.22-beagle:

koen@lieve:/OE/angstrom-tmp/work/beagleboard-angstrom-linux-gnueabi/
2.6_kernel$ grep CACHE ./arch/arm/configs/omap3_beagle_defconfig
CONFIG_CPU_CACHE_V7=y
CONFIG_CPU_CACHE_VIPT=y
# CONFIG_CPU_ICACHE_DISABLE is not set
# CONFIG_CPU_DCACHE_DISABLE is not set
CONFIG_CPU_L2CACHE_DISABLE=y
# CONFIG_OUTER_CACHE is not set

For linux-omap2 2.6.25:
koen@lieve:/OE/angstrom-tmp/work/beagleboard-angstrom-linux-gnueabi/
linux-omap2-2.6.25-r4/git$ grep CACHE .config
CONFIG_CPU_CACHE_V7=y
CONFIG_CPU_CACHE_VIPT=y
# CONFIG_CPU_ICACHE_DISABLE is not set
# CONFIG_CPU_DCACHE_DISABLE is not set
# CONFIG_OUTER_CACHE is not set
# CONFIG_CDROM_PKTCDVD_WCACHE is not set

2. For running at 500 MPU, I can give out u-boot and x-loader changes, but
just waiting for everyone to get their boards modified otherwise it might
block others. For now, I have attached the MLO and u-boot.bin for testing.
Just try this out, boot the kernel and read out the MPU clock by doing
cat /proc/omap_clocks | grep "MPU"

With 2.6.22-beagle

root@beagleboard:~# cat /proc/omap_clocks | grep mpu ; uname -a
mpu_ck 0 381000000 0
Linux beagleboard 2.6.22.1-omap1 #2 Wed Mar 26 16:39:33 IST 2008
armv7l unknown unknown GNU/Linux

With 2.6.25-omap1:
root@beagleboard:~# cat /proc/cpuinfo ; uname -a
Processor : ARMv7 Processor rev 2 (v7l)
BogoMIPS : 378.14
Features : swp half thumb fastmult vfp edsp
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x1
CPU part : 0xc08
CPU revision : 2
Cache type : write-through
Cache clean : not required
Cache lockdown : not supported
Cache format : Unified
Cache size : 768
Cache assoc : 1
Cache line length : 8
Cache sets : 64

Hardware : OMAP3 Beagle Board
Revision : 34301000
Serial : 0000000000000000
Linux beagleboard 2.6.25-omap1 #3 PREEMPT Mon Apr 21 08:55:10 CEST
2008 armv7l unknown unknown GNU/Linux

So both still run at 381MHz, but 2.6.25 should have L2 enabled.

regards,

Koen

Did you try with my latest u-boot.bin and MLO files?

The 2.6.25 doesnt have the omap-clocks entry in proc, so try 2.6.22 with my latest u-boot.bin and MLO you should get MPU at 500 and then run your demos on 2.6.22.

We can then add other peripheral set to 2.6.25.

Regards,
Khasim

Did you try with my latest u-boot.bin and MLO files?

Yes:

root@beagleboard:/media/mmcblk0p1# md5sum mlo
6a9f907d630de81f0b8ee8398cf94cf6 mlo
root@beagleboard:/media/mmcblk0p1# md5sum u-boot.bin
2408dd1757856d52e71c110aa653c178 u-boot.bin

The 2.6.25 doesnt have the omap-clocks entry in proc, so try 2.6.22 with my
latest u-boot.bin and MLO you should get MPU at 500 and then run your demos
on 2.6.22.

root@beagleboard:~# cat /proc/omap_clocks | grep mpu ; uname -a
mpu_ck 0 381000000 0
Linux beagleboard 2.6.22.1-omap1 #2 Wed Mar 26 16:39:33 IST 2008
armv7l unknown unknown GNU/Linux

Still 381MHz :(, could you md5sum you working MLO and see if it
matches?

root@beagleboard:~# uname -a
Linux beagleboard 2.6.26-rc1-omap1 #1 Wed May 7 10:25:34 CEST 2008
armv7l unknown unknown GNU/Linux
root@beagleboard:/media/mmcblk0p1# mplayer -nosound -vo null -quiet -
benchmark - loop 12 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK
BENCHMARKs: VC: 59.906s VO: 0.067s A: 0.000s Sys: 1.255s =
61.228s
BENCHMARKs: VC: 56.150s VO: 0.133s A: 0.000s Sys: 1.043s =
57.327s

That's 3.5 times faster with L2 cache enabled! That's a nice
improvement :smiley:

regards,

Koen

Have you tried compiling with armv7a or neon instructions yet?

Have you tried compiling with armv7a or neon instructions yet?

This mplayer was compiled with

-march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp -
fexpensive-optimizations -ftree-vectorize -fomit-frame-pointer -O4 -
ffast-math

I haven't seen any patches that add NEON optimized instructions to
mplayer yet. This mplayer does have Siarheis armv6 stuff.

regards,

Koen