memory bandwidth problem

Hi, group,

I am using the following little program to check BB memory bandwidth. The
number I got is about 31MB/S for C version, and 83MB/S for simd version.
This seems be too slow. I had expected some number like 10x faster. Any
suggestion on where to look? x-loader, u-boot, kernel or just my compile
flags?

Thanks,
Guo

/*
In omap host environment, compile the code as
arm-none-linux-gnueabi-gcc -O2 -o membench membench.c
*/

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char** argv)
{
  int* pbuf1, *pbuf2;
  const int bufSize = 8*1024*1024;
  int i,j;
  clock_t t1, t2;
  double tdiff;
  const int ITER =100;
  double rate;
  typedef int v4si __attribute__ ((vector_size(16)));
  v4si *p1, *p2;

  pbuf1 = (int*)memalign(16, bufSize*sizeof(int));
  pbuf2 = (int*)memalign(16, bufSize*sizeof(int));
  for (i=0; i<bufSize; i++)
  {
    pbuf2[i] = i;
  }
  t1 = clock();
  for(j=0; j<ITER; j++)
  {
    for (i=0; i<bufSize; i++)
    {
      pbuf1[i] = pbuf2[i];
    }
  }
  t2 = clock();
  tdiff = (double)(t2) - (double)t1;
  rate = ITER*bufSize*sizeof(int)/(tdiff/CLOCKS_PER_SEC);
  rate /= (1024.0*1024.0);
  printf("rate(MB/S) = %.3f, clocks_per_sec %d\n", rate, CLOCKS_PER_SEC);
/*
  for(i=900; i<910; i++)
  {
    printf("%d\n", pbuf1[i]);
  }
*/

  //SIMD version of the memory benchmark
    t1 = clock();
  for(j=0; j<ITER; j++)
  {
    p1 = (v4si*)(pbuf1);
    p2 = (v4si*)(pbuf2);
    for(i=0; i<bufSize/4; i++)
    {
      *p1 = *p2;
      p1++;
      p2++;
    }
  }
  t2 = clock();

  tdiff = (double)(t2) - (double)t1;
  rate = ITER*bufSize*sizeof(int)/(tdiff/CLOCKS_PER_SEC);
  rate /= (1024.0*1024.0);
  printf("SIMD rate(MB/S) = %.3f, clocks_per_sec %d\n", rate, CLOCKS_PER_SEC);

  free(pbuf1);
  free(pbuf2);

  return 0;
}

Guo,

I am using the following little program to check BB memory
bandwidth. The
number I got is about 31MB/S for C version, and 83MB/S for simd
version.
This seems be too slow. I had expected some number like 10x faster.
Any
suggestion on where to look? x-loader, u-boot, kernel or just my
compile
flags?

[snip]

  for(j=0; j<ITER; j++)
  {
    p1 = (v4si*)(pbuf1);
    p2 = (v4si*)(pbuf2);
    for(i=0; i<bufSize/4; i++)
    {
      *p1 = *p2;
      p1++;
      p2++;
    }
  }

Out of interest, what results do you see if your inner loop(s) above
do only reads, or only writes? (It strikes me that reading in bursts
and writing in bursts will be less harsh on write buffers, etc., than
read-one-write-one-read-one-write-one...)

Cheers,

Matt

If I do:

arm-angstrom-linux-gnueabi-gcc -O2 -march=armv7-a -mtune=cortex-a8 -
mfpu=neon -mfloat-abi=softfp -o membench membench.c

I get:

root@beagleboard:~# ./membench
rate(MB/S) = 32.035, clocks_per_sec 1000000
SIMD rate(MB/S) = 172.973, clocks_per_sec 1000000

koen@bitbake:/OE/angstrom-dev/work/armv7a-angstrom-linux-gnueabi$ arm-
angstrom-linux-gnueabi-gcc -v
Using built-in specs.
Target: arm-angstrom-linux-gnueabi
<snip>
gcc version 4.3.1 (GCC)

root@beagleboard:~# uname -a
Linux beagleboard 2.6.26-omap1 #7 Mon Aug 11 16:38:06 CEST 2008 armv7l
unknown unknown GNU/Linux

regards,

Koen

Now better with your flags.
rate(MB/S) = 31.156, clocks_per_sec 1000000
SIMD rate(MB/S) = 154.739, clocks_per_sec 1000000

[root@beagleboard c]# uname -a
Linux beagleboard.org 2.6.22.18-omap3 #1 Thu Jul 24 15:29:36 IST 2008
armv7l unknown

host: arm-none-linux-gnueabi-gcc -v
Using built-in specs.
Target: arm-none-linux-gnueabi
Configured with: /scratch/paul/lite/linux/src/gcc-4.2/configure
--build=i686-pc-linux-gnu --host=i686-pc-linux-gnu
--target=arm-none-linux-gnueabi --enable-threads --disable-libmudflap
--disable-libssp --disable-libgomp --disable-libstdcxx-pch --with-gnu-as
--with-gnu-ld --enable-languages=c,c++ --enable-shared
--enable-symvers=gnu --enable-__cxa_atexit --with-pkgversion=CodeSourcery
Sourcery G++ Lite 2007q3-51
--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls
--prefix=/opt/codesourcery
--with-sysroot=/opt/codesourcery/arm-none-linux-gnueabi/libc
--with-build-sysroot=/scratch/paul/lite/linux/install/arm-none-linux-gnueabi/libc
--enable-poison-system-directories
--with-build-time-tools=/scratch/paul/lite/linux/install/arm-none-linux-gnueabi/bin
--with-build-time-tools=/scratch/paul/lite/linux/install/arm-none-linux-gnueabi/bin
Thread model: posix
gcc version 4.2.1 (CodeSourcery Sourcery G++ Lite 2007q3-51)

These numbers still sound almost too small to be true.

regards,
Guo

Matt,

If I replace *p1 with a local variable, the SIMD version speed increased
from 155MB/S to 168MB/S. With similar change, the C version speed will
increase from about 31MB/s to 190MB/S.

Dissemble the code, the C version target local variable is one register.
The SIMD target local variable is still a stack variable (maybe in L1
cache). So I guess the slow speed might just due to slow DDR->L2->L1->CPU
read. I don't know ARM assemly programming, so cannot control the SIMD
version to do a load to register.

I am new to ARM architecture. Does ARM has explicit cache control
instruction? Like overlap cache load and other calculation ability?

regards,
Guo

Guo Tang <tangguo77@gmail.com> writes:

> Hi, group,
>
> I am using the following little program to check BB memory bandwidth. The
> number I got is about 31MB/S for C version, and 83MB/S for simd version.
> This seems be too slow. I had expected some number like 10x faster. Any
> suggestion on where to look? x-loader, u-boot, kernel or just my compile
> flags?

If I do:

arm-angstrom-linux-gnueabi-gcc -O2 -march=armv7-a -mtune=cortex-a8 -
mfpu=neon -mfloat-abi=softfp -o membench membench.c

I get:

root@beagleboard:~# ./membench
rate(MB/S) = 32.035, clocks_per_sec 1000000
SIMD rate(MB/S) = 172.973, clocks_per_sec 1000000

koen@bitbake:/OE/angstrom-dev/work/armv7a-angstrom-linux-gnueabi$ arm-
angstrom-linux-gnueabi-gcc -v
Using built-in specs.
Target: arm-angstrom-linux-gnueabi
<snip>
gcc version 4.3.1 (GCC)

root@beagleboard:~# uname -a
Linux beagleboard 2.6.26-omap1 #7 Mon Aug 11 16:38:06 CEST 2008 armv7l
unknown unknown GNU/Linux

Now better with your flags.
rate(MB/S) = 31.156, clocks_per_sec 1000000
SIMD rate(MB/S) = 154.739, clocks_per_sec 1000000

These numbers still sound almost too small to be true.

Are you running anything else on your Beagle? Using a cleaned up
version of your test (attached), I get quite different numbers. I
added timings for plain memcpy() and a hand-written assembler
function, and the result looks like this:

memcpy 192566305 B/s
INT32 163817378 B/s
C SIMD 163537932 B/s
ASM SIMD 280814532 B/s

memspeed.c (1.65 KB)

neoncpy.S (729 Bytes)

Måns Rullgård <mans@mansr.com> writes:

Guo Tang <tangguo77@gmail.com> writes:

> Hi, group,
>
> I am using the following little program to check BB memory bandwidth. The
> number I got is about 31MB/S for C version, and 83MB/S for simd version.
> This seems be too slow. I had expected some number like 10x faster. Any
> suggestion on where to look? x-loader, u-boot, kernel or just my compile
> flags?

If I do:

arm-angstrom-linux-gnueabi-gcc -O2 -march=armv7-a -mtune=cortex-a8 -
mfpu=neon -mfloat-abi=softfp -o membench membench.c

I get:

root@beagleboard:~# ./membench
rate(MB/S) = 32.035, clocks_per_sec 1000000
SIMD rate(MB/S) = 172.973, clocks_per_sec 1000000

koen@bitbake:/OE/angstrom-dev/work/armv7a-angstrom-linux-gnueabi$ arm-
angstrom-linux-gnueabi-gcc -v
Using built-in specs.
Target: arm-angstrom-linux-gnueabi
<snip>
gcc version 4.3.1 (GCC)

root@beagleboard:~# uname -a
Linux beagleboard 2.6.26-omap1 #7 Mon Aug 11 16:38:06 CEST 2008 armv7l
unknown unknown GNU/Linux

Now better with your flags.
rate(MB/S) = 31.156, clocks_per_sec 1000000
SIMD rate(MB/S) = 154.739, clocks_per_sec 1000000

These numbers still sound almost too small to be true.

Are you running anything else on your Beagle? Using a cleaned up
version of your test (attached), I get quite different numbers. I
added timings for plain memcpy() and a hand-written assembler
function, and the result looks like this:

memcpy 192566305 B/s
INT32 163817378 B/s
C SIMD 163537932 B/s
ASM SIMD 280814532 B/s

The very different figures for the naive C loop prompted me to dig a
little deeper, and I found something strange. It appears that
addresses 0x2001000 (32M+4k) apart use the same cache line or
something similar, severely degrading the throughput of the copy.
Your test just happens to allocate the buffers with this magic
interval.

Hi, Måns,

Nice finding. Could you elaborate more on the cache line problem? I
haven't understood it.

Is this what happening? The 2 buffers are 32M+4K apart, then in the copy,
target and source are using the same cache line. But then the copy
operation will be from L1 cache to L1 cache, the copy will be faster
instead of slower, right?

Are there any way to avoid this problem in the real application?

Thanks,
Guo

Guo Tang <tangguo77@gmail.com> writes:

Måns Rullgård <mans@mansr.com> writes:

> Guo Tang <tangguo77@gmail.com> writes:
>
>>
>>> > Hi, group,
>>> >
>>> > I am using the following little program to check BB memory
>>> > bandwidth. The number I got is about 31MB/S for C version,
>>> > and 83MB/S for simd version. This seems be too slow. I had
>>> > expected some number like 10x faster. Any suggestion on where
>>> > to look? x-loader, u-boot, kernel or just my compile flags?
>>>
>>> If I do:
>>>
>>> arm-angstrom-linux-gnueabi-gcc -O2 -march=armv7-a -mtune=cortex-a8 -
>>> mfpu=neon -mfloat-abi=softfp -o membench membench.c
>>>
>>> I get:
>>>
>>> root@beagleboard:~# ./membench
>>> rate(MB/S) = 32.035, clocks_per_sec 1000000
>>> SIMD rate(MB/S) = 172.973, clocks_per_sec 1000000
>>>
>>> koen@bitbake:/OE/angstrom-dev/work/armv7a-angstrom-linux-gnueabi$ arm-
>>> angstrom-linux-gnueabi-gcc -v
>>> Using built-in specs.
>>> Target: arm-angstrom-linux-gnueabi
>>> <snip>
>>> gcc version 4.3.1 (GCC)
>>>
>>> root@beagleboard:~# uname -a
>>> Linux beagleboard 2.6.26-omap1 #7 Mon Aug 11 16:38:06 CEST 2008 armv7l
>>> unknown unknown GNU/Linux
>>
>> Now better with your flags.
>> rate(MB/S) = 31.156, clocks_per_sec 1000000
>> SIMD rate(MB/S) = 154.739, clocks_per_sec 1000000
>>
>> These numbers still sound almost too small to be true.
>
> Are you running anything else on your Beagle? Using a cleaned up
> version of your test (attached), I get quite different numbers. I
> added timings for plain memcpy() and a hand-written assembler
> function, and the result looks like this:
>
> memcpy 192566305 B/s
> INT32 163817378 B/s
> C SIMD 163537932 B/s
> ASM SIMD 280814532 B/s

The very different figures for the naive C loop prompted me to dig a
little deeper, and I found something strange. It appears that
addresses 0x2001000 (32M+4k) apart use the same cache line or
something similar, severely degrading the throughput of the copy.
Your test just happens to allocate the buffers with this magic
interval.

Nice finding. Could you elaborate more on the cache line problem? I
haven't understood it.

Is this what happening? The 2 buffers are 32M+4K apart, then in the copy,
target and source are using the same cache line. But then the copy
operation will be from L1 cache to L1 cache, the copy will be faster
instead of slower, right?

If the cache is write-allocate, and the source and destination
addresses, for whatever reason, must use the same cache-line, only one
of them can be in cache at any time. Copying a word at a time under
such conditions will result in constant cache misses.

I'm a bit surprised that this is happening, since the Cortex-A8 L1
cache is 4-way set associative, and the L2 cache is 8-way set
associative.

Are there any way to avoid this problem in the real application?

Profile carefully, looking for unexpected cache misses.

Måns Rullgård <mans@mansr.com> writes:

Måns Rullgård <mans@mansr.com> writes:

Guo Tang <tangguo77@gmail.com> writes:

> Hi, group,
>
> I am using the following little program to check BB memory bandwidth. The
> number I got is about 31MB/S for C version, and 83MB/S for simd version.
> This seems be too slow. I had expected some number like 10x faster. Any
> suggestion on where to look? x-loader, u-boot, kernel or just my compile
> flags?

If I do:

arm-angstrom-linux-gnueabi-gcc -O2 -march=armv7-a -mtune=cortex-a8 -
mfpu=neon -mfloat-abi=softfp -o membench membench.c

I get:

root@beagleboard:~# ./membench
rate(MB/S) = 32.035, clocks_per_sec 1000000
SIMD rate(MB/S) = 172.973, clocks_per_sec 1000000

koen@bitbake:/OE/angstrom-dev/work/armv7a-angstrom-linux-gnueabi$ arm-
angstrom-linux-gnueabi-gcc -v
Using built-in specs.
Target: arm-angstrom-linux-gnueabi
<snip>
gcc version 4.3.1 (GCC)

root@beagleboard:~# uname -a
Linux beagleboard 2.6.26-omap1 #7 Mon Aug 11 16:38:06 CEST 2008 armv7l
unknown unknown GNU/Linux

Now better with your flags.
rate(MB/S) = 31.156, clocks_per_sec 1000000
SIMD rate(MB/S) = 154.739, clocks_per_sec 1000000

These numbers still sound almost too small to be true.

Are you running anything else on your Beagle? Using a cleaned up
version of your test (attached), I get quite different numbers. I
added timings for plain memcpy() and a hand-written assembler
function, and the result looks like this:

memcpy 192566305 B/s
INT32 163817378 B/s
C SIMD 163537932 B/s
ASM SIMD 280814532 B/s

The very different figures for the naive C loop prompted me to dig a
little deeper, and I found something strange. It appears that
addresses 0x2001000 (32M+4k) apart use the same cache line or
something similar, severely degrading the throughput of the copy.
Your test just happens to allocate the buffers with this magic
interval.

I did some tweaks to the code, and disabled the framebuffer. The
result in numbers, using 8MB buffers:

copy memcpy 225595776 B/s
copy ASM ARM 301156146 B/s
copy ASM NEON 343882833 B/s
copy ASM A+N 352340617 B/s
write memset 530244447 B/s
write ASM ARM 530860509 B/s
write ASM NEON 531750947 B/s
write ASM A+N 590044870 B/s

This is running on a rev C prototype with ES3.0 silicon, in case it
matters. The kernel is l-o head with some patches.

Here's the improved ARM+NEON memcpy:

memcpy_armneon:
        push {r4-r11}
        mov r3, r0
1: subs r2, r2, #128
        pld [r1, #64]
        pld [r1, #256]
        pld [r1, #320]
        ldm r1!, {r4-r11}
        vld1.64 {d0-d3}, [r1,:128]!
        vld1.64 {d4-d7}, [r1,:128]!
        vld1.64 {d16-d19}, [r1,:128]!
        stm r3!, {r4-r11}
        vst1.64 {d0-d3}, [r3,:128]!
        vst1.64 {d4-d7}, [r3,:128]!
        vst1.64 {d16-d19}, [r3,:128]!
        bgt 1b
        pop {r4-r11}
        bx lr

The super-fast ARM+NEON memset looks like this:

memset_armneon:
        push {r4-r11}
        mov r3, r0
        vdup.8 q0, r1
        vmov q1, q0
        orr r4, r1, r1, lsl #8
        orr r4, r4, r4, lsl #16
        mov r5, r4
        mov r6, r4
        mov r7, r4
        mov r8, r4
        mov r9, r4
        mov r10, r4
        mov r11, r4
        add r12, r3, r2, lsr #2
1: subs r2, r2, #128
        pld [r3, #64]
        stm r3!, {r4-r11}
        vst1.64 {d0-d3}, [r12,:128]!
        vst1.64 {d0-d3}, [r12,:128]!
        vst1.64 {d0-d3}, [r12,:128]!
        bgt 1b
        pop {r4-r11}
        bx lr

Måns Rullgård <mans@mansr.com> writes:

Måns Rullgård <mans@mansr.com> writes:

Måns Rullgård <mans@mansr.com> writes:

Guo Tang <tangguo77@gmail.com> writes:

> Hi, group,
>
> I am using the following little program to check BB memory
> bandwidth. The number I got is about 31MB/S for C version, and
> 83MB/S for simd version. This seems be too slow. I had
> expected some number like 10x faster. Any suggestion on where
> to look? x-loader, u-boot, kernel or just my compile flags?

If I do:

arm-angstrom-linux-gnueabi-gcc -O2 -march=armv7-a -mtune=cortex-a8 -
mfpu=neon -mfloat-abi=softfp -o membench membench.c

I get:

root@beagleboard:~# ./membench
rate(MB/S) = 32.035, clocks_per_sec 1000000
SIMD rate(MB/S) = 172.973, clocks_per_sec 1000000

koen@bitbake:/OE/angstrom-dev/work/armv7a-angstrom-linux-gnueabi$ arm-
angstrom-linux-gnueabi-gcc -v
Using built-in specs.
Target: arm-angstrom-linux-gnueabi
<snip>
gcc version 4.3.1 (GCC)

root@beagleboard:~# uname -a
Linux beagleboard 2.6.26-omap1 #7 Mon Aug 11 16:38:06 CEST 2008 armv7l
unknown unknown GNU/Linux

Now better with your flags.
rate(MB/S) = 31.156, clocks_per_sec 1000000
SIMD rate(MB/S) = 154.739, clocks_per_sec 1000000

These numbers still sound almost too small to be true.

Are you running anything else on your Beagle? Using a cleaned up
version of your test (attached), I get quite different numbers. I
added timings for plain memcpy() and a hand-written assembler
function, and the result looks like this:

memcpy 192566305 B/s
INT32 163817378 B/s
C SIMD 163537932 B/s
ASM SIMD 280814532 B/s

The very different figures for the naive C loop prompted me to dig a
little deeper, and I found something strange. It appears that
addresses 0x2001000 (32M+4k) apart use the same cache line or
something similar, severely degrading the throughput of the copy.
Your test just happens to allocate the buffers with this magic
interval.

I did some tweaks to the code, and disabled the framebuffer. The
result in numbers, using 8MB buffers:

copy memcpy 225595776 B/s
copy ASM ARM 301156146 B/s
copy ASM NEON 343882833 B/s
copy ASM A+N 352340617 B/s
write memset 530244447 B/s
write ASM ARM 530860509 B/s
write ASM NEON 531750947 B/s
write ASM A+N 590044870 B/s

This is running on a rev C prototype with ES3.0 silicon, in case it
matters. The kernel is l-o head with some patches.

The Cortex A8 CPU has an L2 cache preload engine (PLE) [1], which can
be used to preload large blocks of data into the L2 cache. Using
this, I was able to push the copy throughput even higher:

copy PLE+NEON 415596200 B/s

I also coded up some pure read tests:

read ASM ARM 637178403 B/s
read ASM NEON 719075707 B/s
read PLE+NEON 741113693 B/s

The preload engine seems like it can be useful. As with everything,
however, it takes some fine-tuning of parameters to maximise
performance.

[1] Documentation – Arm Developer

Hi,

Can someone tell me if the memory allocated for the omap frame buffer in the kernel side
is configured as cache enabled ?
I am interesting in it for performance consideration.

Thanks,

Laurent

"DOUAT Laurent" <ldouat22@free.fr> writes:

Hi,

Can someone tell me if the memory allocated for the omap frame
buffer in the kernel side is configured as cache enabled ?
I am interesting in it for performance consideration.

It should be mapped as non-cached, write-combining.

ok thanks Mans,

Another question, with the armv7, when data cache is disabled, does it mean
like other cpus
that memory access are made word by word instead of "burst" mode access (1
data cache line flow)

In this case, all frame buffers allocation including offscreen surfaces are
non cached
it may slow down graphics perfomance for any kind of pixmap operation.

Two suggestions :
- At least only the final frame buffer surface could be non cached, not the
offscreen buffers.
- Or all frame buffer partition could be cached and a data cache flush could
happen when VBL happens.

In this case your memcpy,memset neon accelerated functions could be
avantagely used in omap_fb for speeding up blit copy.

I am not criticizing the current omap_fb implementation. I understand that
this model is simple for cache coherency.
I am working for a company making a graphic engine for embedded device and
all frame buffer implementation
(ST7109, Sigma design, TI davinci) looks like the same.

My strong thought is that using cached memory with a more complex frame
buffer module could speed up graphic part
and save some cpu bandwitdh.

Laurent

"DOUAT Laurent" <ldouat22@free.fr> writes:

ok thanks Mans,

Another question, with the armv7, when data cache is disabled, does
it mean like other cpus that memory access are made word by word
instead of "burst" mode access (1 data cache line flow)

For reads single-word reads, I suppose that would be the case. For
writes, you can still have a write-combining buffer. I don't know how
multi-word reads are handled in this case.

In this case, all frame buffers allocation including offscreen
surfaces are non cached it may slow down graphics perfomance for any
kind of pixmap operation.

Two suggestions :
- At least only the final frame buffer surface could be non cached, not the
offscreen buffers.
- Or all frame buffer partition could be cached and a data cache flush could
happen when VBL happens.

Flushing the cache also takes time. Which is quicker, writing to
uncached memory or writing to cache and flushing, depends on the
precise access patters in each case. If mostly writing, as is
typically the case with a framebuffer, a write-allocate cache would
waste time reading from memory to fill the cache lines as they are
allocated. Without write-allocate, there will no difference compared
to uncached.

The only way to know for sure is to benchmark specific cases.

In this case your memcpy,memset neon accelerated functions could be
avantagely used in omap_fb for speeding up blit copy.

Any kind of copy within the framebuffer is probably best done with the
DMA engine.

I am not criticizing the current omap_fb implementation. I
understand that this model is simple for cache coherency.
I am working for a company making a graphic engine for embedded
device and all frame buffer implementation (ST7109, Sigma design, TI
davinci) looks like the same.

I've been doing embedded graphics for a few years using various chips,
and I've had opportunities to experiment with various approaches.

My strong thought is that using cached memory with a more complex
frame buffer module could speed up graphic part and save some cpu
bandwitdh.

Working with uncached memory certainly requires a little extra
attention, or there will be consequences. Also keep in mind, that
allowing the framebuffer to be cached means there will be less room
for other data in the cache. The increased thrashing can more than
cancel any gain from having the framebuffer cached. Again,
benchmarks are the only way to know for sure.

Dear Mans,

I'm working on TI OMAP processor which has Cortex A-8. I'm trying to
achieve the maximum ARM bandwidth when it does copy of data from one
location to other. I'm able to setup MMU table, enable L1 $, L2$ and I-
$. I'm struggling with enabling PLE engine. Could you pass on your
code which you used to configure pre load engine.

You can mail it to my following IDs
yum2000@gmail.com
ajayk@ti.com

Your response will be highly appreciated

Thanks
Ajay

yum2000@gmail.com wrote:

Dear Mans,

I'm working on TI OMAP processor which has Cortex A-8. I'm trying to
achieve the maximum ARM bandwidth when it does copy of data from one
location to other. I'm able to setup MMU table, enable L1 $, L2$ and I-
$. I'm struggling with enabling PLE engine. Could you pass on your
code which you used to configure pre load engine.

Here's a patch to enable userspace access to the PLE:
http://git.mansr.com/?p=linux-omap;a=commitdiff;h=3e1afa3

Here's some code that uses it:
http://thrashbarg.mansr.com/~mru/mem.S

Thanks Mans