PRU writing to DDR-ram takes long time

Hi,

I would like to transfer high data rates from PRU to my userspace C-program. Therefore I need a buffer which is larger than the 12kB of shared ram. I tried using cmicalis example which he mentioned here: https://groups.google.com/d/msg/beagleboard/iqo3csrF93E/tIzWhN0CB9QJ

My problem:
After writing something to the DDR by the PRU, it takes too long time until I can read out this value in my C-program. The time variies from nearly 0 up to 5s.

I set up a little example based on cmicalis code to show the problem. I basicly write a value from my C-program to the PRU dataram and measure how long it takes until the PRU has written this value to the DDR. The PRU is doing this in an endless loop, it should take only a few cycles.
This is the output of my example:

`

set: 0; data: 0; ddr: 0; time: 0ms
set: 1; data: 1; ddr: 1; time: 2167ms
set: 2; data: 2; ddr: 2; time: 0ms
set: 3; data: 3; ddr: 3; time: 0ms
set: 4; data: 4; ddr: 4; time: 439ms
set: 5; data: 5; ddr: 5; time: 425ms
set: 6; data: 6; ddr: 6; time: 546ms
set: 7; data: 7; ddr: 7; time: 0ms

`
As you can see, the time ist variing very much.

The important parts of my example are these:

`

while(info.pru_params->run_flag) {
timer = 0;
info.pru_params->counterIn = counter;
while(ddr[0] != counter) {
usleep(1000);
timer++;
}
printf(“set: %2u; data:%2u; ddr:%2u; time: %4ums\n”, counter, info.pru_params->counterOut, ddr[0], timer);
fflush(stdout);
counter++;
usleep(1000);
}

`
PRU:

`

MAIN_LOOP:

LBBO r0, ADDR_PRURAM, OFFSET(AppParams.CounterIn), 4 //load value

SBBO r0, ADDR_DDR, 0, 4 //write value to DDR
SBBO r0, ADDR_PRURAM, OFFSET(AppParams.CounterOut), 4 //write value to dataram

// Check to see if the host is still running
LBBO r0, ADDR_PRURAM, OFFSET(AppParams.RunFlag), 4
// If not, jump to exit
QBEQ EXIT, r0, 0

// Do the loop again
JMP MAIN_LOOP

`
The full programs can be found here: https://github.com/nils-z/am335x_pru_package/tree/master/pru_sw/example_apps/ddr_access_timing

If I check the time how long it takes to write to the dataram, it is always very short, as I expect it. The problem occurs only when using the DDR.

Has anyone an explanation for this behaviour? Perhaps it is a problem with some kind of cache, how could I get around this? I’d like to implement a ring buffer but it would have to be very large if it has to take data of several seconds.

Thanks, Nils

You are encountering caching effects on the ARM side. The proper way to
deal with this is to set the memory flags appropriately in the MMU,
typically with kernel-level code that's setting up to do DMA. The PRU
in this use case basically looks like a "smart" peripheral that can
modify the contents of memory directly. That is the same as any other
DMA enabled I/O device, and requires the same care with handling of the
memory regions.

Okay, my problem seems to be this: http://en.wikipedia.org/wiki/Direct_memory_access#Cache_coherency

Is there a way to do this without programming kernel-level code? I’ve never done this before and don’t really know what to do exactly.
I found the option to drop the cache by writing a “1” to /proc/sys/vm/drop_caches. I used this is my program with this few lines of code:

`
FILE * fd;
fd = open("/proc/sys/vm/drop_caches", O_WRONLY);
write(fd, “1”, sizeof(char));
close(fd);

`

It works most of the time, but sometimes I have to write the “1” a few more times until the cache is really flushed. Perhaps this is only a problem with another cache when writing to this file.

I have some more questions now:

This is a workaround, but I think no good solution. Is there any nicer way to drop the cache in C?
Could I get performance problems at another point if I drop the cache too often (about every sec. later)?
Is there a possibility to drop only the cache for my memory section? Or to disable caching for this part of memory?
Is there a function like memcpy which is able to read directly from the memory without looking in the cache?

Thank you for the help!

Nils

I think I found a better solution now.

Opening /dev/mem was done by

`
info->mem_fd = open("/dev/mem", O_RDWR);

before, i changed it to
info->mem_fd = open("/dev/mem", O_RDWR | O_SYNC);

`

Now it works with the speed I expected and I don’t have to drop the cache.
I’m not really sure if I should also use the O_DIRECT flag, but when doing so (and defining #define _GNU_SOURCE (see http://www.titov.net/2006/01/02/using-o_largefile-or-o_direct-on-linux/)), I get a segfault when reading from the mapped buffer, I don’t know why. With only O_SYNC it works as it should.

What do you think about this solution?

This solution works fantastically for me! I couldn’t figure out why my write speeds were so slow.

Thanks for posting this!