Slow reading speed from DDR allocated for pru (uio_pruss driver)

Hello,
I’ve encountered a problem with very slow reading speed from memory allocated by pru kernel driver uio_pruss comparing to reading from usual address spaces. Here is an performance tests on my Beagle Bone black:

Average memcpy from pru DDR start address to application virtual address (300 kB of data): 10.4781ms
Average cv::Mat.copyTo (300 kB of data): 11.0681ms
Average memcpy from one virtual address to another (300 kB of data): 0.510001ms

Kernel version is 4.4.12-bone11

Can somebody explain the issue? May be I should have used new pru rpmesg rproc driver?

I do not think anyone would be able to answer this question properly
without at least doing a code review of your code. Also you're not really
giving enough information as to what exactly it is that you're doing it. So
to me, your numbers and times make sense, but your qualifiers do not mean
anything to me.

You also need to be aware that memcopy() is notoriously slow . . .

On thing that did just pop into my head while thinking on your situation. Is that perhaps you’re experiencing somewhat of flash media speed bottleneck. So, if you could setup, and use a ramdisk to run your test application from. The results from those tests could be enlightening.

The code is simple. I’m using PRU Linux Application Loader API.

void main (void)
{
//pru initialization…

u_int32_t ddrSizeInBytes = prussdrv_extmem_size();
u_int32_t *sharedDdr;
prussdrv_map_extmem(&sharedDdr);

u_int32_t* destination = new u_int32_t[76800]; //6404801 bytes
memcpy(sharedDdr, destination, sizeof(u_int32_t) * 76800); //takes aprox 10-12 ms

u_int32_t* localSource = new u_int32_t[76800];
memcpy(localSource, destination, sizeof(u_int32_t) * 76800); //takes aprox 0.5 - 1 ms

//close pru…
}

Like William said, we can't really answer your question without more
detail, but I'll take a guess. The DRAM that's shared with the PRU is
marked as non-cachable memory since the PRU can modify it. That means
for a typical memory copy loop *EACH* word read from DRAM is going to
turn into a full round-trip CPU to DRAM to CPU read latency rather
than the first read triggering a cache-line fill.

You probably want to use a memory copy that uses a bunch more
registers and does burst reads from the PRU memory region (as big as
you can for performance, but at least a cache line long). There are
several useful routines from the ARM folks themselves:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

...along with the benefits and drawbacks of each.

I’d have to see actual code myself, before I could even think about considering what could be the problem. Earlier, what I mentioned about running the executable from memory, is really a stab in the dark, and is most likely not the case. My reasoning with that is if the executable is running from flash media potentially part of it could be loading from flash as the program is running. But probably not.

Another thing. Where are you copying to ? To a file on disk ? Or are you just dumping to printf() ? Because printf() will also slow you down considerably. The thing with printf() though, is you can just pipe the output of the executable to a file, and the bottleneck will mostly go away. So that can be tested for easily.

Anyway, if you’re storing the data on disk, you’re probably going to want to avoid that by creating a tmpfs file. Then let another process deal with the data outside of the main application loop. I know this may sound a bit odd, but this can actually increase performance. You’ll get a little bit of a process context switching penalty, but it’ll barely be perceivable. Meanwhile, the data collecting process will be allowed to just plow right through the data.

So your problem is you. I’m not saying you’re doing anything wrong, just that if you need more performance. You need to do things differently than what you’re doing now. So first off, the uio_pruss driver makes, or can make that memory location visible to userspace. I forget the exact sysfs file location, but it’s there.

Secondly, and perhaps this is prefered. You could mmap() that memory location, read the data out, and dump it directly into a tmpfs file. The benefits of this is that you’ll have mem to mem copy speeds, which once it hits the tmpfs any application you care to access this with, can. The second benefit to this, is that you mmap() that memory location, then you mmap() the tmpfs location, and it becomes a direct object to object copy ( a = b; ) type of situation. Or can be.