PRU to DMA DDR caching issues

Hello,

I encounter a caching issue with my application. I will describe what I want to do, how I planned to do it and what goes wrong:

WHAT:

By using the PRU on my BBB, I want to timestamp a periodic rising edge on one input pin in a nanosecond scale and signal it to a Linux Kernel module on the ARM.

HOW:

To receive an interrupt in my kernel module, I bridged the pin with the rising edge to a second one (timer4 interrupt).
This interrupt fires a few microseconds after the event happened.

To read values from the PRU with best determinism and lowest latency, I allocated some DDR memory with dma_alloc_coherent() in my kernel module and handout the address via debugfs to the PRU.

The PRU is in endless loop:

wait for rising edge, read out the PRU cycle counter and write the cycle counter to the DDR memory address.
This works like a charm and I got the event’s cycle counter snapshot in my kernel module!

The kernel module interrupt is firing a few microseconds after the event and has some jitter I want to avoid.

So I decided it would be best to burst the actual cycle counter to a second ram address for a ten thousand times by the PRU so when the Kernel module reads this ddr location, It knows the difference from the event’s cycle counter and the cycle counter now.

This does not work!

WRONG:

Initially everything appered to be working. I was reading out for example:

event cycles: 1000
now cycles: 4300

great!

But to test the “now cycles” counter, I added to the kernel module to read it thousand times in a loop. Guess what? It is thousand times the same.

I tried a few options. For example to write and read two alternating memaddresses for this “now counter” by kernel and pru but nothing gives the results I expected.
It seems like anyone caches the results…

Any help? So many thanks… I hope the problem can be understood.
Tom

Unless you carefully write kernel code to treat your DDR memory buffer
as DMA memory, you are almost certainly encountering caching effects.
The ARM core reads the memory location once, and will not do so again
as long as the data remains in the cache. The more often you read the
DDR memory location, the more likely the data is to stay in the cache.

I recommend instead of using a buffer in DDR memory, use the PRU data
memories. They are accessible by both the ARM and PRU cores, and have
the proper memory flags setup so the ARM core will not cache reads.

Tom,

I had a hard time understanding exactly what your problem is, so I really don't have an answer for you. I'll just try to address the "what you are trying to do part", because it seems you may be making things harder on yourself.

Firstly, I am taking that what you mean by "a nanosecond scale" is actually "on the order of nanoseconds". A PRU instruction takes 5ns to execute and I'm counting at least 4 or 5 instructions in that loop.

A solution for you to completely avoid the issues you have by using another pin (polled by the linux kernel, which will give you some non deterministic timing) is to do everything in the PRU. First, use the dedicated RAM in the PRUs. The bridge to the rest of the ARM core (OCP) is not in the real-time domain of the PRUs and has to deal with clashes and resources being occupied by the kernel. Second, use the PRU to ARM interrupt to signal that there was an edge. You'll get less latency and lower jitter, although you really can't do much better than lower here. Also, PRUs have two RAM banks so you can do a ping pong buffer too.

If your problem requires you to do some action on the rising edges (and any delay or jitter is unacceptable in your application), I would suggest you do everything in the PRUs (if it is not too complicated, after all there's some code size limitations).

Sorry I couldn't be of more help. If you give me more details on your applications, I can try again... :wink:

Cheers!

Hello and thank you for the fast help. Here my answers for your comments:

Unless you carefully write kernel code to treat your DDR memory buffer
as DMA memory, you are almost certainly encountering caching effects.

I thought to have this done by getting the memory space from dma_alloc_coherent(). I will research if there is more needed to disable the caching but my understanding was that a DMA flagged space will never be cached because the ARM core can not know if something has changed.

I recommend instead of using a buffer in DDR memory, use the PRU data memories.

As to my tests, writing from the PRU to the DDR memory only requires 3 cycles on the PRU and not more on a few million tries (L3 fast interconnect). However I do not know how long it takes for this memory to be available at the ARM…

I can not do all work in the PRU code because I need to tag the rising edge with the Linux Kernel time. Therefore I need to find out the most deterministic way to get the counter value into the Kernel.

I will do further tests. Maybe there is someone here who experienced the same road. I think tagging an event with the PRU (5ns) and set it into relation to Linux Kernel Time without losing to much nanoseconds should be one of the great PRU benefits.

Thanks all

I tracked down the issue a bit more:

If I insert something between two reads of the DDR memory in my kernel module (I inserted a pr_info(“test”)), the value is refreshed.

Maybe there is a possiblity to invalidate the cache? I will investigate more.

Thanks

Hello and thank you for the fast help. Here my answers for your comments:

Unless you carefully write kernel code to treat your DDR memory buffer

as DMA memory, you are almost certainly encountering caching effects.

I thought to have this done by getting the memory space from dma_alloc_coherent().
I will research if there is more needed to disable the caching but my
understanding was that a DMA flagged space will never be cached because the
ARM core can not know if something has changed.

There's more than just allocating the memory to successfully use it
for DMA (which is basically what's happening here, the PRU is an
independent mechanism that modifies memory outside the context of the
ARM core). This is very non-trivial to implement correctly, and
drastic overkill for what you need to do unless you're moving very
large amounts of data (more than will fit in the PRU data memories).

I recommend instead of using a buffer in DDR memory, use the PRU data

memories.

As to my tests, writing from the PRU to the DDR memory only requires 3
cycles on the PRU and not more on a few million tries (L3 fast
interconnect). However I do not know how long it takes for this memory to
be available at the ARM...

Several hundred nS at the very least. It looks like the transaction
only takes three cycles on the PRU because the write is posted. To
get a better idea of the actual transaction time, try doing a read on
the PRU side!

I can not do all work in the PRU code because I need to tag the rising edge
with the Linux Kernel time. Therefore I need to find out the most
deterministic way to get the counter value into the Kernel.

I will do further tests. Maybe there is someone here who experienced the
same road. I think tagging an event with the PRU (5ns) and set it into
relation to Linux Kernel Time without losing to much nanoseconds should be
one of the great PRU benefits.

Have the PRU sample the pin and record the data you want into the
shared data memory. Then have the PRU send an interrupt to the ARM
core indicating the data is available.

An alternate method that should produce similar quality results
without needing the PRU is to use the capture timers. You can
configure the hardware timers to capture on the rising and/or falling
edge of a signal, and have the timer interrupt the ARM core. This
provides cycle-level accuracy for timing, and you should be able to
implement everything using the exiting Linux kernel drivers for the
timer hardware.

This training material from free-electron explains the cache effects and how to deal with them. Starting at slide 440

http://free-electrons.com/doc/training/linux-kernel/linux-kernel-slides.pdf

Regards,
John