Why use PRU shared DRAM (12k) over individual core DRAM (8k)?

ags · February 8, 2017, 2:04am

I’m working on using the PRU for critical signal timing, paired with userland code that loads data into the PRU local memory.

I’ve done a lot of research learning about latency wrt local and system resources, L3/4 fabric delays, etc. In each case, I haven’t seen any difference in the read/write (load/store from the PRU sense) time (PRU cycles) between the PRU core-specific 8k DRAM and the shared 12k DRAM. I also see that PRU0 can access PRU1’s 8k DRAM, and vice versa. So in effect the 8k DRAM is also shared between the two PRU cores. So other than providing more memory than the 8k for each PRU core, why would one use the 12k DRAM?

[Note: I haven’t seen any posts measuring latency of load/store from one PRU core to the other PRU core’s DRAM. Could that be the advantage of the 12k shared DRAM - same timing when being accessed by either PRU core? If both cores are accessing either the shared 12k or core-specific 8k DRAM concurrently, would that cause one to stall?]

Charles_Steinkuehler · February 8, 2017, 4:11am

The available TRM is a bit light on specifics, but I would expect the
8K DRAM for each core is single port memory and is highly likely to
incur wait states if both PRU cores are trying to read from the same
memory bank. I would expect the 12K shared memory to be at least
2-port (so both PRUs could access it simultaneously without incurring
wait states) and possibly triple ported (so the ARM core could access
the memory without potential wait states).

...and of course 12K is bigger than 8K!

ags · February 8, 2017, 5:15am

That makes a lot of sense. Unfortunately, many things that make sense to me ultimately turn out to not be true. Does the definitive answer here lie with Gerald? I wonder if he (still) monitors this group…

Regarding the PRUSS I have many questions. I’m really trying to nail down the timing/latency/determinism of PRU execution and IO. Your work measuring latency of IO and # PRU clocks has been very helpful (Thank you for your posts). Still, I wish for validation of what has been measured with architectural/design detail. Much of the data available (even from TI) seems to be conflicting.

ags · February 9, 2017, 11:20pm

Continued review of documentation has caused me to wonder if I’ve missed a fundamental error in my thinking about what is and isn’t deterministic when using the PRUs. The PRU-local 32-bit interconnect bus is itself a shared resource. If one PRU writes to its own DRAM, and the other PRU writes to its own DRAM, won’t that potentially cause one to stall waiting for the other to complete (particularly with a burst load/store)? That would make the dual/triple porting of the shared DRAM also less valuable. If the PRUs are being used to get data from the ARM core/main memory and then bit bang pins, that too is subject to competition for control of the 32-bit bus. Does this make sense or am I still missing something?

William_Hermans · February 10, 2017, 1:37am

What is it that you're asking ? There is no "shared DRAM". There is only
"the DRAM" that is used by the main ARM processor, which the PRU's can
access through the interconnect which you speak of. Writing to, and reading
from the DRAM as I understand it is not deterministic. e.g. the L4
interconnect can incur a latency penalty.

However, the 8k memory used by each PRU core, as well as the shared 12k
memory each PRU has access to is supposed to be single cycle read / write
access. In fact each PRU core as I understand it has the ability to
"broadside" all of it's 32bit registers in a single cycle over to the 12k
shared memory.

Now, if you're looking for a way to move data out of the PRU's memory,
maybe having a userspace mmap(), and /dev/mem/ application read directly
from the 12k PRU shared RAM ? You're going to incur a penalty one way or
another. No need to bog down the PRU's trying to solve that problem. What
do you think Charles ? Is this reasonable ?

Charles_Steinkuehler · February 10, 2017, 1:56am

You cannot "broadside" store the register file into the 8k or 12k data
rams, only into one of the three scratch pad locations or directly
into the other PRU's register file. Table 4-21 (of the AM335x TRM
version spruh73o) lists what happens when you encounter collisions or
stalls with the XIN/XOUT commands.

William_Hermans · February 10, 2017, 2:16am

Thanks Charles, Yeah the whole time I'm thinking to myself "but theres
three areas . . ." Anyway you have actually have hands on. Everything I
know is just from reading. Also good to know there is a newer TRM with PRU
information in it that I was unaware of.

William_Hermans · February 10, 2017, 2:42am

Anyway, my idea was something like this. In the past I’ve written a couple different applications that have directly accessed peripherals registers through mmap() + /dev/mem/. So I would imagine( again no hands on here ) accessing the PRU’s memory directly from user space Linux could be potentially as simple. Now whether or not the scratchpad areas could be access the same way from user space Linux, I’m not sure. I am also not even sure if that would be important. However, since a user space C application is used to load the PRU’s executable binary in the first place. Not much of a stretch to imagine one could, or should be able to access all of the PRU’s periphery. I could be wrong I suppose.

But the point is really this. If you need to get data out of the PRU’s into userland Linux as quickly as possible. Maybe the way to pull that data ot of the PRU’s memory is from the ARM(Linux ) side of things ?

Charles_Steinkuehler · February 10, 2017, 4:04pm

No, you want to have the PRU doing writes.

In modern systems, writes are fast (they can get posted so they
complete at the initiator side and can take their time working through
the various interconnect fabrics to make their way to their ultimate
destination). Reads typically stall the initiator until the data is
received.

If you need to move data quickly from the PRU to the ARM, reference
the BeagleLogic code. That moves data pretty much as quickly as the
hardware physically allows (which requires a kernel module):

https://github.com/abhishek-kakkar/BeagleLogic

ags · February 17, 2017, 12:43am

I have a project down the road that will require fast writes from PRU to ARM/system DRAM. But I’m not there yet.

For this project, my focus is on reading data (from SD card, eMMC, USB stick, network, etc) into DDR and then pushing it to the PRUs and then bit-bang out precise timing (using EGP). I am trying to avoid external circuit support and thus need deterministic timing. That’s what got me very interested in the BBB. Perhaps others as well - what a great, low-cost, small-footprint combination of the scope/breadth/content/flexibility of Linux with these embedded real-time units.

Eventually it dawned on me that there will be some latency/non-deterministic timing unless I use the PRUs completely fenced-off from the system (ARM, DDR, etc). So I’m trying to identify when/where that non-determinism can occur (and conversely, where it cannot).

When I referenced “shared DRAM” I was sloppy, thinking it was clear in the context. I mean the 12k shared DRAM that is part of the PRU-ICSS. I see that, or the (2) individual 8k DRAMs as the “portal” to the ARM core (along with interrupts). I haven’t coded it yet, but I think I’m pretty clear on pushing the data from userland to the PRUs (mmap() & /dev/mem as was offered above). I already have a use planned for the three scratchpad areas and using the broadside interface for single-instruction transfers. They appear to not be subject to any conflict other than the other PRU.

The point I’m trying to make is that from the TRM, it appears there is the possibility of some non-deterministic latency whenever using anything connected to the 32-bit PRU-ICSS bus. That is because the system (ARM) can access that bus through the OCP slave - and it will have to do that if it’s going to be pushing data to the 12k or 8k PRU-ICSS DRAM. I think I can manage that (using interrupts to trigger the ARM to write the data and not start any timing critical steps until I can determine that write is complete). But when thinking this through, the question it has raised is this:

If I have both PRUs executing, won’t they be (potentially) competing for access to the single 32-bit PRU-ICSS bus each time they access their “own” 8k DRAM or the “shared” 12k DRAM? Both PRUs can access all three of these memory locations, and the diagram seems to indicate there is only one path to them. And if this is true, then other than 12k being bigger than 8k, I don’t see any advantage (or difference at all, other than having the same address in memory for either of the PRUs) between using the 12k or 8k DRAM from either PRU.

That’s what I’m trying to verify, or be disabused of whatever mistake I’ve made.

To be specific, this is what I think will (can) happen:

ARM writing to 12k PRU shared DRAM can affect timing of PRU read/write to it’s own 8k DRAM, the other PRU’s 8k DRAM, as well as the 12k PRU-ICSS 12k shared DRAM
PRU0 reading/writing to either 8k DRAM or 12k DRAM can affect timing of PRU1 reading/writing to either 8K DRAM or 12k DRAM, even if the source/target of PRU0 is not the same as the source/target of PRU1
Any reads from system resources (through OCP master) are subject to stalls (e.g. peripherals, GPIO, ARM DDR)
Any writes to system resources (through OCP master) are also subject to stalls (but less likely) if the interconnect fabric has been saturated. (I was hoping I could get some rough idea of how much it takes to “saturate the interconnect fabric” - and do only writes contribute, or reads as well).

I will look at that BeagleLogic code and see if I can see how that was done. I’d still like to understand the underlying operation in more detail. Thanks.

ags · February 20, 2017, 12:51am

tl;dr

Is it correct that whenever the PRU cores access any resource through the 32-bit system but, it is subject to varying delay since the other PRU core and even the ARM core (through the OCP slave, for instance if the ARM is pushing data to the PRU 8k or 12k DRAM) may also be contending for control of that bus?

zmatt · March 31, 2017, 10:56am

TI’s “VBUS” interconnects aren’t actual buses, they use crossbar switches, which means stalls should only happen if two initiators simultaneously try to access the same target. Probably there are priority rules then, similar to when two cores try to access the same scratchpad. This is something you could test fairly easily, e.g. let both cores repeatedly access one of the memories and compare their stall counts. Using multi-word vs single-word accesses might also have influence.

There is also an undocumented priority config register for the pru interconnect you can try playing with, at offset 0x24 in pruss cfg module. Bits 0-7 are four 2-bit values presumably related to the four initiators (pru core 0, pru core 1, L3, and the unused pruss-to-pruss port), and bits 8-21 are 1-bit values presumably related to the targets. I don’t know what the values actually mean

Matthijs