Fastes way to copy large amounts of data from pru to arm

lopatin · January 20, 2022, 5:35pm

I’m working on a project where I am using the pru to record a bunch of data with its fast IO (about 50 to 60 MHz). In some ways similar to the BeagleLogic project, but with some additional control code (on the pru side). Problem was that you can’t compile the project from sourcecode, so I build my own thing from the ground up and got it working. The only issue that I now have is the amount of data that I am recording.

For my measurement I need to collect data in the magnitude of about 2MB. I know that the pru also has access to some fast memory, but only 28 KB (8 pru0 + 8 pru1 + 12 sram), which is not even close. The obvious solution was to treat this memory as a buffer and push data at regular intervals with rpmsg, which I was going to use to get the final data anyway.

This however turned out to be impractical as it takes one of the cores approximately 800 cycles (varies a little) to move 496 byte to a mailbox. Doing smaller chunks is less efficient as there is a 400 cycle overhead and doing bigger ones is not supported. Both prus are needed for the recording process. Pru1 does the controls and gathers IO data into its registers and moves them thru broadside to pru0, which does a bit of formatting and saves it to the ram. There is only a hand full of free cycles during recording and about 150 every 3 KB of data (other hardware is busy and both prus are free to do anything while waiting).

Under these circumstances I can only see three solutions. Either I somehow enlarge the pru ram (I am not sure if it is just a memory region or actually special hardware that can’t be changed), make the pru copy the data directly into a large section of ram (hoping that the speed doesn’t suffer too much) or have the prus do their thing and task the arm with grabbing the data from pru memory. Some sort of memory mapping.

On that note I actually found a neat library (libpruio) that does exactly that (didn’t test how fast it actually reads tho). There are a couple things that I don’t particularly like:

It takes forever to process an interrupt (between 3000 and 10.000 cycles, presumably because it’s userspace) which means that I’d have to grab data blindly hoping that it is already written
I can only get it to run on the older 4.19 Kernel (mostly due to my inability to get the uio drivers working on 5.4 or some other mystery reason)
It’s third party software that will only be maintained for as long as the maker is willing to. Sure its gpl, but the thought of maintaining a decade old project and learning cmake and freebasic is almost as frightening as a devicetree
getting it to work requires some ancient software incantations. I am afraid that I won’t find them anywhere on the web one day.

I would prefer some TI endorsed tech. Some way to get the arm to directly access the pru memory. Or get the pru to write data into a bigger memory chunk. Maybe there is some other way that I am oblivious to?

In any case I’d like to thank you for reading my long ramblings, I’d be thankful for any tips directions or reading material!

dinux · January 20, 2022, 7:24pm

Hi,

My understanding is that RPMSG has been designed for control transfers, not high bandwidth data. For 2MB/s you probably need to use shared memory instead.

When PRU writes data it typically uses posted writes. Hence writing to a shared memory region in DRAM could appear to be rather fast. Check https://www.ti.com/lit/pdf/sprace8 for more information.

You could use /dev/mem to access any physical memory address range from your userspace application. Make sure that your kernel config has enabled that feature.

You cannot enlarge pru ram. It’s special SRAM, not a region in the system DRAM.

If you require long term support then I would suggest to use kernel 5.10 or later. The remoteproc driver received one last incompatible update before being merged into mainline. Now that it is in Linus’ tree, I don’t think there will be more breaking changes.

There has been a long standing GSoC task to port am335x-pru package to remoteproc. Keep an eye on it - I think it could match the requirements you have, should it be completed sometime in the future.

Lastly, a shameless plug. I had similar challange as yours, and ended up writing a kernel driver. I still used RPMSG but only for control transfers. The actual data is written by PRU in a shared memory region in system DRAM.

I hope my ramblings have been useful to you

Regards,
Dimitar

zmatt · January 20, 2022, 10:49pm

You already mentioned BeagleLogic, which is actually a great example since it dumps up to 200,000,000 bytes/second from PRU.

Just allocate a ringbuffer in DDR3 memory, since PRU can write to there just as fast as it can write to its local ram (in all but the most extreme circumstances). Also allocate a shared variable in DDR3 where PRU writes a copy of its write-pointer for the ARM core to read (using the same memory for both avoids write-ordering hazards). Having PRU read DDR3 memory is very slow however (and has variable timing), so communicating the ARM’s read-pointer to PRU (to enable PRU to perform overrun detection or flow control) is optimally done via PRUSS memory rather than DDR3 memory.

My py-uio project actually has an example of using such a ringbuffer to dump data from PRU to ARM (specifically a stream of messages, but it could work equally well for a stream of bytes):

ARM side: py-uio/stream-c.py at master · mvduin/py-uio · GitHub
PRU side: py-uio/stream.cc at master · mvduin/py-uio · GitHub

As the name implies this uses uio rather than remoteproc, but that doesn’t affect the basic concept, it only affects details like how to allocate the memory in the first place. Also this reader on the ARM side is obviously not very efficient (in no small part due to being written in Python) but it’s meant to demonstrate the concept.

DTJF · January 21, 2022, 10:13am

DRam and SRam are RAM ereas on the PRU-ICSS module directly connected to the 32-bit PRU INTERCONNECT bus. No way to extend that.

The PRU can read/write DDR3 RAM over the L3 bus with unpredictable latency. Ie when the ARM reads a cell, the PRU is blocked to write. No way to fulfill your hard real-time requirements (ie when future kernels will force additional L3 loads).

The rproc driver got further incompatible change. It’s the last like all previous has been the last. That driver will be finished when it works like the uio_pruss driver.

The statements about libpruio are nonsens (ie it doesn’t need an interupt to sync chunk transfers, or it isn’t limited to FreeBASIC as examples in C or Phyton also working). Anyhow, it’s your point of view.

Solution:

Let PRUs writing data to DRam, transfering chunks to DDR3 RAM by DMA, triggered by either the ARM or - if possible - by one PRU. Neither rproc nor libpruio have anything out-of-the-shelf, it’s up to you to develop the DMA transfer. AFAIR BeagleLogic works that way, you may find inspiration in the source.

zmatt · January 21, 2022, 12:07pm

BeagleLogic writes directly to DDR3 from PRU which, as I said, is generally just as fast as writing to PRUSS local memory: 1+n cycles per n-word write. There are two ways you could see worse performance:

It is possible to saturate the FIFO of the PRUSS → L3 bridge since this bridge needs 2+n cycles to transfer an n-word write to the L3 interconnect. If you perform a sufficient number of back-to-back writes (without any other instructions in between) they will eventually take 2+n cycles per write instead of 1+n. Inserting any other instruction between two writes to the L3 will allow the bridge to catch up and avoids the extra stall cycle. The number of back-to-back writes you can perform without getting these additional stalls depends on the number of words per write: you need 25 back-to-back single-word writes to get the first stall, but only two back-to-back 13-word (or larger) writes.
In theory the DDR3 memory controller could get sufficiently congested that writes could back up all the way to PRUSS and cause further writes to stall. This will depend on interfering traffic from other system components (ARM, DMA, display controller), and is a bit hard to estimate since the DDR3 interface has a ton of bandwidth (up to 1.6 GB/s (1.49 GiB/s) shared between read and write) but some access patterns can leave the bus idle for many cycles due to access latency and bank switching. If I remember correctly, both the memory controller and the L3 interconnect have some tuning knobs that could be used to prioritize PRU traffic. I wouldn’t worry about this unless you run into actual problems.

I would not recommend using DMA to transfer data from PRUSS local memory to DDR3 memory. This is unnecessarily complicated, uses unnecessary resources, and will have a much lower maximum bandwidth achievable than the direct approach.