How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

I have an application running on the ARM host that writes to the PRU shared memory. The PRU core then manipulates that data and sends it out the EGPIO pins with exact timing.

For this to work, I need to have a steady supply of data bursts available to the PRU core - about 32 KiB each burst. This won’t fit into the PRU shared memory (12 KiB), PRU memory (2x 8 KiB), or even spread out over all PRU memory (28 KiB total). So it’s not possible to load the data into PRU memory and then kick off the PRU core to send it out the pins. There will be a necessary transfer of additional data from ARM host to PRU memory after the PRU has started sending data out through the EGPIO pins.

I’ve instrumented the PRU PASM code using the CYCLE register, and see that there is variable latency when the PRU is waiting for a memory block to be received from the ARM host. This can be upwards of 5 milliSec, which won’t work for this application. I’ve tried using ionice to set class to realtime and priority to 0, but this had no appreciable effect.

Is there some way to reduce the latency of writing from ARM/Linux to the PRU memory? I’ve heard that some projects use DMA to transfer data from the PRU to host (ARM, system) DDR (e.g. BeagleLogic project) but nothing about the reverse direction. Does this even make sense? Will the kernel already be invoking DMA during a memcpy from user virtual address space to the mmap’d physical PRU memory address?

I need to provide about 32 KiB to the PRU within 5 milliSec, repeating every 20 milliSec. This seems like it should be easily accomplished, if a USB driver can sustain 480 Mbps data rates. I must be approaching this the wrong way. Any suggestions on how this should be architected will be greatly appreciated.

On Tue, 21 Mar 2017 22:08:17 -0700 (PDT), ags
<alfred.g.schmidt@gmail.com> declaimed the
following:

I need to provide about 32 KiB to the PRU within 5 milliSec, repeating
every 20 milliSec. This seems like it should be easily accomplished, if a
USB driver can sustain 480 Mbps data rates. I must be approaching this the
wrong way. Any suggestions on how this should be architected will be
greatly appreciated.

  If you've got a USB system that actually manages 480Mbps for more than
a few bytes, you've got something miraculous.

  USB is a polling intensive protocol, with lots of turn-arounds (host:
Are you ready to receive?, client: ready, host: data packet send)

  High Speed USB has a data payload size of 1kB plus a few bytes for
sync/PID/CRC16... Or about 8k bits per transaction. Sure, the 8k bits goes
out at 480Mbps... And then gets followed by polls of connected devices to
find out which is the next device to be serviced.

  The effective rate for high-speed USB is only around 280Mbps (USB 3
Superspeed is rated 5Gbps, but the spec is considered met with an effective
rate of 3.2Gbps -- USB 3 is full-duplex signalling, the others are
half-duplex)

  I've not encountered any protocol that requires something like 32kB as
a continuous stream with no subdivisions for handshake/error-checking.
Ethernet breaks data up into (with overhead) ~1.5kB chunks; TCP may be able
to send multiple chunks before getting an ACK back on the first chunk, but
it is still chunked...

I’d say you most likely have a flaw in your code, because what you describe is only around 1.6 MiB/s.

I’d also like to point out that you will rarely, if ever see any USB interface that will achieve a full 450Mbit/s. For example, the g_ether network gadget driver at best usually only achieves 105-115Mbit/s, but that partly due to the code written.

That's not much data. I recommend you just make a circular buffer in
the PRU data memory, and run a periodic task on the ARM side to keep the
buffer filled. Using the 12K shared data ram you can store almost 2 mS
worth of data, which ought to be plenty. By way of example, the default
Machinekit ARM side thread period is 1 mS and could easily be faster for
something simple like this.

Note you might need an -rt or -xenomai kernel to achieve reliable
operation, I've seen the non-rt kernels occasionally "wander off into
the weeds" for several hundred mS at a time.

"Wander off into the weeds . . ." I get a kick out of that expression every
time I see it in this context.

I do agree with Charles, and would like to add that you need to pay close
attention to which C, and Linux API function calls you use in your
application. Function calls such as printf() which can be handy for quick
and dirty text debugging can slow your code down considerably. However, if
you pipe the output of such an application into a file. You'll notice a
huge performance improvement with that single trick alone. Anything related
to threading, file locks( poll(), etc ), etc through Linux API calls is
also going to slow you down. Certainly there is more, but these are the
three things I've personally tested, and can think of off the top of my
head. Also, under certain conditions, using usleep() in your code where
you're using a busy wait loop, can help some, but at other times it could
potentially backfire. Depending on how busy your system is. Either way
though, a busy wait loop without using usleep(), or giving CPU time back to
the system will wind up using ~95% processor time. Until "preempted". Just
remember that there is only have one core / thread to work with.

You may also need to slim down unneeded processes, services, and kernel
modules that are loaded / running by default on a stock beaglebone Linux
image. As all of this will be compete for CPU time, which you may not be
able to afford, in order to have your application perform as well as you'd
like. Basically you need to profile your system, and see what you can get
away with.

So from personal experience, I can say with reasonable confidence that the
maximum possible latency with an RT kernel is going to be around 50ms. But
this number will be if your system is constantly "busy". If you system is
extremely busy, it can be more. But I've had an application that was doing
a lot of processing in code, but was only using up to 5% processor time.
Because I was giving processor time back to the system by using usleep().
But anyway, if you need "real-time" an RT kernel could work fine. Depending
on your definition of the term. If you need deterministic, you may need to
use xenomai, move into the kernel, or potentially both.

I would probably start by profiling your system to see what all is running
in the background, and if everything you do have running is necessary.
After that, try installing an RT kernel.

You’ve hit the nail on the head. The issue (IMO) is Linux “wandering off into the weeds”. It comes back, eventually… but while gone, bad things happen.

  1. I am using a handshake approach between PRU and ARM, using interrupts. When the PRU wants more data, it generates an ARM interrupt. The user space application listens for the interrupt (using select()) and when received sends more data. The PRU is made aware of the data being ready by sending an interrupt to the PRU.
  2. I am using a ring (though with only two compartments, it seems more like a “line”) to send the data. I think of it as a tic/tock, or ping/pong approach: when one “side” (half) of the data space has been read by the PRU, it signals the ARM host to send another (half) buffer full of data. So the PRU is always reading from a buffer while the ARM is loading the other.
  3. While the average data rate I need to sustain is about 13Mbit/second (not a problem) the challenge is ensuring, under all conditions, that I can send 262 kbits of data from ARM to PRU, in chunks small enough to fit into the 12k PRU shared RAM, in a “timely manner”. With my current design, this requires sending 4KiB of data from ARM to PRU shared RAM, completing the transaction within 960 uSec of the request for more data. The limiting factors are the timing (can’t starve the PRU of data, otherwise the bitstream out will have gaps which corrupt the content for the (extra-BBB client)) and the size of the PRU memory (if I could load a full “frame buffer” of data at once to ensure not starving the PRU that would work - but the PRU shared RAM only holds 1/8 of the data required by the PRU for each burst).

I thought using select() to wait for notification of an event (by “listening” to the fsys uio files) would free the ARM cpu to do other things while waiting, but provide the most immediate path to the user space application to send more data. Is there a better way?

So that select() is probably your whole problem. Unless, you're using other
system calls as well. But I've already discussed with you the best, and
fastest way to achieve your goal. Several times in fact. Use a bit in
memory, *somewhere*.

PRU side:
while(somewhere & 0x1 )
    usleep(1000);
/* Do our work after while() fall through */

Userspace side:
while(! (somewhere & 0x1) )
    usleep(1000);
/* Do our work after while() fall through */

No need for select(), no need for fancy threading calls, or other magical
hand waving. Just two simple busy wait loops waiting for their respective
turns. But, don't forget to toggle the bit back, when you're done. Anyway,
it's not really Linux that's off in the weeds. Well perhaps it is, but your
application is pushing it into the weeds.

You’re also not going to be able to use alarm(), or any functions that gets time from the system. Not if you want to stay reasonably deterministic. Write your userspace app to do one thing, and fast, and handle the rest else where. For instance if you’re sending your data to some remote location. Have that remote location time stamp the data.

OK, I will use the busy-wait loop w/ usleep and test. The reason I used select was I thought it would allow me to do other things (I need to have another process, thread, or loop in this same application serving out audio data to another client, synchronized with this data). My understanding was that the process blocking on select() to return would free the CPU for other things, but allow a quick wake-up to refresh the buffer as needed.

BTW, I have only mentioned the problems - but it does almost work. In my tests, I ran 12,500 4KiB buffers from ARM to PRU and measured (on the PRU side, using the precise CYCLE counter) to see if the PRU ever had to wait for the next buffer fill. Turns out that the PRU had to wait about 180 times, or about 1.5% of the buffer fill events. The worse case wait (stall) time was ~5milliSeconds.

OK, I will use the busy-wait loop w/ usleep and test. The reason I used
select was I thought it would allow me to do other things (I need to have
another process, thread, or loop in this same application serving out audio
data to another client, synchronized with this data). My understanding was
that the process blocking on select() to return would free the CPU for
other things, but allow a quick wake-up to refresh the buffer as needed.

I thought that select(), and all that should work too, initially. But you
have to remember, we're talking about an OS here that has an "expected"
latency of 100ms, or more- Depending. I can tell you that one could easily
experiment, and find out for themselves. One of the easiest tests one could
do for themself. Would be to run a loop, for 10,000 iterations, then
compare using select() to a busy wait loop. Then run the cmdline command
time on each to see the difference. This of course is not a super accurate
test, but should be good enough to show a huge difference in executable
completion time. *If* you're more of the scientific type, then get the
system time in your test app before, and after the test code, then output
the difference between those two times.

Anyway, using an RT kernel, or an xenomai kernel may improve this latency
*some*, but it is said that this comes at the expense of *some* other
performance aspects of the OS. I've not actually tested that myself, but
only read about it.

BTW, I have only mentioned the problems - but it does *almost* work. In
my tests, I ran 12,500 4KiB buffers from ARM to PRU and measured (on the
PRU side, using the precise CYCLE counter) to see if the PRU ever had to
wait for the next buffer fill. Turns out that the PRU had to wait about 180
times, or about 1.5% of the buffer fill events. The worse case wait (stall)
time was ~5milliSeconds.

One has to be very careful what they use in code when writing an executable
that requires some degree of determinism from userspace. I can not think of
the articles I've read in the past that led me to understand all this. But
they're out there. Pretty much anything that is a system call, will incur a
latency penalty. Because one ends up switching processor context from
userspace, to kernelspace, and back to userspace. This in of it's self may
not be too bad, but any variables that are needed will end up being copied
back and forth as well. In these cases however, you can incur huge latency
spikes that you may not have anticipated.

Personally, I've run into this problem a couple times during two different
projects. So my style of coding is to just get something working, right ?
Then refactoring the code to perform to my expectations. Basically,
starting with really "simple" stuff like printf(), select, etc. Then
refactoring those out when / if needed. Many times, it's not needed, but
when it is, one should understand the consequences of using such function
calls in an executable. That way, one should have at least a rough idea
where to start with "trimming the fat". But everyone falls into this "trap"
at least once or twice when entering the embedded arena.

My understanding of calls like select(), is that when they're used, you're
yielding the processor back to the system, with the "promise" that
eventually, the system will notify you when something related to that call
has changed. But with a busy wait loop, you're defining the time period
you're allowing the processor to be yielded back to the system. In the case
of my example, approximately 1ms. Just be aware that with any non real-time
OS, much faster than 1ms intervals will yield varying results. e.g. the
system will( may ) not be able to keep up with your code. If your code is
super efficient, you can potentially get hundreds of thousands of
iterations. This is of course not guaranteed, but I've done it personally
with the ADC, so I do know it can be possible. At this performance level,
you're almost certainly using mmap(). You're almost certainly using a lot
of processor time as well. 80% +

Also my code was pseudo code that I picked apart myself after I posted. On
the PRU side of things, you're probably going to want to do things a bit
differently. For starters, you're going to want to time your data transfers
from the PRU probably. That is, every 20ms, you're going to kick off a new
data set. However, this has to be done smartly, as you do not want to
override the userspace side file lock. So perhaps a double buffer will be
needed ? That will depend on the outcome of your given situation. Another
technique that could be used, would be data packing. As plain text data can
be a lot larger in memory than a packed data structure. But it would also
require a lot of thought on as how to do this smartly. As well as a strong
understanding of struct / union "data objects" + data alignment. For the
best results.

There could potentially be a lot more to consider down the road. Just pick
away at it one thing at a time. Eventually you'll be done with it.

One other thing I did not think of to mention is that: I was recently watching a video on youtube from Jason Turner. A person who is known for talking about performance related C++ coding. Now I’m not exactly a huge fan of C++, but I do like to keep up with the language. But one of the things he mentions in this video that I completely agree with. “Simple code, is 99% more likely to perform better than complex code”. Or something to that effect. Which may seem obvious initially, but consider a simple two lines of a busy wait loop, to the select() system call. Is the select() only two lines of easy to read / understand code? You know, I can not say with 100% certainty that is is not. But I seriously doubt it.

@William Hermans I thought I’d share the result of my efforts to reliably stream data from ARM host (Linux userspace) to PRU.

I instrumented the PRU ASM code to use the CYCLE register for very precise measurements. I ran tests that kept track of how many times, for how long, and the “worst offender” when the PRU was stalled waiting for data from the ARM host. I used this to test my current implementation using select(), and then replaced select() with usleep() (and nanosleep()), and then again a loop with no sleep, just a brute-force busy wait that never released the CPU. As it turns out, the results were surprising. Using usleep() (and similar related methods), the number of stalls, the overall stall time and the worst-case stall time were all significantly worse than the implementation using select(). Even the busy wait loop w/out sleep() was worse. I did a bit of research and sleep() and related methods are implemented using a syscall (sleep - used to use alarm in the olden-days (so I read)). So getting through the call gate and the context swap happens with sleep() just like with select(). My theory is that select() is more efficient precisely because of this: one call to select() incurs one system call/context swap per interrupt. The process is put on the NotRunning list, and the the OS continues on. When a trigger event happens, the OS returns the process to the Running list and then control back to user space. For the sleep() method, there are many calls per “interrupt”, polling some memory location looking for the signal from the PRU. So what is handled by one userspace->kernelspace->userspace transition with select() could require dozens of these transitions using sleep().

I don’t claim to be an expert, and if there is a flaw in this theory I’m open to hearing what it is. But this is my theory at the current moment.

So what I ended up doing is compress the data so that one “frame” can fit in PRU memory at once. The PRU needs to send a full “frame” out with precise timing (within microsecond timing) for all data in that frame. Between frames, there is slack. By compressing the data, I can load a full frame into the PRU0/1 DRAM and shared RAM, and then kick off writing out the frame. Now everything is (or appears to be) deterministic in the timing of all transfers between registers, scratch and PRU DRAM. So I’ve sidestepped the problem of unpredictable latency waiting for data from the ARM host.

I hope this might help someone else with similar requirements.

@William Hermans I thought I'd share the result of my efforts to reliably
stream data from ARM host (Linux userspace) to PRU.

I instrumented the PRU ASM code to use the CYCLE register for very precise
measurements. I ran tests that kept track of how many times, for how long,
and the "worst offender" when the PRU was stalled waiting for data from the
ARM host. I used this to test my current implementation using select(), and
then replaced select() with usleep() (and nanosleep()), and then again a
loop with no sleep, just a brute-force busy wait that never released the
CPU. As it turns out, the results were surprising. Using usleep() (and
similar related methods), the number of stalls, the overall stall time and
the worst-case stall time were all significantly worse than the
implementation using select(). Even the busy wait loop w/out sleep() was
worse. I did a bit of research and sleep() and related methods are
implemented using a syscall (sleep - used to use alarm in the olden-days
(so I read)). So getting through the call gate and the context swap happens
with sleep() just like with select(). My theory is that select() is more
efficient precisely because of this: one call to select() incurs one system
call/context swap per interrupt. The process is put on the NotRunning list,
and the the OS continues on. When a trigger event happens, the OS returns
the process to the Running list and then control back to user space. For
the sleep() method, there are many calls per "interrupt", polling some
memory location looking for the signal from the PRU. So what is handled by
one userspace->kernelspace->userspace transition with select() could
require dozens of these transitions using sleep().

I don't claim to be an expert, and if there is a flaw in this theory I'm
open to hearing what it is. But this is my theory at the current moment.

I've honestly no idea how you're implementing what I suggested, so I can't
really comment on what's going on. sleep() won't work though, and I'm not
sure how usleep() is implemented for your particular OS( Debian Linux ),
but usleep() on micro-controllers( bare metal ) is usually less than 10
lines of code. I want to say less than 6 lines of code, but it's been a
while since I've looked through an implementation.

Your findings are also surprising to me, but I can not help but feel that
you did not implement the busy wait loop how I expected. Or perhaps there
is something else going on that we haven't discussed. So, if you used
"interrupts" with the busy wait loop, then that's not how I intended it to
be used. The busy wait loop was to be in place of your interrupt code. But
that would explain why it could have been slower. To be sure, there are
potentially many other things that could have been culprit / co-culprit for
your given situation. It's not always easy to talk about these things at a
high level without making sure everything is understood on both sides of
the conversation. Without seeing your code, I can not really say any more,
with surety.

So what I ended up doing is compress the data so that one "frame" can fit
in PRU memory at once. The PRU needs to send a full "frame" out with
precise timing (within microsecond timing) for all data in that frame.
Between frames, there is slack. By compressing the data, I can load a full
frame into the PRU0/1 DRAM and shared RAM, and then kick off writing out
the frame. Now everything is (or appears to be) deterministic in the timing
of all transfers between registers, scratch and PRU DRAM. So I've
sidestepped the problem of unpredictable latency waiting for data from the
ARM host.

I hope this might help someone else with similar requirements.

Yeah, there is usually more way than one way to do the same thing. That's
why I mentioned data packing, as I had a feeling that it could at least be
useful for you.

On a bare-metal microcontroller, sleep() is a busy loop but in Linux
sleep/usleep/nanosleep() results in a system call, which explains the
latency differences. BTW, a busy loop on Linux could still be
interrupted and result in latency.

The only problem I have with that train of thought, is that I've written
code, that literally handled all 200Ksps of the ADC, which used usleep() .
Prior to that, I implemented nearly exactly the same thing I was trying to
explain here, but both sides of my project were in userspace. One to read
from the CANBUS, and Decode PGN's in real-time, the other halt taking that
data, and putting it out to a web page via web sockets. When I tested this
with redundant data, I was getting 2000+ web socket msg's a second to the
web client. Where using various other methods like select(), and poll() was
achieving less than 20 msg's a second.

So, I'm not arguing, but rather confused as to why this would work for me,
and not for someone else.