GPIO from PRU via OCP on a Beaglebone Black - Timing Issues

For those that may remember the old LEDScape library, I’ve been working on an updated version of that library, which focuses on strips instead of matrices, uses rproc instead of UIO PRUSS, and updates the PRU assembly to clpru from pasm.


The key thing you need to know is that we hook up 32 addressable LED strips and then use the PRU to bitbang out RGB(W) data. We use the PRU because our timings need to be pretty precise- a few hundred nanoseconds for each key phase of the operation.

Here’s the important issue: we need to address all 32 GPIO pins from the PRU, but not all of them are bound to the r30 register. So we need to go through the OCP port. This is exactly how LEDScape worked, and continues to work, just fine. We’ve never been able to get LEDScape working under 4.x kernels, mostly because of UIO problems (which is what kicked off this whole “move to rproc” thing).

My upgrade, ThrowingBagels, uses basically the same core logic on the PRU, just ported to clpru assembly and running on a 4.19 kernel. And seemingly randomly, the timings hitch which causes the LEDs to flicker to the wrong color. Phases of our bitbang operation will sometimes take almost twice as long as they should- a sleep that should have been 600ns ends up taking 1100ns. The only operation happening that doesn’t have guaranteed timings is writing to the GPIO pins via OCP, everything else we do happens entirely in PRU DRAM. Since this appears to happen randomly, the hitch must be coming from that OCP step, I assume.

In support of that hypothesis, if I upgrade from the kernel that ships with the “AM3358 Debian 10.3 2020-04-06 4GB eMMC IoT Flasher” image to the most recent 4.19 kernel, the problem becomes a lot more infrequent. We’re blasting this data out at 30fps, like video, and when cut down on the number of services running and update the kernel, I can get the glitches down from happening every few seconds, to happening every few tens of seconds.

My suspicion, and I can’t quite prove anything, is that on 4.19 there’s something about the kernel or configuration that sometimes adds latency to OCP writes, which wasn’t there on 3.16. So my key question is: how do I improve the timing consistency when the PRU uses OCP to write to DDR RAM? I understand that it will never have guaranteed timing, but sometimes it’s hitting me with latencies of up to 500ns. Anything I can do to minimize that latency would be a huge help.

TL;DR: how can I make PRU->OCP->GPIO more consistent in its timing under a 4.19 kernel?

Wow… You should have contact us before doing a lot of that. I completely re-wrote most of the LEDScape code over the last couple years to completely optimize things in attempts to reduce some of the timing issues. Porting to clpru and rproc was already part of that. All my updates are in FPP ( ).

Anyway, to answer your question, the issue is specific to GPIO0. GPIO1-3 is not affected by the massive latency issues. Thus, the best option is to chose GPIO pins on GPIO1-3 and not use the GPIO0 pins. That wasn’t an option for me as we needed to output 48 strings. In the FPP code, if nothing is using the second PRU (the second PRU could be used for DMX or pixelnet output), we divide the work and have one pru do the GPIO1-3 and the other do the GPIO0. If something IS using the other PRU, and the strings are short enough, then we split it on the one pru and do GPIO1-3 first, then do the GPIO0’s. For the most part, that keeps the GPIO0 problems from affecting all the strings so the random flashes would really just be on the GPIO0 strings. In the case where the second PRU is used for something else AND the strings are longer, then we do have to do all 4 GPIO’s at once and all of them can be affected so it’s definitely not a perfect solution.

To minimize the issues (but not entirely eliminate) I do now build a custom 4.19 kernel that disables most of the devices on the L4_WAKEUP interconnect. Any power management and frequency scaling stuff causes huge issues with GPIO0 latencies so those are the most important things to disable. I think my notes are at:

Not sure if that helps enough for you. Feel free to ask more questions. :slight_smile:

That is super helpful. Thanks a bunch. The pinlayout we’re using on our boards uses a lot of GPIO0 already, so it’s definitely too late to change on this. The way we’re banging things out, all the GPIOs are being hit at the same time, so the latency does appear to hit our strings. I’ll try giving your kernel a shot, though- that’ll definitely help. And maybe I’ll move the GPIO0 bits over to the other PRU. I hate to have to do that, but if it’s what needs done, it’s what needs done.

Also, off topic, but poking at FPP: SK6182s support the WS281x protocol, so you mostly already support them, but if you poke around at ThrowingBagels approach a little, it’s not a big push to get 32-bit support for RGBW LEDs (we use a lot of SK6182s with the warm-white LED, and it looks great). We’ve been running a custom hack of LEDScape for ages so ThrowingBagels is sorta a consolidation of the features we use, stripped down to the bare minimum.

The debs for the kernel are at:
do you should be able to update to our kernel fairly easy. If you need to start building your own kernel, I’d suggest grabbing a Beaglebone AI and building on that. It’s WAY faster for kernel building. :). You can cross-compile from a debian x86_64, but I was never able to get that to actually produce proper .deb files that could be installed cleanly on the BBB so I pretty much just use the AI for kernel builds. (It’s actually the ONLY thing I use my AI for.)

FPP provides a complete UI frontend for configuring the pixel strings and such and we do allow the various 4 channel types. It does a lot of other things as well. That said, most of these things are done on the ARM side and not the PRU. Part of trying to figure out the latency issue was seeing what make sense to do on the arm side and what works best on the PRU side. If you actually wanted to try FPP and see if FPP’s optimized PRU code and kernel combination would work, you could use the FPP 4.6.1 image on an SD card (see the release assets at github). You would just need to create a small json file in /opt/fpp/capes/bbb/strings to describe the pinout of your cape (use any of them in that directory as a starting point) and it should then “just work”. You would need to configure e1.31/artnet input universes on the Channel Input tab, put FPP in “bridge” mode, and then it should work like a normal light controller and accept pixel data. (Or use DDP protocol which doesn’t require configuring the input)

Personal plug: I’d be happy to sell capes that don’t use gpio0. Screen Shot 2021-03-23 at 4.01.22 PM.jpg

Personal plug: I’d be happy to sell capes that don’t use gpio0.
Heh, we’ve already got all the boards for this project. Maybe we’ll revisit that design in future projects, though.

Your software is definitely doing a lot more than ours, and certainly much more than we need- we just listen for RGB data on a UDP socket. We, uh… don’t really treat them like lights, and instead as very large pixels in a screen. All the mapping/direction/orientation stuff is handled in the render stack, which we build custom for pretty much every project. Last one was a Unity App that was part of a kiosk connected to a gigantic chandelier. Our current project is kinda like an architectural scale video wall with a C++/OpenCV app driving the pixels.

I might give the FPP image a shot though, if building my own custom kernel doesn’t help. Your guidance on that was super helpful, though your FPP kernel did not get along with our software (LEDs just didn’t work- in lieu of diagnosing that, I just opted to compile my own, which is going on… right now). Thanks a bunch!

One more thing you can try with the TI kernel if your kernel build has issues:

The biggest problem is the CPU going in and out of idle states. Thus, you can install the cpufrequtils and linux-cpupower packages and then at boot run:

cpufreq-set -g performance

cpupower idle-set -d 1

(and maybe cpupower idle-set -d 0 )

The first will lock the cpu freq at 1Ghz. The second will disable the very costly idle states in the processor. I believe the M3 processor in the L4_WAKEUP is in charge of the power management stuff which includes the CPU idle settings. Flipping the CPU out of idle seems to take a long time and blocks the bus while it waits. Disabling that state helped a lot. Alternatively, you can install the “bone” kernel which doesn’t have the idle driver (or at least didn’t early last year, not sure anymore). Anyway, those had a huge impact, but still wasn’t 100% which is why we decided to compile our own kernel completely disabling everything on the L4-WAKEUP.


I’ve been playing with building my own kernel, which I did get cross-compilation working, even with the bindeb-pkg target, which is cool. The weird sideeffect of trying and disabling various bits and modules is that… the PRU cycle counter stops working. I have no idea how or why that happens, but if I use my own kernel, ensure that all the appropriate modules are loaded, bupkiss. I suspect I have the same problem with the FPP kernel (I threw that on, tried running my software, saw it didn’t work, and just went to work making my own, but didn’t dig into the root cause, just chalking it up as an incompatibility).

So yeah, playing with the cpufrequtils is probably going to be the next thing to try. Thanks again!