Load and execute PRU code from bare-metal application

Hi,

after there is some experimental, bare-metal code now running smoothly on my BBB I plan to utilitise the PRU to do some realtime tasks (mainly do bit-banging on some GPIO outputs).

Unfortunately documentation and examples seem to be very rare and the TRM is very detailled - to not to say sometimes much too detailled to get an overview about the whole story. There is some PRU code available at https://github.com/beagleboard/am335x_pru_package but the host-side code seems to expect a running Linux system.

So my question: are there any examples/documentation available out there that show/describe how to

  • enable PR
  • load code into PRU
  • execute that code on PRU
  • exchange data between CPU und PRU (seems to be via some shared memory?)
    frwom within bare-metal code? Starterware itself seems to ignore the PRU part completely, nothing helpful there instead of an unused headerfile…

Thanks for all ideas, tips and suggestions!

Hi Satz,
Did you ever make any progress with your questions?

I have successfully used Starterware on the BBB and a TI PRU-Cape with its supporting software to load and run PRU program examples.
However, I haven’t found any examples on communicating between the ARM and the PRUs via shared memory. Have you?
I did get a simple transfer to work but am not sure about ensuring mutual exclusion or the best way to set up the shared memory.
Bryan

There is no single correct or "best" way to implement communications
between the ARM core and the two PRUs. This is a standard problem in
all multiple core machines, and you will find a lot of material with a
quick Google search. Depending on your application you may want to use
things like mailbox registers, lockless queues, interrupt signaling,
req/ack handshakes, etc.

Thanks for your reply Charles. My question was not well formulated.
I have an understanding of the theory of mutual exclusion. However, applying the theory using starterware on the beaglebone black is where I am making slow progress.
ANY examples would be useful of any method. One simple method I have used in the past is spinlocks and shared memory. But I haven’t been able to get that working yet on the BBB with Starterware.
I posted a more detailed question on the TI E2E Starterware forum but haven’t received any replies as yet (perhaps it was also a badly formulated question!).
https://e2e.ti.com/support/embedded/starterware/f/790/t/410442
Thanks again.

Pretty much all of the memory is shared, in that both the ARM core and
the BBB can see it.

Using the DDR system memory is problematic, however. It is both more
cumbersome to use on the PRU side (accessed via the interconnect bus it
stalls the PRU while reading), and on the ARM side you won't ever see
any changes unless you're careful about your cache management (typically
using kernel-mode code and the same sort of memory semantics required
when doing DMA transactions).

I'd recommend just using the PRU data memory space for anything like
semaphores or mailboxes. The memory is already mapped with the proper
access flags to avoid ARM side caching issues, and the PRUs can access
the memory without stalling. The only real reason to use anything but
the PRU shared memory is if you're data set is larger than the 8K/12K
RAMs will support.

I looked at your TI post, but it doesn't make much sense to me. I'm not
very familiar with starterware, or with the spinlock register you refer
to. I can say that the ARM atomic bus transactions are unlikely to work
properly between the PRU and the ARM if you're using the PRU shared
memory, but there are many other synchronizing constructs you can use.
The mechanisms I've used in my code rely on unidirectional atomic
access, which works well. The ARM writes values into the PRU shared
memory which the PRU reads, and the PRU writes to *DIFFERENT* locations
which the ARM side reads. With only a single writer for each memory
address, there is no need for an atomic "read-modify-write" as needed
for a traditional spinlock. It's possible to directly build lockless
work queues and req/ack handshakes out of this sort of primitive, and if
you really want a spinlock, you could build it on top of req/ack.

I’d suggest to have a look into this thread: https://groups.google.com/forum/#!category-topic/beagleboard/pru/rCO-2nKynVE

Greetings Charles,

Does the PRU stall when writing to memory outside of the PRU address space? I am working on interfacing a cheap camera to the PRU and want to have it write to a 640 x 480 buffer. So the PRU will only ever write to the buffer, and the ARM core will only ever read the buffer, so I don’t see contention being an issue, but the amount of space I will need is bigger than what all the PRU memory combined offers so I definitely need to use DDR. My concern is that the PRU won’t be able to write the data from the camera out fast enough, as there will be 8 parallel bits coming in every cycle at 12Mhz. I can shift 4 bytes in at a time and write it out DWORD at a time (which I guess would make the best use of the bus), but that is still a 3Mhz pace. Should the OCP bus be able to handle this? Any info appreciated.

Thanks,
Bill Merryman

Here is something for you to look at Bill. http://comments.gmane.org/gmane.comp.hardware.beagleboard.user/59975

Charles, and a couple other people talk some about cycles and how many cycles reading / writing takes to various addresses. Not sure this will answer your question thoroughly or not. One user suggests using PRU0 to write to the PRU shared RAM, while PRU1 takes this data, and writes it to DDR. Instead of using DMA.

In addition to the other thread, I'd suggest looking at the
BeagleLogic code. It's possible to move _large_ amounts of data
through the PRU to the DRAM, but it requires some finesse.

A few additional comments in-line, below.

Here is something for you to look at Bill.
http://comments.gmane.org/gmane.comp.hardware.beagleboard.user/59975

Charles, and a couple other people talk some about cycles and how
many cycles reading / writing takes to various addresses. Not sure
this will answer your question thoroughly or not. One user suggests
using PRU0 to write to the PRU shared RAM, while PRU1 takes this
data, and writes it to DDR. Instead of using DMA.

Greetings Charles,

Does the PRU stall when writing to memory outside of the PRU
address space?

It depends. Writes are posted, so they won't stall for long if you
aren't saturating the internal SoC bus.

I am working on interfacing a cheap camera to the PRU and want
to have it write to a 640 x 480 buffer. So the PRU will only ever
write to the buffer, and the ARM core will only ever read the
buffer, so I don't see contention being an issue, but the amount
of space I will need is bigger than what all the PRU memory
combined offers so I definitely need to use DDR. My concern is
that the PRU won't be able to write the data from the camera out
fast enough, as there will be 8 parallel bits coming in every
cycle at 12Mhz. I can shift 4 bytes in at a time and write it out
DWORD at a time (which I guess would make the best use of the
bus), but that is still a 3Mhz pace. Should the OCP bus be able
to handle this?

It's possible to move a *LOT* more data than that (again, see the
BeagleLogic code). Note that you will generally get better results
with burst transfers (ie: moving many 32-bit words at a time) than by
writing individual DWORDS. Since there are two PRUs, for maximum
throughput it makes sense to have one PRU doing the data acquisition
and the other PRU writing the data to system memory. You can
communicate up to the entire PRU register set "broadside" between the
two PRU cores in one clock using the exchange instructions.

I don't know if this is a good idea. I think this would lock both PRUs when
accessing shared RAM because only one can access it at the same time. As
far as I remember TRM, a PRU writing to DRAM would halt the main core, not
vice-versa. When this is correct the additional write/read operation to
shared RAM not only wastes a full PRU core but also adds some additional
delays without winning something.

On the other hand: how much data do you really retrieve from your camera?
And how long would data transfer to DDR really take comparing to the
remaining time between two pictures?

William, Charles, and Karl,

I can’t thank you enough for all of your input. My intention is to use a cheap OV7670 camera to capture a video stream for a robotics project (I’ve seen other projects that suggest at least image capturing from the camera is possible by direct output, as opposed to using I2C).

I would like to keep the other PRU free to run a half duplex UART out to some Robotis Dynamixel servos. I originally tried to read the camera from a program running on the main core. I had TIMER7 putting out a 12Mhz clock to the camera, and the VSYNC, HREF, PCLK, and the 8 bit parallel video lines coming into one of the GPIO banks. The VSYNC line appeared to be signaling 15 times a second, which was expected. An oscilloscope reading suggested the other lines were signaling at about the right intervals. It just seemed like something in the process of reading the GPIO pins was not keeping up. I thought since the main core runs at 1Ghz and this is bare metal I would have plenty of cycles between PCLK signals to read and handle the data, but I was only getting the expected data every so often, with a lot of garbage coming in between. So I decided to go the PRU route hoping the more direct GPIO access and determinism would make for a reliable process.

Since the camera is running at 15 frames a second at 640 * 480 (YUV, so 2 bytes per pixel), I would have to pump 9MB a second to where this is getting stored, with at least 614KB to store one frame (and I would kind of like to back buffer it for computer vision processing, so double that). If this is just crazy, please let me know.

BTW, I haven’t actually written the code to read the PRU GPIO pins yet. Do I have to set the pinmux up in the regular pad control registers, or is their muxing controlled completely by the PRU registers.

Thanks again for all of your help!

Can you not use one of the many USART peripherals on the SoC for this?

In addition to the other thread, I’d suggest looking at the
BeagleLogic code. It’s possible to move large amounts of data
through the PRU to the DRAM, but it requires some finesse.

My first intuition would be using EDMA.

on the ARM side you won’t ever see any changes unless you’re careful about your cache management (typically using kernel-mode code and the same sort of memory semantics required when doing DMA transactions).

Mapping the memory as uncacheable and using appropriate barriers suffices (privilege is not needed, though typically baremetal code tends to run privileged anyway).

I’d recommend just using the PRU data memory space for anything like
semaphores or mailboxes. The memory is already mapped with the proper
access flags to avoid ARM side caching issues, and the PRUs can access
the memory without stalling.

Well in a baremetal application nothing is “already mapped”. In fact an issue here is that the PRU memory space resides within the same 1MB section as peripherals, so if both are accessed from the A8 you’d need to set up a page table for that section to make PRU memory space normal uncacheable while making the peripheral space device-type. (Well, you don’t have to, but if you care about performance…)

I do agree with having the PRUs stick to PRU memory as much as possible.

I looked at your TI post, but it doesn’t make much sense to me. I’m not
very familiar with starterware, or with the spinlock register you refer
to. I can say that the ARM atomic bus transactions are unlikely to work
properly between the PRU and the ARM if you’re using the PRU shared
memory, but there are many other synchronizing constructs you can use.

I’m pretty sure that ARM atomic bus transactions (neither the old locked SWP nor the newer load/store-exclusive) will not accomplish anything useful, if they get anywhere beyond the CPU boundary at all.

TI is however referring to the hwspinlock peripheral which (along with the mailbox peripheral) is specifically intended for inter-core synchronization. I personally would not be eager to use hwspinlock though.

I’d also try to go for unidirectional messaging like you’re saying. The mailbox peripheral looks quite reasonable for inter-core notification, especially if this is infrequent and you want to avoid polling, though I think you also may be able to do that purely with the rather elaborate PRUSS interrupt controller.

When writing from the A8 to uncacheable memory, do remember to finish off with a memory barrier instruction since the A8 is allowed to (and does) buffer writes indefinitely otherwise.

Greetings All,

Wow, I’m glad this has generated so much conversation. Thanks to everyone who has chimed in.

To Rick M, one of the things that attracted me to the BBB was that it has several available UARTS, but I also need things to run in a deterministic fashion since I need to control an array of servos and updating needs to happen 128 times a second, which means a several dozen byte packet going out that frequently. After reading through a bit more in the TRM about the PRU UART, I don’t think a PRU UART will be feasible since it looks like they top out at around 300Kbs, and I need a megabit. I’m hoping things will be sufficiently deterministic since I’m running bare metal, and will drive the update loop with a timer interrupt and have the UART just feed things out as fast as the line will consume it. I know things will run more slowly if I don’t use caching, but if I disable caching, does that eliminate any pipelining? I’m a noob when it comes to pipelining and caching, since I’ve only ever hacked on AVR microcontrollers and a Cortex M3, where those weren’t considerations. I’m a line of business programmer in my day job :(.

Matthijs, does EDMA offer that big a performance boost? Most of my background up to this point has been just coding things for handling hardware and timer interrupts and UART communication. I’m an extreme noob when it comes to the more involved hardware stuff like DMA. Does going from the PRU to DDR pass over the L3 interconnect whether it’s DMA or regular DWORD by DWORD assignment? I’m figuring this will have to pump 9 MB a second to DDR, but with each write being a DWORD, this should only be one write every 455 clock cycles for the main core (assuming my math is correct). I have to admit, my head is swimming with some of what you wrote, so I definitely need to crack the books harder. If you know of any good references on MMUs, caching, and pipelining for beginners, let me know (I also need to educate myself more on kernel programming). I just imagine there has to be some good way to get good throughput from the PRUs to the rest of the system, otherwise the PRUs wouldn’t be very useful to the rest of the system, but again, I may just be naïve.

In the meantime, right now I’m just finishing getting my development environment set up for everything, since I’m using GCC and am using Eclipse for my IDE (up til now while learning my way around Starterware and the PRU tools, I’ve just been using notepad and the command line >_<). I’ve got it set up now to build the PRU code, convert it to C header files, and include it in my main code, which it can compile for the final bin and use memcpy to load and start the PRU at run time (primitive, but it works for my purposed right now). Now I’m going to start writing code to start trying to read the camera, and I’ll report back my results. Maybe I’ll eventually take the dive into Linux (I’ve waded in until this point).

Thanks, again to everyone!

Hey Bill,

If you’re needing deterministic, and if you decide to run Linux( or maybe just experiment ), You can always look into Xenomai.

Now keep in mind that I have no hands on personally. But it could be as easy as writing a current image to sdcard, and apt-get install-ing one of the latest xenomai kernels. Followed by learning about Xenomai of course . . . something I’ve been wanting to do myself, but have no gotten around to.

Even easier, the Machinekit images run Xenomai "out of the box", so no
messing with installing kernels and configuring the run-time
environment, just boot and start playing:

http://elinux.org/Beagleboard:BeagleBoneBlack_Debian#BBW.2FBBB_.28All_Revs.29_Machinekit

Hey Charles,

Is there any in depth documentation on machinekit ? Preferably all in one source . . . As the documentation implementers for machinekit do not seem to get that developers do not enjoy a seemingly endless round-robin of pointless links . . .

<heh>

The documentation is very much a work in progress, but what's
available in the form of official docs are in the github repo
(separate from the code):

https://github.com/machinekit/machinekit-docs

Much of this is still from LinuxCNC and while it's getting updated,
given the speed of code changes at the moment the docs are a bit behind.

If you're looking for details on the BBB/Xenomai install, that's not
really within the realm of the Machinekit docs repo. The best place
to look for the details and "secret sauce" of building a working image
is to actually grab the build scripts from github. Robert Nelson is
now building the Machinekit images as part of his "universal SoC build
farm", so the Machinekit build scripts are right next to (and
virtually identical to) the scripts used to craft the other BeagleBone
images:

https://github.com/RobertCNelson/omap-image-builder

To make a Machinekit image, just:

./RootStock-NG.sh -c machinekit-debian-wheezy

...like it says at the bottom of the readme.md file.

If you’re looking for details on the BBB/Xenomai install, that’s not
really within the realm of the Machinekit docs repo. The best place
to look for the details and “secret sauce” of building a working image
is to actually grab the build scripts from github. Robert Nelson is
now building the Machinekit images as part of his "universal SoC build
farm", so the Machinekit build scripts are right next to (and
virtually identical to) the scripts used to craft the other BeagleBone
images:

https://github.com/RobertCNelson/omap-image-builder

To make a Machinekit image, just:

./RootStock-NG.sh -c machinekit-debian-wheezy

…like it says at the bottom of the readme.md file.

Thanks for your answer Charles. However what I would like to find out is how is machinekit different from say Debian. Not so much in difference between distro’s( because I’m thinking it’s “just” a kernel with some tools ), or determinism, but how does one use it to their full advantage.

So for all I know, one would use it like you’d use Linux in general. My guess would be this is not the case however. Also, knowing some guidelines while developing deterministic code would be very handy too.

So basically, stuff that an experienced developer should know when using machinekit, but doesn’t from lack of experience with machinekit. Which libc is expected . . . etc.

Does that make any sense ? Maybe I’m looking in the wrong place so far ?

I'm not sure exactly what you're using the UART for. Are your servos controlled via serial packets of some kind? Or are they typical hobby PWM servos? If the latter, then I would have thought using a UART on the ARM core (not the PRU) would be the best way to go. I'm assuming they can do a megabit, although that probable requires DMA.

It sounds like you're using the UART to communicate with the servo, and a high rate. I can see why you'd want the timing to be right in that case. I don't really have any idea what the caching effects are.

Good luck!