Load and execute PRU code from bare-metal application

There are serial hobby servos nowadays. Futaba SBus is one example.

*If you're looking for details on the BBB/Xenomai install, that's not*
* really within the realm of the Machinekit docs repo. The best place*
* to look for the details and "secret sauce" of building a working image*
* is to actually grab the build scripts from github. Robert Nelson is*
* now building the Machinekit images as part of his "universal SoC build*
* farm", so the Machinekit build scripts are right next to (and*
* virtually identical to) the scripts used to craft the other BeagleBone*
* images:*

* GitHub - RobertCNelson/omap-image-builder: omap image builder
<https://github.com/RobertCNelson/omap-image-builder&gt;\*

* To make a Machinekit image, just:*

* ./RootStock-NG.sh -c machinekit-debian-wheezy*

* ...like it says at the bottom of the readme.md <http://readme.md> file.*

Thanks for your answer Charles. However what I would like to find out is
how is machinekit different from say Debian. Not so much in difference
between distro's( because I'm thinking it's "just" a kernel with *some*
tools ), or determinism, but how does one use it to their full advantage.

The Machinekit BBB image *IS* Debian, just with a Xenomai capable
kernel and some packages to make use of it pre-installed.

So for all I know, one would use it like you'd use Linux in general. My
guess would be this is not the case however. Also, knowing some guidelines
while developing deterministic code would be very handy too.

So basically, stuff that an experienced developer should know when using
machinekit, but doesn't from lack of experience *with* machinekit. Which
libc is expected . . . etc.

Does that make any sense ? Maybe I'm looking in the wrong place so far ?

That makes sense, but is _way_ beyond the scope of a simple email,
particularly since I don't know how much you do or don't know about
coding for real-time.

If you're wanting to easily write deterministic code, you might want
to use PREEMPT_RT, which works really well on the x86 architecture and
is coming along on the ARM architecture. This allows you to write
"normal" C code, including making kernel syscalls (directly or via
libraries like libc) without loosing real-time performance.

Xenomai runs in it's own domain, and while you can call routines in
the Linux kernel, doing so breaks any guarantee of hard real-time
performance. So you have to write Xenomai drivers or directly talk to
any hardware you're expecting to have real-time performance.

Note that Machinekit is a project to control motors and other physical
things (ie: machines) that runs under several possible real-time
environments (currently Xenomai, PREEMPT_RT, RTAI, and even plain
Posix w/o real-time guarantees). The Machinekit images for the BBB
are simply a ready-to-run version of the RCN's BBB Debian builds with
the Xenomai kernel and Machinekit packages pre-installed for ease-of-use.

That makes sense, but is way beyond the scope of a simple email,
particularly since I don’t know how much you do or don’t know about
coding for real-time.

Note that Machinekit is a project to control motors and other physical
things (ie: machines) that runs under several possible real-time
environments (currently Xenomai, PREEMPT_RT, RTAI, and even plain
Posix w/o real-time guarantees). The Machinekit images for the BBB
are simply a ready-to-run version of the RCN’s BBB Debian builds with
the Xenomai kernel and Machinekit packages pre-installed for ease-of-use.

Thanks Charles. Your answer pretty much answered all my questions. I guess I could have been more succinct in saying that I just wished to know if looking into Xenomai, or machinekit was a waste of time for my own purposes. Which now it does seem that way. For now.

Pretty much all I wanted was some form of Linux, that ran on a “tighter schedule”. PREEMPT_RT sounds like where I may want to be.

I do know a bit about real-time coding, but would definitely not consider myself an expert. In the context of Linux . . . all I know is by reading. No hands on.

Hi William,

Thanks for the suggestion. Actually, I had looked a little bit at Xenomai, and also Chibi-OS and a couple of other RTOSes. I was worried about adding complexity for myself, not really having experience in compiling kernals and not having much experience programming within Linux or using other tools within Linux. Also, I’m kind of a control freak. But I do like to try to learn something new with each project I work on, so I’ll probably fight with the bare metal a little while longer, and then seriously consider moving to an OS based environment. Since Linux controls a lot more of the world than most people give it credit for, I really should try to get fluent with it (I could add it to my resume then). It would also be nice to abstract away other things that an OS already has code for. For example, I would like to add a wireless USB NIC to the BBB, and if there is one that already has a driver and I can just write code to use it, that would be awesome.

Thanks Charles, I hadn’t seen that yet. I’m going to fight with the bare metal a little while longer, and then probably start exploring this.

Hi Rick,

The servos I am using are Robotis Dynamixel servos. The servos themselves have (I believe) Atmega8 controllers in them to handle the actual PWM details. They use a 3 wire interface :1 power, 1 ground, and 1 half duplex 1Mbps serial line. I don’t know how time variation tolerant my setup would be. The way I’ve done it on my other setups is to use a timer based interrupt that fires about every 8 milliseconds that updates the target values in memory, then pushes them to a circular buffer that feeds the UART, with a transmit register empty interrupt to pull the next byte. With the 1Mbps line used by the servos being an order of magnitude slower than the CPU updating things, I was thinking maybe that would have the effect of smoothing out any hiccups, so if there was a millisecond or two variation, it might not have a big impact. The other controllers I am using probably interface more immediately with their memory, but they are also running much more slowly (72 Mhz being the fastest) and I haven’t had to resort to anything like DMA for them, so I’m hoping that since the AM3359 runs so much faster than that, I won’t have to here either (although I would like to get familiar with the workings of that at some point, and as the complexity of the code grows and the demands on the system increase, I may have to).

Thanks!

After reading through a bit more in the TRM about the PRU UART, I don't
think a PRU UART will be feasible since it looks like they top out at
around 300Kbs

Hmm, where'd you get that number? The PRU UART looks like the highest
performance UART: it receives a 192 MHz functional clock and the datasheet
specs 12 Mbps max (that would be using a /1 divider and 16x oversampling).
The other UARTs receive a 48 MHz functional clock and spec max 3.6864 Mbps
(/1 divider and 13x oversampling, so that would get you 3.6923 Mbps to be
precise).

I've also noticed that UART0 cannot cope too many consecutive writes, even
if there's enough fifo space: the fifo pointers seem to get corrupted or
something (I'm guessing a bug in the synchronization logic between the
interface and functional clock domains). This only appears as issue when
trying to rapidly fill the UART fifo from the cortex-a8 in a tight loop
(using posted writes). Inserting some dummy register write between
consecutive data bytes fixes the issue, as does slowing down the loop in
some other way. Using EDMA would probably also solve the problem.

I haven't tested the other UARTs, but I'd guess the other UARTs will have
the same behaviour except for the PRUSS UART (due to ick/fck ratio).

I know things will run more slowly if I don't use caching, but if I disable

caching, does that eliminate any pipelining? I'm a noob when it comes to
pipelining and caching, since I've only ever hacked on AVR microcontrollers
and a Cortex M3, where those weren't considerations.

Heh, yeah I personally went from ARM7TDMI-based microcontrollers to the
DM814x, a Cortex-A8 based TI SoC closely related to the AM335x... quite a
bit of culture-shock there. "Wait, is this still an ARM processor?" o.O

I'm not sure what you mean by "eliminate any pipelining": pipelining is an
intrinsic part of the design of almost any modern CPU, even the AVR
(2-stage) and Cortex-M3 (3-stage), although they pale in comparison to the
Cortex-A8 (14-stage, plus 10 more for NEON instructions). The PRU is a
notable exception for being non-pipelined, which is deeply impressive
considering it runs at 200 MHz and has 32-bit compare-and-branch
instructions. In general pipelining becomes most visible in unpredictable
branches, which take 1 cycle on PRU, 2 on AVR, 3 on the M3, and 14 on the
A8.

However, especially since the A8 executes strictly in-order, memory
accesses can stall the pipeline for quite a while, and I suspect this is
what you mean. This is highly dependent on memory region attributes
(including cacheability), which also means setting up MMU and caches is
absolutely essential on the Cortex-A8. This isn't very hard: for a
baremetal application it typically suffices to setup the section
translation table with the desired attributes (see
http://community.arm.com/docs/DOC-10098 for an example), set L2 cache
enable bit in the Auxiliary Control Register (if not already set), and the
M, C, Z, and I bits of the Control Register (Z and I are already set by
bootrom iirc).

One of the easiest ways to murder write-performance is by marking memory as
"strongly ordered", which is the default for data access if the MMU is
disabled. This makes the cpu wait synchronously on writes, so then you're
looking at about 150-200 ns (= cycles @ 1 GHz) for each write, depending on
the "ping time" from the cpu to the target. In contrast, writes to device
or normal memory are buffered and therefore take 1 execution cycle as long
as the buffer isn't full. The limiting factor in draining the buffer is
that the cpu can only have one device write and one normal uncacheable (or
write-through) write outstanding on the AXI bus, but almost immediately
(afaik as soon as the write is accepted by the async bridge to the L3) the
write is "posted" (i.e. becomes fire-and-forget) and acked to the cpu.

In case of normal memory, small writes to sequential addresses are
automatically coalesced to larger writes when possible. This isn't done for
device and strongly-ordered memory, so using aligned dword (strd) and
quadword (neon) writes when possible will get you significant performance
gain there.

In case of non-Neon reads, the cpu has to wait for the data to become
available, so caches obviously have a huge impact: L1 cache hit = 1 cycle,
L1 miss L2 hit = 9 cycles, L2 miss (or uncacheable) = ping time to target.
If they miss the caches, reads from normal memory still have the benefit of
overtaking buffered writes, while device reads aren't allowed to overtake
device writes. The situation with Neon is more complicated and I never
fully figured out what goes on there. For example, some timings for a
simplistic memory copy using Neon (vld1, subs, vst1, bne) on a DM814x (A8 @
600 MHz) targeting DDR3-800:

from strongly ordered to strongly ordered: 17.76 cycles/byte
from device to device: 12.77 cycles/byte
from device to uncacheable: 9.02 cycles/byte
from uncacheable to uncacheable: 1.31 cycles/byte
from uncacheable to device: 1.10 cycles/byte
from L2 miss to uncacheable: 1.06 cycles/byte
from L2 miss to device: 0.99 cycles/byte
from L2 hit to device or uncacheable: 0.50 cycles/byte

"L2 miss" refers to the first access of each cacheline (i.e. one out of
four loads).

Of course for most peripheral targets caching is not an option. You could
probably often get away with marking them normal uncacheable instead of
device, though this may require introducing memory barriers and I don't
know how expensive they are. It would also be highly Cortex-A8 specific:
architectually an ARM CPU is allowed to perform arbitrary reads from normal
memory, and many perform speculative reads for example.

Matthijs, does EDMA offer that big a performance boost?

After giving it more thought I'm actually not sure whether EDMA would
achieve higher throughput than writes by a PRU core, since PRU is a direct
initiator on the L3F while EDMA has to go through the L4HS to reach PRUSS.
Having EDMA perform the transfer would however free up PRU's precious time.
After setting things up, PRU could request EDMA transfers with a single
write to EDMA, or using the PRUSS interrupt controller.

Another point of some importance is that since EDMA uses non-posted writes
you would actually be sure the data has reached its destination when it
signals completion. If PRU writes data to RAM, then signals the A8 using an
interrupt, which subsequently proceeds to read from the same location, it
is not guaranteed to actually read the data written by PRU: this data may
still be in some queue on route from PRUSS to EMIF, while the A8 has a
private hotline to EMIF that bypasses it.

For other situations the benefits are more clear: for example it can read
data from a peripheral in response to its dma request and directly deliver
it into PRUSS, and send notification to PRU when a certain amount of data
has been transferred. This can save PRU from having to perform reads over
the L3 interconnect.

EDMA also has a staggering amount of bandwidth. While its reads are
limited by latency just like other initiators, the max size of a single
access by EDMA is 64 bytes, so for example it can slurp the whole content
of an ADC FIFO with a single read access. It is synchronous to the L3,
avoiding the latency of an async bridge. Although it uses non-posted
writes, it can have four writes + a read outstanding simultaneously. And
all this describes a single Transfer Controller (TC), EDMA has three of
these. Total theoretical bandwidth is just under 8 gigabytes per second,
though I don't know how much is achievable in practice.

I think I had more stuff I wanted to say, but this email is already long
enough and been sitting in Drafts for too long, so I'll just press "Send"
now :wink:

Matthijs

I think I had more stuff I wanted to say, but this email is already long enough and been sitting in Drafts for too long, so I'll just press "Send" now :wink:

Matthijs

Impressive analysis. Thanks.

BTW there is some interesting B3/Xenomai/PRU stuff here as well: / - Repository - Bela - Sound Software .ac.uk

I am looking at a robotics project where I need commercial (full
industrial) reliability and and and.

Just getting into this stuff so I have spent most of my time so far just
reading (hopefully learning and not asking superfluous questions! grin!!).

TIA

Dee

This should not be a problem, these guys here are offering something very
similar: http://halaser.eu/e1701m.php

Very interesting.

I have just started working through the documentation.
Do you know of any other boards like this?

Dee

Wow. At this point I feel like I should be paying you tuition ^_^.

Apparently while I was falling asleep reading the TRM in bed late at night, I totally misread and misinterpreted the UART divisor tables on pg. 238. Thanks for pointing that out, and for the heads up about the pointer corruption issue. I’ll probably still try to use one of the non-PRU UARTS first (in case I want to dedicate the other PRU to other sensors or processing), and fall back to the PRU one if I’m having too much trouble getting smooth real time operation.

Before getting into microcontroller programming for robots about 2 years ago, I hadn’t done any hardware level programming since I was a kid 30 years ago on 6502 processors. Didn’t really have to think much about pipelining, caching, or memory management back then ^_^. I do line of business desktop and web programming for my day job.

I’m probably using the term pipelining too casually/incorrectly. I know the hardware will simultaneously execute one instruction while decoding the next one and fetching the one after that. I was kind of including dealing with what is loaded in cache, how things slow down with cache misses, etc… My first couple of ‘hello world’ type programs I’ve written for this didn’t even use caching, and even now I’m online using instruction caching (since the SDK code for that is super easy and enabling it sped things up considerably). I tried to set up the MMU, but it was hanging my program, and I didn’t want to get bogged down in trying to debug that yet, at least not until I learn a LOT more.

The way I am trying to set things up now, just so I can see if the camera is working or will work, the PRU will only ever write to the picture memory, and the main core will only ever read it. So if the main core stalls while reading it, that is no big deal. What will be critical is that the PRU can write the data coming from the camera (at about 9MB a second) to memory dependably.

I have a lot more to say/ask too, and I can’t thank you enough for all the help and info you’ve given me so far, but I’m writing this from work and I think if I want to keep this job for a while longer I better get back to it. Talk more soon…

SUCCESS!! I was able to get the OV7670 camera connected, get the PRU reading it, and get the results over to my PC for display (although because I’m pushing to the PC via serial port, I can only see stills and not video, and the stills take about 20 seconds to transfer (640 * 480 greyscale (I’ll work on the color later) image going byte at a time across a 115200 serial connection)). I discovered some of my initial (and persistent) problems were with poor terminations in my wiring (makes me inclined to want to go back and try it again as main core code with the GPIOs). It appears the L3 can consume writes from the PRU fast enough to move to memory (although since this is the ONLY thing running on the device right now, I don’t know how it would degrade the operation of other code). This is really giving me the itch now to try to port this to run under Debian, so I can use the other facilities of the OS.

If anyone wants to look at (or laugh at) my code, you can see it here: http://sourceforge.net/p/bioloidfirmware/code/ci/master/tree/
in the ‘Beaglebone Firmware’ folder. Also, there is a WIndows program in the Utilities folder called ‘UARTImageReceiver’ that I am using on the PC side to fetch the image. It transmits a character to the BBB, and when the BBB receives it, it dumps back the contents of the array, which the PC program builds into an image and displays.

The setup is really sloppy right now (as is the code), I’ll try to clean it up soon. Also, I’m using GCC and Eclipse. The way it is set up, you have to run the makefile with ‘pru_bin’ as the argument to build the PRU part, then run it without the argument to build the main program. I’m running 'the TI ‘bintoc’ program to convert the PRU program to arrays that I include in the main program which I then load into PRU memory and start the PRU. I tweaked ‘bintoc’ to take an extra argument to use as the name for the generated array. Because of this, and since I am directly writing to the address of my array in main memory from the PRU, there is some more compile time craziness that is necessary. I have to compile everything, check the map file to see where the array gets placed, put that address into the PRU code, then compile it again. That also only makes either the debug or release version usable, but not both (since I’m not using a debugger anyway, I might just ditch the debug build).

Finally, I included a huge chunk of the Starterware code directly in the project so I wouldn’t pollute my Starterware install (because I want to keep working through the examples), and so I could move the project around without breaking stuff. TI, please don’t sue me. If I need to remove something, let me know. I stole, er, borrowed liberally from a bunch of people, and will try to attribute properly as soon as possible. In case you don’t notice, I’m a slob and miss a lot.

SUCCESS!! I was able to get the OV7670 camera connected, get the PRU reading it, and get the results over to my PC for display (although because I’m pushing to the PC via serial port, I can only see stills and not video, and the stills take about 20 seconds to transfer (640 * 480 greyscale (I’ll work on the color later) image going byte at a time across a 115200 serial connection)). I discovered some of my initial (and persistent) problems were with poor terminations in my wiring (makes me inclined to want to go back and try it again as main core code with the GPIOs). It appears the L3 can consume writes from the PRU fast enough to move to memory (although since this is the ONLY thing running on the device right now, I don’t know how it would degrade the operation of other code). This is really giving me the itch now to try to port this to run under Debian, so I can use the other facilities of the OS.

Awesome BIll that sounds great.

One thing I was thinking, and have been playing around with myself lately is . . . You could use websockets to push images / video out over the ethernet port. How one would implement that “bare metal” I am not sure. From within Linux it is pretty easy, and a few good libraries / API’s to play with. The one I’ve been experimenting with lately is libmongoose GitHub - cesanta/mongoose: Embedded Web Server, and they have another library called “Fossa” GitHub - cesanta/fossa: Async non-blocking multi-protocol networking library for C/C++ which is supposed to be cleaner.

Anyway, websockets are a protocol that can be used with, or without a browser on the client side. So, if you wanted to write your own client for manipulating the video - You could. Fairly easily.

Hi Bill. I want to do this too. I checked your sourceforge rep and could not find your OV7670 interface.
maybe we can collaborate. contact me thru www.baremetal.tech

later…dd