Equivalent of PRU on main CPU

Lenny · August 5, 2015, 4:56pm

Hello everyone,

I find it a pity that the PRU runs only at 200 MHz and not at the full 1 GHz. I was wondering if there exist any linux distributions (or not linux at all) which allow to run real-time code on the main CPU. Of course, this would have to disable all linux features that are not explicitly implemented in the code, but in many cases I think that could be desirable. Has anyone ever done something like that? I guess it boils down to using an extremely minimized kernel / completely removing the OS and only running the necessary stuff, but I am by no means an expert in low-level linux programming and was wondering if there exists some code out there which would make it easier for me to start with.
In the end, I would like to use more powerful instructions of the main CPU such as floating point arithmetics and the higher speed to implement PRU-like functionality for DSP in real-time.

Thanks in advance!

RobertCNelson · August 5, 2015, 5:00pm

rt linux: https://rt.wiki.kernel.org/index.php/Main_Page

sudo apt-get update
sudo apt-get install linux-image-4.1.3-ti-rt-r7
sudo reboot

but your going to find, the pru is just faster and more deterministic...

Regards,

William_Hermans · August 5, 2015, 5:51pm

Lenny,

Hello, so no hands on but I have read about several technologies you may be interested in.

First, TI’s starterware, and bare metal should be possible - for this hardware.

Also, you could directly run an executable from uboot, which is the first( second ? ) stage bootloader for the beaglebone official debian images.

Thirdly . . . remoteproc could be used on a dual core processor to run a / the second core baremetal while the first core runs Linux. Or, so I think I’ve read.

There are also RTOS’s out there, and TI does have TI-RTOS i think its called. But not sure if it is capable of running on the sitara processors or not. In other words I would think it could, but have not looked into it.

So all this stuff I know of, but not much in the way “how”. remoteproc for the beaglebone specifically has been discussed in the past on these groups. But there is not much information out there. A couple linux/documentation text files is all. And of course the kernel source it’s self.

Anyway, is this the kind of info you’re after, or is this too basic ?

Lenny · August 5, 2015, 7:41pm

Thanks Robert,

that rt linux project seems very interesting, but as you say, it is still too high-level / multitasking in order to beat the PRU. So one would have to remove basically all the remaining functionality from rt linux to arrive at something fast and deterministic at the ns timescale i guess…

Thanks anyways!

Charles_Steinkuehler · August 5, 2015, 7:59pm

Even that's not enough.

The "A" in the Cortex-A9 CPU means "Application". The processor core
is designed for high speed and *NOT* deterministic operation. There
are lots of tradeoffs in application processors made to allow high
speeds including pipelining, caching, branch prediction, out-of-order
execution, that all have a negative impact on the determinism of code
execution times. You will find the PRU, which was intentionally
designed for fixed operation times will be _far_ more deterministic
than the ARM A9 core even though it's running at only 20% of the speed
(due in large part to the intentional lack of the advanced features
that _allow_ the A9 core to run at it's GHz clock rate).

You can probably come close to the PRU performance determinism on a
Cortex-A core if you're willing to dedicate one core of a multi-core
part strictly to doing real-time, but the PRU will still probably work
better. Especially if you need high-speed I/O, where the single-cycle
latency to the direct PRU I/O pins will run _circles_ around anything
the ARM core can do trying to talk to the GPIO pins.

Lenny · August 5, 2015, 8:12pm

Thanks for your post William,
the idea to start an executable from uboot sounds very close to what I want i guess. My question here would be which document is equivalent to the PruReferenceGuide, such that I can find out how to talk to the various hardware pieces such as memory and inputs/outputs and the NEON core etc., and which compiler I would have to use (in the best case a C compiler with inline assembler support). And if available, a library with some useful functions such as a accessing serial port (USB) and maybe even Ethernet (though i guess that would require interrupts and all sorts of other overhead) would be just perfect!

But actually I have just now looked into starterware (http://processors.wiki.ti.com/index.php/StarterWare_02.00.01.01_User_Guide#Serial_Peripherals), and it really looks amazingly close to what I had in mind. There are lots of examples and I guess i’ll start testing some of them.

@Charles: Thanks for the warning Im still still a noob when it comes to processor architecture. The application I have in mind (FIR filter) is computationally intensive, but does not need a huge data throughput (few MSps would be enough, which I know I can delegate to the PRU if necessary). I found the idea of using the main processor appealing as I read somewhere about its SIMD capability (doing 16 or 32 multiplications and accumulates simultaneously, which would theoretically allow something like 16-32Gflops, right?), and floating point arithmetics.

So if you confirm that all those advantages are lost somewhere in the communication between core and dedicated modules, that would be a pity but indeed save me a lot of time

And for curiosity/ease of later implementation/number of available input-output ports: What delay and number of necessary instructions can I expect for exchanging one or multiple bits between the main processor and a GPIO port? More than 10 cycles?

Thanks so much for your suggestins and help!

William_Hermans · August 5, 2015, 8:55pm

Hello again Lenny,

Yeah I have not looked into uboot very deeply. But it does have at least some Ethernet, USB, and Serial functionality. I say “some” because I’m not sure how extensive it is. For example with the builtin ethernet stuff, you can boot over TFTP and NFS, but beyond that . . . not sure what can be done with it. Also I’m pretty sure there is some I2C functionality built in too. NO personal hands on with it though . . .

Anyway, yeah, I’d listen to Charles, and Robert, as they’ve probably both had their hands into the lower level stuff more than I have. I was just tossing some things out there for you to read on when and if you got the time . . . As I was kind of in the same boat you are in - But a couple years ago. I was also thinking maybe super low level baremetal, but decided a good long while ago that it was not really worth it for me. PRU’s ? Sure, but baremetal Sitara . . .Yeah not for me hehe

William_Hermans · August 5, 2015, 9:13pm

I just thought I’d toss this out there though.

Over the last 2-3 months I’ve been working on a project that Involves socketCAN, and forming NEMA 2000 fastpackets from socketCAN frames in real time. Then pushing the data in the form of JSON out a webserver ( web sockets ), in real time as well.

So up until last night I was working on all this in a quad core 4 GB RAM virtual machine( virtualbox ). canplayer was eating up around 58% CPU on a single core, while my two processes were taking around 8% between the two. quad cores - 3Ghz.

Imagine my surprise last night that canplayer only uses 2% CPU, and the two processes I’m working on use 0% - In fact, the webserver only shows up in atop about half the time- heh.

My point is: The beaglebone may not have but 1/3 the CPU frequency, and a LOT less memory. It may also do some things such as compile large projects slower, or even sometimes a lot slower. But do not let it fool you. It is still a beast.

Lenny · August 5, 2015, 9:49pm

Hehe, a beast indeed

I downloaded the StarterWare software and I like it. I’ll summarize my current understanding, and if someone wants to correct me in case its necessary, I’ll be glad:
As far as I understand, StarterWare does not use an OS overhead, so you get to execute your code directly in the MPU - bare metal access so to say. I imagine the same can be accomplished by properly embedding your compiled file into a bootloader at the right place. The provided examples are reasonably clear, for example to set a GPIO pin, you find the instruction

GPIOPinWrite(GPIO_INSTANCE_ADDRESS,
GPIO_INSTANCE_PIN_NUMBER,
GPIO_PIN_LOW);

Checking what is behind is really a simple instruction
HWREG(baseAdd + GPIO_CLEARDATAOUT) = (1 << pinNumber);
where the macro HWREG only provides a properly shaped pointer to the address in brackets. Using these examples is equivalent to a painful and time-consuming study of the TRM, where you can find the addresses of all those registers.

So as far as I understand, this operation should also only take the equivalent of one single assembler instruction after compilation. Two questions now remain:

how many cycles does it take the MPU to execute this instruction (or any other one - this is not specified in the TRM but I am sure it is somewhere in the ARM documention)
how long does it take until the value arrives at the output pin

The second question aims at Charles concern. Again, from the TRM i deduce that for example the GPIO modules are connected to the MPU through the L4 interconnect. The interface clock rate is specified in the GPIO chapter of the TRM to be 100 MHz. Now I do not understand how this bus works in detail, but the fact that it can handle several sources and destinations simultaneously raises the concern that there is a buffer involved that comes with some extra latency. But I would assume that by running only the one code snippet that I define, and no OS processes in the background, that all other devices are disabled, and therefore the bus is really only used when my program does so. So there should be top prority handling for my packets and therefore they should arrive with minimal, and up to clock missynchronization, deterministic delay. So I would just estimate a delay of a few interface clock cycles, so a latency less than - say - 1/20MHz. Is my reasoning correct here or do I forget something?

I guess my next steps will be reading on the MPU itself, as to whether one can hope to implement really fast algorithms on a very low level here. If Im not mistaken, this is the document to read. A first glance tells me that maybe I’ll understand what the A in Cortex-A9 actually means on a low level

If there is a catch that I am not aware of - thanks for letting me know!

Charles_Steinkuehler · August 5, 2015, 10:00pm

@Charles: Thanks for the warning Im still still a noob when it comes to
processor architecture. The application I have in mind (FIR filter) is
computationally intensive, but does not need a huge data throughput (few
MSps would be enough, which I know I can delegate to the PRU if necessary).
I found the idea of using the main processor appealing as I read somewhere
about its SIMD capability (doing 16 or 32 multiplications and accumulates
simultaneously, which would theoretically allow something like 16-32Gflops,
right?), and floating point arithmetics.

So if you confirm that all those advantages are lost somewhere in the
communication between core and dedicated modules, that would be a pity but
indeed save me a lot of time

The Cortex-A9 core should be great at running a FIR filter,
particularly if you can use the NEON SIMD instructions. The problem
with the application style processors (and the optimizations that make
them fast) is you create uncertainty and variable delay in responding
to a real-world event (like an interrupt for a new chunk of data).

For the BBB, you can get around 75 uS worst-case latency is a good
estimate. If you have a mechanism to DMA (or use the PRU to collect
and write) samples into main memory, the ARM should be fine at running
the FIR filter, but you should bunch samples together and only fire an
interrupt for processing every N samples (or you're wasting a *LOT* of
time in IRQ overhead).

And for curiosity/ease of later implementation/number of available
input-output ports: What delay and number of necessary instructions can I
expect for exchanging one or multiple bits between the main processor and a
GPIO port? More than 10 cycles?

The ARM core should see about the same latency as the PRU when talking
to the GPIO. Writes will typically be posted and won't "cost" time on
the CPU as long as you don't write so fast you saturate the
interconnect. Reads will generally stall the CPU and should take on
the order of a couple hundred nanoseconds:

github.com

machinekit/machinekit/blob/master/src/hal/drivers/hal_pru_generic/pru_generic.p#L137-L165


      
          // PRU GPIO Write Timing Details
          // The actual write instruction to a GPIO pin using SBBO takes two 
          // PRU cycles (10 nS).  However, the GPIO logic can only update every 
          // 40 nS (8 PRU cycles).  This meas back-to-back writes to GPIO pins 
          // will eventually stall the PRU, or you can execute 6 PRU instructions 
          // for 'free' when burst writing to the GPIO.
          //
          // Latency from the PRU write to the actual I/O pin changing state
          // (normalized to PRU direct output pins = zero latency) when the
          // PRU is writing to GPIO1 and L4_PERPort1 is idle measures 
          // 95 nS or 105 nS (apparently depending on clock synchronization)
          //
          // PRU GPIO Posted Writes
          // When L4_PERPort1 is idle, it is possible to burst-write multiple
          // values to the GPIO pins without stalling the PRU, as the writes 
          // are posted.  With an unrolled loop (SBBO to GPIO followed by a 
          // single SET/CLR to R30), the first 20 write cycles (both 
          // instructions) took 15 nS each, at which point the PRU began
          // to stall and the write cycle settled in to the 40 nS maximum
          // update frequency.

This file has been truncated. show original

The interconnect may be somewhat faster for the ARM core, but talking
to the GPIO is going to be *WAY* slower than talking to main memory,
which is itself much slower than the CPU core frequency.

Charles_Steinkuehler · August 5, 2015, 10:06pm

Hehe, a beast indeed

I downloaded the StarterWare software and I like it. I'll summarize my
current understanding, and if someone wants to correct me in case its
necessary, I'll be glad:
As far as I understand, StarterWare does not use an OS overhead, so you get
to execute your code directly in the MPU - bare metal access so to say. I
imagine the same can be accomplished by properly embedding your compiled
file into a bootloader at the right place. The provided examples are
reasonably clear, for example to set a GPIO pin, you find the instruction

        GPIOPinWrite(GPIO_INSTANCE_ADDRESS,
                     GPIO_INSTANCE_PIN_NUMBER,
                     GPIO_PIN_LOW);

Checking what is behind is really a simple instruction
        HWREG(baseAdd + GPIO_CLEARDATAOUT) = (1 << pinNumber);
where the macro HWREG only provides a properly shaped pointer to the
address in brackets. Using these examples is equivalent to a painful and
time-consuming study of the TRM, where you can find the addresses of all
those registers.

So as far as I understand, this operation should also only take the
equivalent of one single assembler instruction after compilation. Two
questions now remain:
1) how many cycles does it take the MPU to execute this instruction (or any
other one - this is not specified in the TRM but I am sure it is somewhere
in the ARM documention)
2) how long does it take until the value arrives at the output pin

The second question aims at Charles concern. Again, from the TRM i deduce
that for example the GPIO modules are connected to the MPU through the L4
interconnect. The interface clock rate is specified in the GPIO chapter of
the TRM to be 100 MHz. Now I do not understand how this bus works in
detail, but the fact that it can handle several sources and destinations
simultaneously raises the concern that there is a buffer involved that
comes with some extra latency. But I would assume that by running only the
one code snippet that I define, and no OS processes in the background, that
all other devices are disabled, and therefore the bus is really only used
when my program does so. So there should be top prority handling for my
packets and therefore they should arrive with minimal, and up to clock
missynchronization, deterministic delay. So I would just estimate a delay
of a few interface clock cycles, so a latency less than - say - 1/20MHz. Is
my reasoning correct here or do I forget something?

See my previous mail, the numbers for the PRU will likely closely
match what you can do on the ARM core, since the interconnect is going
to be the limiting factor to performance.

tl;dr:
Writes will go fast, but won't show up at the pin for a while.
Reads will take about 165 nS.

I guess my next steps will be reading on the MPU itself, as to whether one
can hope to implement really fast algorithms on a very low level here. If
Im not mistaken, this
<https://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf>
is the document to read. A first glance tells me that maybe I'll understand
what the A in Cortex-A9 actually means on a low level

Um...that's the -M manual, you want the -A manual, specifically the
Cortex-A9. Just get it straight from the source:

...click the "resources" tab.

Carlos_Novaes · August 13, 2015, 5:13pm

Sounds promising… Anyway I have a felling that it will not beat the PRU speed, for reasons already explained in this discussion.
If you are willing to get rid of the OS to run direcly your application on beaglebone, maybe using FPGAs would be a better idea. I don’t know FPGAs very well, but I know you can even “program” some FPGA devices to work as a CPU/FPU. No idea of the performance nor the costs involved, this is just a guess on a interesting topic.