How many PRU clock cycles does a LBBO instruction take?


I am using a Beaglebone Black. When i measured the number of PRU clock cycles needed for the execution of various assembler instructions, I found surprisingly large values for memory access. Here follows a list, in which one cycle corresponds to a delay of 5ns as expected:

Most operations, such as ADD,SUB,QBxx,MOV,JMP etc.: 1 cycle

LBBO 1,2,4 Bytes from PRU DRAM: 3 cycles
LBBO 8 Bytes from PRU DRAM: 4 cycles
LBBO 12 Bytes from PRU DRAM: 5 cycles
LBBO 16 Bytes from PRU DRAM: 6 cycles

LBCO 4 Bytes from DDR: 43 cycles
LBCO 8 Bytes from DDR: 44 cycles
LBCO 12 Bytes from DDR: 45 cycles
LBCO 16 Bytes from DDR: 46 cycles

With PRU DRAM, i mean any addresses between 0x00000000 and 0x00004000 and the shared PRU RAM (12 kB starting from 0x00010000). Any other address i tried had the delay stated for “DDR”.

Can anybody confirm the long DDR (and other delays if possible) readout times that I have measured? Does anybody have an explanation for these large delays?

Thanks in advance! Lenny

Yes, I did some testing of the read and write latency to GPIO pins, the
results are in my PRU source code for LinuxCNC:

The write latency is caused by interfacing to the chip interconnect
fabric instead of the local PRU resources. Note that the fabric
includes write posting, so if you're not saturating the fabric you can
do a write in two PRU cycles (10 nS). If you are saturating the fabric,
the maximum rate is 8 PRU cycles (40 nS).

For reads, the PRU stalls until the data is returned by the chip
interconnect fabric, which takes around 33 PRU cycles (165 nS) when
reading from the GPIO ports. If you are actually reading from DDR
memory (instead of on-chip resources), you also have the DRAM read
latency on top of this.

Note that almost all of these values include one or more clock
synchronization delays so they can change some either way (shorter or
longer) based on clock phasing and configured operating frequencies.

Thanks a lot. These delays are quite disappointing. Is it possible to shorten these delays by using e.g. DMA to transfer data from a peripheral (in my case the TSC_ADC) directly to PRU memory?

It should be possible to DMA into the PRU data memory, and indeed
checking section 10 of the TRM (Interconnects) shows that all TPTC
(Third Party Transfer Controller, a.k.a DMA) initiators connect to the
L4_Fast slave where the PRU-ICSS lives.

Charles great answer

Lenny - rule of thumb is if you have to go outside of the PRU, it will be both slow and non-deterministic because you have to go over the l3 interconnect. Because of this if you need tight timing behavior you should not make any outside-of-pru memory accesses. I got around this by having PRU0 doing the “realtime” stuff and putting data in PRU memory and then having PRU1 stream that to DDR.

I briefly looked at DMA but could not spend the time to figure out how to initiate DMA transfers from the PRU. If anyone figures that out a post on it would be fantastic.


Very nice findings!
Do you have a broader list of instructions and their duration?
And/or is there any official or unofficial documentation where these delays could be taken from?
Also, you mentioned the PRU DRAM only - does it take the same time for the shared memory?
Thanks a lot!

Sorry, just saw that you actually mentioned that the shared memory has the same performance as the DRAM.
Also, I found this:
where it is said that LBBO should take (1+word count) cycles. If that’s right, an LBBO instruction up to 4 bytes should take 2 cycles for VBUS and 3 cycles for VBUSP. For now I need to study more to understand which one is the case, but VBUSP matches with your findings.

You could also use the code snippet in this article to calculate clock cycles for individual instructions:

Thanks, Kumar!
I ended up doing a slightly different program before reading your comment. I used the STALL register to get how many clock cycles an instruction spares, so that means the instruction actually takes 1 + stall. I came up with these values for my BBB rev. A5A for the instructions that matter for my application (all of the tests were made using 32 bits only):

LBBO/LBCO = 3 clocks for DRAM and Shared RAM

SBBO/SBCO = 2 clocks for DRAM and Shared RAM
LBBO = 43+ clocks for DDR reading (43.3 in average over 10000 tries)

LBBO = 42 (or 43) clocks for ADC FIFO0DATA reading (41.0001 in average over 10000 tries)
The ADC clock didn’t impact here, tried with 3 MHz and 8 MHz by changing the ADC_CLKDIV register).

I believe I will have to adapt my programs to always consider the difference of the CYCLE register… It seems to be the only way to be deterministic. Up to now I was manually counting instructions and subtracting them from the number of delay loops.

Would someone kindly decode what VBUS and VBUSP are? Searched but could not find other relevant references. Thanks.

I also don't see any references to VBUS/VBUSP, but based on the context
they are internal PRU buses.

To answer your original question, access to anything outside the PRU
block (the two PRU cores, data memories, and local peripherals) requires
communicating over the SoC's internal interconnect fabric. You can
perform zero wait state writes to these resources (at least until you
saturate the posted write logic), but reads will stall the PRU until
data is returned from the far end.

I characterized the performance when accessing the GPIO registers from
the PRU, and got results similar to your DDR memory timings:

Note that all timings are approximate. The exact number of PRU cycles
it will take to complete a write or read will depend on things like
internal bus utilization, various clock crossing latencies (which by
nature will have a varying amount of latency) and how quickly the far
end can respond. The DDR DRAM controller in particular needs to
schedule the read request and there are many factors that can cause the
read latency to vary.