Unexpected PRU I/O Delay

Geoffrey_Messier · March 8, 2017, 9:19pm

Hi everyone,

I’m using the PRU on a BeagleBone Black to interface to a digital to analog converter (DAC). The DAC generates a clock signal that I have connected to pin 16 of R31. Each time a transition on that pin is detected, I write new data to the DAC via R30. The quicker the PRU reacts to a clock transition and updates the DAC data lines, the faster I can run the DAC.

To start my testing, I’ve written the following simplified assembly code that just toggles the data lines when a DAC clock transition is detected.

`

.origin 0
.entrypoint start

start:
// Start by waiting for a high to low transition on DCLKIO
// - DCLKIO is connected to pin 16 of R31
mov r0, 0x10000

// Dummy data values
mov r2, 0xffff
mov r3, 0

jmp wait_one_to_zero

wait_one_to_zero:

qbeq wait_one_to_zero, r0, r31

mov r30, r2

// Now, wait for a low to high transition.
mov r0, 0

jmp wait_zero_to_one

wait_zero_to_one:

qbeq wait_zero_to_one, r0, r31

mov r30, r3

// Now, wait for a high to low transition.
mov r0, 0x10000

jmp wait_one_to_zero

`

This works fine but when I look at the data lines on my logic analyzer, I notice that there is an approximately 25ns delay between when a transition occurs on my DAC clock and when the data lines are updated. Since the PRU runs at 200 MHz, I was expecting something closer to 10ns (1 clock cycle for qbeq and 1 clock cycles for mov). I should mention that if I run some assembly code that just does “mov r30,0; mov r30, 0xffff; mov r30, 0” that I do see 5ns transitions on the output pins as advertised. I’ve also looked at the waveforms on an oscilloscope and they ring a bit but not enough to confuse where the logic transitions are occurring. I use “config-pin PX.XX pruin/pruout” to configure my pins for use by the PRU and I’m running 4.9.12-bone4.

I’ve read through the PRU section in the (massive) AM335x TI Reference Manual and it seems that both R30 and R31 should be operating in “direct input” and “direct output” modes by default. The diagrams and the documentation in the reference manual seem to indicate a direct connection between the R30/R31 registers and the I/O pins but my measurements seem to suggest there’s some kind of buffering with an additional clocking delay hiding in there somewhere. The fact that I can still toggle the output pins at 200 MHz with “mov r30,0; mov r30,0xffff, mov r30,0” perhaps means that the buffering is only on the input lines or that it exists on the output as well and the output transitions initiated by the assembly code move through that buffering in a pipelined manner.

Any comments on this problem would be greatly appreciated. As it stands now, this delay is going to mean I have to run my DAC quite a bit more slowly than I had hoped

Thanks in advance!
Geoff

Charles_Steinkuehler · March 9, 2017, 1:48am

You are wasting two instructions loading r0 with your compare value,
remove these and use a bit test instruction instead (QBBS/QBBC), or
since you're reading r31, you can use wait until bit set/clear (WBS/WBC).

You don't need to jump to a label that represents the next
instruction, just go ahead and fall-through and execute it.

wait_one_to_zero:
qbbs wait_one_to_zero, r31, 16
mov r30, r2

// Now, wait for a low to high transition.
wait_zero_to_one:

qbbc wait_zero_to_one, r31, 16
mov r30, r3

// Now, wait for a high to low transition.
jmp wait_one_to_zero

You can get marginally better performance if you unroll the loop (so
you have 8 or 16 or however many cycles before you perform the jmp
back to wait_one_to_zero), but you may or may not be able to do that.

Also, as you noted, there is likely some latency between the PRU and
the input and output pins. If you are only using one direction, the
latency doesn't matter and you can see or generate 5nS wide pulses,
but if you need to loop input to output (or output to input) any
latency becomes important. I have not measured this latency, but I
would expect it to be at least one clock cycle each way. If you do
measure this in-circuit, please share your results.

Geoffrey_Messier · March 9, 2017, 4:20am

Charles, thanks very much for the quick reply. I’ve tried the following two assembly language programs based on your suggestion:

`

.origin 0
.entrypoint start

start:
wbs r31, 16
ldi r30, 0
wbc r31, 16
ldi r30, 0xff
jmp start

`

and

`

.origin 0
.entrypoint start

start:
mov r2, 0
mov r3, 0xff

wait_one_to_zero:

qbbs wait_one_to_zero, r31, 16
mov r30, r2

wait_zero_to_one:

qbbc wait_zero_to_one, r31, 16
mov r30, r3
jmp wait_one_to_zero

`

Both programs still have the 25ns delay between the clock edge on R31 and the change in the R30 transitions. This seems to indicate that I’m looking at a 2 clock cycle delay for the commands and a ~3 clock cycle combined latency on the input and output pins. Probably some kind of buffering or latching delay. Too bad they don’t mention this in the reference manual. Fig. 4.8 in Section 4.4.1.2.3.1 is a bit misleading.

Dennis_Lee_Bieber · March 9, 2017, 2:18pm

On Wed, 8 Mar 2017 20:20:11 -0800 (PST), Geoffrey Messier
<geoff.messier@gmail.com> declaimed the
following:

Both programs still have the 25ns delay between the clock edge on R31 and
the change in the R30 transitions. This seems to indicate that I'm looking
at a 2 clock cycle delay for the commands and a ~3 clock cycle combined
latency on the input and output pins. Probably some kind of buffering or
latching delay. Too bad they don't mention this in the reference manual.
Fig. 4.8 in Section 4.4.1.2.3.1 is a bit misleading.

Likely one clocking delay to ensure the signal only changes state
between instructions (wouldn't want the state to transition /while/ it was
being read), and if the inputs are Schmitt-triggered (to "debounce"
electrical noise during transition), that likely adds a second cycle... If
not more.