PRU I/O max speed

Paul_Beam1 · February 25, 2021, 5:26pm

I am, unfortunately, bit-banging SPI with the PRU, and I seem to be running into a speed limit < 50 MHz I desire. I can certainly create a clock that fast, but reading data seems to be delayed. I can see on the logic analyzer a “0” clearly being read as a '1" so there is either a delay in my clock output or a delay in my input or both. I would like to think that r30 and r31 are tied directly to the outside world, but now I am thinking there is something in between that is either clocked or just has significant output delays. Anyone else encountered this?

Paul_Beam1 · February 25, 2021, 8:15pm

With a sample size of one, r31 appears to be 4 instructions behind the state of the pin.

Andrew_P_Lentvorski · April 22, 2021, 9:00am

I would be stunned if the GPIOs don’t have synchronizer flip-flops as they are sampling a signal asynchronous to the 200MHz clock. That would account for 2 clocks. You probably need one extra to clock data into R31. And then one clock to read R31.

50MHz is a pretty smoking speed for SPI–you normally need to start thinking about series termination and some basic signal integrity. You normally need the clock to capture to flop directly if you want things to work.

I suspect you probably need to use the 16-bit Parallel Capture Mode while feeding your clock out back as clock in. You’ll still probably be 4 clocks behind when the data hits R31, but the data will get captured by the PRU_CLOCKIN edge properly so the delay will now be deterministic if you are generating the 50MHz clock yourself.

Paul_Beam1 · April 22, 2021, 1:53pm

I got it working, and I hope to never revisit it. It was kind of a surprise. I selected a 1MS/s 16-bit SPI ADC and assumed a 16 Mhz SPI clock to get the data out. I totally missed that the ADC can’t sample, convert, and send at the same time, so I basically have 300nS to get my 16 bit out. Everything else I had done with the PRU monitored and responded to an external clock, so this is the first time I was generating the clock and sampling the incoming data. I had noticed a previous oddity where I had some debugging statements (set an output pin) and when I removed them things stopped working. There is definitely a speed limit.

Gerhard_Hoffmann · April 22, 2021, 3:02pm

I think I can read in about 3 pcs. LT2500-32 via the PRU in Software.

The LT2500 ADC delivers 32 bit results via SPI, and with the capture and

conversion time slots it needs 100 MHz SPI to process each 1MHz sample.

The ADC feeds its data to a shift register in a Xilinx 2c64 Coolrunner.

The PRU then reads it bytewise and writes the collected 32 bit words

into a ring buffer in the shared RAM.

Up to now I have tested 1 ADC, but bandwidth should be enough for 3,

just so.

regards, Gerhard

jkridner · April 24, 2021, 6:16am

https://pub.pages.cba.mit.edu/ring/

Gerhard_Hoffmann · April 24, 2021, 7:48am

It was really a ping-pong buffer, not a ring.

I did check the timing with an Agilent 54846B scope.

this is snipped from a backup copy, I have re-purposed the BBB

I did comment about this here already a year ago or so.

My memory about that gets fuzzy…

volatile register unsigned int __R30; // CPU register R30 connects directly to some output pins
volatile register unsigned int __R31; // CPU register R31 connects directly to some input pins

#define SELECT 8 /* 2 BITS addressing of the 4 bytes in CPLD Bit8 = P8.27 und Bit9=p8.29 /
#define PROG_CLK 10 / P8.28 prog_dat green wire to SDI of ADC*/
#define PROG_DAT 11 /* that works unexpectedly. probably the BBB handbook is wrong/incomplete. /
/ should be possible according to CPU data sheet. /
#define DAT_AVAIL 16 / Pin P9.26, the only PRU1 pin on P9 input busy oŕ drl, depending on output used */

// GPIO, Clearing or setting takes about 40 nsec
#define PROG_ENA (1<<2)
#define USE_CHAN_B (1<<4)

// variables in main end up on the stack. We only have 0x100 bytes by default.
// global variables are on the heap. Stack and heap are on the bottom of the PRU data RAM.

volatile int heapmarker = 0x22222222; // Easy to find in a memory dump
volatile char *bla = "HEAP @ @ @ ";
int i;
volatile int *pipo_pointer;
int pipo_offset;

// data avail is either not busy or not drl. It is high active.
// When the ADC is busy, it is low for 600 nsec.
// The CPLD then takes a little more than 32 Clocks
// to get the 32 bits. Then we can read them out, bytewise.
// It is probably harmless if that extends slightly into the next
// conversion since the read activity is decoupled from the ADC core.
// inline saves 20 nsec.

inline void wait_data_avail(void){

while ( __R31 & (1 << DAT_AVAIL)) {}; // wait for the high time of p9.26 = data_avail
while (!(__R31 & (1 << DAT_AVAIL))){}; // wait for the low time

// now the ADC transaction window opens.
// next 320 ns we will read the data into the CPLD or program the ADC
}

// read 4 bytes from the CPLD, mask them, shift them & convert to one int.
// I must read at least 3 times that the results are right ( for address setup time)
// removing a single read makes it 60 nsec faster, 15 nsec per read. Should be 5 nsec???
// reading 3 times takes 40 nsec per bit. That should be enough.
// reading 4 times takes 60 nsec per bit. Reading __R31 takes abt. 20 ns.
// Von der steigenden Flanke von data_available am P9 bis zum return dauert 725 nsec.
// kill 320 nsec, the time the CPLD needs to fill the shift register
// das sind 340 nsec für 5 Schleifendurchläufe. LAHM??
// 1 = 42 ns 2 = genauso?? 11 = 104 ns 31 = 208 ns 51 = 304 ns
// 61 = 355 nsec 60 = 350 ns

// Once through the empty loop costs 5 nsec.
// for( retval=60; retval; retval–){};

// D.h. es bleiben 250 nsec zum ABspeichern im pingpong-Buffer. Das CPLD könnte sagen,
// wann die Daten für den BBB fertig sind. Dann hätte man 350 nsec mehr Zeit für sonst was.
// Das ist jetzt gemacht. Data_rdy_bb geht jetzt 330 nsec low, solange das CPLD den ADC auslutscht.
// Wenn data_rdy_bbb hochgeht, können wir gleich anfangen, das CPLD leerzunuckeln.
// Genug Zeit für noch 2 Kanäle.

inline int read_adc(void){

int retval;

// Without volatile it runs 3 times as fast, even though __R31 is volatile
volatile unsigned int byte0, byte1, byte2, byte3;

wait_data_avail();

// __R30 &= ~(3 << SELECT); // address 0 Trigger
// __R30 |= (3 << SELECT); // address 3

// from here to parking address wires at return it takes 350 nsec.

__R30 &= ~(3 << SELECT); // address 0
byte0 = __R31; // address setup time for byte 0
byte0 = __R31; // 5 nsec each line
// byte0 = __R31;
byte0 = __R31;

__R30 |= (1 << SELECT); // address 1
byte1 = __R31;
byte1 = __R31;
// byte1 = __R31;
byte1 = __R31;

__R30 &= ~(3 << SELECT); // address 2, remove old bit field
__R30 |= (2 << SELECT); // insert new bit field
byte2 = __R31;
byte2 = __R31;
// byte2 = __R31;
byte2 = __R31;

__R30 |= (1<< SELECT); // increment to address 3
byte3 = __R31; //
byte3 = __R31;
// byte3 = __R31;
byte3 = __R31; // get the last byte

retval = ((byte0 & 0xff) )

((byte1 & 0xff) << 8 )
((byte2 & 0xff) << 16)
((byte3 & 0xff) << 24);

__R30 &= ~(3 << SELECT); // park address at 0, may be removed.

return retval;
}