PRUSS GPIO speed for reading state ?

Cedric_Malitte · September 19, 2014, 8:46pm

Hi all,

I had a few hours to play with the pruss, but I came to a dead end…

My goal is to read ADCs, ADS8326 to be precise.
It’s a kind of SPI adc with one clock, one select, one out.

I’d like to use 4 in parallel, which means only one clock, one select and 4 inputs on the PRUSS.
I try to pull up CLK line and then read each input, shift them into variables to be sent to main app.

When I look at the CLK line on a scope, it’s taking way too much time to get input states and shift even if the asm code should only take a few cycles.
I’m lazy, I write the pruss code in C, but asm looks nice.

Here’s the code in C

#define ADC1 (1 << 14)
#define ADC2 (1 << 15)
#define ADC3 (1 << 16)
#define ADC4 (1 << 17)

#define SOC_GPIO_1_REGS (0x4804C000)
#define SOC_GPIO_3_REGS (0x481AE000)
#define GPIO_PIN_LOW (0x0)
#define GPIO_PIN_HIGH (0x1)

#define GPIO_CLEARDATAOUT (0x190)
#define GPIO_SETDATAOUT (0x194)
#define GPIO_DATAIN (0x138)
#define GPIO_DATAOUT (0x13C)
#define GPIO_OE (0x134)

#define HWREG(x) (*((volatile unsigned int *)(x)))

#define ADC_CLK_PIN 12
#define ADC_CS_PIN 13

#define ADC_CLK_HI (HWREG(SOC_GPIO_1_REGS + GPIO_SETDATAOUT) = (1 << ADC_CLK_PIN))
#define ADC_CLK_LOW (HWREG(SOC_GPIO_1_REGS + GPIO_CLEARDATAOUT) = (1 << ADC_CLK_PIN))

#define ADC_CS_HI (HWREG(SOC_GPIO_1_REGS + GPIO_SETDATAOUT) = (1 << ADC_CS_PIN))
#define ADC_CS_LOW (HWREG(SOC_GPIO_1_REGS + GPIO_CLEARDATAOUT) = (1 << ADC_CS_PIN))

#define PRU0_ARM_INTERRUPT 19
#define SYSCFG (*(&C4+0x01))
int C4 attribute((cregister(“MEM”,near),peripheral)); //only compatible with v1.1.0B1 +
//add following lines to MEMORY{} in lnk.cmd
//PAGE 2:
// MEM : o = 0x00026000 l = 0x00002000 CREGISTER=4
volatile register unsigned int __R31;

void main()
{
/Intialise OCP Master port for accessing external memories/
SYSCFG&=0xFFFFFFEF;
ocp_init();
shm_init();
/Start Main Code/
int i,j;
unsigned int sensor_1=0, sensor_2=0, temp=0;
HWREG(SOC_GPIO_1_REGS + GPIO_OE) &= ~(1 << ADC_CLK_PIN); // output
HWREG(SOC_GPIO_1_REGS + GPIO_OE) &= ~(1 << ADC_CS_PIN); // output
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC1; // input
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC2; // input
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC3; // input
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC4; // input
ADC_CLK_HI;
DELAY1;
while (1)
{
ADC_CLK_HI;
asm volatile
(
" NOP \n"
" NOP \n"
" NOP \n"
" NOP \n"
" NOP \n"
);
//READ
sensor_1 |= (HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC1);
sensor_1 |= ((HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC2)<<16);
sensor_2 |= (HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC3);
sensor_2 |= ((HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC4)<<16);
ADC_CLK_LOW;
if (j!=15)
{
// shift bits
sensor_1 = sensor_1 << 1;
sensor_2 = sensor_2 << 1;
}
delay_100();
}

Relevant part reading sensor_1 in asm:

.dwpsn file “adc_pru.c”,line 154,column 4,is_stmt,isa 0
LDI r0, 0x4000 ; [] |154|
LDI32 r1, 0x481ae138 ; [] |154|
LBBO &r1, r1, 0, 4 ; [] |154|
AND r0, r1, r0 ; [] |154|
LBBO &r1, r2, 8, 4 ; [] |154| sensor_1
OR r0, r1, r0 ; [] |154|
SBBO &r0, r2, 8, 4 ; [] |154| sensor_1

.dwpsn file “adc_pru.c”,line 155,column 4,is_stmt,isa 0
LDI r0, 0x8000 ; [] |155|
LDI32 r1, 0x481ae138 ; [] |155|
LBBO &r1, r1, 0, 4 ; [] |155|
AND r0, r1, r0 ; [] |155|
LSL r0, r0, 0x10 ; [] |155|
LBBO &r1, r2, 8, 4 ; [] |155| sensor_1
OR r0, r1, r0 ; [] |155|
SBBO &r0, r2, 8, 4 ; [] |155| sensor_1

My great trouble is that it takes to much time, in fact way too much.

Using this code, the CLK line is at 757 Khz.
CLK hi is around 1us and low is the rest…

I’d like to achieve at least 2Mhz for CLK line.

I might have misread the doc, but isn’t an instruction supposed to be 5ns ?
That should be 35ns for first part and 40ns for second part.

Any clue or help ?

The learning curve is a bit harder than I tought

Thanks

Cedric_Malitte · September 19, 2014, 8:51pm

Well I misread the doc… not all instructions are created equal

Even that, it’s still slow as hell to read the inputs…

Charles_Steinkuehler · September 19, 2014, 9:22pm

Hi all,

I had a few hours to play with the pruss, but I came to a dead end...

My goal is to read ADCs, ADS8326 to be precise.
It's a kind of SPI adc with one clock, one select, one out.

I'd like to use 4 in parallel, which means only one clock, one select and
4 inputs on the PRUSS.
I try to pull up CLK line and then read each input, shift them into
variables to be sent to main app.

When I look at the CLK line on a scope, it's taking way too much time to
get input states and shift even if the asm code should only take a few
cycles.
I'm lazy, I write the pruss code in C, but asm looks nice.

Here's the code in C

<snip>

My great trouble is that it takes to much time, in fact way too much.

Using this code, the CLK line is at 757 Khz.
CLK hi is around 1us and low is the rest....

I'd like to achieve at least 2Mhz for CLK line.

I might have misread the doc, but isn't an instruction supposed to be 5ns
?
That should be 35ns for first part and 40ns for second part.

Any clue or help ?

The learning curve is a bit harder than I tought

Thanks

Well I misread the doc.... not all instructions are created equal

Even that, it's still slow as hell to read the inputs...

The *INSTRUCTION* takes 5 nS (or maybe 10-15, depending on exactly what
you're doing), but since you're reading data from outside the PRU
domain, the round-trip time for each GPIO read is killing your
performance. You need to use the direct PRU inputs, and not general
purpose I/O accessed through the AXI fabric.

I have some details on read/write timings to the GPIO via the
interconnect fabric in the comments of my PRU code for Machinekit:

github.com

machinekit/machinekit/blob/master/src/hal/drivers/hal_pru_generic/pru_generic.p#L135-L163


      
          .entrypoint START
          
          // PRU GPIO Write Timing Details
          // The actual write instruction to a GPIO pin using SBBO takes two 
          // PRU cycles (10 nS).  However, the GPIO logic can only update every 
          // 40 nS (8 PRU cycles).  This meas back-to-back writes to GPIO pins 
          // will eventually stall the PRU, or you can execute 6 PRU instructions 
          // for 'free' when burst writing to the GPIO.
          //
          // Latency from the PRU write to the actual I/O pin changing state
          // (normalized to PRU direct output pins = zero latency) when the
          // PRU is writing to GPIO1 and L4_PERPort1 is idle measures 
          // 95 nS or 105 nS (apparently depending on clock synchronization)
          //
          // PRU GPIO Posted Writes
          // When L4_PERPort1 is idle, it is possible to burst-write multiple
          // values to the GPIO pins without stalling the PRU, as the writes 
          // are posted.  With an unrolled loop (SBBO to GPIO followed by a 
          // single SET/CLR to R30), the first 20 write cycles (both 
          // instructions) took 15 nS each, at which point the PRU began

This file has been truncated. show original

Note that *WRITES* from the PRU to the GPIO are fairly quick, but
*READS* are very slow. This is because the write can be posted allowing
the PRU to continue on executing code, but on reads the PRU stalls until
the data is returned.

Executive Summary of PRU <-> GPIO timing:

Peak GPIO write speed : 10 nS (100 MHz)
Sustained GPIO write speed : 40 nS ( 25 MHz)
GPIO Read speed : ~165 nS ( ~6 MHz)

You are then making things much worse by reading from the GPIO bank
multiple times in your code. You should factor all the
HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) accesses into a single read to a
local variable, then use the local variable to do the bit manipulations,
rather than performing the expensive read four times.

Also, don't blame the compiler for not optimizing this for you. If you
are wondering why this didn't get optimized, the compiler cannot treat a
GPIO register read as a generic (ie: cachable) memory read since the
value read can potentially be different each time (ie: the access is
volatile). Therefore, it's up to you to integrate any read or write
combining that is acceptable, the compiler can't do it for you. Also,
even standard memory reads from DDR via the PRU are really volatile,
since the ARM core is running in the background and could potentially be
changing the values between each PRU access.

Clean up your code a bit and I expect you'll see much better results!

Cedric_Malitte · September 19, 2014, 11:39pm

Thanks a lot Charles for your lights.

Yes my code is not really optimized and that’s only a draft to play with I/Os.
I do coding for microchips and I thought ( beat me ) that the pruss would behave the same concerning IOs, I mean direct access.

When I saw the delays, I also thought to read the whole reg and then bitmask it to get the pins I need.
And you confirmed this, it will be a lot faster doing bitmasking.
As far as I can achieve under 250ns for reading, it will be fine.

I had a quick look at your code, and will dig into it later.
I think I’ll code directly in ASM as I do not have that much to do.
Just an infinite loop to clock 2 pins, read the others and send the value over ram.

Regards,

Cedric

Charles_Steinkuehler · September 20, 2014, 8:42pm

It does, if you use the direct PRU inputs (ie: read register R31).

It's when you use the GPIO registers via the SoC interconnect that reads
become very slow.

Cedric_Malitte · September 22, 2014, 3:06am

Thanks Charles, I think I got it !

I will try tomorrow if I got time, but should look like this:

MOV r3, 0 // Sensor_1 and Sensor_2 16 bits data
MOV r4, 0 // Sensor_3 and Sensor_4 16 bits data

CLR r30.t14 // CS low
MOV r0, 0 // bit counter
LOOP1:
SET r30.t15 // CLK high
MOV r2, r31.b0 // Read in the data

QBBC ChkSensor2, r2.b0.t0 // if r2.0, set r3.0
SET r3, 0
ChkSensor2:
QBBC ChkSensor3, r2.b0.t1 // if r2.1, set r3.15
SET r3, 15
ChkSensor3:
QBBC ChkSensor4, r2.b0.t2 // if r2.2, set r4.0
SET r4, 0
ChkSensor4:
QBBC EndCheck, r2.b0.t3 // if r2.3, set r4.15
SET r4, 15

EndCheck:
CLR r30.t15 // CLK low
QBEQ DataDone, r0, 15

RotateData:
LSL r3, r3, 1
LSL r4, r4, 1
ADD r0, r0, 1
JUMP LOOP1

DataDone:
SET r30.t14 // CS high

I could also do bitmasking for the data in, as timing should be equal whatever the bit value is.

I’ll chek what I have on the scope and I’ll decide later.

Regards,

Cedric