PRUSS GPIO speed for reading state ?

Hi all,

I had a few hours to play with the pruss, but I came to a dead end…

My goal is to read ADCs, ADS8326 to be precise.
It’s a kind of SPI adc with one clock, one select, one out.

I’d like to use 4 in parallel, which means only one clock, one select and 4 inputs on the PRUSS.
I try to pull up CLK line and then read each input, shift them into variables to be sent to main app.

When I look at the CLK line on a scope, it’s taking way too much time to get input states and shift even if the asm code should only take a few cycles.
I’m lazy, I write the pruss code in C, but asm looks nice.

Here’s the code in C

#define ADC1 (1 << 14)
#define ADC2 (1 << 15)
#define ADC3 (1 << 16)
#define ADC4 (1 << 17)

#define SOC_GPIO_1_REGS (0x4804C000)
#define SOC_GPIO_3_REGS (0x481AE000)
#define GPIO_PIN_LOW (0x0)
#define GPIO_PIN_HIGH (0x1)

#define GPIO_CLEARDATAOUT (0x190)
#define GPIO_SETDATAOUT (0x194)
#define GPIO_DATAIN (0x138)
#define GPIO_DATAOUT (0x13C)
#define GPIO_OE (0x134)

#define HWREG(x) (*((volatile unsigned int *)(x)))

#define ADC_CLK_PIN 12
#define ADC_CS_PIN 13

#define ADC_CLK_HI (HWREG(SOC_GPIO_1_REGS + GPIO_SETDATAOUT) = (1 << ADC_CLK_PIN))
#define ADC_CLK_LOW (HWREG(SOC_GPIO_1_REGS + GPIO_CLEARDATAOUT) = (1 << ADC_CLK_PIN))

#define ADC_CS_HI (HWREG(SOC_GPIO_1_REGS + GPIO_SETDATAOUT) = (1 << ADC_CS_PIN))
#define ADC_CS_LOW (HWREG(SOC_GPIO_1_REGS + GPIO_CLEARDATAOUT) = (1 << ADC_CS_PIN))

#define PRU0_ARM_INTERRUPT 19
#define SYSCFG (*(&C4+0x01))
int C4 attribute((cregister(“MEM”,near),peripheral)); //only compatible with v1.1.0B1 +
//add following lines to MEMORY{} in lnk.cmd
//PAGE 2:
// MEM : o = 0x00026000 l = 0x00002000 CREGISTER=4
volatile register unsigned int __R31;

void main()
{
/Intialise OCP Master port for accessing external memories/
SYSCFG&=0xFFFFFFEF;
ocp_init();
shm_init();
/Start Main Code/
int i,j;
unsigned int sensor_1=0, sensor_2=0, temp=0;
HWREG(SOC_GPIO_1_REGS + GPIO_OE) &= ~(1 << ADC_CLK_PIN); // output
HWREG(SOC_GPIO_1_REGS + GPIO_OE) &= ~(1 << ADC_CS_PIN); // output
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC1; // input
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC2; // input
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC3; // input
HWREG(SOC_GPIO_3_REGS + GPIO_OE) |= ADC4; // input
ADC_CLK_HI;
DELAY1;
while (1)
{
ADC_CLK_HI;
asm volatile
(
" NOP \n"
" NOP \n"
" NOP \n"
" NOP \n"
" NOP \n"
);
//READ
sensor_1 |= (HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC1);
sensor_1 |= ((HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC2)<<16);
sensor_2 |= (HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC3);
sensor_2 |= ((HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) & ADC4)<<16);
ADC_CLK_LOW;
if (j!=15)
{
// shift bits
sensor_1 = sensor_1 << 1;
sensor_2 = sensor_2 << 1;
}
delay_100();
}

Relevant part reading sensor_1 in asm:

.dwpsn file “adc_pru.c”,line 154,column 4,is_stmt,isa 0
LDI r0, 0x4000 ; [] |154|
LDI32 r1, 0x481ae138 ; [] |154|
LBBO &r1, r1, 0, 4 ; [] |154|
AND r0, r1, r0 ; [] |154|
LBBO &r1, r2, 8, 4 ; [] |154| sensor_1
OR r0, r1, r0 ; [] |154|
SBBO &r0, r2, 8, 4 ; [] |154| sensor_1

.dwpsn file “adc_pru.c”,line 155,column 4,is_stmt,isa 0
LDI r0, 0x8000 ; [] |155|
LDI32 r1, 0x481ae138 ; [] |155|
LBBO &r1, r1, 0, 4 ; [] |155|
AND r0, r1, r0 ; [] |155|
LSL r0, r0, 0x10 ; [] |155|
LBBO &r1, r2, 8, 4 ; [] |155| sensor_1
OR r0, r1, r0 ; [] |155|
SBBO &r0, r2, 8, 4 ; [] |155| sensor_1

My great trouble is that it takes to much time, in fact way too much.

Using this code, the CLK line is at 757 Khz.
CLK hi is around 1us and low is the rest…

I’d like to achieve at least 2Mhz for CLK line.

I might have misread the doc, but isn’t an instruction supposed to be 5ns ?
That should be 35ns for first part and 40ns for second part.

Any clue or help ?

The learning curve is a bit harder than I tought :slight_smile:

Thanks

Well I misread the doc… not all instructions are created equal :slight_smile:

Even that, it’s still slow as hell to read the inputs…

Hi all,

I had a few hours to play with the pruss, but I came to a dead end...

My goal is to read ADCs, ADS8326 to be precise.
It's a kind of SPI adc with one clock, one select, one out.

I'd like to use 4 in parallel, which means only one clock, one select and
4 inputs on the PRUSS.
I try to pull up CLK line and then read each input, shift them into
variables to be sent to main app.

When I look at the CLK line on a scope, it's taking way too much time to
get input states and shift even if the asm code should only take a few
cycles.
I'm lazy, I write the pruss code in C, but asm looks nice.

Here's the code in C

<snip>

My great trouble is that it takes to much time, in fact way too much.

Using this code, the CLK line is at 757 Khz.
CLK hi is around 1us and low is the rest....

I'd like to achieve at least 2Mhz for CLK line.

I might have misread the doc, but isn't an instruction supposed to be 5ns
?
That should be 35ns for first part and 40ns for second part.

Any clue or help ?

The learning curve is a bit harder than I tought :slight_smile:

Thanks

Well I misread the doc.... not all instructions are created equal :slight_smile:

Even that, it's still slow as hell to read the inputs...

The *INSTRUCTION* takes 5 nS (or maybe 10-15, depending on exactly what
you're doing), but since you're reading data from outside the PRU
domain, the round-trip time for each GPIO read is killing your
performance. You need to use the direct PRU inputs, and not general
purpose I/O accessed through the AXI fabric.

I have some details on read/write timings to the GPIO via the
interconnect fabric in the comments of my PRU code for Machinekit:

https://github.com/machinekit/machinekit/blob/master/src/hal/drivers/hal_pru_generic/pru_generic.p#L135-L163

Note that *WRITES* from the PRU to the GPIO are fairly quick, but
*READS* are very slow. This is because the write can be posted allowing
the PRU to continue on executing code, but on reads the PRU stalls until
the data is returned.

Executive Summary of PRU <-> GPIO timing:

Peak GPIO write speed : 10 nS (100 MHz)
Sustained GPIO write speed : 40 nS ( 25 MHz)
GPIO Read speed : ~165 nS ( ~6 MHz)

You are then making things much worse by reading from the GPIO bank
multiple times in your code. You should factor all the
HWREG(SOC_GPIO_3_REGS + GPIO_DATAIN) accesses into a single read to a
local variable, then use the local variable to do the bit manipulations,
rather than performing the expensive read four times.

Also, don't blame the compiler for not optimizing this for you. If you
are wondering why this didn't get optimized, the compiler cannot treat a
GPIO register read as a generic (ie: cachable) memory read since the
value read can potentially be different each time (ie: the access is
volatile). Therefore, it's up to you to integrate any read or write
combining that is acceptable, the compiler can't do it for you. Also,
even standard memory reads from DDR via the PRU are really volatile,
since the ARM core is running in the background and could potentially be
changing the values between each PRU access.

Clean up your code a bit and I expect you'll see much better results!

Thanks a lot Charles for your lights.

Yes my code is not really optimized and that’s only a draft to play with I/Os.
I do coding for microchips and I thought ( beat me ) that the pruss would behave the same concerning IOs, I mean direct access.

When I saw the delays, I also thought to read the whole reg and then bitmask it to get the pins I need.
And you confirmed this, it will be a lot faster doing bitmasking.
As far as I can achieve under 250ns for reading, it will be fine.

I had a quick look at your code, and will dig into it later.
I think I’ll code directly in ASM as I do not have that much to do.
Just an infinite loop to clock 2 pins, read the others and send the value over ram.

Regards,

Cedric

It does, if you use the direct PRU inputs (ie: read register R31).

It's when you use the GPIO registers via the SoC interconnect that reads
become very slow.

Thanks Charles, I think I got it !

I will try tomorrow if I got time, but should look like this:

MOV r3, 0 // Sensor_1 and Sensor_2 16 bits data
MOV r4, 0 // Sensor_3 and Sensor_4 16 bits data

CLR r30.t14 // CS low
MOV r0, 0 // bit counter
LOOP1:
SET r30.t15 // CLK high
MOV r2, r31.b0 // Read in the data

QBBC ChkSensor2, r2.b0.t0 // if r2.0, set r3.0
SET r3, 0
ChkSensor2:
QBBC ChkSensor3, r2.b0.t1 // if r2.1, set r3.15
SET r3, 15
ChkSensor3:
QBBC ChkSensor4, r2.b0.t2 // if r2.2, set r4.0
SET r4, 0
ChkSensor4:
QBBC EndCheck, r2.b0.t3 // if r2.3, set r4.15
SET r4, 15

EndCheck:
CLR r30.t15 // CLK low
QBEQ DataDone, r0, 15

RotateData:
LSL r3, r3, 1
LSL r4, r4, 1
ADD r0, r0, 1
JUMP LOOP1

DataDone:
SET r30.t14 // CS high

I could also do bitmasking for the data in, as timing should be equal whatever the bit value is.

I’ll chek what I have on the scope and I’ll decide later.

Regards,

Cedric