unexpected "low speed" of PRU 1

Kasimir · May 12, 2021, 1:35pm

Hi,
I’m working on a sine - triangle modulator, is running on BeagleBone black / PRU 1.
On Linux/Arm I calculate the pattern for one period in form of a data structure
pattern to output and time to the next event.
Output is PRU 1 __R30 bit 0, 1, 2, 3 ( 4 only for debug reasons, oscilloscope trigger )
It works … but I’m not surprised about the speed.
The output loop of the PRU is written in some lines of ASM.
Frequencies: triangle should be 400kHz, better 800kHz,
sine wave is between 20kHz and 100kHz
Beaglebone has to drive a high speed GaN H-Bridge.

The datatransport and handshake between Linux and PRU works fine.
A C-Program on PRU is watching for new data. Then the new data ( pattern-time structure )
are copied into local ram, to get the best speed ( lowest latency ).
If the data are stored in local ram, the assembler program is called, to output the given pattern. First the arguments are saved in registers,

then the output starts in a loop.
Pick up pattern from local RAM, and output,
feed delay loop from local RAM,
delay loop,

update index register,
check for possible new data,
if not, back to the top, output next period.

What I said … it works. But with cycle time of 5nsec ( 1/200MHz ) and 1 cycle for most of the (ASM) instructions, I can’t see the speed.

So there is something wrong in my setup or code.
If somebody would like to help debugging, let me know.
Sources with Makefile etc are available.

All based on latest Debian image, all udates are installed, HDMI is off.

So, let me know, think it makes only sense to upload that stuff in case there is really

somebody able to help on that.

Thanks in advance
Kasimir

Mark_Lazarewicz · May 12, 2021, 3:54pm

The memory access will add some cycle post your assembler code with comments you’re correct it doesn’t make sense maybe someone will see the issues. The PRU labs discuss measuring cycle times in CCS if you have JTAG but toggle a GPIO and measure with a scope is probably easier.

Kasimir · May 12, 2021, 7:07pm

Hi Mark,
thanks you very much for the quick response.
Going to post the ASM. Looking Forward.
Kasimir
…

Kasimir · May 12, 2021, 7:49pm

This is my code to output pattern on __R30
; ********************************************

.global ausgabe
ausgabe:
ldi r18, 0 ; initial value
ldi r30, 0x10 ; debug
ldi r17, 0x00 ; debug
mov r13, r15 ; R15 contains start address, save in R13
mov r12, r14 ; R14 contains number of data points
naechster:
lbbo &r30, r15, 4, 1 ; (r15) = pattern
lbbo &r17, r15, 0, 2 ; (r17) = time to wait to output next pattern
warte:
sub r17, r17, 1 ; delay loop
qbne warte, r17, 0 ;
add r15, r15, 5 ; next element, update pointer
sub r14, r14, 1 ; number of remaining elements - 1
qbne naechster, r14, 0 ; was it the last one?
mov r15, r13 ; yes, load addess pointer with saved value
mov r14, r12 ; and load loop counter with saved number of elements
lbbo &r18, r16, 0, 1 ; load variable, if 0 run again, if != 0 exit
or r30, r30, (1<<4) ; debug, trigger signal for oscilloscope
qbeq naechster, r18, 0 ; as long handshake[0] = 0 is
jmp r3.w2 ; r3 contains return address
;*****************************************************************

The datastructure:
typedef struct Event Event_t;
struct Event
{
unsigned int time; // number of loops to the next event
unsigned char pattern; // Bit 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
// ------±–±–±–±—±–±—±–+
// | | | d |~z34|z34|~z12|z12|
// ------±–±–±–±—±–±—±–+

};

int main( int argc, char *argv[])
{
int i;
int j;
Event_t event_knoten[500];
…
…
ausgabe(pattern_liste.anzahl, &event_knoten[0].time, &handshake[0]) ; // asm to write pattern
// as long handshake[0} == 0

It works fine, only the delay time loop need better resolution, at the moment the time for only one loop is too long.
Have no idea to optimize ist.

Also from
or r30, r30, (1<<4) ; debug, trigger signal for oscilloscope
to
naechster:
lbbo &r30, r15, 4, 1 ; (r15) = pattern
I measure 250nsec … was expecting 25nsec …

I can see some jitter on my oscilloscope ( Tektronix THS730A ), has nothing to do with
GND connection, long wires etc., all that is perfect. Oscilloscope works fine.

Is it possible that “some what” from Linux / ARM area is disturbing my timing?

Thanks again for any helpfull input.
Kasimir

Mark_Lazarewicz · May 12, 2021, 11:25pm

Hello Kasmir

I will take a look and hopefully others who are using PRU can also be helpful I began programming in asm many many years ago but haven’t used PRU assembler. Can you reply whether you have an oscilloscope or high speed logic analyzer? This is what we used to debug many years ago.

You could remove any memory Accesses by hard coding the data( modify your code) just do a tight loop toggling GPIO and measure the frequency.

This will tell you the max frequency of your GPIO

Perhaps write some test code doing just that and share results . Staring at source code isn’t always the fastest way to find error especially since we don’t have your exact set-up.

In the meantime hopefully someone sees something obvious. I’m sure the max frequency of what you are attempting has been discussed.

Maybe someone will comment on what they have achieved and share their solution.

Break the problem into peices and resist the temptations to be drawn into detour’s can be challenging when getting input.

By running experiments you can stay busy while waiting for input from group members

I hope that’s helpful

Mark

Kasimir · May 13, 2021, 6:36pm

HI Mark,
was trying to use the loop instruction …

.global ausgabe
ausgabe:
ldi r18, 0 ; initialisation
ldi r30, 0x10 ; debug
ldi r17, 0x00 ; debug
mov r20, r15 ; save start addresss
mov r21, r14 ; save number of pattern
naechster:
loop next_pattern, r14 ; for each pattern
lbbo &r30, r15, 4, 1 ; output (r15) = pattern
lbbo &r17, r15, 0, 2 ; load number of delay loops
loop weiter, R17 ; delay loop
weiter:
add r15, r15, 5 ; increment address pointer by 5 ( next data structure element )
next_pattern:
mov r15, r20 ; load saved start address in address pointer
mov r14, r21 ; load saved number of pattern in pattern counter
lbbo &r18, r16, 0, 1 ; check if stop request
or r30, r30, (1<<4) ; debug
qbeq naechster, r18, 0 ; if handshake[0] == 0 continue
jmp r3.w2 ; otherwise return r3 contains return address

DTJF · May 13, 2021, 6:45pm

It works fine, only the delay time loop need better resolution, at the moment the time for only one loop is too long.
Have no idea to optimize ist.

Twice as fast:

LOOP EndWait, R17.w0 // note: max 16 bit counter
EndWait:

Also from
or r30, r30, (1<<4) ; debug, trigger signal for oscilloscope
to
naechster:
lbbo &r30, r15, 4, 1 ; (r15) = pattern
I measure 250nsec … was expecting 25nsec …

I can see some jitter on my oscilloscope ( Tektronix THS730A ), has nothing to do with
GND connection, long wires etc., all that is perfect. Oscilloscope works fine.

Is it possible that “some what” from Linux / ARM area is disturbing my timing?

The LBBO &r30, r15, 4, 1 instruction needs at least 3+1 cycles (as long as the adress in R15 is not in the PRU local memory map). And it may take additional cycles in case of heavy trafic on the L3 bus.

Note: for cycle watching you don’t need an osci. Instead you can use the CYCLE Register (offset = Ch) in the PRUSS_PRU_CTRL register space.

Dimitar_Dimitrov · May 13, 2021, 6:46pm

Which assembler are you using? It should have warned you that “loop weiter” body must be at least two instructions, whereas you have zero.

Also, you cannot nest HW-assisted loops.

Regards,
Dimitar

DTJF · May 13, 2021, 6:50pm

Hi Kasimir, sorry my post overlapped.

Kasimir · May 13, 2021, 7:46pm

Hi, thanks to all
so, here is a picture ( the first posted asm ). The delay is always 1 ( r17) so there is always 1 loop.
The pattern are 0-1-0-1-0-1- …

Channel 1 is __R30 Bit 0 ( pattern)
Channel 2 is __R30 Bit 4 used for trigger
The high time of Bit 4 is > 200nsec … I can’t understand
The high / low time of the pattern is 450nsec … why?
If cycle time for register - register operation is 1 and dram access is 3 … it should be 45nsec …
I do not understand why I can’t see the 200mHz speed of the pru unit

What do you think?
Kasimir

Dennis_Bieber · May 13, 2021, 7:46pm

Which assembler are you using? It should have warned you that "loop weiter"
body must be at least two instructions, whereas you have zero.

Sliding into the thread...

From the manual:
"""
Hardware Loop Assist (LOOP) Defines a hardware-assisted loop operation. The
loop is noninterruptible (LOOP). The loop operation works by detecting when
the instruction pointer would normal hit the instruction at the designated
target label, and instead decrementing a loop counter and jumping back to
the instruction immediately following the loop instruction.
"""

So, yes... the loop encounters the target label... and jumps back to...
the target label as there is no intervening opcode to use as a target for
the jump. Might be optimized out completely unless one puts at least a NOP
instruction inside -- though the next comment probably voids all
consideration.

Not sure of the "at least two instructions" -- seems one, with label on
the next (outside of loop) instruction, would be viable. PC would hit
label, so jump back to the (one) instruction following LOOP statement.

Also, you cannot nest HW-assisted loops.

A critical item to consider...

Kasimir · May 13, 2021, 8:03pm

Hi Dennis,
thanks for information … I’m using currently the first version of asm, without loop.
Because here is something else wrong, the timing is factor 10 to 15 far away …
Think I can use only one loop for timing. If I have to insert a nop … then there is no advantage.
I’m hanging now a week on this point. have no progress.

Thinking on a hardware solution with 2 DDS devices from analog devices. One for triangle and one for sine and -sine, comparator … done.
But then the BeagleBone / Sitara cpu makes no longer sense.
I like BeagleBone, made a lot of nice things and it works fine. But now I need the power of the pru unit and I do not see the performance.
May be my code is not placed in internal memory … there are many possibilities to do things wrong …

Thanks again
Kasimir

Kasimir · May 13, 2021, 8:34pm

Just a moment ago, I was standing on cliffs edge, now I made a big step forward …

I’m able to generate a 10ns trigger pulse on __R30 Bit 4 :-)).
I placed the and instruction to clear Bit 4. Now it’s clear, both indirect loads ( lbbo &R ) are
responsible for the unexpected delay. I was expecting both are operating from dram with
latency of 3 cycles. What is wrong? The data structure is expected in local ram, to get best latency.
In C it’s declared that way:
typedef struct Event Event_t;
struct Event
{
unsigned int time; // number of loops
unsigned char pattern; // Bit 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
// ------±–±–±–±—±–±—±–+
// | | | |~z34|z34|~z12|z12|
// ------±–±–±–±—±–±—±–+
};
int main( int argc, char *argv[])
{
int i;
int j;
unsigned char u;
Event_t event_knoten[100]; // later on, r15 is pointing to that address
…
…
…
ausgabe(pattern_liste.anzahl, &event_knoten[0].time, &handshake[0]) ;

***************** change to debug delay in assembler *******************

naechster:
and r30, r30, 0xEF ; debug

lbbo &r30, r15, 4, 1 ; (r15) = pattern <= slow

lbbo &r17, r15, 0, 2 ; load number of loops <= slow

Any hint how to make the lbbo &r… faster?
I’m looking forward

Kasimir

Mark_Lazarewicz · May 13, 2021, 8:48pm

Have you seen the PRU Support Package examples???
I saw examples of linker placement in shared RAM

This example below the C variable is in by default in local RAM

What is smallest pulse period you require for your application?

void main(void)
{
volatile uint32_t gpio;

/* Clear SYSCFG[STANDBY_INIT] to enable OCP master port */
CT_CFG.SYSCFG_bit.STANDBY_INIT = 0;

/* Toggle GPO pins /
/ Note: 0xFFFF_FFFF toggles all GPO pins */
gpio = 0xFFFFFFFF;

/* TODO: Create stop condition, else it will toggle indefinitely */
while (1) {
__R30 ^= gpio;
__delay_cycles(100000000);
}

Just a moment ago, I was standing on cliffs edge, now I made a big step forward …

I’m able to generate a 10ns trigger pulse on __R30 Bit 4 :-)).
I placed the and instruction to clear Bit 4. Now it’s clear, both indirect loads ( lbbo &R ) are
responsible for the unexpected delay. I was expecting both are operating from dram with
latency of 3 cycles. What is wrong? The data structure is expected in local ram, to get best latency.
In C it’s declared that way:
typedef struct Event Event_t;
struct Event
{
unsigned int time; // number of loops
unsigned char pattern; // Bit 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
// ------±–±–±–±—±–±—±–+
// | | | |~z34|z34|~z12|z12|
// ------±–±–±–±—±–±—±–+
};
int main( int argc, char *argv[])
{
int i;
int j;
unsigned char u;
Event_t event_knoten[100]; // later on, r15 is pointing to that address
…
…
…
ausgabe(pattern_liste.anzahl, &event_knoten[0].time, &handshake[0]) ;

***************** change to debug delay in assembler *******************

naechster:
and r30, r30, 0xEF ; debug

lbbo &r30, r15, 4, 1 ; (r15) = pattern <= slow

lbbo &r17, r15, 0, 2 ; load number of loops <= slow

Any hint how to make the lbbo &r… faster?
I’m looking forward

Kasimir

Kasimir · May 13, 2021, 10:24pm

Hi all,

it’s SOLVED
Thanks for all your input.
Problem was located in memory allocation.

Was not using the PRU-Dram. The external ram is very slow and I saw also some jitter.
Now it’s running with expected speed and I’m happy.

Was expecting the local variables in local memory by default. That’s not the case.
Thanks again
Kasimir

Mark_Lazarewicz · May 13, 2021, 10:44pm

Great news

Can you share how it ended up in external RAM?
Incorrect Linker cmd file?

Mark

Kasimir · May 13, 2021, 10:56pm

Hi Mark,
more simple … in C source.
My datastructure was not in internal ram.
volatile Event_t *event_knoten = (Event_t *) (PRU0_DRAM + 0x200);
and in main
event_knoten = (Event_t )malloc(100sizeof(Event_t));

solved it.

Kasimir

Mark_Lazarewicz · May 14, 2021, 12:11am

Hi Kasimir

What’s wrong with below??

My datastructure was not in internal ram.
volatile Event_t *event_knoten = (Event_t *) (PRU0_DRAM + 0x200);

IMO

I think placing anything in a guaranteed memory area is best done with sections from linker command file.

There’s examples about placing data in PRU shared RAM in the those examples I mentioned.

Yes external DDRAM yikes the ARM is caching it.

Glad you’re rolling.

Mark_Lazarewicz · May 14, 2021, 12:40am

Nevermind I understand I think now .PRU0_DRAM needs to be an address from linker command file that statement might work.

Anyway linker command files have always been a murky science I might play around with. I use JTAG so the address being not correct is something you can catch quickly.
Unfortunately Using RPMsg from my recent research isn’t a good match with JTAG debugging.

I’m interested in what memory the Rpmsg carves out for it’s use.

Seems like between the PRU shared RAM and unused PRU0_DRAM if using only one PRU one can squeeze additional RAM if resources are tight that’s why I’m interested in researching the linker command files further.

These PRU are limited in resources it’s like using a small 8 bit processor from 20 year’s ago and squeezing every possible byte out.

Back in the day some guys got job security by using so many tricks to steal memory their code was unmaintainable. They liked that boss couldn’t get rid of them because changing the software would break the entire application.

Ahh I digress .

Mark

Kasimir · May 14, 2021, 8:42am

Hi Mark,

prudebug did help a lot. I’m missing a good debug environment for PRU development.
Up to now it’s time consuming try&error.
It’s more easy to use FPGA on top of Raspberry or use ESP32, 2. core for dedicated high speed functions. At the end I want to use the CPU in my own hardware, Beaglebone is my “emulator” and debug environment.
The big value of the Sitara CPU are the PRU units. Think prudebug should be enhanced.
Have a great day
Kasimir