PRU Pattern bigger than 8k

Le_Costaouec_Vincent · June 23, 2016, 10:55am

Hello,

I’m trying to generate pattern with the PRU in several channels at the same time.
So the objective is to be able to send different sequences of value on several channels of the PRU and all of them need to be sent at the same time.

For now I have succeed it.
My program is working like that :

I initialize the PRU (with a c file)
I load the 8k data into the PRU RAM (with a c file)
I execute the program into the PRU (with a c file)
The pru code run and at the end it came back into the C file to close everything.
However, I reach the limit of the 8k of the PRU RAM (which I understand).

Nevertheless, I would like now to load for instance 24k of data (i.e. more than 8k data).

My idea is to initialize,then send data into the PRU RAM, execute the program.
Then with an interrupt return in the c file and reload data in the RAM, and re-execute the program.
And do this routine until I reach the end of my 24k data.

However, the time to load 8k data into the ram is really high, it takes between 3 to 8 s to do so.

So, I would to know I someone new a quickest/easiest way to send more than 8k data into the PRU ?

Thanks by advance
Regards
Vincent
" Enjoy life no matter what ! "

Charles_Steinkuehler · June 23, 2016, 1:55pm

There are lots of different ways you can do this. I'd probably put
the 24K of data into DDR memory (the PRUSS driver allocates a chunk of
memory you can use for this), use one of the PRUs to move the bulk
data to a PRU local ring buffer in the 12K shared PRU data memory, and
use the second PRU to do whatever it is you need to do with the data.

This assumes you don't need both PRUs for your task.

Charles_Steinkuehler · June 23, 2016, 2:02pm

Thinking about this a bit more, it's probably easier to just have a C
program on the ARM side do the writing into the PRU memory
ring-buffer. That leaves both PRUs available, and should be easier to
code as well. The only drawback is the ARM side code can get
interrupted by the kernel for potentially a long time (hundreds of
milliseconds on a non-real-time kernel). You don't mention your data
rate requirement (other than it's faster than "seconds), but a
standard kernel should work if your 12K buffer holds about 0.2 seconds
worth of data. If the 12K buffer represents less time than that, you
will probably need a PREEMPT_RT (easiest) or Xenomai kernel to
guarantee the ARM side will keep up.

Le_Costaouec_Vincent · June 23, 2016, 2:43pm

First thanks for your reply.

So to answer your question, I try to load a data every 25ns.

I have few questions regarding your reply :

Can I write into the PRU memory form a C program on the ARM side at that can of speed ? (because for now I have succeeded at least)
In addition does your solution can load several data at the same time as I do with the propriety of the register R30 of the PRU ?
Also the fact, is I am also reading data at the same time (I didn’t tell you about it previously in order to be clearer)., so can I do so with the solution “C program on the ARM side do the writing into the PRU memory ring-buffer” ? Maybe that it would be more easy with your DDR memory solution ?
Moreover, I didn’t new that the PRU got a Ring buffer, is it link with the section 4.4.1.2.3.3 28-Bit Shift In of the Technical reference manual, or is it something else ?
Also the fact with your solution with the DDR memory is that to load a data from there I have also observed some huge delay with the LBBO command. Or it’s just me who have done something wrong ?

Thanks again for your answer.

Regards
Vincent

DTJF · June 23, 2016, 6:35pm

Hello Vincent!

Charles_Steinkuehler · June 23, 2016, 7:02pm

First thanks for your reply.

So to answer your question, I try to load a data every 25ns.

I have few questions regarding your reply :

  * Can I write into the PRU memory form a C program on the ARM side at that
    can of speed ? (because for now I have succeeded at least)
      o In addition does your solution can load several data at the same time as
        I do with the propriety of the register R30 of the PRU ?

Probably. If your data is Byte sized (pun intended), that's 40 MB/s,
or about 10 million 32-bit writes/sec (10 MHz). IIRC, the L4_Fast bus
used to communicate with the PRU runs at 100 MHz, so there's lots of
head room.

  * Also the fact, is I am also reading data at the same time (I didn't tell you
    about it previously in order to be clearer)., so can I do so with the
    solution "C program on the ARM side do the writing into the PRU memory
    ring-buffer" ? Maybe that it would be more easy with your DDR memory solution ?

You can read and write from the PRU memories at the same time. There
is no problem with the ARM writing to the PRU data memory at the same
time the PRU is reading it.

  * Moreover, I didn't new that the PRU got a Ring buffer, is it link with the
    section 4.4.1.2.3.3 28-Bit Shift In of the Technical reference manual, or is
    it something else ?

A ring buffer is a software mechanism that allows two asynchronous
processes to communicate without having to use locks or semaphores, so
it is very fast. There are many ways to construct these depending on
exactly what you need to do (how many writers and readers there are),
and what sort of atomic transactions are supported by the hardware
you're running on. Google "lock free queue" and "lock free ring
buffer" for lots of details. I expect you will probably be OK with a
degenerate single writer, single reader ring-buffer, but you haven't
fully explained what you're doing and you keep adding details, so
you'll have to figure that out for yourself.

  * Also the fact with your solution with the DDR memory is that to load a data
    from there I have also observed some huge delay with the LBBO command. Or
    it's just me who have done something wrong ?

PRU reads from DDR memory will stall until the data is returned, which
is one reason I suggested using the ARM core to write the data into
the PRU (the writes will post, so the ARM core can carry on doing
other things while the data actually gets written).

If you want to read data efficiently from DDR using the PRU, you
should use the LBBO command to read as large a block of data as
possible. The PRU ties into the same L3F on-chip fabric as the ARM
and GPU cores, so there's plenty of bandwidth, you just need to read
as much as possible at one time to reduce the read latency effects.

William_Hermans · June 23, 2016, 7:10pm

PRU reads from DDR memory will stall until the data is returned, which
is one reason I suggested using the ARM core to write the data into
the PRU (the writes will post, so the ARM core can carry on doing
other things while the data actually gets written).

I’ve never done this personally, But wouldn’t it be faster to write from the ARM core directly into the 12k PRU shared memory ? Or is that what you’re proposing ?

Charles_Steinkuehler · June 23, 2016, 7:12pm

That's what I'm proposing.

My comment about DDR reads was in response to a specific question of
why performance was so bad doing an LBBO from DDR memory.

William_Hermans · June 23, 2016, 7:27pm

That’s what I’m proposing.

My comment about DDR reads was in response to a specific question of
why performance was so bad doing an LBBO from DDR memory.

Cool. Yeah I’ve done a lot of contemplation on this specific situation as well as other PRU related things, and this to me just makes sense. After having seen you write that certain PRU operation can slow things down considerably. And going out over the L4 interconnect to DDR memory . . . seems like it’d be much slower. But you know, you have the hands on, I do not. Sooner or later perhaps I can, and will justify the time to start toying with the PRUs on a regular basis. But I’ve not needed anything “fast” yet.

I’d probably use the PRUs for a different purpose than what I image you’re using them for though Charles. One idea I’ve had lately is to use two Serial modules as an IPC mechanism between a supervisor type service running as root, and a regular user Serial side. But If I implemented the Serial ports in software by way of the PRUs, it could be really fast. Granted I’m having a hard time thinking of anything that would require that speed( that I’d personally need ).

John_Syn · June 23, 2016, 7:46pm

Yet another solution would be to place the 24K in DDR and use DMA to populate the 12K shared PRU ram using a ping pong arrangement. The PRU would read from one buffer while the DMA fills the other buffer. When the PRU has finished with the one buffer, it will switch to the other buffer and trigger the next DMA transfer. Starterware has EDMA code that can be adapted to work on the PRU.

Regards,
John

Le_Costaouec_Vincent · June 29, 2016, 12:53pm

First thanks to all of you for your reply. I’m also sorry for my delay.

I didn’t speak about the reading part early because for me it’s seem to be working, So to summarize my code is only doing two things :

send different sequences of value on several channels of the PRU and all of them need to be sent at the same time
When a data have been send, I read data (from register r31) and store them into the memory (into “/sys/class/uio/uio0/maps/map1/addr”)
→ And I do this loop until I have send all the data.

@Charles

So it’s true if L4_Fast bus used to communicate with the PRU runs at 100 MHz I will be really interested in it.
If I have well understand it, that’s mean that I have to load the data into this location

PRU_ICSS 0x4A30_0000 0x4A37_FFFF 512KB PRU-ICSSInstruction/Data/ControlSpace

(extract from page 184 of the TRM : Table 2-4. L4 Fast Peripheral Memory Map (continued) )

Previously, I was using this instruction to load my data into the PRU RAM

prussdrv_pru_write_memory(PRUSS0_PRU0_DATARAM, 0, sequenceData, NUMBER_DATA);

So now I have to replace it with

prussdrv_map_extmem(sequenceData);
Is its write ?

(sequence Data is my array of data store define as is :

unsigned int sequenceData[NUMBER_DATA];

An other question is, in assembly, am I also supposed to load it from 0x4A30_0000 address or there is a padding that I need to had because of the PRU mapping ?

When you are saying :

If you want to read data efficiently from DDR using the PRU, you
should use the LBBO command to read as large a block of data as
possible.

you mean use
LBBO (LBBO REG1, Rn2, OP(255), IM(124))

with IM=124 write ?
Before I start to do so, have you got an idea of the “read latency effects” that I have to expect in this specific case?

@ TFJ
So with the message for charles you should see my method to load the data into the DRam.
I measure the time between the moment that I start loading the value into the DRam and the end of my program and I subtract the time that sequence need (by observing it at the oscilloscope).
I know that it’s not really accurate measurement but I was already to munch from what I was expected.

Charles_Steinkuehler · June 29, 2016, 4:06pm

@Charles

So it's true if L4_Fast bus used to communicate with the PRU runs at 100 MHz I
will be really interested in it.
If I have well understand it, that's mean that I have to load the data into this
location

PRU_ICSS 0x4A30_0000 0x4A37_FFFF 512KB
PRU-ICSSInstruction/Data/ControlSpace

(extract from page 184 of the TRM : Table 2-4. L4 Fast Peripheral Memory Map
(continued) )

Previously, I was using this instruction to load my data into the PRU RAM
>
prussdrv_pru_write_memory(PRUSS0_PRU0_DATARAM,0,sequenceData,NUMBER_DATA);
>

So now I have to replace it with
>
prussdrv_map_extmem(sequenceData);
>
Is its write ?

It depends on what you're trying to do. I don't typically use the PRU
library stuff for anything other than getting the code running on the
PRU, so I'm not the best person to ask.

(sequence Data is my array of data store define as is :
>
unsignedintsequenceData[NUMBER_DATA];
>

An other question is, in assembly, am I also supposed to load it from
0x4A30_0000 address or there is a padding that I need to had because of the PRU
mapping ?

I'm not sure what you're asking.

The PRU view of memory is documented in section 3.1.2, "Local Data
Memory Map", along with the documentation for the PMAO register in the
PRU CFG space.

The ARM view of the PRU subsystem is documented in section 3.2,
"Global Memory Map".

When you are saying :

    If you want to read data efficiently from DDR using the PRU, you
    should use the LBBO command to read as large a block of data as
    possible.

you mean use
>
LBBO (LBBO REG1,Rn2,OP(255),IM(124))
>

with IM=124 write ?

Yes, the IM(124) value should be as large as possible to provide
high-speed reads. The maximum value will depend on how large a block
of free registers you can leave in your code.

Before I start to do so, have you got an idea of the "read latency effects"
that I have to expect in this specific case?

The read latency will basically be the same whether you're reading one
byte or the whole register file. That's why you want to read as much
at one time as you can. The actual read latency will vary depending
on the current state of the DDR memory (which pages are open) and how
heavily the ARM core and other on-chip resources are using the DDR.

Le_Costaouec_Vincent · June 29, 2016, 4:25pm

It depends on what you’re trying to do. I don’t typically use the PRU
library stuff for anything other than getting the code running on the
PRU, so I’m not the best person to ask.

What I am trying to do is loading the data into the L4_Fast bus, in order to be able to read them next from the assembly code.
So I would, like to ask you how would you load data into the L4_Fast bus if it’s not with the library stuff, (have you an example please)?

An other question is, in assembly, am I also supposed to load it from
0x4A30_0000 address or there is a padding that I need to had because of the PRU
mapping ?

I’m not sure what you’re asking.

Here it was to ask you how will you load the data in assembly because I’m guessing that you advise me to do it with LBBO (LBBO REG1,Rn2,OP(255),IM(124)) ,
And I know my REG1 because it will be what I will receive from this instruction, but with L4_Fast bus I don’t know what I should pass as Rn2.
So I was wondering which address will correspond to the L4_Fast bus that I can use it with LBBO.

Thanks for your reply
If I can clarify something else do not hesitate

Thanks by advance
Regards
Vincent
" Enjoy life no matter what ! "

DTJF · June 29, 2016, 4:47pm

Not entirely correct. Pass a pointer, check error

`
if (prussdrv_map_extmem(&sequenceData))
{ /* error handling here */ }

sequenceData[0] = 0815;
…
`

Note: the ERam has a default size of 256 kB. In order to get 1 MB, you’ll have to load the driver manually (8 MB max.). See this documentation (section ERam at the bottom) for details.

BR

DTJF · June 29, 2016, 4:58pm

You’ll need the pyhysical memory adress:

unsigned int ESize = prussdrv_extmem_size() // optional unsigned int EAddr = prussdrv_get_phys_addr(sequenceData`)

`

Pass this EAddr to the PRU and load it into Rn2.

DTJF · June 29, 2016, 5:13pm

Just found another problem:

(sequence Data is my array of data store define as is :

unsigned int sequenceData[NUMBER_DATA];

Don’t dim an array, use a pointer instead

`

unsigned int *sequenceData;

`
And here’re the missing semicolons from my last post:
;;