What happens on read/write conflict on the PRU scrartchpad?

Hello again,

My application uses the two PRUs:
PRU0 will control four PWM and some digital outputs and has some tight time constraints. In short, the less assembly instructions it runs on the normal execution path, the better.
PRU1 on the other hand, will read some inputs and do the interface with PRU0 and ARM. It implements a sort of ring buffer with the ARM side and transfer new data to PRU0 via scratchpad. Aside from the number of tasks, there a plenty of time to waste here.
PRU0 will send a interrupt to PRU1 every time it read the scratchpad. Also I can let PRU1 interrupt PRU0 to signal that the new data are ready, but doing so will waste at least two or three cycles.

What I would like to know is, if there are a conflict on the scratchpad communication. lets say PRU1 is writing some registers to the scratchpad bank0 while, at the same time, PRU0 is loading the same registers from the same scratchpad bank.

Does anyone know what happens in this case?
Will PRU0 or PRU1 stall and wait for the other to complete the access?
Will PRU0 read the previous data?
Will PRU1 read corrupted data?

PS: Section 5.2.4.2 of the am335xPruReferenceGuide.pdf explains what happens if the two prus are trying to write at the same time. This is not my case as PRU0 will never write anything on the scratchpad.

Hey Carlos,

So, let me say up front that I have zero hands on with the PRUs. But have experienced something similar recently with Linux shared memory using mmap. Which may not be as fast as the PRU’s but was fast enough to produce output fast enough to “flood” out / crash firefox within seconds.

Will PRU0 or PRU1 stall and wait for the other to complete the access?
Will PRU0 read the previous data?
Will PRU1 read corrupted data?

I would think that neither of these cases would be desirable. But in the case of the second and third question, both could be possible - e.g. undefined behavior. In the first question I’m not sure what you mean exactly, but I did again experience something similar with POSIX shared memory.

To explain further. Two processes, one reading, one writing. The process doing the writing has more to do and hence is not as fast as the reading process. In this case, the reader literally locked out the writer, since it was able to access the file so fast. The end result was that the writer process was able to write to this memory once, after which is was completely locked out but the reader process.

PRU0 will send a interrupt to PRU1 every time it read the scratchpad. Also I can let PRU1 interrupt PRU0 to signal that the new data are ready, but doing so will waste at least two or three cycles.

Is ~ 15ns that important ? versus not knowing what may happen?

Anyway, there are a few people here that know the PRUs very well. Perhaps if they answer your question it may render what I say here moot. But if you really need deterministic operation. “Waste” the 3 cycles per loop iteration. It will probably make your life much easier, and your data much more reliable :slight_smile:

Hi Willian,
Thank you for your insights.

I am thinking how to write a test code to determine exactly what happens, but it is not as trivial to me.

The point with POSIX shared memory is, on my understanding, valid until some extent. It takes one cycle for the PRU to read from or write to the scratch pad register. That is just one cycle to back up the entire (or partial) contents of the registers and another cycle to get them back (on the same or on the other PRU).

To clarify, my first guess is that PRU0 wil be blocked until PRU1 complete the writing, so it wil read in two cycles. The effect on the pwm outputs will be a 5ns jitter.

The 15ns are important as long as my third guess is proved to be impossible. This conflict is unlikely to happen and so I can live with either a 5ns jitter or a repeated sample. But I can not allow for a corrupted reading that may even interpret a signal to shutdown my pwm module.
Well, I am just trying to push the PRU to its limits and got four pwm (10 bits) and three output pins driven at 19531 samples per second with no software induced jitter (every execution path takes exactly the same time, so any jitter is hardware related). By including the interrupt control the sample rate will drop to 15024 samples per second or a little less. Of course this is all in vain if I cannot get the ARM running at 1GHz to process the inputs and generate a control action at this rate even with a high priority task.

The 15ns are important as long as my third guess is proved to be impossible. This conflict is unlikely to happen and so I can live with either a 5ns jitter or a repeated sample. But I can not allow for a corrupted reading that may even interpret a signal to shutdown my pwm module.

And this is what could happen if one PRU is reading while another is writing.

As far as blocking / non blocking. I honestly do not know. But is my instinct that if it is not explicitly mentioned as being a blocking call - It will be non blocking.

The way you put it made complete sense. Probably I will do it with some interruption handling, better safe than sorry.
But being a hardware guy, this raises a philosophical question: Why handle the case when two simultaneous writes are performed and don’t even care in this other case? Maybe because there are interruptions to handle this? Well, then why not just use this same interrupts to also block the simultaneous writes?

The way you put it made complete sense. Probably I will do it with some
interruption handling, better safe than sorry.
But being a hardware guy, this raises a philosophical question: Why handle
the case when two simultaneous writes are performed and don't even care in
this other case? Maybe because there are interruptions to handle this?
Well, then why not just use this same interrupts to also block the
simultaneous writes?

However well (or not) it is handled in hardware, the problem is well
known in operating systems, which have two (or more) effective
simultaneous processes going on at the same time. Synchronization
between processes becomes critical under certain circumstances. You
may have that kind of problem.

Consider what happens when you write a string of bytes/words rather
than a single one. What happens if that's interrupted halfway
through? In an operating system, it can happen.

Harvey

However well (or not) it is handled in hardware, the problem is well
known in operating systems, which have two (or more) effective
simultaneous processes going on at the same time. Synchronization
between processes becomes critical under certain circumstances. You
may have that kind of problem.

Consider what happens when you write a string of bytes/words rather
than a single one. What happens if that’s interrupted halfway
through? In an operating system, it can happen.

Harvey

Yes, so for example it is possible to get half a good integer, and half a corrupt integer. However that should not be a problem in bare metal as is / may be the case of the PRU’s. Since they operate completely independent of the A8’s OS.

So Carlos, one option you may be able to use. Boolean “tag”. First byte in memory starts off with either a 0, or 1. This value then instruct the PRU’s which one has access “rights” to the shared memory. Then once the PRU that has access right to the memory is done with it’s operation. It writes the value the other PRU is allowed to access.

However, this introduces a compare, and an additional byte write. Which may use up more than or at least as much overhead which you’re trying to avoid. However this “overhead” is somewhat mitigated by the fact that it is only on the PRU that is waiting for access. Write should only be one cycle, a compare, I’m less sure about.

However this “overhead” is somewhat mitigated by the fact that it is only on the PRU that is waiting for access. Write should only be one cycle, a compare, I’m less sure about.

I was not very clear here. This kind of implementation should be somewhat mitigated because half the process is done on each PRU.

So if a write is 1 cycle and a compare is 1 cycle. You’re reducing the delay to 1 cycle per core. Instead of 3 cycles per core.

But being a hardware guy, this raises a philosophical question: Why handle the case when two simultaneous writes are performed and don’t even care in this other case? Maybe because there are interruptions to handle this? Well, then why not just use this same interrupts to also block the simultaneous writes?

Again, I do not know the hardware that well. I have a pretty good rough idea of what the PRUs are capable of. From a high level. That’s about it. It is completely possible that each core can block the other while accessing this shared memory. I do not know one way or another.

All I am saying, is that from a software perspective, plus what little I do know about hardware of this type - this make no sense. As operations of this sort could introduce added complexity, and operational overhead.

However, one thing to keep in mind. Simultaneous writes, are guaranteed corruption, especially if somehow another processor we reading between the two writes. Simultaneous read / write will not necessarily introduce corruption in data. For this, a couple things to keep in mind. If writes, and reads take the same amount of time. Reads are nearly guaranteed to be"safe" - Although perhaps redundant. It is also as I mention above fairly “trivial” to implement a read/write mechanism. Or give each PRU exclusive “rights” to the shared memory at any given point in time.

Hi Carlos,

have you looked at the PruReferenceGuide section 5.2.4.2 (p.34-35)? Let me copy paste here:

A collision occurs when two XOUT commands simultaneously access the same asset or device ID.
Table 20 shows the priority assigned to each operation when a collision occurs. In direct connect mode
(device ID 14), any PRU transaction will be terminated if the stall is greater than 1024 cycles. This will
generate the event pr<1/0>_xfr_timeout that is connected to INTC.

Table 20. Scratch Pad XFR Collision Conditions

Operation Collision Handling
PRU XOUT (→) bank[j]
If both PRU cores access the same bank simultaneously, PRU0
is given priority. PRU1 will temporarily stall until the PRU0
operation completes.

PRU XOUT (→) PRU If PRU
executes XOUT before PRU executes XIN, then
PRU will stall until either PRU executes XIN or the stall
is greater than 1024 cycles.

PRU XIN (←) PRU If PRU executes XIN before PRU executes XOUT, then
PRU will stall until either PRU executes XIN or the stall
is greater than 1024 cycles.

I used the direct XOUT / XIN with device ID=14 to synchronize the two PRU’s. There were no unexpected problems, everything like described in the manual.

Let me know if this wasnt your problem. Bests, Lenny

Oh sorry, I didn’t see your PS :wink:

In case that one PRU reads from the scratchpad and in the same cycle the other PRU writes to it, I am pretty sure that there will be no conflicts. It is the standard behaviour when you program sequential logic with a hardware description language. However, the read operation will yield the “old” data from the scratchpad, that is the ones from before the write operation.

But as you have lots of time to waste on one PRU, it might be a better idea to use direct PRU transfer (XIN/XOUT with device 14), without the scratchpad in between. Lets say your PRU0 (the time-critical one) executes N instructions per loop. Then just let PRU1 make an update of the important registers every M cycles (with M<N) so that data for PRU0 will always be fresh and PRU0 will never have to wait for PRU1. The advantage is that your PWM will run perfectly deterministic, that is there will never be any irregularities in the output signal due to the interrupt. The data will be updated every PWM cycle, and all this only costs you one instruction per PWM loop cycle.

Hi Lenny. Sorry for the delay.
Usually, PRU1 will have new data received from the ARM before PRU0 complete one PWM sample, this is the key point. If the ARM sporadically could not deliver a control action in time, the PWM should just repeat the last values, so reading the old data is desirable.
I really did not think in use direct PRU transfer and I don´t know why. Maybe because at my first readings this seemed unsafe. I will give it a try, thank you for the idea.

Hello everyone. This is just an update.

I tried the direct connection mode but it is more suitable for syncing the two PRUs. Anyway, my previous approach will work. As Lenny said:

In case that one PRU reads from the scratchpad and in the same cycle the other PRU writes to it, I am pretty sure that there will be no conflicts. It is the standard behaviour when you program sequential logic with a hardware description language. However, the read operation will yield the “old” data from the scratchpad, that is the ones from before the write operation.

That’s exactly what happens. no extra delays or any type of conflict.

Thank you Lenny, and everyone else.

Carlos Novaes

PS: If it is of interest to someone here comes my test experiment:
I had PRU0 and PRU1 with cycle register enabled and counting clock cycles. Over each iteration on a total of twenty, the cycle register was read and stored into one register from r0 to r19. This on both PRUs
On PRU0 I also read r23 from scratch pad and the store its lower word (r23.w0) into the upper word (rx.w2) of one of r0 to r19. Total cycle counting is 6 for each iteration.
On PRU1 I store the lower word of r23 into the upper word of one of r0 to r19 (according to the iteration) and also increment r23 and store it on the scratchpad. Total cycle counting is 7 for each iteration.
Then, on both PRUs, write r0 to r19 into the shared ram and signal the ARM.
On the ARM side, wait for signals from PRU0 and PRU1, read the shared ram (data from both PRUs), calculate the cycle offset of each iteration and print the results. There are no stall on any PRU, all iteractions takes exactly 6 cycles on PRU0 and 7 cycles on PRU1. At each 7 iteractions, PRU0 will repeat the previous value of r23.
Here comes the output from console:
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

PRU0 :::::: PRU1 :
Value : Cycle : Cycle Offset :::::: Value : Cycle : Cycle Offset :
1 : 58960: ------------------ :::::: 1 : 16: ------------------ :
2 : 58966: 6:::::: 2 : 23: 7:
3 : 58972: 6:::::: 3 : 30: 7:
4 : 58978: 6:::::: 4 : 37: 7:
5 : 58984: 6:::::: 5 : 44: 7:
6 : 58990: 6:::::: 6 : 51: 7:
7 : 58996: 6:::::: 7 : 58: 7:
7 : 59002: 6:::::: 8 : 65: 7:
8 : 59008: 6:::::: 9 : 72: 7:
9 : 59014: 6:::::: 10 : 79: 7:
10 : 59020: 6:::::: 11 : 86: 7:
11 : 59026: 6:::::: 12 : 93: 7:
12 : 59032: 6:::::: 13 : 100: 7:
13 : 59038: 6:::::: 14 : 107: 7:
13 : 59044: 6:::::: 15 : 114: 7:
14 : 59050: 6:::::: 16 : 121: 7:
15 : 59056: 6:::::: 17 : 128: 7:
16 : 59062: 6:::::: 18 : 135: 7:
17 : 59068: 6:::::: 19 : 142: 7:
18 : 59074: 6:::::: 20 : 149: 7

Hello Carlos,

Thanks for sharing. Personally I’m always interested in what others are doing, and like to see “progress reports”. Not that anyone has to report anything to me personally, but I still like reading about what others are doing.

I expect some day in the future I will be investing some time getting to know the PRU’s as well. But as a hobby, I have the luxury of doing so, when I get around to it :wink: Anyway, I’ve always found the PRUs interesting . . .maybe that will be my next “pet” project ?

Cool, thanks for the update Carlos! I thought a little about your problem and was wondering whether you are really at the limit of the PRUs capability:
You say you have roughly 20kSps with 10bit resolution for four outputs, so roughly 200k updates of the output pins per second. That means you need roughly 10 instructions for changing the PWM outputs to their next value. What do you think of the following implementation? Is it not optimal but I think its performance should be comparable to yours but roughly 10x faster:

START:
XIN 10,r0,120 //fetches data from scratchpad0 - each register holds 8x4bits, where bit 0,4,8,12 defines four consecutive values for output 0, bit 1,5,9,13 for output 1 and so on.
AND R30.b0, r0.b0, 0x0F //move the bits 0:3 of r0.b0 to r30.b0 (the other bits of R30.b0 are set to zero by the AND) - here it is assumed that your output pins correspond to r30[0:3]. That can easily be adapted to any other consecutive 4 bits by using a LSL instruction
LSR R30.b0,r0.b0,4 //move the bits 4:7 of r0.b0 to r30.b0 (the other bits of R30.b0 are set to zero by the LSR)
AND R30.b0, r0.b1, 15 //same for r0.b1
LSR R30.b0,r0.b1,4
//repeat this instruction for all other bytes up to…
AND R30.b0, r29.b3, 15
LSR R30.b0,r29.b3,4
//up to here, 308=240 sets of 4 bits were written to the output
XIN 11,r0,120 //fetches data from scratchpad1
//repeat the write operations
XIN 12,r0,120 //fetches data from scratchpad2
//repeat the write operations
XIN 14,r0,120 //fetches data from other PRU
//repeat the write operations
JMP START
//here a total of 4
240=960 writes on 4 digital outputs r30[0:3] has been executed. So we have already about 9.9 bits resolution and - given an overhead of 5 cycles - an update rate of 200MHz*960/965 = roughly 199MHz. That is the actual output will be roughly 200kSps.

The 960 cycles leave plenty of time for PRU1 to perform 4 LBBO operations that are 120byte wide: Typically each one should take less than 80 cyccles. So the code will look sth like
LBBO from DRAM filling all registers r0:r29
XOUT 10,r0,120
LBBO from DRAM filling all registers r0:r29
XOUT 11,r0,120
LBBO from DRAM filling all registers r0:r29
XOUT 12,r0,120
LBBO from DRAM filling all registers r0:r29
XOUT 14,r0,120
//here the code will stall until PRU0 requests the data with the corresponding XIN operation.

This is the simplest implementation. You can make it more sophisticated by inserting some logic into PRU1 code that modulates the output data over several cycles to get a higher output resolution. But even in this implementation, the only irregularity comes from the XIN’s and the final jump operation in PRU0, which should be taken into account in the algorithm that generates the data which fills all of the registers (the value of register r29[28:31] counts twice if its in the scratchpad, and threefold if its in PRU1). This is a complication, but actually leads to a neat trick: If you want to invest some of the gained speed into higher resolution, you can increase the weight of any 4bit datapoint by inserting a loop after it that just holds that particular output value for that number of cycles (you need to reserve a register for the counter in that case, but you will still gain in resolution). If one loop of the PRU0 code takes much more than 1024 cycles, you should also insert a loop of similar length into PRU1 code in order to not have PRU1 stall more than 1024cycles for PRU0’s XIN command.

The Prus are pretty powerfull for driving the io pins and their deterministic nature are very useful. But it was a little hard for me to learn each topic. Gpio access, interrupts, memory mapping and the scratchpad.
In the process of learning, I have written a small library, I called it libpru. It is composed of pru assembly include file and c++ counterpart for the arm processor. Maybe it can be useful to other people, I do not know git very well but maybe I will uploaded it somewhere.

Interesting idea, if only pwm. I did not write it before, but pru1 will also read five incremental encoders and report the positions back to the arm side. So I have some free time on pru1 to manage pru0 and arm communication, but can’t dedicate it exclusively for that. Also, there are two digital outputs and two digital inputs (on off).

The Prus are pretty powerfull for driving the io pins and their deterministic nature are very useful. But it was a little hard for me to learn each topic. Gpio access, interrupts, memory mapping and the scratchpad.
In the process of learning, I have written a small library, I called it libpru. It is composed of pru assembly include file and c++ counterpart for the arm processor. Maybe it can be useful to other people, I do not know git very well but maybe I will uploaded it somewhere.

I do not know git very well either. It’s something I’ve been putting off for a while( learning about ), but eventually I think all software developers need to learn, and use git.

It’s been keeping me from sharing my current code for the project I’m working on, but it’s currently a mess anyway heh ! I have not even updated my blog in a couple years . . . which has been on my mind too. Lots of energy to invest in such thing though - When you would rather be doing something else like learning some new software / hardware aspect, etc.

I would not mind seeing your work sometime, but could not say when I would get the time to look. If a blog post or similar I probably read a couple a day so would not be a problem - But reading through, and understanding someones code . . . is another story. Especially since my ASM is very rusty, and my ARM ASM knowledge is non existent.

Can you help me, I’ve installed UHD but Uhd_find_devices and Uhd_probe nothing :frowning:
Please help

Can you help me, I’ve installed UHD on beaglebone but Uhd_find_devices and Uhd_probe nothing :frowning:
Please help