PRU/ARM communication with shared memory

Carlos_Novaes · September 9, 2015, 10:31pm

Hi

I am facing a strange problem with my application. I have some code running on PRU0 and PRU1 that share some data using scratchpad. PRU1 also send some data to ARM, 32 bytes at each time with a counter value as timestamp. each time a new chuck of data is ready, it is written in a circular buffer on shared memory, also a pointer is updated in a fixed position of the shared memory and a interrupt is sent to the ARM. On the ARM side, each time the interrupt is received, the pointer is read so it can get the last chunk of data. So far so good.
Analyzing the data received on the ARM side, we note some samples are missing, eg: timestamps are 1000, 1001, 1003, 1004, 1005… (1002 is missing) and this is expected. That is the point in having a ring buffer. The problem is that sometimes, some samples are repeated. ex: 1004, 1005, 1005, 1006, 1008… The only way I can think of this to happen is that for some reason the PRU flush data to the shared ram but this write is not completed before the interrupt is sent to the ARM… the ARM read the shared memory but finds old data.
Here comes my question: I can deal with it as it happens here and there, but where and how can I investigate it further and maybe, how to ensure that this will not happen or at least, do not get worst?

Thanks,

Carlos Novaes

William_Hermans · September 9, 2015, 11:10pm

Hard to guess, but if I were to guess I would guess that the OS side of things is reading the memory faster than the PRU can build the buffer - Sometimes.

I also experienced using POSIX shared memory between two process in Linux. How I dealt with this was pretty simplistic.

struct shared {

char *buffer[length];

int access;

}

/* write side process */

struct shared *s;

s->access = 0;

. . .

while(s->access != 0)

usleep();

/* Do stuff when this process has access */
. . .

s->access = 1;

/* Read side process*/

struct shared *s;
. . .

while(s->access != 1)

usleep();
/* Do stuff when this process has access */
. . .

s->access = 0;

*s in this case is a pointer to a mmap()'d region of shared memory. Which both processes have access too. This virtually guarantees that each process only has access to *buffer at the appropriate time. Meaning, the write process starts off with access, and does not relinquish access until it is done writing to the buffer. After which, the read process gains control, does it’s thing, and passes access back to the write process.

One caveat here that I can think of. The reading process should copy the buffer data, and return control to the write process as fast as possible. As the write process will be getting data in fairly fast I’m assuming, and will have no place to store the data, Unless you use two buffers on the PRU side.

William_Hermans · September 9, 2015, 11:16pm

Well, buffer in my case was actually fixed size buffer[], and not a pointer type character array, but you get the idea I hope.

William_Hermans · September 10, 2015, 12:00am

In addition to the above . . . It sounds as though you’re not clearing your data once you’re done with it.

So perhaps, and I’m speculating here. The value the PRU is put into a variable initially - Where then you copy that value into a ring / circular buffer. If you’re getting identical time stamps, and data into your circular buffer, then you’re not clearing your initial variable(s) once done with that particular data value.

Still, the whole situation seems more like a synchronization issue to me more than anything. Still, you should probably be clearing your data fields as well - Once done with them.

Carlos_Novaes · September 10, 2015, 12:07pm

I think I got tour point here. I used the signal/interruption system to sync between arm and pru, as it seens to me a more professional approach (I am not a programmer anyway...) than keep reading a status flag on the shared memory. But It seems that the pru functions to write to the shared ram region are not blocking and will return even if the data was not actually written into RAM.
I will try to set some flag so signal that a chunck of data was already processed and will see how it performs.
Another thing to note is that each time new data is read by ARM, another chunck of data is written to the same shared memory, just in a different location. This could cause the bus to saturate at some conditions?

Charles_Steinkuehler · September 10, 2015, 1:20pm

I think I got tour point here. I used the signal/interruption
system to sync between arm and pru, as it seens to me a more
professional approach (I am not a programmer anyway...) than keep
reading a status flag on the shared memory. But It seems that the
pru functions to write to the shared ram region are not blocking
and will return even if the data was not actually written into RAM.

Writes will always complete, but it's possible to see updates happen
out of order due to hardware or compiler optimizations (that's what
memory fences are for).

You also don't mention if "shared ram" is the PRU shared memory region
or a chunk carved out of SDRAM. The data will always be written, but
visibility rules (especially on the ARM side) will depend on the
specific code you write, the memory region flags (cache-able, I/O
region, etc), any memory barriers used, etc.

You'll have a much easier time using the PRU shared memory region if
you're not already.

I will try to set some flag so signal that a chunck of data was
already processed and will see how it performs. Another thing to
note is that each time new data is read by ARM, another chunck of
data is written to the same shared memory, just in a different
location. This could cause the bus to saturate at some conditions?

The description of your code so far hasn't sounded much like a
standard ring buffer. Writing lockless multi-threaded code is
non-trivial, and there are lots of ways to mess things up. I suggest
you either use a proper ring buffer (if you don't want dropped
samples), or a proper req/ack mechanism if you're just trying to get a
single value at a time across the PRU<>ARM boundary.

If you want more specific help, post a link to some code for review.

William_Hermans · September 10, 2015, 2:50pm

I think I got tour point here. I used the signal/interruption system to sync between arm and pru, as it seens to me a more professional approach (I am not a programmer anyway…) than keep reading a status flag on the shared memory. But It seems that the pru functions to write to the shared ram region are not blocking and will return even if the data was not actually written into RAM.

Do whatever makes you happy as far as fixing your problem. But know that “professionalism” has nothing to do with programming. There are ways to do somethings, and ways not to.

It is far more troubling to leave data variables uncleared once done with them, and not use memory use synchronization. As these can promote hard to nail down problems, as you’re observing.

Anyway, stop worrying about professional and what you think is professional, and focus on the things that are actually important.

Carlos_Novaes · September 11, 2015, 3:07am

I think I got tour point here. I used the signal/interruption
system to sync between arm and pru, as it seens to me a more
professional approach (I am not a programmer anyway…) than keep
reading a status flag on the shared memory. But It seems that the
pru functions to write to the shared ram region are not blocking
and will return even if the data was not actually written into RAM.

Writes will always complete, but it’s possible to see updates happen
out of order due to hardware or compiler optimizations (that’s what
memory fences are for).

I see

You also don’t mention if “shared ram” is the PRU shared memory region
or a chunk carved out of SDRAM. The data will always be written, but
visibility rules (especially on the ARM side) will depend on the
specific code you write, the memory region flags (cache-able, I/O
region, etc), any memory barriers used, etc.

You’ll have a much easier time using the PRU shared memory region if
you’re not already.

Yes, I am using it. The 12K region of ram shared between the two PRUs and that may be memory mapped for the ARM.

I will try to set some flag so signal that a chunck of data was
already processed and will see how it performs. Another thing to
note is that each time new data is read by ARM, another chunck of
data is written to the same shared memory, just in a different
location. This could cause the bus to saturate at some conditions?

The description of your code so far hasn’t sounded much like a
standard ring buffer. Writing lockless multi-threaded code is
non-trivial, and there are lots of ways to mess things up. I suggest
you either use a proper ring buffer (if you don’t want dropped
samples), or a proper req/ack mechanism if you’re just trying to get a
single value at a time across the PRU<>ARM boundary.

In fact, it is not exactly a standard ring buffer implementation. In fact there is just a region of memory, lets say 32 contiguous structures of 32 bytes. At the PRU side I got a “pointer” which is in fact just an index from 0 to 31, with a single increment at each new sample and wrapping around 32->0 This “pointer” is written to another region of the shared RAM and the ARM can read it. Also, the ARM has it own version of this same pointer, indicating the location of the previous processed sample and will increment it after receiving a interruption from the PRU. Usually, the two versions of this pointer will have the same index value, but if the ARM fails to process one or more samples, they will have different values. If they have different values, then the arm code will know that nothing can be done as response to the missed samples, but the old data may be (in fact they will be) useful to compute a more accurate response to the actual sample.

If you want more specific help, post a link to some code for review.

The code is messy and some variable names are very unclear, but I will post it anyway, if someone is interested or willing to help/comment

Thank you very much for your help and hints.

Carlos_Novaes · September 11, 2015, 3:31am

I think I got tour point here. I used the signal/interruption system to sync between arm and pru, as it seens to me a more professional approach (I am not a programmer anyway…) than keep reading a status flag on the shared memory. But It seems that the pru functions to write to the shared ram region are not blocking and will return even if the data was not actually written into RAM.

Do whatever makes you happy as far as fixing your problem. But know that “professionalism” has nothing to do with programming. There are ways to do somethings, and ways not to.

Nice and objective advice, I like it. In fact, the interrupt system works great to signal new data to be exchanged by the two PRUs, there are just the PRUs registers, no memory cache, everything completes in one pru cycle… a predictable word. I was lured to think that the same would happen to sync data between the PRU and the ARM.
Now I see that when I wrote “professional approach”, even stating that I am not a programmer, it may sounded offensive. So I ask, anyone who may felt uncomfortable, please forgive me. That was not my intention.

It is far more troubling to leave data variables uncleared once done with them, and not use memory use synchronization. As these can promote hard to nail down problems, as you’re observing.

Anyway, stop worrying about professional and what you think is professional, and focus on the things that are actually important.

Thank you very much. I will re-implement the code with this kind of flagging… maybe it is the right time to dive into some posix threads reading.

William_Hermans · September 11, 2015, 4:35am

Nice and objective advice, I like it. In fact, the interrupt system works great to signal new data to be exchanged by the two PRUs, there are just the PRUs registers, no memory cache, everything completes in one pru cycle… a predictable word. I was lured to think that the same would happen to sync data between the PRU and the ARM.
Now I see that when I wrote “professional approach”, even stating that I am not a programmer, it may sounded offensive. So I ask, anyone who may felt uncomfortable, please forgive me. That was not my intention.

It is far more troubling to leave data variables uncleared once done with them, and not use memory use synchronization. As these can promote hard to nail down problems, as you’re observing.

Anyway, stop worrying about professional and what you think is professional, and focus on the things that are actually important.

Thank you very much. I will re-implement the code with this kind of flagging… maybe it is the right time to dive into some posix threads reading.

No no . . . what I mean by my comment was that it is more important to do a job right by you. Or put another way - Don’t worry so much about what other people think about your code, but do what works for you, and simpler code generally works best. Perhaps you’re familiar with the acronym K.I.S.S. ? e.g. “keep it simple stupid” . . . this is not meant as a denigrating remark. Actually the meaning I take from this is that there are usually more ways to do things ( in engineering ) and generally the simplest way is the best. For many reasons - But not always for the obvious reasons.

So my comment about the POSIX shared memory . . . The code was more of a high level concept than code that you can actually use for your situation. Honestly I have no idea how you would implement an equivalent PRU side, but I’m sure it is possible.

The beauty of the code is simplicity, and straight to the desired effect. No overhead, etc, and similar code could not get much faster . . . depending. One of the downsides however, is that it is a “blocking lock”. Which means the program can not do anything after this check, until it can do it’s thing with the shared data area. This can mean that one side or the other may stall while waiting on the other side to complete it’s task.

I also like the idea of using interrupts. However, there is probably lots of system overhead involved on the OS side - By comparison. Really though, I have no experience with using interrupts in / on an OS. Only bare metal. But the potential problem I’m seeing here - Is that I know that Linux works in time slices, and your Linux process will only get the CPU time the kernel lets it have. Which can be very non deterministic at times. I’ve experienced this first hand recently . .

Also, I’m going to assume, and it is probably a very good assumption that interrupts in Linux will require a system call at minimum. Which can be bad for an executable that needs to be fast. As an example - looks at the speed difference between accessing a file using sysfs compared to mmap(). sysfs uses systems calls, where mmap() does not. Well, for reading / writing. Initializing either is / can be very close to the same. Anyway, my personal experience here is with using POSIX semaphores. Compared to my simplistic code, they were dog slow. POSIX semaphores use system calls . . . my code does not. Being very simple code, and very little of it. Also is is also very deterministic. It virtually guarantees that only one side of the shared memory transaction can take place, and it has to take place in a certain order.

Interrupts between the two PRUs seems the way to go. Especially since it seems to be a single cycle operation on each side. For your Linux side application . . . I’m not so sure. But the only real way to find out is to implement two or more different ways of doing the same thing, and test it yourself.

Anyway . . . I’ve probably beat this horse to death

William_Hermans · September 11, 2015, 4:55am

This can mean that one side or the other may stall while waiting on the other side to complete it’s task.

This is actually more of a cautionary remark than anything. If you use such a technique “smartly”, it can be very fast. As another example . . . the code I’ve been working on recently involves reading in data from the CAN peripheral, and building multi frame packets ( varying in size ) out of this data. Then writing this data to a shared memory file. After this the shared memory read side reads the data, and put this out over a web socket to a web browser client.

With this in mind, I’ve been able to send 280+ messages a second via a web socket to the web browser. More realistically though, I’m able to give the browser ~20 data samples a second using the technique I mentioned above. That, and the data is multi packets, and each packet is only sent over the CANBus ever 500ms for most packets. Then, I’m only currently tracking three data packets . . .one of which does send out information every 100ms - Which I realistically can not keep up with. Even if that data packet is monitored by a separated thread. I’ve tested it . …

William_Hermans · September 11, 2015, 5:10am

Carlos, I forgot to mention before also that you seem to have a flaw in your code some where. I mentioned clearing your data variables already, but there is potentially more there too is you’re getting duplicate time stamps, and data. The first impression I get from this is that

a) you’re not clearing your data variables between reads / writes on the ARM side. And . . .
b) You have no locking mechanism between the PRU, and ARM to tell the ARM side program it is ok to read the data.

This is what I was trying to get at last night, but I always feel like I’m in a hurry because I’m either coding, or reading about something new to me related to what I’m coding . . . and hence why I respond with so many darned posts . . . heh. As I have a lot on my mind . . .

Carlos_Novaes · September 11, 2015, 10:52am

Hello Willian, first of all, thank you again for your suggestions and clarifications. I am convinced now that it will work better (and with simpler code) if I implement some locks and keep reading a flag bit to know if it is time to read/write the memory. As far as I can tell, I can do it in a non blocking way on the PRU side. On the ARM side, It can be possible to spawn a new thread specialized in checking this, so the arm program can do other things without having the main thread blocked. There will be nothing to do in the control task before having a new sample to process, but on the supervisory task I can fill in some buffers and send some data to a web server running on the beaglebone… so anyone can plug in a eth cable, open the web browser and sort of watch the “live” results or choose to download every data after the experiment is over.
Ah, no, I have no lock mechanism aside from the interrupt system, which is not working as I expected on the ARM code.

Carlos Novaes

Carlos_Novaes · September 11, 2015, 4:39pm

Hello everyone. Just for an update: PROBLEM SOLVED!

There was a couple of lines from an ancient version of the code running at the PRU side, that I forgot (failed) to remove. Under some conditions the timestamp was not incremented, or incremented twice. Now, everything is working good with the interrupt signaling, but based on this discussion I am now aware that there may be some other points to consider in order to prevent failures.

Thank you,

Carlos Novaes

William_Hermans · September 11, 2015, 4:50pm

Carlos,

I’d let what you have run for several days, to see if it works out for you. Before making an absolute decision. It seems in your eyes that interrupts are the best solution, and I can not say I disagree with that. Assuming the code is fast enough.

Carlos_Novaes · September 11, 2015, 6:59pm

Hi,

I would not say this is my absolute decision, but in the present state it seems that the erratic behavior is unlikely to happen. Anyway it is something that the ARM code must be able to deal with.
It was running for about ten minutes, just a few samples missed (when I start a file transfer via sshfs), none of then was repeated. It will perform the control of a walking robot, as a academic experiment, and will run for about twenty seconds to one minute at each time. Just to record some data and for demonstration purposes. Of course, there is no control processing yet, so the final results will be worst. If so, there must be the case for a complete rewrite.Another thing I just found out, and that may have some influence, is that the ARM clock was set to 300MHz with the “ondemand” governator. I expect better results setting this to the maximum of 1000MHz.

William_Hermans · September 11, 2015, 7:14pm

Another thing I just found out, and that may have some influence, is that the ARM clock was set to 300MHz with the “ondemand” governator. I expect better results setting this to the maximum of 1000MHz.

Maybe - Because on demand will go up to 1Ghz when CPU exceeds ~66% load. For my own usage here, I found that it did not make much difference until I went to a multi process model( in addition to the two processes I have always had ). That is to say, I was reading all values from the CANBus via one executable prior to experimenting with using multiple processes for different data sets. Using fork() with execv() type processes.

In the end I’d have to say that multiple processes using fork() and execv() did allow me to do more processing at once. But it also introduced more kernel cleanup, as is noticeable when running atop to monitor process activity. Meaning, that I was able to sample more data in the same amount of time, but because the kernel gets involved a lot more in cleanup, I do miss some samples once in a while . . .Well, actually I’m not 100% sure what is happening behind the scenes but I see kworker processes pop up more frequently. Either way, whatever kworker processes are doing is causing my applications to stall occasionally. For a short period of time.

Just as one last note. These processes I create run for the lifetime of the whole application. So . . . It’s not like I’m creating too many zombies processes or something . . . Which I learned not to do very early on