Passing Mailbox messages between ARM and PRU

Chris_Grey · January 19, 2023, 9:11pm

I’m reading through the TRM for how the mailbox is supposed to be interfaced with, but it’s giving me more questions than answers. The J721E TRM (spruil1c.pdf Sec 7.1.4) talks about Global Initialization and Mailbox initialization.

Unfortunately it doesn’t give any more useful examples of how the mailbox is interacted with within the context of Linux or PRU programming.

Suspecting the memory offset I’ll write values to will be a function of the queue number somehow, I start looking for more documentation that gives a little bit more specifics. Looking in the J721E_Register3.pdf, there’s a lot of mailbox registers. My interpretation of what’s documented is there’s 11 mailbox instances, each with 16 queues, each queue capable of holding up to 4 messages. The register I believe I would want to read/write to would be MAILBOX_MESSAGE_y as defined in the doc.

So the first queue of the 1st instance would be at 31F8_0040h. This register appears to be the 4-byte FIFO. So I write 2 values in, and then read it, I should read out the 1st value written. Then read it again, and get the 2nd value. That’s what I expected anyway.

But when I tested this theory, it didn’t seem to work. Here’s what I tried from the terminal of my BBAI64:
sudo busybox devmem 0x31f80040 w 0xdeadbeef
followed by
sudo busybox devmem 0x31f80040

The value read back was all 0s. So there’s either more configuration required to activate the mailbox or I’ve gone off the rails somehow. Anybody have any ideas?

Having an example already written and working would probably be all I need. Anybody know where such code might live?

Chris_Grey · January 21, 2023, 3:17am

I’ve revisited this and just to see what happens, I tried a different mailbox, and what do you know? It worked. Sort of.

When I send messages to 31F8_1040h, I can read those messages back out in the order I put them in. I can even reference 31F8_1080h which will return 1 if the queue is full (has 4 messages). I can even call 31F8_10C0h to get a count of how many messages are in the queue.

What is odd is when I 1st started testing, the default value if nothing is in the queue was 0x0. However after a while, one of my previous values wound up getting stuck in the register as the “default” value. So the queue would continue to return the values I put in, but once all messages were pulled, continuing to read the register no longer returned 0x0. It would return one of the previous values I’d sent. So it’s almost as though there’s an incantation you can do to set the default value OR the mailbox functionality is just flaky.

Once I’d played around with this a bit longer, I returned to the 0th mailbox, and it continues to not work. So I don’t know what’s up with that. Maybe there’s something in the background using that mailbox and as I’m putting garbage values in it, whatever is listening to the mailbox is pulling those values out faster than I can read them out.

But this is working well-enough for me to use and start writing some test code to pass messages back and forth between the PRU and ARM so I can get performance numbers. I vaguely recall that there’s a PRU timer value that can be referenced. I’d like to use that as the value I push to the ARM from the PRU so I can do other tasks in the PRU and essentially send a timer value before and after the task is complete and then get a delta in the ARM and calculate just how fast, things like accessing DDR is, and how much jitter there is in doing so. It’s a shame there’s only 4 message positions in the mailbox. It’ll likely be too cumbersome for me to use mailboxes to do bulk data transfers, so I’ll need to also test shared memory and use mailboxes to share, with the PRU, what the raw memory addresses are as well as how many bytes has been written so it can go read whatever I sent.

Anyway, it’s good to get confirmation on how this works and see that it does work just as simply as it is documented to albeit with some caveats.

If someone knows what is going on with the “default” value getting changed in the mailbox, let me know what it is I’m doing inadvertently to cause that…and how I can reset the mailbox to clear that out when/if it happens again.

Chris_Grey · January 22, 2023, 12:53pm

With this information, I was able to successfully send data back and forth between the ARM and code I wrote for the PRU. However I never could find a timer counter that would return a value other than 0.

What I could send is the number of times my while-loop looped. As well, I was able to send a “stop” message from the ARM to the PRU. And by monitoring the mailbox count of the mailbox the PRU was pushing messages into, I could see that the PRU had, indeed, stopped confirming I successfully attained 2-way coms.

If someone knows a value I can use that’ll indicate a count of clock changes, I’ll put that in my program and retest.

The value I was trying was ICSSG0_PRU1_CYCLE_COUNT @ 0x0B02400C (I’m loading my code into PRU1).

I even found that the IEP subsystems have a 64-bit timer in them (IEP_COUNT_REG0 & IEP_COUNT_REG1). And I tried activating the IEP (via IEP_GLOBAL_CFG_REG) and reading the lower-32 but it never changed value either. Perhaps there’s more to activating it than setting the config register to a value of 1?

There has to be a stand-alone free-running timer register somewhere that can be used for delta-timing in the PRUs. That’s a pretty fundamental aspect to realtime programming & control.

The only other thing I found in the PRU subsystem that looks promising is the ECAP registers. But dealing with it just seems far more heavy-weight than what I need. I just need a value that increments on a known clock frequency.

Perhaps I’ll start another thread asking if anybody knows of a free running timer that can be accessed anywhere in the system. As long as I know how fast the timer ticks, I can then correlate different values and get an accurate indication as to how long tasks take to perform.

Anyway I’m going to mark this subject as solved since I do have a working solution for the title. There are still unanswered questions like why values get stuck in the FIFO as default values. But that’s easy enough to avoid, simply don’t read the FIFO if its empty.

FredEckert · January 22, 2023, 4:03pm

How about the GTC?

TRM:
12.10.1 Global Timebase Counter (GTC)
12.10.1.1 GTC Overview
The GTC module provides a continuous running counter that can be used for time synchronization and debug trace time stamping.

   uint64_t t;
   t = (*((uint64_t volatile *)(0x00A90008)));

Chris_Grey · January 22, 2023, 6:27pm

I didn’t know about that. I plugged it into the code…seems to work beautifully. Thank you.

Question though, I was trying to find something that seemed to be within the PRU subsystem. It seems all the documentation suggests that once the PRU has to reach outside the subsystem, there’s more than the standard opcode processing delay of 1-2 PRU clock cycles to complete, and the time it takes is also non-deterministic. Being GTC is “global”, are reads to it going to be subject to this non-deterministic delay?

I guess I’ll find out as I get into more realtime analysis, not manual terminal-checking the incoming mailbox on the ARM for values from the PRU.

FredEckert · January 23, 2023, 2:39am

I searched for some definitive information on this but, couldn’t find it readily. Based upon what I have gathered there is a “few nanoseconds of jitter” caused by going through the system interconnect. I also read somewhere that it can vary based upon the traffic on the interconnect.

Maybe this recent PRU timer post by user Juvinski could help you stay in the PRU subsystem:

BarryBeagle · February 15, 2023, 4:28pm

Hi @Chris_Grey

Found this from TI Training that suggest that local PRU memory access is 3 clicks and going to external is 36 clicks (best case).

This training material is for PRU-ICSS whereas now we have PRU-ICSSG which I think means clock speed increase from 200Mhz to 250Mhz?

Therefore back of envelope calcs (assuming 250Mhz and 3 / 36 clicks respectively) are that local PRU memory access is about 12ns whereas accessing global RAM would be about 144ns - depending on how busy the bus is…

As an aside: All of this is extra confusing given that multiple TI docs emphatically state that J721E has no PRU support?! I don’t understand this confusion on TI’s part about this because some docs say no and yet others say yes. Similarly there are forum post on TI where TI engineers are saying it doesn’t support PRUs…yet we all know by direct experience that the PRUs are on our BB-AI64.

Chris_Grey · February 16, 2023, 1:55pm

To add more confusion to the mix, the recent webinar that was done on the BBAI64’s AI functionality suggested that the PRU-ICSSGs are capable of 333MHz. I called this out and requested that the person making this declaration double-check and update the forum posts were others had determined that it was 250MHz (and believed to be bot default and MAX).

Point is, IF the PRUs are capable of being configured for 333MHz, does this delay track with the PRU clock speed? Or is it a fixed delay that is being correlated to the PRU clock rate? In other words, if the PRU is capable of and is increased to 333MHz, does this 3/36 tick still hold? Or do they increase?

My guess is speeding up the PRU clock does nothing for the delays related to traversing the system outside of the PRU. And as such, a 36 tick delay while the PRU is running 250MHz would result in a 48 tick delay when the PRU clock is 333MHz. Would that be your assessment too?

BarryBeagle · February 16, 2023, 2:49pm

Yes I guess the devil is in the details… at 250Mhz each tick is 4ns, at 333Mhz it falls to 3ns.

For internal access , it seems obvious that each tick scales with clocks speed (so things go faster on wall clock time).

In ICSSG the G is for Gigabit, but I’m not quite sure what that means…is that a Gigabit per PRU cluster? (So the 6 processors in PRU0 share 1Gbit and the 6 other processors in PRU1 share a another 1Gbit)? Or do all 12 PRUs share a total of 1Gbit throughput?

To add even more confusion. In ICSSG there are 3 types of PRUs: PRU, RTU, and PRU-TX. From what I understand PRU are same as previous version (have access to everything) and PRU-TX are new concepts for quickly sending data to things like UART / Ethernet, etc. Whereas the RTU have no external access, they are internal processors for the cluster only.

Therefore given above, we can assume that only 4 of the units (2 x PRU + 2 x PRU-TX) have external access…so how does 1Gbit get shared among 4 units if each unit moves at 333Mhz?

Perhaps its as simple as “you have 1Gbit available for external access, so you could theoretically have 333Mhz access to external memory if something else isn’t using that bandwidth?”

Chris_Grey · February 16, 2023, 4:25pm

I took the “G” for Gigabit to mean the PRU_ICSSG is capable of managing a 1GBaseT datalink, not that it was executing at 1GHz.

As it relates to the difference between the various PRU cores, this was asked in a different thread:

See if that answers some of your questions as to the capabilities/limitations of each PRU core in each PRU_ICSSG subsystem. As you mentioned, they are NOT equal.

BarryBeagle · February 16, 2023, 5:46pm

ahh that makes sense.

Also looks like your theory of 48 click delay outside core is correct, see attached.

Chris_Grey · February 16, 2023, 6:25pm

It’s just a shame that they don’t tell you in that chart at what clock speed the tick-counts correlate to.

When I originally read that chart you have referenced, I assumed those counts were with the PRUs running at 250MHz. As far as I can tell, nothing in that doc suggests the PRUs are even capable of 333MHz. And right now, that’s only hearsay based on a vague reference in that webinar that, until I see documentation OR out-right experimentation from someone showing they’ve accomplished this on a TDA4VM, I’m dubious of. However if that rumor about the PRUs being capable of 333MHz is accurate, that would put the tick-count from 48@250 to 64@333MHz for external accesses.

BTW, I’m just dividing 333 by 250, then multiplying by the value @250 to get the corresponding value @333.

All that said, where did you get your numbers of 3 and 36 ticks from? Is that experimentation or documentation?

BarryBeagle · February 16, 2023, 6:52pm

My source for 36 was from here (Page 14):

The source for 333Mhz is from here (page 5):

BUT that source for 333 doesn’t mention the J721E, its referring to AM64x/AM243x - however elsewhere I’ve heard it claimed that J721E is same as AM64 but I don’t know…

Chris_Grey · February 17, 2023, 12:22pm

Unfortunately, that Building Blocks PRU doc only covers PRU_ICSS (200MHz), not PRU_ICSSG (250MHz default, supposedly capable of 333MHz). And the only place that I saw a 36 for latency was for latency accessing CFG. Regardless, we can assume that 36 was for 200MHz PRUs.

So to correlate those numbers @200MHz to a 250MHz PRU, multiply that 36 by (250/200), which gives you 45@250MHz. But even this has the fallacy of trying to compare latency values from one platform to another. The path between PRUs and the other resources, such as DDR, are different from platform to platform.

For example, on the J721E, you often have to go through a Region Address Translator to convert 32-bit PRU pointers to 48-bit physical address pointers since this platform is 64-bit and thus capable of memory above 4GB. Some might think 64-bit is only important if you are using memory above 4GB and since there’s only 4GB RAM on the BBAI64, it shouldn’t matter right? Not right. The platform is 64-bit regardless of whether you max its capabilities or not. Add to that, there’s a fair amount of the memory space taken up by memory-mapped configuration registers. So the starting point for the 4GB RAM is much higher than 0x00 thus pushing the highest physical RAM address above what’s reachable by 32-bits. Point is, the extra layer of RAT adds latency to things like accessing DDR from the 32-bit PRUs. And this is only 1 example of differences in the platform, although 32 vs 64-bit is a pretty significant one. So its no surprise that you are finding documented latency values closer to 48-51@250MHz for the PRU_ICSSG on platforms similar to ours.

As for the similarities between AM64x/AM65x and J721E, I’ve noticed that too. But when you start looking closer at their specs, the J721E is clearly a more updated, and feature-packed offering. So I don’t know how safe it is to try to correlate specs from their documents to J721E. But when you don’t have complete documentation, you work with what you do have…so I understand the desire to try.

And since we are on the subject, here’s another document I dug up on the subject:
sprace8a - PRU Read Latencies
That doc covers for a few different products, but notice how they explicitly indicate the PRU frequency for most of the products with PRU_ICSS but conveniently don’t include that info for PRU_ICSSG equipped products. It seems as it relates to the PRU_ICSSGs, we can’t get a straight answer. Frustrating…

BarryBeagle · February 17, 2023, 1:17pm

The path between PRUs and the other resources, such as DDR, are different from platform to platform.

For example, on the J721E, you often have to go through a Region Address Translator to convert 32-bit…

I suppose the only thing we can really do is test empirically - back-to-back calls to remote mem and measure time between.

Chris_Grey · February 17, 2023, 2:12pm

I agree. And that’s exactly what I’ve been trying to code up to test. The test I envision (not completely coded up yet) is a round-trip test. My plan is to send a value down to the PRU, then have the PRU send it back up to the ARM. On the ARM, I time how long it takes for the round-trip to take, then do it again. Eventually I want to get where I keep sending different values, then as I receive values, compare the timestamp deltas between when it was sent and when it was received.

The logic running in the ARM could easily keep track of things like worst-case, best-case, and average. But to get a proper view of the data, I’d want to let the interaction run that about a million times and then output the results to a CSV file. Then in Excel, I could run a T-test analysis of the data and visually determine the statistical likeliness of hitting worst-case and just how often the system stays “near” worst-case.

I think with these details, we can start to get a clear picture of not just the worse and best case scenarios which illustrates how much jitter there is, but where the system resides most of the time. Hopefully the act of logging this data will add sufficient contention for DDR to give some real world numbers, not best-case scenario numbers. If not, then I may need to execute a bunch of other random programs in Linux just to get more “activity” on the memory bus.

Once I’ve gotten this successfully testing mailbox transfers, I want to do a similar test for shared DDR memory. So write a value to DDR from ARM. Look for a change in value from the PRU (polling it), then have the PRU write a change to DDR (different location) when the PRU detects the change, and have the ARM looking for the PRU’s change (again polling)…and test the round-trip detection time with this same statistical breakdown.

Finally do an ARM interrupt test where the ARM sends a message to the PRU, and the PRU triggers an ARM interrupt. And when the ARM interrupt is detected, take a timestamp and compare the round-trip time.

With multiple empirical tests like this, we can start determining what kinds of things we can reliably do and what is unsafe to rely on…and which transfer yield the best performance.

I’ve often heard the saying by Linux developers that you don’t need to be real time as long as you are real fast. Tests like these will tell us just how “fast” things are. And from a firmware-development perspective, we can get an idea as to what things are OK to do in Linux and what must be done in the PRU to meet tight deadlines. In my case, I have hard-deadlines that would cause another processor to go off in the weeds doing unpredictable things & eventually watchdog if my BB logic doesn’t keep up. So knowing this kind of empirical information even if I had trustworthy accurate documentation would still be necessary for peace of mind.

BarryBeagle · February 17, 2023, 2:23pm

That could be super useful - please post your code when you get to a point that its testable.

Probably also good to find some benchmark utils (sysbench?) to run in linux userspace while test happens, this way can see what impact system load and bus contention has these best/worst case figures.

BarryBeagle · February 17, 2023, 8:53pm

@Chris_Grey

What information are you using to program PRU on TDA4VM / J721E? The more I dig into it the more confused I become…

I’ve have both the TRM for TDA4VM and the “pru-software-support-package-6.1.0”, but its very hard to make head or tails of programming the PRU’s for J721E…

For example in the PRU Support Package, the include .h files for AM335x has 10 header files (pru_cfg, pru_uart, etc), the section for AM65x has 12 header files. But, there is only a single header file for J721E (pru_intc.h) ?

Likewise, in the TRM itself there seems to be missing information compared to AM335x.

Are you perhaps using the files from AM65x?

Chris_Grey · February 18, 2023, 6:37pm

I started off with an Ubuntu VM running TIs Code Composer Studio (CCS) which supports the ability to compile C code for PRUs. But I found for the quick testing I was doing at the time, it was easier for me to setup the PRU cross compiler directly on the BBAI64 and perform the PRU cross-compile on the ARM. This allowed me to make a makefile to script the rebuild and even the shutdown, redeploy, and restart of the PRU being tested.

I wouldn’t recommend doing “big” development directly on the BBAI64, at least not until the Cloud9 IDE is officially supported on the BBAI64 the way it is on the BBB. Setting up complex C projects is not easy to do unless you are a very old school terminal developer OR someone with a fetish with using VI or the other terminal-based code editors. Using a proper IDE can be easier for non-terminal junkies. And if you are working in an organization where there’s a VCS involved, then you can often get VCS-integration into your IDE to make management of these things easier than what can be done on the terminal.

As far as I can tell, the code you write is the same. Now the PRU configuration registers will be different since this is a PRU_ICSSG and the memory pointers to memory-mapped configuration registers will be platform-specific, but that info is accessible in the docs. I did need help and some sanity checks from others on the forum when I was in the thick of doing that. And experimenting & asking questions, I got what I needed…

Now things like this I also find terribly confusing. And the only thing I can really do is take code, and “try stuff.” If there’s a header-file the code wants, I scour my VM or the BB for the file or a file containing similar code and try that. But yes, the differences between the PRU_ICSS and PRU_ICSSG are not as skin-deep as I’ve read elsewhere or maybe the tools are just different now than when the example code I was working with was made with??? I don’t have a good answer for you there other than to say, you just have to keep digging for documents/info, google-searching, and asking the questions. BTW, google-searching for TI documents often finds things you won’t find searching on the TI website. And even then, there may be some example code you just never get to working without digging into the code far more than you wanted to.

There also seems to have been a change in linker at some point and some files and syntax used for the older one don’t work on the new one (or vice-versa).

And the transition from UIO to remoteproc is another one of those changes that will cause older code to be confusing based on documentation.

All that said, I understand that for things to get better, they have to change. So I can’t say I wish things didn’t change so much. But I do get that when they do change, it can be confusing to correlate what is applicable to what, when, why, and where. And I do wish the migrations were better documented. All I can say is Welcome to embedded development.

I’m finding the entire TDA4VM documentation platform quite a disappointment in this respect. TI has done a bad job of this. The info available for PRU_ICSSG is out there, but it isn’t easy to find and it isn’t where you’d EXPECT to find it.

And a lot of this may be a new generation of people at TI simply are in charge of the documentation, and they aren’t as thorough as their predecessors that documented the AM335x. Keep in mind the AM335x is coming up on 15 years old! It’s easy to imagine that there’s been quite a personnel turnover at TI in that time. Add to that, there was simply less to document for the AM335x. As complicated as that part is, the J721E/TDA4VM is significantly more complicated with more to document. That’s an excuse. But it is a factor.

What is absolutely inexcusable is TI’s decision to abandon/ignore the PRUs in the TDA4VM. If they wanted to sell a variant that didn’t have PRUs, that’s fine. But release a sku with them and document that sku properly.

Back to the age of the AM335x, we are approaching the time period where TI may retire the part and they may not make a pin-for-pin compatible replacement for it. During the chip shortage, they did stop making the AM3352 forcing companies like mine to the much more expensive AM3358. And as far as I know, they haven’t announced when it will be available again. And from their perspective, retiring chips is the right thing to do even when companies are still buying them. The fab they are based on is old and the realestate it takes to make them could be used to make newer and MORE parts with less silicon. Also the peripherals required for the AM335x are becoming outdated…specifically eMMC and DDR3. And while the chip may still be perfectly capable for the things we use them for, the components it uses cannot be expected to be made indefinitely. We’ve already entered the time where size-for-size, DDR4 is cheaper than DDR3. Due to DDR3’s lack of popularity, its price will start rising. Even DDR4 is showing the early signs of dropping in popularity as DDR5 begins to gain popularity in the server/desktop/cellphone market. So an AM335x replacement chip designed today should start with DDR5 and probably M.2 keyed for off-the-shelf NVME and another keyed for Wifi.

And while eMMC isn’t quite on that same trajectory since it’s close-cousin SD-Flash keeps it alive, there are other aspects that are affecting it, one being you can’t find industrial eMMC memory that’s rated for 3,000 (30,000 pSLC) WECs anymore.The durability of industrial flash is also getting lower and lower because the size is getting larger and larger. The only way to get more durability is to buy more of it and wear-level across a larger space in order to attain the same total-write durability that eMMC suppliers like Micron used to supply years ago. With larger sizes, the flash itself is no longer the bottleneck to read/write speeds, the eMMC interface is. Relative to interfaces today, eMMC is SLOW!!! And to get support for larger/faster parts, we need a memory controller capable of UFS or NVME…and that means updated silicon.

So I really want to see a cost & feature-equivalent AM335x refreshed to work with modern-day memory components including an updated USB4.0 controller (AM335x is still USB1.1).

Anyway sorry for that rant…

I reference AM65x documentation from time to time, but I can’t say I explicit am using code or header files intended for AM65x. If the header-file is one of those that contains per-defined memory-mapped register address values, I’d say that would not be recommended. While you may find instances where memory addresses are the same across platforms, generally this is not something you can rely on. Although I would suspect there are some configuration registers that are more solidified. I don’t know the TI lineup or any of the memory address values well enough to know which peripherals are almost always the same memory addresses.

BarryBeagle · February 19, 2023, 3:13pm

@Chris_Grey

This is a great writeup and thanks for doing it. It helps alot.

I’m actually quite new to PRU programming (I’ve done simple bit-toggle “Hello World” level compiles on BBB and original AI) - but jumping into PRU programming on this specific chip is extra confusing.

I started off with an Ubuntu VM running TIs Code Composer Studio (CCS) which supports the ability to compile C code for PRUs…

Yes same. I’m just compiling directly on device now as (a) performance is actually quite good, and (b) CCS, for all its benefits, is adding another layer of complexity to an already confusing situation.

Now the PRU configuration registers will be different since this is a PRU_ICSSG and the memory pointers to memory-mapped configuration registers will be platform-specific, but that info is accessible in the docs.

This is the point where I’m failing.

I found a very helpful blog post outlining basic PRU programming concepts (AM335x based):

Blinky is no problem, however moving on to the UART example, well he relies on pru_uart.h to set all the registers…and I can’t seem to understand how do you apply this approach to J721E by hand. I realize its probably all in the 6 volume TRM, but 30000+ pages of docs is asking alot of poor me!

I don’t mention that asking you for help, as per your post above I understand “this is how it is”. I’m really just really documenting my situation for any other person who is doing google searches looking for same.

What is absolutely inexcusable is TI’s decision to abandon/ignore the PRUs in the TDA4VM. If they wanted to sell a variant that didn’t have PRUs, that’s fine. But release a sku with them and document that sku properly.

I’m really confused about how TI got in this situation… Its like there is some internal battle going on inside of TI and one department wants to rip out PRU for this chip and the other is fighting to keep it in. I know thats probably not really whats going on, but my point is the documentation situation on PRU is almost schizophrenic, one doc (or one section of a doc) will list it as unavailable non-supported, then another section provides documentation on another PRU interaction with different subsystem.

Case in point, on page 361 of TRM it list the PRU “Control Registers” as a completely blank chart.

icssg_controlregs

Meanwhile there are thousands of other pages outlining all other PRU_ICSSG memory registers…how does one work without the other?

I really want to see a cost & feature-equivalent AM335x refreshed to work with modern-day memory components including an updated USB4.0 controller

That would be really great!