Passing Mailbox messages between ARM and PRU

That could be super useful - please post your code when you get to a point that its testable.

Probably also good to find some benchmark utils (sysbench?) to run in linux userspace while test happens, this way can see what impact system load and bus contention has these best/worst case figures.

@Chris_Grey

What information are you using to program PRU on TDA4VM / J721E? The more I dig into it the more confused I become…

I’ve have both the TRM for TDA4VM and the “pru-software-support-package-6.1.0”, but its very hard to make head or tails of programming the PRU’s for J721E…

For example in the PRU Support Package, the include .h files for AM335x has 10 header files (pru_cfg, pru_uart, etc), the section for AM65x has 12 header files. But, there is only a single header file for J721E (pru_intc.h) ?

Likewise, in the TRM itself there seems to be missing information compared to AM335x.

Are you perhaps using the files from AM65x?

I started off with an Ubuntu VM running TIs Code Composer Studio (CCS) which supports the ability to compile C code for PRUs. But I found for the quick testing I was doing at the time, it was easier for me to setup the PRU cross compiler directly on the BBAI64 and perform the PRU cross-compile on the ARM. This allowed me to make a makefile to script the rebuild and even the shutdown, redeploy, and restart of the PRU being tested.

I wouldn’t recommend doing “big” development directly on the BBAI64, at least not until the Cloud9 IDE is officially supported on the BBAI64 the way it is on the BBB. Setting up complex C projects is not easy to do unless you are a very old school terminal developer OR someone with a fetish with using VI or the other terminal-based code editors. Using a proper IDE can be easier for non-terminal junkies. And if you are working in an organization where there’s a VCS involved, then you can often get VCS-integration into your IDE to make management of these things easier than what can be done on the terminal.

As far as I can tell, the code you write is the same. Now the PRU configuration registers will be different since this is a PRU_ICSSG and the memory pointers to memory-mapped configuration registers will be platform-specific, but that info is accessible in the docs. I did need help and some sanity checks from others on the forum when I was in the thick of doing that. And experimenting & asking questions, I got what I needed…

Now things like this I also find terribly confusing. And the only thing I can really do is take code, and “try stuff.” If there’s a header-file the code wants, I scour my VM or the BB for the file or a file containing similar code and try that. But yes, the differences between the PRU_ICSS and PRU_ICSSG are not as skin-deep as I’ve read elsewhere or maybe the tools are just different now than when the example code I was working with was made with??? I don’t have a good answer for you there other than to say, you just have to keep digging for documents/info, google-searching, and asking the questions. BTW, google-searching for TI documents often finds things you won’t find searching on the TI website. And even then, there may be some example code you just never get to working without digging into the code far more than you wanted to.

There also seems to have been a change in linker at some point and some files and syntax used for the older one don’t work on the new one (or vice-versa).

And the transition from UIO to remoteproc is another one of those changes that will cause older code to be confusing based on documentation.

All that said, I understand that for things to get better, they have to change. So I can’t say I wish things didn’t change so much. But I do get that when they do change, it can be confusing to correlate what is applicable to what, when, why, and where. And I do wish the migrations were better documented. All I can say is Welcome to embedded development.

I’m finding the entire TDA4VM documentation platform quite a disappointment in this respect. TI has done a bad job of this. The info available for PRU_ICSSG is out there, but it isn’t easy to find and it isn’t where you’d EXPECT to find it.

And a lot of this may be a new generation of people at TI simply are in charge of the documentation, and they aren’t as thorough as their predecessors that documented the AM335x. Keep in mind the AM335x is coming up on 15 years old! It’s easy to imagine that there’s been quite a personnel turnover at TI in that time. Add to that, there was simply less to document for the AM335x. As complicated as that part is, the J721E/TDA4VM is significantly more complicated with more to document. That’s an excuse. But it is a factor.

What is absolutely inexcusable is TI’s decision to abandon/ignore the PRUs in the TDA4VM. If they wanted to sell a variant that didn’t have PRUs, that’s fine. But release a sku with them and document that sku properly.

Back to the age of the AM335x, we are approaching the time period where TI may retire the part and they may not make a pin-for-pin compatible replacement for it. During the chip shortage, they did stop making the AM3352 forcing companies like mine to the much more expensive AM3358. And as far as I know, they haven’t announced when it will be available again. And from their perspective, retiring chips is the right thing to do even when companies are still buying them. The fab they are based on is old and the realestate it takes to make them could be used to make newer and MORE parts with less silicon. Also the peripherals required for the AM335x are becoming outdated…specifically eMMC and DDR3. And while the chip may still be perfectly capable for the things we use them for, the components it uses cannot be expected to be made indefinitely. We’ve already entered the time where size-for-size, DDR4 is cheaper than DDR3. Due to DDR3’s lack of popularity, its price will start rising. Even DDR4 is showing the early signs of dropping in popularity as DDR5 begins to gain popularity in the server/desktop/cellphone market. So an AM335x replacement chip designed today should start with DDR5 and probably M.2 keyed for off-the-shelf NVME and another keyed for Wifi.

And while eMMC isn’t quite on that same trajectory since it’s close-cousin SD-Flash keeps it alive, there are other aspects that are affecting it, one being you can’t find industrial eMMC memory that’s rated for 3,000 (30,000 pSLC) WECs anymore.The durability of industrial flash is also getting lower and lower because the size is getting larger and larger. The only way to get more durability is to buy more of it and wear-level across a larger space in order to attain the same total-write durability that eMMC suppliers like Micron used to supply years ago. With larger sizes, the flash itself is no longer the bottleneck to read/write speeds, the eMMC interface is. Relative to interfaces today, eMMC is SLOW!!! And to get support for larger/faster parts, we need a memory controller capable of UFS or NVME…and that means updated silicon.

So I really want to see a cost & feature-equivalent AM335x refreshed to work with modern-day memory components including an updated USB4.0 controller (AM335x is still USB1.1).

Anyway sorry for that rant…

I reference AM65x documentation from time to time, but I can’t say I explicit am using code or header files intended for AM65x. If the header-file is one of those that contains per-defined memory-mapped register address values, I’d say that would not be recommended. While you may find instances where memory addresses are the same across platforms, generally this is not something you can rely on. Although I would suspect there are some configuration registers that are more solidified. I don’t know the TI lineup or any of the memory address values well enough to know which peripherals are almost always the same memory addresses.

@Chris_Grey

This is a great writeup and thanks for doing it. It helps alot.

I’m actually quite new to PRU programming (I’ve done simple bit-toggle “Hello World” level compiles on BBB and original AI) - but jumping into PRU programming on this specific chip is extra confusing.

I started off with an Ubuntu VM running TIs Code Composer Studio (CCS) which supports the ability to compile C code for PRUs…

Yes same. I’m just compiling directly on device now as (a) performance is actually quite good, and (b) CCS, for all its benefits, is adding another layer of complexity to an already confusing situation.

Now the PRU configuration registers will be different since this is a PRU_ICSSG and the memory pointers to memory-mapped configuration registers will be platform-specific, but that info is accessible in the docs.

This is the point where I’m failing.

I found a very helpful blog post outlining basic PRU programming concepts (AM335x based):

Blinky is no problem, however moving on to the UART example, well he relies on pru_uart.h to set all the registers…and I can’t seem to understand how do you apply this approach to J721E by hand. I realize its probably all in the 6 volume TRM, but 30000+ pages of docs is asking alot of poor me!

I don’t mention that asking you for help, as per your post above I understand “this is how it is”. I’m really just really documenting my situation for any other person who is doing google searches looking for same.

What is absolutely inexcusable is TI’s decision to abandon/ignore the PRUs in the TDA4VM. If they wanted to sell a variant that didn’t have PRUs, that’s fine. But release a sku with them and document that sku properly.

I’m really confused about how TI got in this situation… Its like there is some internal battle going on inside of TI and one department wants to rip out PRU for this chip and the other is fighting to keep it in. I know thats probably not really whats going on, but my point is the documentation situation on PRU is almost schizophrenic, one doc (or one section of a doc) will list it as unavailable non-supported, then another section provides documentation on another PRU interaction with different subsystem.

Case in point, on page 361 of TRM it list the PRU “Control Registers” as a completely blank chart.

icssg_controlregs

Meanwhile there are thousands of other pages outlining all other PRU_ICSSG memory registers…how does one work without the other?

I really want to see a cost & feature-equivalent AM335x refreshed to work with modern-day memory components including an updated USB4.0 controller

That would be really great!

1 Like

I’ve never done anything complex enough to interact with a UART within the PRU, however if you have existing code intended for the PRU_ICSS, I would think it wouldn’t be too terribly difficult to migrate that to a PRU in the PRU_ICSSG. The main thing(s) that would need to change are the memory addresses to the UART you are using. That’s just a matter of looking them up in the TRM. And just to give you some confidence, you might want to look them up in the TRM of the AM335x and verify the values you are changing are the same as what you identified in the AM335x’s documentation. Then change them to the address documented for J721E. It “should” be just that simple. And if TI was consistent about the UART values it exposes and the values that are to be written to those UART address locations, then I don’t know why it wouldn’t be this simple. However if there are interaction differences between the two platform’s UARTs, then it will get a little more complicated. For instance, lets say (and I’m making stuff up here) the AM335x wants you to write:
1 for 9600 baud
2 for 19200 baud
3 for 28800 baud
etc…

But the J721E’s UART expects the actual baud value be written. In this case, you’d have to update the code to write what the J721E expected. But again, that shouldn’t be horrible to figure out.

All that said, I personally wouldn’t try to communicate anything more than the absolute simplest coms out a UART from a PRU. My experience with UARTs has been with highly structured languages that require a full-up stack and multiple threads to perform asynchronous transmit/receive functions. Typical UART interactions are FAR easier done in an RTOS than a PRU. But if all the PRU code is doing is outputting human-readable characters as a debug mechanism, then that’s probably appropriate.

@Chris_Grey, @BarryBeagle, et. al:

I was just reviewing my old R5F gpio toggle experiment post. In that post, Chris stated that he was working on the BBAI-64 PRUs but hadn’t yet gotten an IO to toggle. I wanted to check back in with you guys: Have you been able to use the BBAI64 PRU_ICSSG to control IO?

Thanks and Best regards,
Fred Eckert

I have not experimented with physical I/O yet. I haven’t done much since mid January. I did most of this investigation over the Christmas holidays. And once “real” work (and life) kicked back in, I haven’t had a lot of time to dedicate to looking into this further. I still have the BBAI64 here on my desk. I just haven’t been able to break away to do much with it. Dealing with logical things was easy enough with just the board. But Physical I/O requires I connect things and get more involved. And while that isn’t impossible, I haven’t take the time to do that work. I do have one of those prototyping capes with the tiny breadboard and 2 LEDs and some buttons. I got that back when I was working with the BBB over a decade ago. I’m hoping it’ll plug in and let me do some of that I/O testing with simple things like the switch and LEDs it has. That just hasn’t happened. It’s still on the to-do list…but it may be Thanksgiving before I get back to it.

I’m also hoping there’s more maturity to the development environment to make my investigations a bit easier. I’m also hoping that some more support for sleep/quick-wake is developed. The product I’d hoped to use the BBAI64 as the heart of would REQUIRE that. When I investigated this, it seemed all the way up to TI, this had not really been explored or exercised to know how well it really works and certainly no code-support for it by TI or BB. It seems to only be a silicon possibility, not a supported feature. That, alone, was the biggest loss of wind from my sails for investigating anything on the BBAI64 further.

Do you really need that much horsepower for your product. Ti is not doing a good job at making friends with us, other options do exist so I have been looking at other vendors. Not sure why Beagleboard is so committed to Ti product.

So far it makes me feel like we have been chasing our tails, doing way too many deep dives into docs that are lacking content between point A and point B. Then so much of it is stale. Pretty sure if we were one of Ti’s top ten customers our journey would not be so difficult.

Do I need that much HP, the answer is yes and no.

To do the base-functionality of what I need, probably not. And if I was willing to have companion software on the PC that did a lot of the configuration, datalogging, and general UI, then I could probably get away with just a BBB…maybe.

But the product I envision is one that also hosts its own WebUI that will allow a user to make changes to the configuration as well as show datalogging and trending information in realtime using nothing more than a browser on their PC. And to get the rich experience without a lot of lag requires a fair amount of server-side heft, even if the browser is doing a lot of UI lifting.

And the reason I say this is because of my current experience doing EXACTLY THAT on the AM335x. The product I work on professionally has real-time logic running along-side a web UI. And when the WebUI is exercised, not only is it a SLOW user experience, it takes clock cycles away from the normal operation code (duhh, it’s a single core processor). Even with thread priority tinkering, it’s still not great.

So I really want something with more processing power in each core, to get a more usable user-experience. And I’d like multiple cores so I can do things like dedicate Linux processes to one processor and put all other system functionality and WebUI on the other.

The AI64 would be an excellent choice. I had to kill the desktop manager and run it as a headless server and it is solid. One is in production here and it has not caused any trouble. Its up against metal box dells and its holding its own and from a wall transformer. Its on NVMe and that puts it in a class of its own, kinda like a crossover server. We only need the reliability for internal connections and it is holding its own. Its not even on the backup supply and has been off from power outages and it recovered without intervention.

I was reading the docs on the AM64 and AM62 regarding the PRU and those two don’t play the same tune. Some functionality is on one variant and it does this and not that… It would be nice if they would explicitly indicate what it can and cannot do. Running down a rabbit hole then to discover it dead ends is not good.

Did get CCS up and tried the PRU demos out and it loads the those cores just fine with JTAG, when it wants to. Its flaky but still usable. Now I tried to load the A53 from CCS and it does not debug or load the cores, they spend so much time putting CCS together then the ball bounces. I made the wrong assumption when starting out, from all the hype it appeared to be a continuous solution, that is not the case.

If you spent the time to integrate GDB into CCS it would be okay, that would be about the only way to do it with Linux kernel.

Oh yeah, I have no complaints about the BBAI64’s performance. But with that performance comes a lot of heat, even when the processor is just sitting idle. And that’s a level of power consumption you cannot tolerate in an automotive environment when the engine is OFF. So a low-power mode is needed. And when the ignition is turned ON, a quick-power-from-sleep is needed to return the processor to full power and speed is needed to be ready to do the control necessary with minimal sleep->wake transition lag.

So as much as I like about the BBAI64, there are just as many things that are, essentially shortcomings. The scattered and incomplete documentation on TI’s side, the lack of support and example code for features that TI claims are possible, and as you said, the differences in peripheral capabilities. Just to give an example, look on the beagleboard.org website’s home page. The BBAI64 is listed to have 4 PRUs, which is not incorrect, but is incomplete. The J721E/TDA4VM has 2 PRU_ISCSSG subsystems, each of which have 2 PRUs, 2 RTUs, and 2 TX cores. Each of those cores are similar-but-different, and while I did find the documentation on what exactly those differences are, it’s not easy to come across…especially given the TDA4VM documentation doesn’t even mention the presence of those subsystems. (why TI, why?) And you have to just know to also look for J721E documentation (a processor that doesn’t actually exist as far as I can tell). And when you don’t know, you have to go glean that knowledge via forums like this.

So there’s a lot of good, bad, and ugly as it relates to the BBAI64, BB’s relationship with TI, and what TI has offered as support & guidance. And as good as the good is, the bad and ugly are where we, as developers live.

Yes, that is so very true. They do have a big TRM, we were going to use Khadas VIM boards but amlogic is way too secretive for us to use. No TRM and they would not let me have developer access to the good stuff. TI on the other hand does provide MANY points of entry into the device. It must be working very well for others or they would not be where they are at. If you have a large pool of researchers, developers, programmers, hardware people it would be much easier. When you have limited this and that plus trying to find an employee that even knows what an ARM processor is, the other major hurdle.

From what I have read so far the PRU is a home run in theory. All I am doing is getting tired of connecting the missing dots. This is actually way worse than in college. My prof. would give us a “black box” projects all the time. He would sketch an input waveform and an output waveform. We had to do the rest and also said every one’s homework better be different. Point I am getting at is you can always use a scope and signal generator to characterize and experiment with the device. This stuff is purely invisible. Back in the old days we had an ICE (in circuit emulator) that would plug in the SOCKET of the processor and then work on it. This SoC stuff is for the birds.

I read this thread for the first time today… the topic is very specific, and not one that I am particularly interested in. HOWEVER if you are interested in gomer’s $.02 you might abandon your quest to decipher remoteproc.

If your project is at the prototype / poc stage (just my guess), you might consider revisiting the trusty BBB. It still has a lot going for it despite its’ age, like coherent documentation.

At the risk of being a Johnny one note, I’m going to suggest that you abandon remotemsg, and look into ring buffer(s) for your communication between ARM and PRU. At minimum, you could put up your poc without dealing with the poor documentation for more current BB products.

if you take a look at the app that I’ve made available, you’ll see that its’ ARM utilization is barely measurable even though the most complex processing is done there. it takes very few ARM cycles to do the ring buffer processing. While this clock app has no feedback from PRU to ARM, it would be trivial to implement (as a poc).

Ring buffers have lot’s of advantages over even the vaporware speak of remotemsg. They are always async, low overhead, and don’t rely on interrupts.

this gif is the best explanation of how they work, the text of the wikipedia article sadly is meh.

@FredEckert has recently posted that he successfully implemented this application. This gives me confidence that I haven’t missed any critical code or instructions to get it running. Thanks again Fred.

For your timing needs, I’m going to suggest this approach that I use routinely in other projects:

  • set up a 1mhz pwm signal into a pru fast input
  • increment a register for each pulse of the pwm (accurate 1uS timer)
  • segment your pru code into instruction blocks < 200 instructions (run in < 1uS)
  • block (waste) the remainder of the uS waiting for the pwm pulse.

the math of this approach is 4G / 1M = 4K. 4K seconds is about 68minutes 15 seconds then you have to code for rollover.

good luck!
gomer

1 Like

Might very well work good in bare metal, but with a linux kernel. Just way too much going on. The part I like about the PRU is according to docs it should run totally autonomous from the main A53 cores running linux. Have I gotten to that point yet, no, other than playing with the “hello world” and testing JTAG has been my reach on the issue. It does seem like it is worth trying to get going. Just assuming the docs are being truth full.

For some reason the docs appear to be written from developers notes. Whomever is writing them does not understand what is going on or they would be able to fully articulate to others what is going on.

Your gif did not load, it seems like many years ago we did implement ring buffers using the old ttl / cmos logic chip designs. That stuff was locked in place so everything was predictable. Its has been many years ago so please forgive me if this is not correct.

it loads for me … strange

main wikipedia page : Circular buffer - Wikipedia

again, not a fan of this article, but the gif is excellent. The main advantage of a ring buffer is that it allows BOTH processors to run simultaneously if sufficiently large. The clock app demonstrates this in code, which is MUCH better than I can describe in english.

gomer

1 Like

HA! The whole reason I poked this thread was to see if it was feasible to port gomer’s PRU clock over to the BBAI-64. I figured it would be a good learning experience. However, as a wise man once said, I am not up to the skill level to be the BBAI-64 PRU pioneer… In fact, I hope I never have to use the PRUs. Hope to stay with the R5F processor.

Since I managed to derail yet another thread, we might as well keep the derailment going… As part of my RF5 research, I have been trying to find the time to study RemoteProc and RPMsg. Aren’t vrings ring buffers? I have gotten TI’s RPMsg ipc_echo_baremetal_test program to work Linux to R5F baremetal.

Gomer why are you so against remoteproc?

If I keep all my realtime code in the R5F, it shouldn’t matter. However, I need to dump over more that 512 bytes of data in some sort of a double buffering mechanism. My very preliminary research indicates that DMA-BUF may be what I may need… Any comments?

Gomer thinks not. ring buffers do not use interrupts to ‘kick’ their fellow processors, alerting them that they have a msg. This seems to gomer to be synchronous. Ring buffers are asynchronous, and if sized adequately, neither writer nor reader needs to concern itself with the status of the contra processor. Of course, gomer could be wrong about anything.

Gomer isn’t against remoteproc, although there was a period of annoyance porting from the earlier UIO code. To make the sample clock app turnkey, gomer uses remoteproc to load and start the PRU(s). Remoteproc does this fine.

Gomer doesn’t understand RPMsg and sometimes has thought RPMsg and written remoteproc or remotemsg. Sorry for the confusion gomer is.

Gomer likes the PRU asm coding and has not found any documentation to guide using RPMsg with asm.

R5F baremetal is a new term which gomer does not grok. BUT, since the raw url for this link includes the string ‘ccs’, gomer will ignore.

Realtime code and R5F and bare metal… gomer seems to have strayed into a realm that gomer has NO experience in. Better to remain silent and be thought a fool, than to open your mouth and remove all doubt.

ccs 12 has a demo and I have had that up and running. Have you looked into that one.

You’ve gotten a CCS based, TI provided, demo for the BBAI-64 running!? For the BBAI-64, I’ve only been able to use the makefile demos. Is there a particular device you selected in the wizard or did you have to manually load something? What OS is your host? I need to see this demo!

I tried using their Ubuntu 18 build instructions for ccs and yocto. Both turned into a headache.
Now I am running yocto on Ubuntu 22.04 build server that is a dedicated to TI builds. Did get it up on 20.04 then wiped it out and went upto 22.04. I am building images that boot beagleplay, AM62, BBB. Also have ccs up on 20.04 and 22.04 desktops. Not sure if I kept notes on the ccs and yocto. It will run, did have to hunt for some dependencies. I will check my cherrytree files and see I have any notes. Started a to do a write-up for 20.04 and then changed to 22.04 so that is only in a primative note state at the moment.

Pretty sure if you start out with ccs on 22.04 it will only be trivial issues. Keep in mind they have that loaded with tons of different products some of that stuff might not build. I have only tried it on AM62, also have the wifi evm too but have not gotten to that one yet.

I am doing it on the AM62 EVM board with JTAG probe XPS110, if you get the EVM board you don’t need the probe. The debugger is on the board so it is just a USB cable.

I should also learn to read, just noticed you are on AI64… we are currently with the Beagleplay / AM62.