Beaglebone AI 64 showing 2GB of RAM in htop

aurelien · October 21, 2022, 9:38am

Hello all!

I have a Beaglebone AI-64 which should have 4GB of RAM, but in htop I only see 2GB:

Is this normal?

benedict.hewson · October 21, 2022, 9:57am

Yes, the rest of the memory if allocated to the co-processors.
there is an alternative dtb file that disables the shared memory

see this post

aurelien · October 21, 2022, 11:59am

Thank you for the information!

I’m not really familiar with co-processors, do they need memory for the system to be stable? Are they automatically used depending on the work-load? Or are they independent and only used if asked? If you have any information about it I’d be happy to look at it Thanks in any case!

kaelinl · October 23, 2022, 1:45am

They’re only used if asked. You can either manually load firmware onto them (e.g. Minimal Cortex-R5 example on BBAI-64) or use a library that comes with its own (e.g. TIDL/EdgeAI benchmarks on the AI-64). The coprocessors on the TDA4VM share memory space with the main application cores (at least, all but the MCU-domain ones do) so they can pass data back and forth with minimal overhead. It’s totally safe (as far as I know) to disable the reserved memory so long as you aren’t using TIDL or other libraries which need it.

aurelien · October 23, 2022, 9:51pm

Thanks! I’ve actually been playing with these TFLite TIDL examples and I haven’t seen any speed differences between the 2GB and the 4GB setup. I’ll make sure to pay attention to it though, thanks again

kaelinl · October 24, 2022, 2:46am

Huh, interesting. I tried using the device tree that disables the shared memory and TIDL failed to run. It wasn’t a through test though, just a cursory experiment before switching back.

aurelien · October 24, 2022, 1:24pm

Actually, never mind… I misinterpreted what j7 and am62 were (and I still don’t know what they are). I find the Python examples kind of messy and not very clear… Anyways, I’m just trying to use the TIDL like so:

interpreter = tflite.Interpreter(
    model_path=/path/to/example/model,
    experimental_delegates=[tflite.load_delegate('libtidl_tfl_delegate.so', delegate_options)]
)

But the RAM usage explodes and gets Killed as a result each time.

RobertCNelson · October 24, 2022, 4:21pm

The TIDL framework relies on these co-proccessors. If those co-proccessors are not loaded at boot and not given ram, specific functions of the TIDL framework will not work…

Regards,

aurelien · October 24, 2022, 4:41pm

Okay, thanks.

I’ve also been playing around with @kaelinl’s demo code and it works better than the TFLite delegate. For some reason I’ve not been able to launch anything with TFLite, but with that demo code, I’ve even been able to convert YoloV5S6 Lite (from Edge AI Zoo) and it ran nicely!

kaelinl · October 24, 2022, 4:43pm

“Killer” sounds like it might be the OOM killer detecting an out-of-memory condition. Regardless, as Robert suggested, I wouldn’t expect the coprocessors to be functional without being given firmware or memory (which is what the device tree does). So even if this particular error is something else I wouldn’t expect inference to get very far.

Have you already “compiled” the model for TIDL or is it his just running un-accelersted inference? I’ve been working with the PyTorch/ONNX APIs so I don’t know what it looks like to compile a tflite model with their tools. I think you’ve already found the thread in which I’ve run their benchmarks and then compiled a custom ONNX model (TIDL/EdgeAI benchmarks on the AI-64 - #13 by kaelinl), but that might be at least mildly helpful.

Fully agreed on the code quality. It’s a bit striking that a company could willingly go to market with a software ecosystem in this state… There is absolutely no documentation, and all they have for examples are poorly written omni-tools. And since their forum support strategy is to guide people into hiding their bugs via workarounds, none of it gets fixed. I’m continuing to play with my BBAI-64 but I don’t think I will be able to recommend them due to no fault of bb.org. Nonetheless, I’m still trying to get a custom model to run and demonstrate the performance

Re: “j7”, it’s referring to “j721e” which is some flavor of the same TDA4VM SoC. I haven’t figured out exactly what the name means but I’ve yet to find a time they weren’t equivalent so I guess that’s good enough. In other words, “j7” is what you want. The alternative “am6x” series is a different SoC line with its own (older) hardware.

RobertCNelson · October 24, 2022, 4:56pm

TI tends to use different name for different markets… (sometimes efuses features out etc…) for our case…

J721E DRA829/TDA4VM are other TI names for our TDA4VM on this device.

Which is based on TI’s K3 keystone family which include older (and newer) am6x based devices…

Regards,

aurelien · October 24, 2022, 5:06pm

Yes I’m using the compiled models from the Edge AI TIDL examples. I also tried with the provided Python example, but it’s still getting Killed. I’m using the shared memory setup (2GB) and I can see in htop that RAM and swap reach 100%. I’m not sure why, because it works well with the PyTorch/ONNX API

For the record:

YOLOv5s6 with TFLite x2 CPU threads: 670 ms
YOLOv5s6 with PyTorch/ONNX TIDL Implementation: 278 ms

Which is not that bad I guess, but I was hopping for something a bit faster for 8 TOPS!

Hopefully the framework will get better (and support more ops!). We’re testing different boards at work and we can’t rely on something in this state

kaelinl · October 24, 2022, 5:18pm

Totally, seems fair. That performance is worse than I would have expected even for such a large model (I’d guess it’s swapping heavily or falling back to the CPU for some reason) but if you’ve eliminated the board for software support anyway it doesn’t much matter! Out of curiosity, what other boards are you considering? I assume the Jetson line is one?

aurelien · October 25, 2022, 9:13am

Yes, the Jetson line is definitely one! We’ve used the Jetson boards in different projects, and we really love what they can offer. The overall software support is pretty good, I find it really easy to work with, since the code is basically the same as the one you would have on a regular PC with an Nvidia GPU!

The biggest constraint for us right now, is their integration in our production systems. It has been reported that they cause interference with GPS modules and IMUs. And we’ve also ran into issues with JetPack where flashing the card was a bit tricky, and the 5.0.2 version came out a bit too late for us.

Apart from that, we’ve just been looking at the Raspberry Pi, this Beaglebone and talked with Neural Magic about how they will try to support ARM chips sometime soon (the Khadas Edge 2 seems to be a good fit with their framework). But honestly, I haven’t found that much different options, and most of the boards are kind of hard to find in stock anyways There are some boards with older Intel CPUs and some more recent AMD CPUs, but I prefer ARM boards for their lower power consumption.

aurelien · October 25, 2022, 9:34am

I forgot to also mention the Google Coral M2 PCIe TPU which we would like to test, but it’s so hard to find one in stock.

kaelinl · October 26, 2022, 7:09am

Nice, that’s a solid list! I’ve had a quite good experience with the Jetson line. And I hadn’t heard of Neural Magic, that’s actually super slick, I’ll have to keep my eye on them. And I hadn’t seen the M.2 variant of the Coral TPU either!

My interest in the TDA4VM right now is fueled in a large part by the realtime coprocessor cores which I think make an interesting platform for running bare metal embedded control logic plus ML inference on the same chip. It’s an active area of experimentation for me