PRU locking up Beaglebone AI

pocketnc_john · April 28, 2020, 3:46pm

I’m getting this stack trace in dmesg, but I’m unsure what it means or how to figure out what it means. As far as I can tell, the code running on the PRU is working. I’m generating a 100Khz signal on a direct output, and am able to successfully measure this signal. The Beaglebone is locking up, though, and I believe this stack trace is being spammed so heavily that the logging is taking over the CPU and my ssh shell gets locked out.

I’m using this device tree overlay: https://github.com/PocketNC/BeagleBoard-DeviceTrees/blob/pocketnc-ai-test/src/arm/am5729-beagleboneai-pocketnc-pro.dts

The code I’m running is implemented in PRU Assembly that is assembled with pasm. pasm outputs a .bin file and I need a .elf file for running it with remoteproc, so I’m jumping through some hoops to do that conversion. The elf file does seem to work, but I’m not sure if I need to do more to ensure I’m specifying what resources I need access to or something like that. I can go into more detail if need be.

The stack trace is below. Any ideas about what is going on are appreciated!

[ 168.153783] ------------[ cut here ]------------
[ 168.153829] WARNING: CPU: 0 PID: 0 at drivers/bus/omap_l3_noc.c:147 l3_interrupt_handler+0x27c/0x39c
[ 168.153851] 44000000.ocp:L3 Custom Error: MASTER PRUSS2 PRU1 TARGET L4_PER1_P3 (Idle): Data Access in Supervisor mode during Functional access
[ 168.153865] Modules linked in: xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 rpmsg_rpc rpmsg_proto bnep btsdio bluetooth ecdh_generic brcmfmac pvrsrvkm(O) brcmutil cfg80211 uio_pruss_shmem evdev joydev stmpe_adc omap_remoteproc virtio_rpmsg_bus rpmsg_core 8021q garp mrp stp llc iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat usb_f_acm nf_conntrack u_serial usb_f_ecm usb_f_mass_storage iptable_mangle iptable_filter usb_f_rndis u_ether libcomposite cmemk(O) uio_pdrv_genirq uio spidev pruss_soc_bus pru_rproc pruss pruss_intc ip_tables x_tables
[ 168.154474] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W O 4.14.108-ti-r119 #1
[ 168.154490] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 168.154538] [] (unwind_backtrace) from [] (show_stack+0x20/0x24)
[ 168.154575] [] (show_stack) from [] (dump_stack+0x80/0x94)
[ 168.154609] [] (dump_stack) from [] (__warn+0xf8/0x110)
[ 168.154636] [] (__warn) from [] (warn_slowpath_fmt+0x58/0x74)
[ 168.154667] [] (warn_slowpath_fmt) from [] (l3_interrupt_handler+0x27c/0x39c)
[ 168.154703] [] (l3_interrupt_handler) from [] (__handle_irq_event_percpu+0xbc/0x280)
[ 168.154734] [] (__handle_irq_event_percpu) from [] (handle_irq_event_percpu+0x3c/0x8c)
[ 168.154761] [] (handle_irq_event_percpu) from [] (handle_irq_event+0x48/0x6c)
[ 168.154792] [] (handle_irq_event) from [] (handle_fasteoi_irq+0xc8/0x17c)
[ 168.154822] [] (handle_fasteoi_irq) from [] (generic_handle_irq+0x34/0x44)
[ 168.154850] [] (generic_handle_irq) from [] (__handle_domain_irq+0x8c/0xfc)
[ 168.154879] [] (__handle_domain_irq) from [] (gic_handle_irq+0x4c/0x88)
[ 168.154908] [] (gic_handle_irq) from [] (__irq_svc+0x6c/0xa8)
[ 168.154925] Exception stack(0xc1501ed8 to 0xc1501f20)
[ 168.154946] 1ec0: 00000001 00000000
[ 168.154973] 1ee0: fe600000 00000000 c1500000 c1504e60 c1504dfc c14cbb78 c1501f48 00000000
[ 168.154997] 1f00: 00000000 c1501f34 c1501f14 c1501f28 c012fcb8 c0109768 600f0013 ffffffff
[ 168.155031] [] (__irq_svc) from [] (arch_cpu_idle+0x30/0x4c)
[ 168.155061] [] (arch_cpu_idle) from [] (default_idle_call+0x30/0x3c)
[ 168.155092] [] (default_idle_call) from [] (do_idle+0x180/0x214)
[ 168.155124] [] (do_idle) from [] (cpu_startup_entry+0x28/0x2c)
[ 168.155156] [] (cpu_startup_entry) from [] (rest_init+0xdc/0xe0)
[ 168.155194] [] (rest_init) from [] (start_kernel+0x434/0x45c)
[ 168.155217] —[ end trace d9047b952a20ba7f ]—

jkridner · April 28, 2020, 4:50pm

What is the code running on PRUSS2 PRU1?

This line kinda spells out an illegal access by that PRU or of that PRU:

MASTER PRUSS2 PRU1 TARGET L4_PER1_P3 (Idle): Data Access in Supervisor mode during Functional access

Looks like the error is from here: https://github.com/beagleboard/linux/blob/7a920684860a790099061b67961d0b5ffa033fdf/drivers/bus/omap_l3_noc.c#L135

Looks like a bus exception to me.

pocketnc_john · April 28, 2020, 5:47pm

It’s the hal_pru_generic code. It definitely smells like a bus error. In fact, if I comment out the lines that write to the GPIO, it stops happening, so it seems like I have the wrong addresses in there, but I’m struggling figuring out how that could be.

These lines are where the GPIO ports are written to in memory:
https://github.com/PocketNC/machinekit-hal/blob/c8b38386d87abc45baa33593681cbae46d996980/src/hal/drivers/hal_pru_generic/pru_wait.p#L214-L217

Theoretically, the addresses should be set to the clear addresses of GPIO3, GPIO5, GPIO6 and GPIO7:

Addresses defined here:
https://github.com/PocketNC/machinekit-hal/blob/c8b38386d87abc45baa33593681cbae46d996980/src/hal/support/pru/pru.h#L303-L307

Loaded into registers here:
https://github.com/PocketNC/machinekit-hal/blob/c8b38386d87abc45baa33593681cbae46d996980/src/hal/drivers/hal_pru_generic/pru_generic.p#L261-L264

pocketnc_john · April 28, 2020, 6:19pm

It seems that even if I hardcode the addresses in there (to eliminate the possibility that my registers were getting overwritten somewhere), that I get the bus error. Does enabling the OCP Master port work the same way as on the BBB? It’s supposedly being set here: https://github.com/PocketNC/machinekit-hal/blob/c8b38386d87abc45baa33593681cbae46d996980/src/hal/drivers/hal_pru_generic/pru_generic.p#L174-L176

pocketnc_john · April 28, 2020, 7:19pm

Are there any ramifications of the PRU writing 0 to both the set and clear addresses of GPIO8 (0x48053190 and 0x48053194), when the device tree has several overlapping pins allocated to being direct outputs on the PRU? The issue seems to arise when I write to those two addresses on the PRU, as well as the set and clear addresses of GPIO4 (0x48059190 and 0x48059194). What could cause that to trigger an exception in the kernel?

John_Allwine · April 28, 2020, 7:59pm

Ok, I have a more localized test that triggers the exception:
https://github.com/PocketNC/cloud9-examples/blob/test-pru-gpio-access/BeagleBone/AI/pru/doNothingToGPIO8.pru1_1.c

Two stack traces can be seen in dmesg after running that on the PRU. If it has something to do with the device tree, this is the overlay I’m using:
https://github.com/PocketNC/BeagleBoard-DeviceTrees/blob/pocketnc-ai-test/src/arm/am5729-beagleboneai-pocketnc-pro.dts

pocketnc_john · April 28, 2020, 9:45pm

Using that test, I was able to quickly check to see which GPIO ports I was able to write to from the PRU. GPIO 1, 4 and 8 errored. GPIO 2 doesn’t have any pins mapped to P8 or P9 headers, so that leaves GPIO 3, 5, 6 and 7 that I can use for hal_pru_generic. The P8 and P9 pins that are mapped to GPIO 4 and 8 can all be mapped to direct outputs on certain PRUs. I’ll need to document how each pin can be used, but it seems like just about all the P8 and P9 pins can be used as long as you know which PRU to run it on for direct outputs and which are suitable to be used as GPIO outputs.

Charles_Steinkuehler · April 29, 2020, 4:38pm

Regarding your bus errors, I don't see anything in the TRM that indicates the PRU shouldn't be able to talk to all of the GPIO banks.

I have, however, seen bus errors on uninitialized GPIO banks which come up disabled by default. Check to make sure at least one GPIO pin is exported by the Linux Kernel (either manually or via device tree) and see if the bus errors go away.

jkridner · April 29, 2020, 5:42pm

Regarding your bus errors, I don’t see anything in the TRM that
indicates the PRU shouldn’t be able to talk to all of the GPIO banks.

I have, however, seen bus errors on uninitialized GPIO banks which come
up disabled by default. Check to make sure at least one GPIO pin is
exported by the Linux Kernel (either manually or via device tree) and
see if the bus errors go away.

I’ll double-down on that feedback. I forwarded this to the team at TI and they said it is likely that the GPIO bank doesn’t have its clock enabled. I inquired what the minimal action to enable the clock would be, but haven’t heard back yet. Enabling one of the GPIOs in the bank in Linux seems like a sure way to do it.

I’ll get back when I can find anything more minimal than that.

pocketnc_john · April 29, 2020, 8:59pm

That did it! I exported pin 8.43:

echo 226 > /sys/class/gpio/export

And now the PRUs can write to the set and clear addresses for GPIO8.

I also exported 8.26:

echo 124 > /sys/class/gpio/export

And now the PRUs can write to the set and clear addresses for GPIO4.

pocketnc_john · April 29, 2020, 9:01pm

This makes a whole lot more sense now, because I swear I had it working on those GPIO ports at one point. I think I had exported pins manually while testing things. Thanks for the help!

jkridner · April 30, 2020, 4:28pm

Got a more complete answer from the TI support team:

This is seems classic. The Remote Core, PRU in this case, is trying to access GPIO bank shared with Linux.

Linux, by default owns all GPIO banks and if there are no GPIO lines requested by Linux - the GPIO bank will be powered down.

Correct solution:

GPIOs used by the Remote core has to be grouped in some GPIO bank X and that bank has to be removed from Linux
(DT - disabled or even completely removed).
Remote core can use this GPIO bank X (including GPIO IRQ), but need ensure it’s enabled.

Possible option:

some line from GPIO bank X are used by Linux and some by Remote core
!! but not as IRQ sources !!
IF:
(a) GPIOX_Y is used by some Linux driver and requested before Remote core loaded
(b) unused GPIOX_Z line can be requested through sysfs/gpio_ioctl interface before Remote core loaded
Remote core can access GPIOX_A, but only through GPIO_SETDATAOUT, GPIO_CLEARDATAOUT, GPIO_DATAIN

not supported

Linux and remote Core FW can’t share GPIO bank if both want to use GPIO IRQs.