Kernel 5.10 device-tree overlay for using DMA from BBB PRU

Josh_MacDonald · August 17, 2023, 2:58pm

Hello!

I am working on updating the PRU/DMA example posted in the example here to use kernel 5.10. That version was last seen functioning in kernel 4.19 but I was having stability issues and now I think I understand why.

A lot changed with remoteproc between 4.19 and 5.10. One of the best clues I found was from the TI SDK, with a note about how to track the changes in 5.10.

Roughly, in 4.19 the remoteproc resource table contained an interrupt-controller resource entry, which was to list all the incoming and outgoing interrupts that the PRUSS would send/receive. In 5.10, only the incoming interrupts would be configured through a static table, and it would be through a separate table of type pru_irq_rsc, and the PRU firmware would be responsible for configuring its own outgoing interrupts.

To configure the outgoing interrupt, I followed the instructions here, binding system event 19 to channel 9 (a.k.a. host7, the 7th outgoing interrupt):

  CT_INTC.EISR_bit.EN_SET_IDX = SYSEVT_PRU_TO_EDMA; // sysevt 18 / channel 9
  CT_INTC.CMR4_bit.CH_MAP_18 = HOST_INTERRUPT_CHANNEL_PRU_TO_EDMA;
  CT_INTC.HMR2_bit.HINT_MAP_9 = HOST_INTERRUPT_CHANNEL_PRU_TO_EDMA;
  CT_INTC.HIEISR_bit.HINT_EN_SET_IDX = HOST_INTERRUPT_CHANNEL_PRU_TO_EDMA;

Also, the PRU has to use Shadow Region 1. This is the only shadow region wired to the PRU. The ARM core uses Shadow Region 0 (and the TRM makes ominous notes about not mixing the use of global and shadow regions–the prior example used the global region).

Things that have to change in the device-tree for this configuration to work.

The ARM core needs to avoid using DMA channels number 0 and 1 as these are hard-wired to the PRU interrupt channels 0 and 1 (i.e., these are the only channels for which the PRU can receive completion interrupts).
The DMA controller needs several PaRAM slots reserved for the PRU too.
To use the EDMA controller from the PRU, interrupt channel 9 has to be assigned to the PRU by telling the ARM core to ignore it (this is where my instability came from).

I was able to fiddle my way through this by checking out ti-linux-kernel-dev branch ti-linux-5.10.y, then editing the device tree as follows:


From 6f20fa5818ea7f51eba874fba4da3ce473e83e8e Mon Sep 17 00:00:00 2001
From: Josh MacDonald <jmacd@lightstep.com>
Date: Wed, 16 Aug 2023 21:35:41 -0700
Subject: [PATCH 2/2] Device tree changes for using EDMA from PRUSS

---
 arch/arm/boot/dts/am33xx-l4.dtsi | 8 ++++++--
 arch/arm/boot/dts/am33xx.dtsi    | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/am33xx-l4.dtsi b/arch/arm/boot/dts/am33xx-l4.dtsi
index 2d16b2d9c86b..f0ad10735ec1 100644
--- a/arch/arm/boot/dts/am33xx-l4.dtsi
+++ b/arch/arm/boot/dts/am33xx-l4.dtsi
@@ -846,6 +846,9 @@ pruss: pruss@0 {
 				#address-cells = <1>;
 				#size-cells = <1>;
 				ranges;
+                                /* 0 and 1 are fixed for PRU-initiated DMAs, see TRM ... */
+                                dmas = <&edma 0 2>, <&edma 1 2>;
+                                dma-names = "prucpy0", "prucpy1";
 
 				pruss_mem: memories@0 {
 					reg = <0x0 0x2000>,
@@ -902,13 +905,14 @@ pruss_mii_rt: mii-rt@32000 {
 				pruss_intc: interrupt-controller@20000 {
 					compatible = "ti,pruss-intc";
 					reg = <0x20000 0x2000>;
-					interrupts = <20 21 22 23 24 25 26 27>;
+					interrupts = <20 21 22 23 24 25 26>;
 					interrupt-names = "host_intr0", "host_intr1",
 							  "host_intr2", "host_intr3",
 							  "host_intr4", "host_intr5",
-							  "host_intr6", "host_intr7";
+							  "host_intr6";
 					interrupt-controller;
 					#interrupt-cells = <3>;
+					ti,irqs-reserved = /bits/ 8 <0x80>; /* BIT(7) */
 				};
 
 				pru0: pru@34000 {
diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi
index 7f3ff48eb277..3acc11651faa 100644
--- a/arch/arm/boot/dts/am33xx.dtsi
+++ b/arch/arm/boot/dts/am33xx.dtsi
@@ -232,6 +232,7 @@ edma: dma@0 {
 					   <&edma_tptc2 0>;
 
 				ti,edma-memcpy-channels = <20 21>;
+				ti,edma-reserved-slot-ranges = <0 4>;
 			};
 		};
 
-- 
2.30.2

It worked! The most confusing aspect of this is to disable the interrupt, which requires three lines:

remove system event 17
remove host7
mask bit 7

I’ve tried to translate this into an overlay. Here’s the actual file I tried with: https://github.com/jmacd/nerve/blob/jmacd/lumberjack2023/pru/BB-PRU-DMA.dts

// SPDX-License-Identifier: Apache-2.0
/*
 * Copyright (C) Josh MacDonald
 */

/dts-v1/;
/plugin/;

&{/chosen} {
        overlays {
                PRUDMA.kernel = __TIMESTAMP__;
        };
};

/* in am33xx-l4.dtsi */
&pruss_tm {
	pruss {
                /* 0 and 1 are fixed for PRU-initiated DMAs, see TRM ... */
		dmas = <&edma 0 2>, <&edma 1 2>;
		dma-names = "prucpy0", "prucpy1";

	        pruss_intc {
			/* This is host_intr7 i.e., 8th bit to say that this
			* outgoing irq is not meant for the ARM core. */
			ti,irqs-reserved = /bits/ 8 <0x80>; /* BIT(7) */

                        interrupt-names = "host_intr0", "host_intr1",
                                         "host_intr2", "host_intr3",
                                         "host_intr4", "host_intr5",
                                         "host_intr6";
			interrupts = <20 21 22 23 24 25 26>;
	        };
	};
};

/* in am33xx.dtsi */
&edma {
	ti,edma-reserved-slot-ranges = <0 4>;
};

This looks like the same set of device-tree changes, but something’s not right.

I built the kernel w/ overlays (edited the Makefile, which lists each dtbo in the arch/arm/boot/dts/overlays directory) installed it, and added the overlay to uEnv.txt, booted, and I see the overlay was loaded. But the application doesn’t work, and now I’m looking for tips and tricks for debugging overlays.

Since the overlay is applied before the kernel starts, I wonder if there are any ways to get debugging messages from that moment in time, maybe? Are there other ways I can confirm that the device-tree overlay is doing what I think and/or whether it’s simply filling in nodes in the wrong location?

Thanks,
Josh

gomer · August 18, 2023, 11:04am

perhaps your overlay is fine… why do you suspect it?

I’ve progressed from complete ignorance of the ‘HUB75’ protocol to mere incompetence since yesterday. I haven’t found any timing specifications for the IO manipulation, do you have a reference?

// toggleClock raises and lowers the HUB75 clock signal.
void toggleClock() {
  clock(HI);
  clock(LO);
}

above from pru.c … my ‘C’ skills are meh, but with my projects, it has been necessary to have timing code that controls how much time to leave the clock HI for it to register the data (into the shift register) correctly. The way that you are doing it seems too simple, and I’d suspect this, before debugging the overlay.

knowing this timing info is fundamental to calculating how fast, thus how many led’s that the PRU can manipulate to make the 100 to 200 hz requirement.

I didn’t spend enough time on your code to determine whether I’d suspect the timing for the LATCH, nor do I understand well enough how ‘C’ PRU code manipulates the GPIO through the OCP.

I’d look into the more deterministic ‘fast’ IO available to the PRU. If you’d like to study a practical example, see Turnkey PRU deskclock application for BBB … the IO manipulation is in PRU ASM not ‘C’, and the DMA is limited to a simple transfer of 32bit words from a (linux) ‘C’ program into PRU data space.

any detailed documentation on these devices would be welcome.

gomer

Josh_MacDonald · August 21, 2023, 3:24pm

Thank you for the input.

I haven’t been able to find a good reference manual for HUB75 panels. From reading through what I could find (which are both LEDscape and the PRU Cookbook example), the answer seems to be “see what works”. I am only connecting two of the eight panels at this time, because I know there are timing issues when I connect more panels at once, but I’d like to have two panels working reliably before I go that direction. (There are improvements in the DMA logic I’ll pursue first, as well – the way this code is structured every 4th line is slightly brighter than the other 3 because of the additional time spent handling interrupts and restarting a DMA request.)

I’m fairly sure, in this case, that the parts about lighting up the HUB75 panel are working correctly, and I’ve been using visual debugging techniques–all of this was developed and working (unreliably) on a 4.19 kernel. For example, the flash method reliably conveys 3 bits of color information and I’ve been using the various 3-bit color sequences to diagnose the problem.

I’m thinking that the problem lies with the overlay (described above) because when I patch the kernel directly, the PRU firmware is able to send and receive and DMA interrupts. When I apply what looks like the same device-tree change via the overlay, it appears not to work. I attached a serial console on header J1 to a different BBB booting the same kernel (the cape blocks J1) with the overlay, and I didn’t see any error messages from U-boot.

When I boot a stock 5.10 kernel and run the same firmware, the PRU observes a channel 1 interrupt which I have mapped to all the non-success-DMA-completion interrupts, so channel 1 receives tpcc_errint_pend_po1 (event 62), tptc_errint_pend_po1 (event 61), and the RPMsg interrupt from the ARM core. In my wait_dma handler, I observe the channel 1 interrupt and check which system event triggered it. Only, I there’s a race and I can’t detect any of the conditions.

I enabled CONFIG_DYNAMIC_DEBUG and repeated the process. The ARM core is definitely receiving the CCERR interrupt which I’m expecting to receive on the PRU. The PRU firmware is stuck in the wait_dma loop trying to clear an interrupt, but the ARM is clearing it faster than the PRU can act. The PRU flashes cyan repeating this process until it gets lucky. Occasionally, it detects a CCERR and parks itself with the display green, but every time this happens the ARM has also crashed. I’m guessing that the ARM crashes first, such that the PRU can actually win the race.

Well, I don’t really need the overlay. The kernel patch does the trick! Can anyone think of a reason why patching the device-tree via an overlay would build a kernel that operates differently from the one w/ the device tree built in? Are there more techniques for debugging U-boot that would uncover an answer?

Thanks,
Josh

Josh_MacDonald · August 21, 2023, 3:29pm

What’s especially odd is that I print the device-tree in both scenarios, with the kernel patched and with the overlay, and it looks like the overlay was applied, meaning I see the expected annotations (i.e., ti,irqs-reserved, ti,edma-reserved-slot-ranges, the dma names prucpy0, prucpy1, and the interrupts and interrupt-names properties.

gomer · August 21, 2023, 7:27pm

// These are word-size offsets from the GPIO register base address.
#define GPIO_CLEARDATAOUT (0x190 / WORDSZ) // For clearing the GPIO registers
#define GPIO_SETDATAOUT (0x194 / WORDSZ) // For setting the GPIO registers
#define GPIO_DATAOUT (0x13C / WORDSZ) // For setting the GPIO registers

I don’t think this is correct. It looks to me that you are dividing 0x194 (404 base 10) by 32 which would be 12.625 (base 10) … this is neither a word boundary nor even a byte boundary… similarly 0x190 (400 base 10) divided by 32 is 12.5 … similarly 0x13C (316 base 10) divided by 32 is 9.875.

I think that without the division, the code might be correct, or less wrong.

I did look at more of your code that I thought was wrong, but I was mistaken (I think).

let me ask you whether your objective is to get your panel working, or is it to demonstrate an application with complex interrupts and DMA? Either is a worthy objective, but it is unclear to me what is being sent by rpmsg, and why.

I had to look up some old OCP code that I wrote long ago… I think that your GPIO code might be correct with the elimination of the division mentioned above assuming that all the wiring is correct … do you have a map (preferably in code) outlining the GPIO code to the P8 and P9 headers?

here is how I did it in pasm…:

RESET_HI:
mypin GPIO1_hi, P8_12_bitmap
RET

RESET_LOW:
mypin GPIO1_low, P8_12_bitmap
RET

gpio.h:#define P8_12_bitmap 0x00001000**
gpio.h:#define GPIO1_hi 0x4804C194
gpio.h:#define GPIO1_low 0x4804C190

.macro mypin
.mparam mode, bitmap
MOV work1, mode
MOV work2, bitmap
SBBO work2, work1, 0, 4
.endm

I think that it is helpful to link the application code to the headers, just a suggestion.

gomer

Josh_MacDonald · August 21, 2023, 7:32pm

WORDSZ is 4 bytes

The TI pru-software-support-package examples use this approach.

gomer · August 21, 2023, 7:37pm

right, 32bits … 4 bytes … duh … thanks. I was wrong about that, and 316, 400 and 404 are all divisible by 4…

so … the indexing from the GPIO address is by words… I see.

sorry I wasn’t any help

Daniel_Kulp · August 25, 2023, 5:45pm

Have you looked into FPP? FPP on the Beaglebones can drive HUB75 panels fairly well. It can handle around 20 P5 panels, around 80-90 P10 panels. I’m not sure what you’re trying to do, but the code there might be something to look at.