Recovery from a Soft Brick

I was building Gateware locally using the shipping capes as a starting point and it seems I have stumbled across an invalid configuration that locks up the boot process.

My serial debug output prints the following before stopping:

[    2.786184] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 249)
[    2.794420] io scheduler mq-deadline registered
[    2.801748] GPIO line 510 (sd_card_cs): no hogging state specified, bailing out
[    2.810689] gpio-494 (ADC_IRQn): hogged as input
[    2.815841] gpio-497 (USB_OCn): hogged as input
[    2.822486] GPIO line 472 (vio_enable): no hogging state specified, bailing out
[    2.830578] gpio-473 (SD_DET): hogged as input
[    2.836410] gpio gpiochip3: (41200000.gpio): not an immutable chip, please consider fixing it!
[    2.846522] gpio gpiochip4: (41100000.gpio): not an immutable chip, please consider fixing it!
[    2.856049] gpio gpiochip4: (41100000.gpio): detected irqchip that is shared with multiple gpiochips: please fix the driver.
[    2.870313] microchip-pcie 3000000000.pcie: host bridge /fabric-pcie-bus@3000000000/pcie@3000000000 ranges:
[    2.881130] microchip-pcie 3000000000.pcie:      MEM 0x3009000000..0x3017ffffff -> 0x0009000000
[    2.890832] microchip-pcie 3000000000.pcie:       IO 0x3008000000..0x3008ffffff -> 0x0008000000
[    2.900467] microchip-pcie 3000000000.pcie:      MEM 0x3018000000..0x3087ffffff -> 0x0018000000
[    2.910114] microchip-pcie 3000000000.pcie:   IB MEM 0x0080000000..0x0083ffffff -> 0x0080000000
[    2.919757] microchip-pcie 3000000000.pcie:   IB MEM 0x00c4000000..0x00c9ffffff -> 0x0084000000
[    2.929413] microchip-pcie 3000000000.pcie:   IB MEM 0x008a000000..0x0091ffffff -> 0x008a000000
[    2.939054] microchip-pcie 3000000000.pcie:   IB MEM 0x1412000000..0x1421ffffff -> 0x0092000000
[    2.948670] microchip-pcie 3000000000.pcie:   IB MEM 0x1022000000..0x107fffffff -> 0x00a2000000

Then after about 5 minutes I get the following output every 60 seconds or so:

[  337.681383] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  337.687908] rcu: 	0-...0: (14 ticks this GP) idle=3e3c/1/0x4000000000000000 softirq=38/39 fqs=2625
[  337.697781] 	(detected by 2, t=5255 jiffies, g=-1151, q=2 ncpus=4)
[  337.704586] Task dump for CPU 0:
[  337.708127] task:swapper/0       state:R  running task     stack:0     pid:1     ppid:0      flags:0x00000008
[  337.719046] Call Trace:
[  337.721729] [<ffffffff80a67ba0>] __schedule+0x27c/0x834

I have tried to use the DirectC JTAG programmer without success. I am using the shipping Gateware image BVF-0.4.0-27-g7078de9. I issue the programming command as documented in the DirectC repository.
I have tried programming in both the HSS prompt stage and after the boot process appears to hang.
In both cases I get this error code:

Identifying device...
Looking for MPF device...
ActID = 0 ExpID = F8531CF
ERROR_CODE: 8004
Error return code  6
Elapsed time = 00:00:00 Done.

I double checked my wiring and it looks correct.

So my questions are:
Can I use the shipping *.dat files or do I need to compile my own as the DirectC documentation suggests?
When should I perform the programming action? During the HSS stage?
Do I need to toggle the eMMC multiplexer to USB mode or anything like that?
Are there DirectC commands I can use to verify my wiring? I have tried “device_info” and “read_idcode” but I get the same error.
Is signal integrity a significant concern? I am using a Rpi5 and have to use 3 inch jumpers from the Pi to the IDC header end of the TC2050-IDC.

Thank you in advance!

I think what is happening here is that somehow the PCIe block is not included in your gateware but the device tree overlay causes Linux to look for it causing a locking bus transaction to the FPGA fabric.
I did this to myself a couple of times.
The way I recovered from it was to replace the version of U-Boot with one that did not merge the gateware content device tree overlay before passing it to the Linux kernel. This is a little bit of a nuclear option but it means it can be done with just using the USB-C cable to the board and the HSS.

I’m sure there is a more intelligent approach with the software stack we have now. I’m guessing use the HSS’ usbdmsc command to get the board to show as a USB mass storage device. This should let you retrieve the dtb and boot.scr. I think the key is to modify the boot.scr to remove the check for device tree overlay. @RobertCNelson what do you think? Does that make sense?

Yes, temporarily comment the following line in boot.scr:
run design_overlays;

1 Like

the fun part… boot.scr is built with u-boot mkimage…

setenv fdt_high 0xffffffffffffffff
setenv initrd_high 0xffffffffffffffff

load mmc 0:${distro_bootpart} ${scriptaddr} beaglev_fire.itb;
bootm start ${scriptaddr}#kernel_dtb;
bootm loados ${scriptaddr};
# Try to load a ramdisk if available inside fitImage
bootm ramdisk;
bootm prep;
fdt set /soc/ethernet@20112000 mac-address ${icicle_mac_addr0};
fdt set /soc/ethernet@20110000 mac-address ${icicle_mac_addr1};
run design_overlays;
bootm go;

You can cheat and stop it in the u-boot console and manually copy it’s instructions… skipping run design_overlays;

1 Like

These tips worked! I am up and running again.

Notes:
Commenting a line in the boot.scr file triggers a CRC check failure at boot.

## Executing script at 8e000000
Bad data crc
SCRIPT FAILED: continuing...
... lines removed for clarity ...
RISC-V #

This turned out to be convenient because I could directly enter the scripted commands line by line at the RISC-V # prompt. The script commands aren’t valid in HSS and I wasn’t sure how to exit HSS and get back into u-boot.

After I executed the script commands as Robert suggested, the boot proceeded normally and I was able to flash shipping Gateware back onto the FPGA. Then I un-commented the line in boot.scr so it would pass the CRC check again.

Thank you both for your help!