On both B and C revisions of Beaglebone black, our application (running under either the shipped Angstrom or Debian distros) locks up after random amounts of time. The debug port reports the following problems:
[ 752.038977] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:15]
[ 752.063631] BUG: scheduling while atomic: kworker/0:1/15/0x40010100
[ 754.935632] Kernel panic - not syncing: softlockup: hung tasks
[ 754.946347] [] (unwind_backtrace+0x1/0x8a) from [] (panic
+0x51/0x148)
[ 754.960814] [] (panic+0x51/0x148) from [] (watchdog_timer
_fn+0xc3/0xf8)
[ 754.975585] [] (watchdog_timer_fn+0xc3/0xf8) from [] (__r
un_hrtimer+0xb3/0x140)
[ 754.991496] [] (__run_hrtimer+0xb3/0x140) from [] (hrtime
r_interrupt+0xd7/0x1c8)
[ 755.007625] [] (hrtimer_interrupt+0xd7/0x1c8) from [] (om
ap2_gp_timer_interrupt+0x11/0x1c)
[ 755.025164] [] (omap2_gp_timer_interrupt+0x11/0x1c) from [<c0067fa9
] (handle_irq_event_percpu+0x4d/0x174)
[ 755.043502] [] (handle_irq_event_percpu+0x4d/0x174) from [<c00680f9
] (handle_irq_event+0x29/0x3c)
[ 755.060694] [] (handle_irq_event+0x29/0x3c) from [] (hand
le_level_irq+0x6f/0x84)
[ 755.076699] [] (handle_level_irq+0x6f/0x84) from [] (gene
ric_handle_irq+0x13/0x1c)
[ 755.092998] [] (generic_handle_irq+0x13/0x1c) from [] (ha
ndle_IRQ+0x3b/0x5c)
[ 755.108422] [] (handle_IRQ+0x3b/0x5c) from [] (omap3_intc
_handle_irq+0x39/0x58)
[ 755.124243] [] (omap3_intc_handle_irq+0x39/0x58) from []
(__irq_svc+0x3b/0x5c)
[ 755.139692] Exception stack(0xdf117cf8 to 0xdf117d40)
[ 755.148646] 7ce0: 00000
000 c0849ac0
[ 755.162967] 7d00: 00000000 00000000 c07a2080 00000000 df116000 00000002 df117
db0 00000003
[ 755.177335] 7d20: c0812610 df116000 edb88320 df117d40 c0031241 c0030f4e 40000
133 ffffffff
[ 755.191758] [] (__irq_svc+0x3b/0x5c) from [] (__do_softir
q+0x46/0x174)
[ 755.206357] [] (__do_softirq+0x46/0x174) from [] (irq_exi
t+0x29/0x50)
[ 755.220773] [] (irq_exit+0x29/0x50) from [] (handle_IRQ+0
x3f/0x5c)
[ 755.234735] [] (handle_IRQ+0x3f/0x5c) from [] (omap3_intc
_handle_irq+0x39/0x58)
[ 755.250549] [] (omap3_intc_handle_irq+0x39/0x58) from []
(__irq_svc+0x3b/0x5c)
[ 755.265995] Exception stack(0xdf117db0 to 0xdf117df8)
[ 755.274974] 7da0: 00000002 00000000 00007
d00 00000000
[ 755.289328] 7dc0: c07c81d0 c07c81d0 c07c75dc 00007d02 0000007d 00000003 c0812
610 de5f4b40
[ 755.303615] 7de0: 00000100 df117df8 c0025b2d c0025bea 00000133 ffffffff
[ 755.315416] [] (__irq_svc+0x3b/0x5c) from [] (omap3_nonco
re_dpll_set_rate+0x1f2/0x330)
[ 755.332303] [] (omap3_noncore_dpll_set_rate+0x1f2/0x330) from [<c03
83273>] (clk_change_rate+0x1b/0x52)
[ 755.350016] [] (clk_change_rate+0x1b/0x52) from [] (clk_s
et_rate+0x51/0x72)
[ 755.365316] [] (clk_set_rate+0x51/0x72) from [] (cpu0_set
_target+0xf9/0x198)
[ 755.380787] [] (cpu0_set_target+0xf9/0x198) from [] (__cp
ufreq_driver_target+0x4d/0x70)
[ 755.397849] [] (_cpufreq_driver_target+0x4d/0x70) from [
] (dbs_check_cpu+0x123/0x134)
[ 755.414765] [] (dbs_check_cpu+0x123/0x134) from [] (od_db
s_timer+0x4d/0xb0)
[ 755.430064] [] (od_dbs_timer+0x4d/0xb0) from [] (process
one_work+0x1b5/0x2c0)
[ 755.445787] [] (process_one_work+0x1b5/0x2c0) from [] (wo
rker_thread+0x19b/0x258)
[ 755.461985] [] (worker_thread+0x19b/0x258) from [] (kthre
ad+0x67/0x74)
[ 755.476573] [] (kthread+0x67/0x74) from [] (ret_from_fork
+0x11/0x34)
[ 755.490617] drm_kms_helper: panic occurred, switching back to text console
[ 696.382258] CAUTION: musb: Babble Interrupt Occurred
[ 696.422949] CAUTION: musb: Babble Interrupt Occurred
[ 696.483459] gadget: high-speed config #1: Multifunction with RNDIS
[ 702.376300] musb_g_ep0_irq 710: SetupEnd came in a wrong ep0stage wait
[ 775.658309] CAUTION: musb: Babble Interrupt Occurred
The application performs a modest amount of IO using FTDI through a USB hub. The more USB devices we service, the faster the freeze appears, though it can happen soon after the code runs, or as much as 16 hours later.
This very much appears to be a kernel or driver issue. Can it be diagnosed, and is there a way of fixing it?
This is a watchdog non-maskable interrupt (NMI). The kernel needs to write a fixed pattern to a hardware register once in a while to prevent the interrupt. If it does not, the Watchdog Timer (WDT) will count down and an interrupt will automatically occur. It’s used as a last effort to recover from a cpu lockup. My guess from the stack trace is the CPU is handling so many interrupts from the USB drivers that the watchdog is not serviced. I’ve seen a similar issue on a IEEE80211/usb driver. Don’t expect high-performance computing from a $50 card.
Possible solutions (recognizing I know nothing about FTDI)
The best is to squelch the traffic flow, thus reducing the interrupts. Review the amount of traffic on the FTDI interface and see if some can be dropped or filtered before hitting the interface. Or program the FTDI chip to filter before sending through the USB interface. If still a problem, modify the driver to discard some types of FTDI packets.
Change to a more powerful CPU. The multicore TI offerings (e.g. Pandaboard has an dual-core Cortex-A9, the forthcoming BBB X15 has a dual-core A15 I think, TI Keystone line) should do the trick. But almost everything on that level is $$$$ compared to the BBB.
If you try to change the interrupt priorities or disable the WDT you’ll just be moving the problem to another logical area.
Wouldn't it make sense to raise the priority of the timer or whatever
code was hitting the WDT? It seems like a bug if usb or other
interrupts could cause the watchdog to fail to be serviced given that
the system is still working properly.
How is the wdt serviced by the kernel? Is it easy to raise its priority?
There’s a lot of information about the WDT under the kernel Documentation directory. I think it’s a software interrupt, probably difficult to raise it over hardware interrupts. If you don’t want the WDT, disable it via procfs or as a boot cmdline argument.
BUT, recognize if the WDT isn’t serviced then something bad is going to happen somewhere else. My guess is the kernel will panic with an ENOMEM or maybe it will silently stop. Certainly no user-space application is going to run if the CPU is saturated processing interrupts.
Its TP-Link TL-WN722N
Bus 001 Device 002: ID 0cf3:9271 Atheros Communications, Inc. AR9271 802.11n
This uses the ath9k_htc module and I had to install the firmware, firmware-atheros 0.43