BBB freezing

Julian_Gold · December 11, 2014, 12:35pm

On both B and C revisions of Beaglebone black, our application (running under either the shipped Angstrom or Debian distros) locks up after random amounts of time. The debug port reports the following problems:

[ 752.038977] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:15]
[ 752.063631] BUG: scheduling while atomic: kworker/0:1/15/0x40010100
[ 754.935632] Kernel panic - not syncing: softlockup: hung tasks
[ 754.946347] [] (unwind_backtrace+0x1/0x8a) from [] (panic
+0x51/0x148)
[ 754.960814] [] (panic+0x51/0x148) from [] (watchdog_timer
_fn+0xc3/0xf8)
[ 754.975585] [] (watchdog_timer_fn+0xc3/0xf8) from [] (__r
un_hrtimer+0xb3/0x140)
[ 754.991496] [] (__run_hrtimer+0xb3/0x140) from [] (hrtime
r_interrupt+0xd7/0x1c8)
[ 755.007625] [] (hrtimer_interrupt+0xd7/0x1c8) from [] (om
ap2_gp_timer_interrupt+0x11/0x1c)
[ 755.025164] [] (omap2_gp_timer_interrupt+0x11/0x1c) from [<c0067fa9

] (handle_irq_event_percpu+0x4d/0x174)
[ 755.043502] [] (handle_irq_event_percpu+0x4d/0x174) from [<c00680f9
] (handle_irq_event+0x29/0x3c)
[ 755.060694] [] (handle_irq_event+0x29/0x3c) from [] (hand
le_level_irq+0x6f/0x84)
[ 755.076699] [] (handle_level_irq+0x6f/0x84) from [] (gene
ric_handle_irq+0x13/0x1c)
[ 755.092998] [] (generic_handle_irq+0x13/0x1c) from [] (ha
ndle_IRQ+0x3b/0x5c)
[ 755.108422] [] (handle_IRQ+0x3b/0x5c) from [] (omap3_intc
_handle_irq+0x39/0x58)
[ 755.124243] [] (omap3_intc_handle_irq+0x39/0x58) from []
(__irq_svc+0x3b/0x5c)
[ 755.139692] Exception stack(0xdf117cf8 to 0xdf117d40)
[ 755.148646] 7ce0: 00000
000 c0849ac0
[ 755.162967] 7d00: 00000000 00000000 c07a2080 00000000 df116000 00000002 df117
db0 00000003
[ 755.177335] 7d20: c0812610 df116000 edb88320 df117d40 c0031241 c0030f4e 40000
133 ffffffff
[ 755.191758] [] (__irq_svc+0x3b/0x5c) from [] (__do_softir
q+0x46/0x174)
[ 755.206357] [] (__do_softirq+0x46/0x174) from [] (irq_exi
t+0x29/0x50)
[ 755.220773] [] (irq_exit+0x29/0x50) from [] (handle_IRQ+0
x3f/0x5c)
[ 755.234735] [] (handle_IRQ+0x3f/0x5c) from [] (omap3_intc
_handle_irq+0x39/0x58)
[ 755.250549] [] (omap3_intc_handle_irq+0x39/0x58) from []
(__irq_svc+0x3b/0x5c)
[ 755.265995] Exception stack(0xdf117db0 to 0xdf117df8)
[ 755.274974] 7da0: 00000002 00000000 00007
d00 00000000
[ 755.289328] 7dc0: c07c81d0 c07c81d0 c07c75dc 00007d02 0000007d 00000003 c0812
610 de5f4b40
[ 755.303615] 7de0: 00000100 df117df8 c0025b2d c0025bea 00000133 ffffffff
[ 755.315416] [] (__irq_svc+0x3b/0x5c) from [] (omap3_nonco
re_dpll_set_rate+0x1f2/0x330)
[ 755.332303] [] (omap3_noncore_dpll_set_rate+0x1f2/0x330) from [<c03
83273>] (clk_change_rate+0x1b/0x52)
[ 755.350016] [] (clk_change_rate+0x1b/0x52) from [] (clk_s
et_rate+0x51/0x72)
[ 755.365316] [] (clk_set_rate+0x51/0x72) from [] (cpu0_set
_target+0xf9/0x198)
[ 755.380787] [] (cpu0_set_target+0xf9/0x198) from [] (__cp
ufreq_driver_target+0x4d/0x70)
[ 755.397849] [] (_cpufreq_driver_target+0x4d/0x70) from [
] (dbs_check_cpu+0x123/0x134)
[ 755.414765] [] (dbs_check_cpu+0x123/0x134) from [] (od_db
s_timer+0x4d/0xb0)
[ 755.430064] [] (od_dbs_timer+0x4d/0xb0) from [] (process
one_work+0x1b5/0x2c0)
[ 755.445787] [] (process_one_work+0x1b5/0x2c0) from [] (wo
rker_thread+0x19b/0x258)
[ 755.461985] [] (worker_thread+0x19b/0x258) from [] (kthre
ad+0x67/0x74)
[ 755.476573] [] (kthread+0x67/0x74) from [] (ret_from_fork
+0x11/0x34)
[ 755.490617] drm_kms_helper: panic occurred, switching back to text console
[ 696.382258] CAUTION: musb: Babble Interrupt Occurred
[ 696.422949] CAUTION: musb: Babble Interrupt Occurred
[ 696.483459] gadget: high-speed config #1: Multifunction with RNDIS
[ 702.376300] musb_g_ep0_irq 710: SetupEnd came in a wrong ep0stage wait
[ 775.658309] CAUTION: musb: Babble Interrupt Occurred

The application performs a modest amount of IO using FTDI through a USB hub. The more USB devices we service, the faster the freeze appears, though it can happen soon after the code runs, or as much as 16 hours later.

This very much appears to be a kernel or driver issue. Can it be diagnosed, and is there a way of fixing it?

david_turvene · December 13, 2014, 3:29pm

This is a watchdog non-maskable interrupt (NMI). The kernel needs to write a fixed pattern to a hardware register once in a while to prevent the interrupt. If it does not, the Watchdog Timer (WDT) will count down and an interrupt will automatically occur. It’s used as a last effort to recover from a cpu lockup. My guess from the stack trace is the CPU is handling so many interrupts from the USB drivers that the watchdog is not serviced. I’ve seen a similar issue on a IEEE80211/usb driver. Don’t expect high-performance computing from a $50 card.

Possible solutions (recognizing I know nothing about FTDI)

The best is to squelch the traffic flow, thus reducing the interrupts. Review the amount of traffic on the FTDI interface and see if some can be dropped or filtered before hitting the interface. Or program the FTDI chip to filter before sending through the USB interface. If still a problem, modify the driver to discard some types of FTDI packets.
Change to a more powerful CPU. The multicore TI offerings (e.g. Pandaboard has an dual-core Cortex-A9, the forthcoming BBB X15 has a dual-core A15 I think, TI Keystone line) should do the trick. But almost everything on that level is $$$$ compared to the BBB.

If you try to change the interrupt priorities or disable the WDT you’ll just be moving the problem to another logical area.

Chris_Morgan · December 13, 2014, 3:46pm

Wouldn't it make sense to raise the priority of the timer or whatever
code was hitting the WDT? It seems like a bug if usb or other
interrupts could cause the watchdog to fail to be serviced given that
the system is still working properly.

How is the wdt serviced by the kernel? Is it easy to raise its priority?

Chris

david_turvene · December 13, 2014, 4:54pm

There’s a lot of information about the WDT under the kernel Documentation directory. I think it’s a software interrupt, probably difficult to raise it over hardware interrupts. If you don’t want the WDT, disable it via procfs or as a boot cmdline argument.

BUT, recognize if the WDT isn’t serviced then something bad is going to happen somewhere else. My guess is the kernel will panic with an ENOMEM or maybe it will silently stop. Certainly no user-space application is going to run if the CPU is saturated processing interrupts.

nivwiz · January 19, 2015, 7:29pm

I can confirm this on rcn-ee jessie debian 8.0 image with kernel 3.19.0-rc4-bone2 #1 Mon Jan 12 17:10:24 UTC 2015 armv7l GNU/Linux

my syslog shows this:

`
Jan 19 21:12:10 osbo2 kernel: Modules linked in: binfmt_misc ctr ccm usb_f_acm u_serial g_serial libcomposite arc4 ath9k_htc ath9k_common ath9k_hw ath mac80211 cfg80211 omap_sham omap_aes
Jan 19 21:12:10 osbo2 kernel: CPU: 0 PID: 9958 Comm: kworker/0:5 Tainted: G L 3.19.0-rc4-bone2 #1
Jan 19 21:12:10 osbo2 kernel: Hardware name: Generic AM33XX (Flattened Device Tree)
Jan 19 21:12:10 osbo2 kernel: Workqueue: events od_dbs_timer
Jan 19 21:12:10 osbo2 kernel: task: dd1ff980 ti: de2da000 task.ti: de2da000
Jan 19 21:12:10 osbo2 kernel: PC is at run_timer_softirq+0x126/0x19c
Jan 19 21:12:10 osbo2 kernel: LR is at internal_add_timer+0x25/0x50
Jan 19 21:12:10 osbo2 kernel: pc : [] lr : [] psr: 600f0133\x0asp : de2dbc78 ip : 00000000 fp : de5823a0
Jan 19 21:12:10 osbo2 kernel: r10: 00000000 r9 : 00000000 r8 : de2dbc90
Jan 19 21:12:10 osbo2 kernel: r7 : bf8cbf65 r6 : 00000002 r5 : dd52dc84 r4 : c0af8400
Jan 19 21:12:10 osbo2 kernel: r3 : 00000019 r2 : 00200200 r1 : dd52dc84 r0 : dd52dc84
Jan 19 21:12:10 osbo2 kernel: Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA Thumb Segment kernel
Jan 19 21:12:10 osbo2 kernel: Control: 50c5387d Table: 9d218019 DAC: 00000015
Jan 19 21:12:10 osbo2 kernel: CPU: 0 PID: 9958 Comm: kworker/0:5 Tainted: G L 3.19.0-rc4-bone2 #1
Jan 19 21:12:10 osbo2 kernel: Hardware name: Generic AM33XX (Flattened Device Tree)
Jan 19 21:12:10 osbo2 kernel: Workqueue: events od_dbs_timer
Jan 19 21:12:10 osbo2 kernel: [] (unwind_backtrace) from [] (show_stack+0x11/0x14)
Jan 19 21:12:10 osbo2 kernel: [] (show_stack) from [] (watchdog_timer_fn+0x105/0x12c)
Jan 19 21:12:10 osbo2 kernel: [] (watchdog_timer_fn) from [] (__run_hrtimer+0x45/0x11c)
Jan 19 21:12:10 osbo2 kernel: [] (__run_hrtimer) from [] (hrtimer_interrupt+0xc1/0x20c)
Jan 19 21:12:10 osbo2 kernel: [] (hrtimer_interrupt) from [] (omap2_gp_timer_interrupt+0x23/0x28)
Jan 19 21:12:10 osbo2 kernel: [] (omap2_gp_timer_interrupt) from [] (handle_irq_event_percpu+0x67/0x12c)
Jan 19 21:12:10 osbo2 kernel: [] (handle_irq_event_percpu) from [] (handle_irq_event+0x21/0x2c)
Jan 19 21:12:10 osbo2 kernel: [] (handle_irq_event) from [] (handle_level_irq+0x55/0x90)
Jan 19 21:12:10 osbo2 kernel: [] (handle_level_irq) from [] (generic_handle_irq+0x23/0x2c)
Jan 19 21:12:10 osbo2 kernel: [] (generic_handle_irq) from [] (__handle_domain_irq+0x3b/0x80)
Jan 19 21:12:10 osbo2 kernel: [] (__handle_domain_irq) from [] (omap_intc_handle_irq+0x85/0x8c)
Jan 19 21:12:10 osbo2 kernel: [] (omap_intc_handle_irq) from [] (__irq_svc+0x3b/0x5c)
Jan 19 21:12:10 osbo2 kernel: Exception stack(0xde2dbc30 to 0xde2dbc78)
Jan 19 21:12:10 osbo2 kernel: bc20: dd52dc84 dd52dc84 00200200 00000019
Jan 19 21:12:10 osbo2 kernel: bc40: c0af8400 dd52dc84 00000002 bf8cbf65 de2dbc90 00000000 00000000 de5823a0
Jan 19 21:12:10 osbo2 kernel: bc60: 00000000 de2dbc78 c0062359 c00633fe 600f0133 ffffffff
Jan 19 21:12:10 osbo2 kernel: [] (__irq_svc) from [] (run_timer_softirq+0x126/0x19c)
Jan 19 21:12:10 osbo2 kernel: [] (run_timer_softirq) from [] (__do_softirq+0xb1/0x1c0)
Jan 19 21:12:10 osbo2 kernel: [] (__do_softirq) from [] (irq_exit+0x79/0xb0)
Jan 19 21:12:10 osbo2 kernel: [] (irq_exit) from [] (__handle_domain_irq+0x3f/0x80)
Jan 19 21:12:10 osbo2 kernel: [] (__handle_domain_irq) from [] (omap_intc_handle_irq+0x85/0x8c)
Jan 19 21:12:10 osbo2 kernel: [] (omap_intc_handle_irq) from [] (__irq_svc+0x3b/0x5c)
Jan 19 21:12:10 osbo2 kernel: Exception stack(0xde2dbd40 to 0xde2dbd88)
Jan 19 21:12:10 osbo2 kernel: bd40: de0360c0 00000000 ffffffff 00000000 de016880 00000000 de034980 00000000
Jan 19 21:12:10 osbo2 kernel: bd60: de016880 c0a5da5c 000f4240 11e1a300 00000000 de2dbd88 c002a721 c002a740
Jan 19 21:12:10 osbo2 kernel: bd80: 400f0033 ffffffff
Jan 19 21:12:10 osbo2 kernel: [] (__irq_svc) from [] (_omap3_wait_dpll_status+0x38/0xe4)
Jan 19 21:12:10 osbo2 kernel: [] (_omap3_wait_dpll_status) from [] (_omap3_noncore_dpll_bypass+0x3b/0x70)
Jan 19 21:12:10 osbo2 kernel: [] (_omap3_noncore_dpll_bypass) from [] (omap3_noncore_dpll_set_rate+0x63/0x3b4)
Jan 19 21:12:10 osbo2 kernel: [] (omap3_noncore_dpll_set_rate) from [] (clk_change_rate+0xa9/0xe4)
Jan 19 21:12:10 osbo2 kernel: [] (clk_change_rate) from [] (clk_set_rate+0x91/0x98)
Jan 19 21:12:10 osbo2 kernel: [] (clk_set_rate) from [] (set_target+0xf5/0x2b8)
Jan 19 21:12:10 osbo2 kernel: [] (set_target) from [] (__cpufreq_driver_target+0x133/0x25c)
Jan 19 21:12:10 osbo2 kernel: [] (__cpufreq_driver_target) from [] (dbs_check_cpu+0xaf/0xfc)
Jan 19 21:12:10 osbo2 kernel: [] (dbs_check_cpu) from [] (od_dbs_timer+0x53/0xb4)
Jan 19 21:12:10 osbo2 kernel: [] (od_dbs_timer) from [] (process_one_work+0xb9/0x278)
Jan 19 21:12:10 osbo2 kernel: [] (process_one_work) from [] (worker_thread+0xdb/0x37c)
Jan 19 21:12:10 osbo2 kernel: [] (worker_thread) from [] (kthread+0x9b/0xb0)
Jan 19 21:12:10 osbo2 kernel: [] (kthread) from [] (ret_from_fork+0x11/0x38)

`

every 28 sec !

RobertCNelson · January 20, 2015, 5:50pm

It's either fixed in rc5:

debian@arm:~$ uname -r
3.19.0-rc5-bone2

Or i'm missing something... what hardware do you have plugged in?

Regards,

nivwiz · January 20, 2015, 6:29pm

Its TP-Link TL-WN722N
Bus 001 Device 002: ID 0cf3:9271 Atheros Communications, Inc. AR9271 802.11n
This uses the ath9k_htc module and I had to install the firmware, firmware-atheros 0.43

alejom.tv · February 13, 2015, 8:45pm

I had this same problem.

I had to disable the “cpu-ondemand” feature

rm /etc/systemd/system/multi-user.target.wants/cpu-ondemand.timer

Now it works