kernel panic - has anyone seen something similar?

Me_Nee · November 13, 2014, 5:41pm

Background: my company is using beaglebone blacks in an industrial CAN bus monitoring application. Our Linux guy quit in June, leaving me to support this thing. Prior to June I had pretty much zero Linux experience. Our kernel has custom modifications. One was provided by Tower Tech (an Italian manufacturer of beaglebone CAN capes), to provide support for CAN and the other was done by the guy that quit. It was a modification to the kernel to support high speed serial clocks, as there was a bug in the original kernel code when setting serial rates greater than 230,400 baud (if I remember correctly). The base kernel we’re working off of is 2013-09-04, which from what I can gather is the latest official Angstrom release.

We’re using the beaglebone in 2 different external hardware configurations. One is with a “straight to AM3359 CAN” interface, and the other is through a CAN-to-serial external (external to the beaglebone) interface. The CAN-to-serial incarnation works flawlessly. The straight to CAN version, we discovered, throws the same kernel panic over & over when CAN traffic gets high.

Our suite of programs that runs on the beaglebone includes two networking-related daemons. One is to catch & react to network requests to configure our equipment and the other is basically a UDP blaster to broadcast data. I’ve found that disabling both of these daemons will prevent the panic. If I only disable one of these programs (doesn’t matter which one), the panic still happens.

Anyway, the panic references /net/core/dev.c line 3988 every time. What I’m wondering is if anyone has seen something similar? Or can someone maybe point me in the direction of fixing this? Seems that the issue is likely due to the CAN specific features we got from the Tower Tech kernel, but I’m hesitant to ask them because I know that my ex-colleague had trouble communicating with them in the past. Given that the panic does not occur when we use the indirect CAN-to-serial hardware, that’s why I’m suspicious of the Tower Tech kernel modifications.

And also note that the panic happens all the time. The dump below references our “can_mon” program, but it also happens when nothing is running other than our (custom) daemons.

The panic dump is below.

[ 101.133958] ------------[ cut here ]------------
[ 101.138802] kernel BUG at net/core/dev.c:3988!
[ 101.143440] Internal error: Oops - BUG: 0 [#1] SMP THUMB2
[ 101.149074] Modules linked in: can_raw can c_can_platform c_can can_dev iptable_nat nf_conntrack_ipv4 nf_d
efrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables g_multi libcomposite rfcomm ircomm_tty ircomm i
rda hidp bluetooth rfkill
[ 101.171411] CPU: 0 Tainted: G W (3.8.13-bone53 #1)
[ 101.177689] PC is at __napi_complete+0x36/0x3c
[ 101.182334] LR is at napi_complete+0x1d/0x28
[ 101.186785] pc : [] lr : [] psr: 400000b3
[ 101.186785] sp : d902dd50 ip : 54638479 fp : 0000012c
[ 101.198775] r10: c0d12600 r9 : d902c000 r8 : c0d12608
[ 101.204216] r7 : 00000001 r6 : 00000010 r5 : 40000013 r4 : dcb7eddc
[ 101.211027] r3 : 00000000 r2 : 00000000 r1 : dcb7eddc r0 : dcb7eddc
[ 101.217833] Flags: nZcv IRQs off FIQs on Mode SVC_32 ISA Thumb Segment user
[ 101.225552] Control: 50c5387d Table: 9cacc019 DAC: 00000015
[ 101.231544] Process can_mon (pid: 450, stack limit = 0xd902c240)
[ 101.237812] Stack: (0xd902dd50 to 0xd902e000)
[ 101.242361] dd40: dcb7eddc 40000013 00000010 dcb7e800
[ 101.250900] dd60: dcb7ed40 bf8f476d dcb7eddc 00000010 df0c8010 df258010 c0d1176c c0288173
[ 101.259434] dd80: 00000000 dcb7eddc 00000001 00000010 c08020c0 c0d12608 d902c000 c0d12600
[ 101.267967] dda0: 0000012c c03fa65d 00000000 fffe7240 dc8c7440 00000001 00000003 0000000c
[ 101.276500] ddc0: c0802090 c080208c d902c000 c08858e4 00000037 c003466b df00c740 df007bc0
[ 101.285045] dde0: 00800000 00000100 0000000c 00000000 0000000a 00404100 df007c10 d902c000
[ 101.293582] de00: 00000037 00000000 d902de60 c08884bc fa2000d8 fa2000f8 00000037 c003493d
[ 101.302105] de20: c07fd728 c000cfdf fa200098 fa200040 fa2000b8 c00085d9 c04a22c9 c04a22cc
[ 101.310630] de40: 60000033 ffffffff d902de94 dce76800 00000000 60000013 dcec8409 c04a259b
[ 101.319164] de60: df258010 60000013 00000000 8a108a10 60000013 df2b9800 00000002 df258010
[ 101.327695] de80: dce76800 00000000 60000013 dcec8409 00000039 d902dea8 c04a22c9 c04a22cc
[ 101.336235] dea0: 60000033 ffffffff 00000000 c028015b c02800d9 00000003 dcec8407 dce76400
[ 101.344776] dec0: 14000000 df4e8240 c04d9028 dce76680 dce76800 c0270161 d902df80 dce76994
[ 101.353304] dee0: dcec8400 d902c000 dcdd4740 00000000 df50ce00 c004f3dd dce769a4 dce769a4
[ 101.361842] df00: fffffffb b6fb0000 0000000a dce76800 00000000 d902c000 df4e8240 0000000a
[ 101.370375] df20: c0270065 c026e285 0000000a dcdd4740 00000001 df4e8240 b6fb0000 0000000a
[ 101.378909] df40: d902c000 d902df80 0000000a 00000000 00000000 c00b8887 00000000 00000000
[ 101.387431] df60: beefe5fc 00000000 00000000 df4e8240 00000000 b6fb0000 0000000a c00b8a8d
[ 101.395967] df80: 00000000 00000000 d902c000 0000000a b6fb0000 b6dbfa80 00000004 c000c8e4
[ 101.404500] dfa0: d902c000 c000c741 0000000a b6fb0000 00000001 b6fb0000 0000000a 00000000
[ 101.413038] dfc0: 0000000a b6fb0000 b6dbfa80 00000004 0000000a 0000000a b6fb0000 00000000
[ 101.421560] dfe0: 00000000 beefe14c b6cfbb2c b6d4e30c 600f0010 00000001 00000000 00000000
[ 101.430104] [] (__napi_complete+0x36/0x3c) from [<00000010>] (0x10)
[ 101.437555] Code: 3108 bc30 f636 b815 (de02) de02
[ 101.442565] —[ end trace 59f323e922c98490 ]—
[ 101.447382] Kernel panic - not syncing: Fatal exception in interrupt
[ 101.454008] drm_kms_helper: panic occurred, switching back to text console

beagler001 · November 20, 2014, 10:04pm

Hello Me Nee,

Yes. I am having the same problem. How are you doing on this? Have you figured anything out?

I too am using the CAN interface on the BeagleBone Black device.

Let me know, and I can update you with my findings if you still need help.

Me_Nee · November 20, 2014, 10:36pm

Haven’t figured a software way around this yet. For now we’re avoiding the “direct” CAN interface to the Beaglebone and instead using our external custom hardware to relay serial “CAN” messages to the Beaglebone. We don’t have issues with this format.

That said, if you have a fix I’d love to hear about it.

beagler001 · November 21, 2014, 3:09pm

No solution here yet, but I have found some very relevant discussions out there. Something must have changed with the kernel scheduler that requires drivers (CAN in our case) to be updated. I copied the BeagleBone kernel support guru to this post (Robert Nelson). Perhaps he is already aware of this problem and knows of a work-around. I will post something if/when I get this figured out. Until then, here are some relevant links.

http://stackoverflow.com/questions/3537252/how-to-solve-bug-scheduling-while-atomic-swapper-0x00000103-0-cpu0-in-ts
http://e2e.ti.com/support/omap/f/849/t/250383.aspx
https://community.freescale.com/thread/330079

Me_Nee · November 21, 2014, 7:47pm

Thanks, very informative links, particularly the last one.

beagler001 · November 26, 2014, 3:07pm

My kernel is no longer crashing. Unfortunately, I do not have the exact work-around - as I was messing around with a lot of stuff to try to get this to work.

The one thing consistent in all this is the backtrace (/var/log/kern.log) in that the routine c_can_get_berr_counter is doing something that it should not do. But trying to get someone that knows about this code to take a look seems to be a huge challenge. If you would like to see the backtrace, let me know.

I do know that the resolution was one of two things:

Our CAN transceiver’s enable line was being shared with MMC1_DAT0 - which was brought out to P8-25 on an expansion header. I tried (various methods) to reset the eMMC in hopes of driving its pins to an open-drain state. I could never get that to work, and therefore I could never drive the CAN transceiver’s enable line low. I got around this by wiring P8-25 directly to ground. That could have been causing problems with either the CAN transceiver or the processor itself. Or it could be that leaving the CAN transceiver enabled through the boot process caused issues as well. Nonetheless, I clipped the P8-25 pin on our cape and wired from another GPIO line that was routed to the P8 expansion header (P8-17 I think). This allowed me to enable the CAN transceiver cleanly, post-boot.
I disabled some things in my kernel config.

`

beagler001 · November 26, 2014, 3:11pm

Note that there are plenty of other things in my kernel config. I only showed the differences between the original (when I would get kernel panic) and the modified (no kernel panic).

Me_Nee · November 26, 2014, 4:24pm

Complete newbie - where can I find these config parameters?

beagler001 · December 1, 2014, 3:02pm

Those config parameters are used for the kernel build. They are part of a huge collection of compiler flags used for controlling how the kernel is built.

In your initial post, I noticed that you mentioned you were using a custom kernel; therefore, I assumed that you understood how to modify kernel config parameters and build the kernel.

Does that help?

Me_Nee · December 1, 2014, 4:05pm

The only reason I know we’re using a custom kernel is because our former Linux guy told me so. Never recompiled a kernel before but I have a cursory grasp of what’s involved.

Former coworker’s linux laptop has a folder named “Robert C Nelson” that contains what seems to be the custom kernel mod to fix the UART speed issue. I’ll start poking around in there to see if I can figure it all out.

And yes, that does help. I do appreciate it, thanks for your patience.

beagler001 · December 1, 2014, 4:13pm

RCN’s kernel is the kernel source that I am using as well. If you change into that directory, you can run a rebuild script by typing “tools/rebuild.sh”. Invoking that script automatically pops up a window showing all the kernel config parameters. The number of parameters and finding the exact ones to match what I listed above is rather daunting. What I recommend is to view the default kernel config file and check if you are using the same config as me (probably not). default config file should be named “defconfig” and should be stored within the patches directory.

beagler001 · December 2, 2014, 10:02pm

Hello again…

Never mind any of the stuff I previously mentioned regarding changing of the kernel config parameters. The problem is rooted in my original comment about the c_can driver. There is a patch that exists that solves this problem. Unfortunately, it was inserted into the mainline kernel stream later than the 3.8+ branch we are using on BeagleBone Black; and therefore, the fix is not included in our kernel source. Take a look at this:

http://lists.openwall.net/netdev/2013/11/27/64

If you have acquainted yourself with building the kernel for BBB, I would suggest manually editing that c_can.c file with the changes shown in the link above, rebuilding, and re-installing. That should fix your problem. It did for me.

Good luck.

Jean-Pierre_Aulas · December 12, 2014, 1:35pm

Hello, thanks for your reply, is there another way (more simple than rebuilt) for this fix ?
Hereunder trace with another problem with mysql :
(Linux BBB4 3.8.13-bone50 #1 SMP Tue May 13 13:24:52 UTC 2014 armv7l GNU/Linux)

debian@larnau:~$ [ 543.774398] BUG: scheduling while atomic: rs:main Q:Reg/653/0x40000100
[ 551.739092] BUG: scheduling while atomic: mysqld/1766/0x40000100
[ 582.732825] BUG: scheduling while atomic: mysqld/1766/0x40000100
[ 582.759827] ------------[ cut here ]------------
[ 582.764775] kernel BUG at net/core/dev.c:3988!
[ 582.769500] Internal error: Oops - BUG: 0 [#1] SMP THUMB2
[ 582.775236] Modules linked in: can_raw can c_can_platform c_can can_dev mt7601Usta(O)
[ 582.783670] CPU: 0 Tainted: G W O (3.8.13-bone50 #1)
[ 582.790084] PC is at __napi_complete+0x36/0x3c
[ 582.794820] LR is at napi_complete+0x1d/0x28
[ 582.799368] pc : [] lr : [] psr: 400000b3
[ 582.799368] sp : de51deb0 ip : 00000000 fp : c0d36608
[ 582.811577] r10: c08260c0 r9 : c0d36600 r8 : 0000012c
[ 582.817137] r7 : 00000010 r6 : 00000001 r5 : 40000013 r4 : de68d5dc
[ 582.824078] r3 : 00000000 r2 : 00000000 r1 : de68d5dc r0 : de68d5dc
[ 582.831019] Flags: nZcv IRQs off FIQs on Mode SVC_32 ISA Thumb Segment user
[ 582.838884] Control: 50c5387d Table: 9e664019 DAC: 00000015
[ 582.844998] Process mysqld (pid: 1766, stack limit = 0xde51c240)
[ 582.851381] Stack: (0xde51deb0 to 0xde51e000)
[ 582.856029] dea0: de68d5dc 40000013 00000010 de68d000
[ 582.864737] dec0: de68d540 bf8c0609 de68d5dc 00000010 fa200098 fa2000b8 00000001 ebbdebbc
[ 582.873442] dee0: de51c000 de68d5dc 00000001 de51c000 00000010 0000012c c0d36600 c08260c0
[ 582.882146] df00: c0d36608 c0423cd1 00000002 0002356e de51df10 de51df10 de51dfb0 00000001
[ 582.890853] df20: c082608c de51c000 00000043 00000003 b709e068 b709e230 0000000c c0034f0f
[ 582.899566] df40: 00000008 00000043 00000100 00000000 00000009 00400040 df008650 de51c000
[ 582.908276] df60: 00000000 00000043 00000043 de51dfb0 b709e068 b709e230 a676a328 c0035205
[ 582.916989] df80: c0821728 c000d0e3 fa200098 fa2000b8 fa2000d8 c00085a9 b6cbefbe 80000030
[ 582.925700] dfa0: ffffffff 00000020 000000c8 c04ceca9 258fc214 b4d59e20 b4d59ebe a676a328
[ 582.934409] dfc0: 00000031 a676a32c 36b96c00 00000020 000000c8 b709e068 b709e230 a676a328
[ 582.943115] dfe0: b6fcdfe4 a676a2e8 b6ba124b b6cbefbe 80000030 ffffffff 00000000 00000000
[ 582.951839] [] (__napi_complete+0x36/0x3c) from [<00000010>] (0x10)
[ 582.959435] Code: 3108 bc30 f62c bfe3 (de02) de02
[ 582.964540] —[ end trace a72764883bbe4627 ]—
[ 582.969459] Kernel panic - not syncing: Fatal exception in interrupt

regards
Jean-Pierre Aulas

beagler001 · December 13, 2014, 10:45pm

If you are using the CAN device and the c_can driver, then implementing the kernel mod and re-building/re-installing would seem to be your only option.

If you need to use CAN and can use a USB-to-CAN adapter, or some other serial-to-CAN adapter, then maybe you could get around this problem.

Jean-Pierre_Aulas · December 16, 2014, 3:05pm

thank you for your answer, so I have to rebuild … not lucky for this time
regards
JPA