Background: my company is using beaglebone blacks in an industrial CAN bus monitoring application. Our Linux guy quit in June, leaving me to support this thing. Prior to June I had pretty much zero Linux experience. Our kernel has custom modifications. One was provided by Tower Tech (an Italian manufacturer of beaglebone CAN capes), to provide support for CAN and the other was done by the guy that quit. It was a modification to the kernel to support high speed serial clocks, as there was a bug in the original kernel code when setting serial rates greater than 230,400 baud (if I remember correctly). The base kernel we’re working off of is 2013-09-04, which from what I can gather is the latest official Angstrom release.
We’re using the beaglebone in 2 different external hardware configurations. One is with a “straight to AM3359 CAN” interface, and the other is through a CAN-to-serial external (external to the beaglebone) interface. The CAN-to-serial incarnation works flawlessly. The straight to CAN version, we discovered, throws the same kernel panic over & over when CAN traffic gets high.
Our suite of programs that runs on the beaglebone includes two networking-related daemons. One is to catch & react to network requests to configure our equipment and the other is basically a UDP blaster to broadcast data. I’ve found that disabling both of these daemons will prevent the panic. If I only disable one of these programs (doesn’t matter which one), the panic still happens.
Anyway, the panic references /net/core/dev.c line 3988 every time. What I’m wondering is if anyone has seen something similar? Or can someone maybe point me in the direction of fixing this? Seems that the issue is likely due to the CAN specific features we got from the Tower Tech kernel, but I’m hesitant to ask them because I know that my ex-colleague had trouble communicating with them in the past. Given that the panic does not occur when we use the indirect CAN-to-serial hardware, that’s why I’m suspicious of the Tower Tech kernel modifications.
And also note that the panic happens all the time. The dump below references our “can_mon” program, but it also happens when nothing is running other than our (custom) daemons.
Haven’t figured a software way around this yet. For now we’re avoiding the “direct” CAN interface to the Beaglebone and instead using our external custom hardware to relay serial “CAN” messages to the Beaglebone. We don’t have issues with this format.
That said, if you have a fix I’d love to hear about it.
No solution here yet, but I have found some very relevant discussions out there. Something must have changed with the kernel scheduler that requires drivers (CAN in our case) to be updated. I copied the BeagleBone kernel support guru to this post (Robert Nelson). Perhaps he is already aware of this problem and knows of a work-around. I will post something if/when I get this figured out. Until then, here are some relevant links.
My kernel is no longer crashing. Unfortunately, I do not have the exact work-around - as I was messing around with a lot of stuff to try to get this to work.
The one thing consistent in all this is the backtrace (/var/log/kern.log) in that the routine c_can_get_berr_counter is doing something that it should not do. But trying to get someone that knows about this code to take a look seems to be a huge challenge. If you would like to see the backtrace, let me know.
I do know that the resolution was one of two things:
Our CAN transceiver’s enable line was being shared with MMC1_DAT0 - which was brought out to P8-25 on an expansion header. I tried (various methods) to reset the eMMC in hopes of driving its pins to an open-drain state. I could never get that to work, and therefore I could never drive the CAN transceiver’s enable line low. I got around this by wiring P8-25 directly to ground. That could have been causing problems with either the CAN transceiver or the processor itself. Or it could be that leaving the CAN transceiver enabled through the boot process caused issues as well. Nonetheless, I clipped the P8-25 pin on our cape and wired from another GPIO line that was routed to the P8 expansion header (P8-17 I think). This allowed me to enable the CAN transceiver cleanly, post-boot.
Note that there are plenty of other things in my kernel config. I only showed the differences between the original (when I would get kernel panic) and the modified (no kernel panic).
Those config parameters are used for the kernel build. They are part of a huge collection of compiler flags used for controlling how the kernel is built.
In your initial post, I noticed that you mentioned you were using a custom kernel; therefore, I assumed that you understood how to modify kernel config parameters and build the kernel.
The only reason I know we’re using a custom kernel is because our former Linux guy told me so. Never recompiled a kernel before but I have a cursory grasp of what’s involved.
Former coworker’s linux laptop has a folder named “Robert C Nelson” that contains what seems to be the custom kernel mod to fix the UART speed issue. I’ll start poking around in there to see if I can figure it all out.
And yes, that does help. I do appreciate it, thanks for your patience.
RCN’s kernel is the kernel source that I am using as well. If you change into that directory, you can run a rebuild script by typing “tools/rebuild.sh”. Invoking that script automatically pops up a window showing all the kernel config parameters. The number of parameters and finding the exact ones to match what I listed above is rather daunting. What I recommend is to view the default kernel config file and check if you are using the same config as me (probably not). default config file should be named “defconfig” and should be stored within the patches directory.
Never mind any of the stuff I previously mentioned regarding changing of the kernel config parameters. The problem is rooted in my original comment about the c_can driver. There is a patch that exists that solves this problem. Unfortunately, it was inserted into the mainline kernel stream later than the 3.8+ branch we are using on BeagleBone Black; and therefore, the fix is not included in our kernel source. Take a look at this:
If you have acquainted yourself with building the kernel for BBB, I would suggest manually editing that c_can.c file with the changes shown in the link above, rebuilding, and re-installing. That should fix your problem. It did for me.
Hello, thanks for your reply, is there another way (more simple than rebuilt) for this fix ?
Hereunder trace with another problem with mysql :
(Linux BBB4 3.8.13-bone50 #1 SMP Tue May 13 13:24:52 UTC 2014 armv7l GNU/Linux)
If you are using the CAN device and the c_can driver, then implementing the kernel mod and re-building/re-installing would seem to be your only option.
If you need to use CAN and can use a USB-to-CAN adapter, or some other serial-to-CAN adapter, then maybe you could get around this problem.