eth0 mysteriously stops working

William_Hermans · October 19, 2016, 3:41am

I had a board a few days ago that just stopped working as far as ssh went. So, I checked the LEDs on my GbE switch, and the lights were not lit. Checking further, I found that the LEDs on the beaglebone were also not lit.

So I disconnected the ethernet jack on both ends and reseated. Nothing. The ethernet did not start working again until I rebooted the board, by physically pressing the rest button on the board.

I was curious if anyone else had experienced this same thing. This, for me has actually happened only once. The board this happened to was a Beaglebone green, running from the eMMC. Currently I’m running from sdcard but . . .

william@beaglebone:~$ sudo mount /dev/mmcblk1p1 /media/rootfs/ william@beaglebone:~$ cd /media/rootfs/ william@beaglebone:/media/rootfs$ cat etc/dogtag BeagleBoard.org Debian Image 2016-06-19 william@beaglebone:/media/rootfs$ ls boot/ |grep init initrd.img-4.4.12-ti-r31

grepping through the various files in /var/log shows that everything was working fine as far as I can tell. No error messages that stand out for ‘net’ or ‘eth0’. I’ve also talked with a person I know who has experienced this them self. Except for them, it has happened more than once, including with a 3.8.x kernel as well.

As far as when this happened to me personally, The board was just idling for several days(5 - 6 days ) when I needed to get some information from the board, and not response via ssh.

Graham1 · October 19, 2016, 10:24am

I have two BBG units that I use as headless servers, with only access through Ethernet. Both have been running without reboot for multiple months without any issues. I think that I mentioned that I did have a BBB do exactly what you describe, while running as a headless server last year, but at the time there was a thunderstorm in the area, and lightning strikes in the neighborhood. It recovered on reboot, and has never repeated the symptom.

So, my conclusion is that it is possible to happen, but rare, and in my case was probably caused by electrical transient coming in the Ethernet connection which is routed from a cable modem to the outside world.

For high reliability application, perhaps some extra transient protection on the Ethernet connection, and some kind of “ping monitor” that can auto-reboot the BBG.

— Graham

WulfMan · October 19, 2016, 2:48pm

what version of the OS and kernel are you using?

William_Hermans · October 19, 2016, 5:54pm

I haven't had a BBG Until the last 2-3 months to play with. Now, I've had
~30 over the course of the last 2 months to observe this behavior on. Which
again has only happen once. So, I attributed what happen to me accidentally
knocking the board around a little. Until I talked with another person I
know who has experienced this issue with multiple kernels, and multiple
times over the last I don't know . . . maybe 6 months.

So what I did was first installed the same Debian image he was using, then
changed kernels to the *bone* LTS kernel. Removed g_ether, by removing
Robert's custom boot script for the 335x evm board. After that I got the
project files from this person I know and duplicated his software setup.
Which is a mqtt application. With a custom cape.

Anyway, I was running this software last night, and then I downloaded and
ran nload from a ssh session. But I keep getting ssh Broken pipe errors.
Which is not necessarily a concern. I've seen that before. I intend to
hook up a serial debug cable and run nload from that, but just have not
gotten around to it.

One thing on my mind is that perhaps the software this person I know wrote
is somehow failing to deal with a "busy network" properly. Meaning if the
internet connection is bandwidth saturated, and the application is for some
reason unable to deal with a "stale connection" How will it act ? However,
I would not think this should cause the hardware to fail. Because that's
what I'm seeing when the ethernet traffic indication LEDs stop functioning,
While also rendering the ethernet connection non functional. What I was
able to observe so far however. Was that this application sends around
8-9kBit/s data, and gets 2-3kBit/s back.

Another concern: Knowing that mqtt by default is an inherently insecure
protocol, and this app does currently run as root . . .However there
areseveral caveats to this statement / concern. First, the application is a
peer to peer design in that only the mqtt broker can communicate with the
board. Whether it sends commands, or collects data back from the board.
Second, mqtt is able to use certificates, however I do not htink that is
currently the case with this software *YET*. I given this person I know the
standard security lecture on running root, and locking things down, etc. We
just have not acted on it yet

With all of the above mentioned. When I ran into this issue myself, I was
not running anything other than a stock image, and the stock software that
comes with it. While the board was also just idling for 5-6 days. Maybe a
little longer. I ran uptime from an ssh session where it reported back "5
days . . ." After which this happened. So I'm more inclined to think this
is most likely not a userspace application issue.

I'm not even sure where to go from here, as far as tracking this issue
down. All I can really do is throw everything I know / have at the board,
and hope I get an error trapped from the live kernel log through serial.

RobertCNelson · October 19, 2016, 6:04pm

I think it's related to suspend/cpuidle.. I know another user was
having issues, where they had to ping it twice, as the first would
never respond..

one thing that might help: remove the sleep pinmux's from: mac/davinci_mdio:

https://github.com/RobertCNelson/dtb-rebuilder/blob/4.4-ti/src/arm/am335x-bone-common.dtsi#L370-L383

Regards,

William_Hermans · October 19, 2016, 6:09pm

Thanks Robert,

I’ll check that out, So when you sasy “remove the sleeps”. I just delete “sleep” from pinctrl-names = “default”, “sleep”; or do I need to also remove pinctrl-1 = <&cpsw_sleep>; as well ?

William_Hermans · October 19, 2016, 6:10pm

I would think both, but honestly don’t know . . .

RobertCNelson · October 19, 2016, 6:15pm

Yeah, from:

&mac {
    pinctrl-names = "default", "sleep";
    pinctrl-0 = <&cpsw_default>;
    pinctrl-1 = <&cpsw_sleep>;
    slaves = <1>;
    status = "okay";
};

&davinci_mdio {
    pinctrl-names = "default", "sleep";
    pinctrl-0 = <&davinci_mdio_default>;
    pinctrl-1 = <&davinci_mdio_sleep>;
    status = "okay";
};

to:

&mac {
    pinctrl-names = "default";
    pinctrl-0 = <&cpsw_default>;
    slaves = <1>;
    status = "okay";
};

&davinci_mdio {
    pinctrl-names = "default";
    pinctrl-0 = <&davinci_mdio_default>;
    status = "okay";
};

Regards,

William_Hermans · October 19, 2016, 6:27pm

Thanks again Robert,

So I’ll have to download the overlay board file repo, edit, and then install but hummm. Been a while I need:

##BeagleBone Black: HDMI (Audio/Video) disabled:
dtb=am335x-boneblack-emmc-overlay.dtb

Which probably loads the common overlay file, so . …yeah ok think I got it. Going to be busy today with other things( unavoidable ) so might be tomorrow before I can write back success in modifying the board file. After which I can hopefully get this modification out to be tested on multiple boards by this person I know.

I’ll post full instructions for others here, when I get the chance. SO others can test, and potentially fix the same issue if needed.

William_Hermans · October 20, 2016, 5:15am

Yeah I’m locked in a boot loop ending here:

Starting kernel …

[ 3.341689] CPUidle arm: CPU 0 failed to init idle CPU ops
[ 3.347892] omap_hsmmc 48060000.mmc: unable to obtain RX DMA engine channel 3706465728
[ 3.356275] omap_hsmmc 481d8000.mmc: unable to obtain RX DMA engine channel 3706465648
[ 3.366324] wkup_m3_rproc 44d00000.wkup_m3: Platform data missing!
[ 3.374426] omap_voltage_late_init: Voltage driver support not added
[ 3.381301] cpu cpu0: cpu0 clock notifier not ready, retry
[ 3.482097] bone_capemgr bone_capemgr: Invalid signature ‘ffffffff’ at slot 0
[ 3.489295] bone_capemgr bone_capemgr: slot #0: No cape found
[ 3.548114] bone_capemgr bone_capemgr: slot #1: No cape found
[ 3.608112] bone_capemgr bone_capemgr: slot #2: No cape found
[ 3.668112] bone_capemgr bone_capemgr: slot #3: No cape found
[ 3.675302] cpsw 4a100000.ethernet: Missing rx_descs property in the DT.
[ 3.682080] cpsw 4a100000.ethernet: cpsw: platform data missing
Loading, please wait…

A few points of contention. Kernel is a the LTS 4.1.xbone-rt variant. Updated yesterday. Then the board is a beaglebone green, but I rebuilt am335x-boneblack-emmc-overlay.dtb. Which is the same overlay file I was loading previous to rebuilding.

William_Hermans · October 20, 2016, 5:16am

Overlay, meaning board file.

William_Hermans · October 21, 2016, 12:25am

So, at this point I think I’ll have to decompile both board files, and then run diff to see what’s different.

William_Hermans · October 21, 2016, 3:27am

So after decompiling the two files and comparing with diff, then piping to a file . . . the diff file is literally 2186 lines in length . . . wtf ?

William_Hermans · October 21, 2016, 4:53am

Ok,I had to modify my workflow, but I do believe I got the changes put into place. Not sure why Robert’s way was not working, but I’m used to thinking outside the box, or looking at multiple ways to achieve the same results . . .

You board file name, and kernel version will depend on which board file you need to use, and which kernel you’re running . . .

william@beaglebone:~/dev$ cp /boot/dtbs/4.1.34-bone-rt-r24/am335x-boneblack-emmc-overlay.dtb .

Search for "sleep"

Line 810-814 for me, remove:

cpsw_sleep { pinctrl-single,pins = <0x108 0x27 0x10c 0x27 0x110 0x27 0x114 0x27 0x118 0x27 0x11c 0x27 0x120 0x27 0x124 0x27 0x128 0x27 0x12c 0x27 0x130 0x27 0x134 0x27 0x138 0x27 0x13c 0x27 0x140 0x27>; linux,phandle = <0x37>; phandle = <0x37>; };

line 816-820 remove:

davinci_mdio_sleep { pinctrl-single,pins = <0x148 0x27 0x14c 0x27>; linux,phandle = <0x39>; phandle = <0x39>; };

Line 1827 change:

pinctrl-names = "default", "sleep";

to:

pinctrl-names = "default";

Line 1841 change:

pinctrl-names = "default", "sleep";

to:

`
pinctrl-names = “default”;

`

Line 2165 delete this whole line:

`
cpsw_sleep = “/ocp/l4_wkup@44c00000/scm@210000/pinmux@800/cpsw_sleep”;

`

Line 2166 delete this whole line:

`
davinci_mdio_sleep = “/ocp/l4_wkup@44c00000/scm@210000/pinmux@800/davinci_mdio_sleep”;

`

Then save, and exit the file. After that rname the old board file:

william@beaglebone:~/dev$ mv am335x-boneblack-emmc-overlay.dtb am335x-boneblack-emmc-overlay.dtb.old

Now compile the newly edited source file back into the original board file name / extension:

william@beaglebone:~/dev$ dtc -I dts -O dtb -o am335x-boneblack-emmc-overlay.dtb am335x-boneblack-emmc-overlay.dts

For convience, since I use an NFS share to do most of my work on, I prefer to move both the new dtb, and old dtb to the destination:

william@beaglebone:~/dev$ sudo cp am335x-boneblack-emmc-overlay.dtb* /boot/dtbs/4.1.34-bone-rt-r24/

Double check:

`
william@beaglebone:~/dev$ ls /boot/dtbs/4.1.34-bone-rt-r24/ |grep emmc
am335x-boneblack-emmc-overlay.dtb
am335x-boneblack-emmc-overlay.dtb.old

`

Reboot:

william@beaglebone:~/dev$ sudo reboot

Now do keep in mind. Just because I’m calling out line numbers here does not mean they will be the same for you. But if you use a good text editor, you can search for “sleep”, and should only find these 6 occurrences in your decompiled source file. With that said, always double check to make sure what you’re deleting / changing, is actually what needs to be changed.

William_Hermans · October 21, 2016, 5:16am

Sorry I missed putting the decompile step in my workflow. This is the very next step that is used after making a copy of the board file form the /boot/dtbs// directory:

dtc -I dtb -O dts -o am335x-boneblack-emmc-overlay.dts am335x-boneblack-emmc-overlay.dtb

William_Hermans · October 21, 2016, 4:53pm

So, I’m still getting “Write failed: broken pipe” using ssh from my debian support system to the beaglebone. This is not a timeout issue at all. As in the ssh session I’m running nload which constantly displays eth0 bandwidth usage.