Beaglebone black Ethernet instability: libphy: Link is Down

Hey Everyone,

I have been testing the beaglebone black for some time now, and I have noticed that the Ethernet tends to die after prolonged operation (1 day +).

Beaglebone is run off of a good quality 5V DC supply, with constant traffic being streamed on it. CPU usage is around 40%.

Below is the output of dmesg:

[ 9.145976] net eth0: initializing cpsw version 1.12 (0)
[ 9.150610] net eth0: phy found : id is : 0x7c0f1
[ 9.150644] libphy: PHY 4a101000.mdio:01 not found
[ 9.155748] net eth0: phy 4a101000.mdio:01 not found on slave 1
[ 12.227778] libphy: 4a101000.mdio:00 - Link is Up - 100/Full
[102008.098372] libphy: 4a101000.mdio:00 - Link is Down

Link suddenly dies after many hours of being up. Beaglebone is connected to an ethernet switch and was left completely alone at that time (no tinkering).

Upon attempting to revive the ethernet link without reboot, “ip link set eth0 down” seems to succeed, however “ip link set eth0 up” fails with:

root@beaglebone:~# ip link set eth0 up
[170649.910190] net eth0: phy 4a101000.mdio:00 not found on slave 0
[170649.916577] libphy: PHY 4a101000.mdio:01 not found
[170649.921721] net eth0: phy 4a101000.mdio:01 not found on slave 1

Observations: after the ethernet dies, the orange light on the ethernet plug goes out, and the left LED blinks sporadically green.
Unplugging the ethernet makes the left LED go solid green, right LED still off.
Interestingly, the ethernet switch shows 1Gbps negotiation. Obviously link is not functional (but switch light is ON).

Connecting the bone in this state to a 10/100 switch causes both LEDs on the bone to go completely off, while the 10/100 switch shows 10mbps negotiation.

Rebooting the bone has no effect on ethernet (still dead):

[ 5.705411] net eth0: initializing cpsw version 1.12 (0)
[ 5.707104] libphy: PHY 4a101000.mdio:00 not found
[ 5.712133] net eth0: phy 4a101000.mdio:00 not found on slave 0
[ 5.718328] libphy: PHY 4a101000.mdio:01 not found
[ 5.723339] net eth0: phy 4a101000.mdio:01 not found on slave 1

Only a power cycle or pressing RESET on the board fixes this:

[ 8.242920] net eth0: initializing cpsw version 1.12 (0)
[ 8.256913] net eth0: phy found : id is : 0x7c0f1
[ 8.256950] libphy: PHY 4a101000.mdio:01 not found
[ 8.262017] net eth0: phy 4a101000.mdio:01 not found on slave 1
[ 11.333235] libphy: 4a101000.mdio:00 - Link is Up - 100/Full

Could this be the RESET line of the PHY picking up some glitch? I havn’t checked the schematics/layout but I hope it turns out to be a cap being needed on the PHY reset line. Otherwise, I can’t thing of something else.

Gerald: Help?

Regards
Hussein

Funny how the glitch only happens after an hour. Obviously there is no damage or it would never come back. The reset line of the PHY is the reset line of the processor. I would expect to see a full board reset if this were the issue.

Gerald

Hey Gerald

Actually I’ve seen it before that a glitch on the reset line somehow “locally” reset a chip, and the remaining chips on the same reset line do not get reset (capacitance of reset trace, minimum duration to hold reset can play a role).

I’ve seen this on a custom audio cape I did, where touching a via connected to the reset trace next to the CODEC caused it to reset incompletely with corrupted registers, however the system did not reboot. This was ofcourse sporadic, sometimes touching the via did actually cause everything to reset.

The problem is that I cannot probe this on the scope and wait for it to happen, since the ground terminal of the scope will earth the system and cause all the noise pickup issues to go away. A floating scope or differential probe would be needed but I don’t have that at the moment.

Suggestions? How could one reset the PHY during system operation? Is it possible through sysfs somehow? A SW workaround would require one to be able to do that.

Regards
Hussein

I understand you point. But why not after 5 min or 12 min? Why after an hour?

We have had these boards running for days. So I am more inclined to look for a SW issue here.

Have you looked at the schematic?

Gerald

Hey Gerald

Actually I've seen it before that a glitch on the reset line somehow
"locally" reset a chip, and the remaining chips on the same reset
line do not get reset (capacitance of reset trace, minimum duration
to hold reset can play a role).

I've seen this on a custom audio cape I did, where touching a via
connected to the reset trace next to the CODEC caused it to reset
incompletely with corrupted registers, however the system did not
reboot. This was ofcourse sporadic, sometimes touching the via did
actually cause everything to reset.

The problem is that I cannot probe this on the scope and wait for it
to happen, since the ground terminal of the scope will earth the
system and cause all the noise pickup issues to go away. A floating
scope or differential probe would be needed but I don't have that at
the moment.

Suggestions? How could one reset the PHY during system operation? Is
it possible through sysfs somehow? A SW workaround would require one
to be able to do that.

It isn't possible to reset the PHY using the reset line without having
the am335x do a reset itself. There is a soft reset command that can be
issued over MDIO to the PHY, though.

Can you try issuing a 'reset' command from u-boot after you see this
issue come up? The 'reset' command from u-boot will do a warm software
reset on the am335x which should drive the reset line out of the am335x
and reset the PHY. Does this fix the issue if you let it boot fully
after the 'reset' command?

Do you see this on just one board or is it reproducible on more than
one board?

-Andrew

Hey Gerald

I have noticed that these issues are more prominent when the ground is floating, i.e. no USB connections to a PC and no metal ethernet connectors. Maybe you are testing with usb connected? I would leave a board with just power and ethernet with some ping flood running to try to reproduce it at your end. Hopefully we see (or don’t see? hmm) something.

Regards
Hussein

Hey Andrew

I will try that the next time it locks up. I understand you imply there is no way to reset the PHY while linux is running?

Regards
Hussein

I don’t use the USB connection. I do use a grounded power supply which is the way it must be done. If grounds are different, then you are asking for trouble and creating such a scenario to create an issue does not really prove anything.

Gerald

I had issues with the network not coming up on boot, and it was traced down to problems with the SYS_RESETn line.

I had a level translator connected to SYS_RESETn, to drive a 5V chip. It was powered by a 5V rail. If the 5V rail powered up "differently" than the 3.3V rail (not sure of the exact relationship), I guess it pulled the SYS_RESETn line to weird levels that affected the network chip but not the main processor. I'm now using a GPIO to drive the external 5V chip now, instead of the SYS_RESETn line.

Anyway, the moral is be very, very careful with SYS_RESETn, because it can cause hard-to-trace problems with networking.

- Mike

I will try that the next time it locks up. I understand you imply
there is no way to reset the PHY while linux is running?

The only way to reset the PHY while running is to issue a software
reset to the PHY over the management interface. See the SMSC PHY data
sheet for information on how this can be done.

The only way to issue a hard reset to the PHY is to reset the AM335x.

-Andrew

Ok it just happened again.

Rebooting into uboot (no power cycle) gives this:

musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Peripheral mode controller at 47401000 using PIO, IRQ 0
musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Host mode controller at 47401800 using PIO, IRQ 0
Net: not set. Validating first E-fuse MAC
Phy not found
PHY reset timed out
cpsw, usb_ether
Hit any key to stop autoboot: 0

U-Boot#
U-Boot#

issuing reset does not help…

U-Boot#
U-Boot#
U-Boot# reset
resetting …

U-Boot SPL 2013.04-rc1-14237-g90639fe-dirty (Apr 13 2013 - 13:57:11)
musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Peripheral mode controller at 47401000 using PIO, IRQ 0
musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Host mode controller at 47401800 using PIO, IRQ 0
OMAP SD/MMC: 0
mmc_send_cmd : timeout: No status update
reading u-boot.img
reading u-boot.img

U-Boot 2013.04-rc1-14237-g90639fe-dirty (Apr 13 2013 - 13:57:11)

I2C: ready
DRAM: 512 MiB
WARNING: Caches not enabled
NAND: No NAND device found!!!
0 MiB
MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1
*** Warning - readenv() failed, using default environment

musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Peripheral mode controller at 47401000 using PIO, IRQ 0
musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Host mode controller at 47401800 using PIO, IRQ 0
Net: not set. Validating first E-fuse MAC
Phy not found
PHY reset timed out
cpsw, usb_ether
Hit any key to stop autoboot: 0
U-Boot#
U-Boot#

Power cycling does fix it:

USB Host mode controller at 47401800 using PIO, IRQ 0
Net: not set. Validating first E-fuse MAC
cpsw, usb_ether
Hit any key to stop autoboot: 0
U-Boot#
U-Boot#
U-Boot#

Mike: I am pretty certain thats the problem. This problem was also affecting the codec on my cape, but adding a 100nF cap to ground next to its reset pin solved it. The PHY seems to still be affected though. No idea whatsoever what the cause of this might be. I will consider completely not routing the SYS_RESETn line to the cape. But then how do you reset the chips on the cape so that they are ready by the time the kernel boots? A custom power on reset circuit on the cape? Interesting…

Disclaimer:
I do have a custom cape connected and the reset line runs to an audio codec on it. I have a 100nF on that line to ground (which removed glitches that used to reset the codec).

Try buffering the reset line before sending it all over the place. Make the connection to the buffer be as close as possible to the connector.

Gerald

Great idea! I think that should do it, since apparently its not a widespread issue.

Regards
Hussein

Ok it just happened again.

8<-----

Mike: I am pretty certain thats the problem. This problem was also
affecting the codec on my cape, but adding a 100nF cap to ground next
to its reset pin solved it. The PHY seems to still be affected
though. No idea whatsoever what the cause of this might be. I will
consider completely not routing the SYS_RESETn line to the cape. But
then how do you reset the chips on the cape so that they are ready by
the time the kernel boots? A custom power on reset circuit on the
cape? Interesting...

Disclaimer:
I do have a custom cape connected and the reset line runs to an audio
codec on it. I have a 100nF on that line to ground (which removed
glitches that used to reset the codec).

For white bones, our custom capes use a Linear LTC6993 [1] one-shot in
order to assert SYSRESET_n for 100 ms when ever a reset occurs.

[1]:http://www.linear.com/product/LTC6993

This was the result of the Ethernet strapping resistors not being
sampled properly resulting in the PHY LEDs inverting on warm software
resets on white bones. Your issue might be similar. Linear makes a
nice dev kit for about $25. See if that helps.

The reset out of AM335x rev 1.0 silicon (haven't checked my blacks yet
to see if this has changed in new silicon) asserts for up to 255 counts
of a 24 MHz clock, which is quite short (the default is 6). This
results in a roughly 1 usec reset pulse when the SMSC PHY wants at least
1 ms reset. The 100 ms we use is the "big hammer" approach :slight_smile:

For reference on my issue:
https://groups.google.com/d/msg/beagleboard/PFXV9Nrsf5U/6MXzPKrHWA0J

-Andrew

Dear All,
same behaviour of my beagleboard.
After few hours eth freeze, same error messages.
It happened 2 times in a day.
The black beaglebone is connected to an external device with pinout:

1- GND
5- VDD 5V
11- UART4_RXD
13- UART4_TXD

and it is connected also using the J1 “Serial Debug”.

Eth didn’t restart with any kind of sw reboot, same error:

[89339.290867] net eth0: phy 4a101000.mdio:00 not found on slave 0
[89339.297109] libphy: PHY 4a101000.mdio:01 not found
[89339.302132] net eth0: phy 4a101000.mdio:01 not found on slave 1

anyone has idea on that?

Thank you,
Paolo

So, what is the voltage levels on UART4_RXD and UART4_TXD?

Gerald

Just a thought,

I assume you are running with a dynamically assigned IP-address.
The link might go down because of renew issues of the lease.

Try configuring with a static address.
the configuration then is not subject to outside influences.

My ethernet is flaky when I need to renew a lease, after unplugging or bringing the interface down.

worth a try?

LP

3.3V, is it correct?

thank you for the idea, i will try tomorrow.
but it doesn’t explain that at the reboot the phisical interface is not loaded.
Best,
PT

3.3V is fine!

Gerald