After the investigations made by our engineers. It was proven that nothing can be done as a software fix. The power management design of the beaglebone black/green would need a rework for this to be fixed.
We ended up using another board that can cut the power to the BBG when the problem is detected. It was easier for us as a workaround.
We did not investigate if the Octavio version (like the blue) had this problem.
Connect a GPIO pin to the SYS_RESETN pin (P9-10). Make sure the GPIO pin in set to be an input only with pull-up at boot time in the DT. After boot, have a script or service grep dmseg for the telltail `detected phy mask fffffffb’ message. If the message is found, then set the GPIO pin to output and drive low and the board will reboot and (eventually) win the race and have a functional ethernet port.
Add an additional external micro-usb power supply. I’ve had good luck with the 2A ones sold for the raspberry pi. I haven’t had time to look at all the signal, but the extra power source seems to effect the rise time of the reset signal RC circuit enough to practically always win the race.
If you have this problem and only care about solutions, jump to “workarounds” below.
RECAP
For unlucky souls who come fresh upon this problem and down want to read though a better part of a decade’s worth of conflicting reports…
Due to a design issue, the BeagleBone Black and descendants have a problem where they intermittently come up with various bad state set in the physical network connection chip (PHY) that make the wired Ethernet port inaccessible and there is no way to get it to recover using only software - a power cycle or hardware reset is required.
One of the ways that the PHY can have bad state is that its address can be assigned a different value than expected. The latest versions of the kernel will scan all possible addresses and find the PHY no matter what address is happens to get, so this failure mode is not longer part of issue as long as you use one of these new kernels. (BTW, I have an elegant solution to reassign the PHY back to the expected address which will work with any kernel version if you need it. It also avoids the current kluge that hacks up the device tree to match the new found PHY address.)
There are still some bad states that the PHY chip can come up in that are not addressed by the new kernel. As far as I know there is no software only workaround for these - a power cycle or hardware reset is required.
In my personal experience, the bad state seems to be significantly less likely when the board is powered though the barrel connector (or USB om BeagleBone Green) than when it is powered via the pin on P9 header. I’ve also noticed that most people in this thread are powering thier boards via a cape or header connected power supply which makes sense since these people tend to seen the problem more often. Note that the non-recoverable bad state can still happen even on a baord powered via the barrel - it is just less likely.
In my personal experience, the bad state seems to be more likely on certain individual boards than others. I have a board that comes up in the bad state about 50% of the time, while other boards only come up int he bad state 1 in 100 times.
In my experience, the bad state seems to be significantly less like if nothing is connected to the Ethernet port at power up. I really mean not connected - even if there is an unpowered device connected to the other end of the network cable, then the bad state occurs more often. The cable much be unplugged at one end or the other.
Bit 13 in register 18 seems to be a 100% indication that you are in the bad state. I have never seen a board with that bit set recover, and I have never seen a non-recoverable board without that bit set (except for a couple of seconds if you manually clear it before it sets itself on again). This bit is “reserved” in the datasheet and so far no hints from Microchip as to what it might mean that might lead to a better understanding of the issue.
In the bad state, it is possible to get the PHY to link by manually configuring it to 10Mbs half duplex (no auto negotiation). While the link light comes on and the “link active” bit is set, it does not appear to be decoding incoming packets so this is not a useful workaround.
WORKAROUNDS
In order of effectiveness/desirability.
Use a different board. All the commercially available BeagleBone Black and descendants share this design issue, so look at maybe the Raspberry Pi or one of the other ARM based SBCs.
Spin your own version of the board. This problem could be completely resolved by adding a connection between the reset pin of the PHY and a gpio on the ARM. This way the ARM would be carefully control the required timing sequence for bringing up the PHY chip - and also be able to hardware reset the chip in case there are any problems.
Use a USB Ethernet adapter rather than the on-board eth0 port. Compatible adapters can be found for less than $10.
Connect a gpio pin to the reset pin on header P9. That reset pin is tied to the hardware reset pin of the PHY chip, so you can reset it under software control. gpio 60 happens to be very close physically, making for a very easy jumper connection. Then you need a script to test for the bad state, and activate the gpio to reset if it is found. Note that the reset pin will also reset the ARM, the the BB will reboot every-time you do this but should eventually come up (and satay up) with the PHY in the good state.
Unplug the the Ethernet port during power up, check for bad state after the board comes up, and keep power cycling it until it comes up in a good state, then reconnect the network cable.
Power the board though the barrel or USB rather than though the headers.
Though a combination of 5 & 6, I was able to get my bank of boards to come up with a better than 80% good state rate on the first try. Yona Applegate (of LEDscape fame) reports being able to get his large collection of BBS to all come up with good networking 100% of the time using #4, although the amount of time it takes for all boards to get to the good state is indeterminate.
FUTURE DIRECTIONS
There are likely other workaround possible if someone wants to invest more time working on this issue.
I am happy to try and help anyone who want to dig in deeper. I personally would love to not have to unplug/replug 72 ethernet cables every time I have to power cycle my bank of BBBs!
After removing C24 and C30 (next to the large unpopulated 20-pin header P2 on the bottom of the board) we ran 1000 power cycles and had a 100%
success rate - i.e. board booted and phy detected every time.
We used a programmable power supply and some scripts processing the uart output to count observed
instances of “libphy: PHY 4a101000.mdio:00 not found” and “net eth0:
phy found : id is : 0x7c0f1”, and momentarily interrupted the power supply after seeing either.
We ran the same test on an unmodified board and had a failure rate of 54/1000
I’ve come up with a software only workaround that can make sure a BBB will always come up with a working Ethernet port - although it can take a few minutes and require several automatic internal power cycles.
While neat, I should caution that going into RTC-only mode on an unmodified BBB is rather hazardous. While most power rails shut down in this mode, SYS_5V does not. This is a situation similar to powering the BBB via the battery terminals, and will cause VDD_3V3B to fail to shut down (see [1] for details). This creates a situation where the 3.3V supply of hardware connected to the AM3358 (including various other chips on the BBB itself) remains on, yet the 3.3V I/O supply (VDD_3V3A) of the AM3358 itself is shut off. In this situation, if anything powered from VDD_3V3B drives a signal high (for example the serial buffer if a serial console cable is attached), this will result in serious violations of the Absolute Maximum Ratings (see [2]).
My suggest would be to try using an external reset circuit that keeps nRESET low for a while during power-up (maybe combined with increased pull-up to make nRESET rise more cleanly when it is deasserted, despite the obnoxiously large cap).
We have a few Industrial Beaglebone Black from Element 14. Rev C(PCB revision is B6).
The Ethernet not coming up on every boot is faced in this version too.
The OS used is QNX & hence the Software fix provided in Linux Kernel could not be used.
Is there a solution?
-Geetha
QNX may still use uboot or equivalent
QNX can kill and restart a driver. Would require something to determine the Ethernet was not working and restart it.
It seems like there are/were two problems:
Random RX characters on console/debug UART interrupting normal boot
Ethernet chip not resetting, because the reset pulse is too short
There are several routes to boot or reboot which clouds the issues.
UART Problem
The solutions and theory are…
Connecting a FTDI USB to UART cable by holding the RX line at idle (3.3V), stopped the random characters which triggered the uboot command line.
The original hardware fix was to pull the RX line low through a resistor, but this did not fix all boards. This holds the RX active which may cause a “break” interrupt in the UART.
Alternate hardware fix was to pull the RX to 3.3V through a resistor (like using a FTDI cable). This may also allow the chip to be powered by its data line.
The software fix is by changing uboot to only enter command line, by a specific sequence of characters, not just any random character.
Ethernet Chip Reset Problem
The hardware fixes were:
to increase the length of the reset pulse by increasing C24
to add gate to pull down SYS_RESETn when “power good” signal is not
“QNX can kill and restart a driver. Would require something to determine the Ethernet was not working and restart it” is not going to work.
However, you could replicate in QNX the Linux steps, assuming the source code is available for the Linux tools used.
I tried to implement the software solution given in this link:
It doesn’t seem to work. I am using BeagleBone Black with kernel version 4.14.71-ti-r80. The device reboots randomly instead of checking for bad state of Ethernet. Is there anything I am missing?