BBB and BBW Ethernet problems

I have noticed a very rare but extremely annoying problem with Ethernet connectivity whereby the PHY layer appears to suddenly become unresponsive. This has occurred on both BBW with 3.2 kernels and BBB with 3.8 kernels.

The only way I have found to get out of this problem is a hard reset by cycling power. Removing and reattaching the Ethernet cable, restarting the connman service, and soft rebooting all do not clear the issue.

I think this may be related to the previous thread "Beaglebone Black Ethernet transmits packets but does not receive them". I have started this thread as I believe the it is also present on the BBW.

My guess is that the PHY is getting stuck into a state that only a hard reset will fix. A cursory glance through the kernel does not reveal an obvious way of resetting the PHY from software. Any suggestions welcome.

This problem is very annoying on the bench, and potentially disastrous for unattended systems.

I am pretty sure that this is not a power supply issue.

Best regards,

Dave.

Pushing the reset button does not clear the issue?

Gerald

I am not sure Gerald. This is pretty rare, and I cannot swear one way or the other. This would still be disastrous on an unattended system. We are in the process of designing an intelligent battery powered cape which contains its own watchdog and will cycle power after initiating an orderly shutdown. This should fix the problem, but I would prefer to find a less heavy handed approach which does not involve restarting the entire system.

You are missing my point. There are two aspects here. Prevention. Recovery. I am looking first at recovery. is the issue the PHY or the processor? If reset does not fix it, then it is not the PHY, as that is being reset, So, it has to be in the processor where a warm reset does not fix it because the SW does not know to fix it. A full reset resets everything. so, knowing is a warm reset fixes it is the first step in finding out where the issue may be. If a warm reset does fix it, then it is probably the PHY. We can make a change to add a separate reset for the PHY to recover under SW control. Then the next step is what is causing the PHY to lock up.

Make sense?

Gerald

Yes, that does make sense. Also I just talked to one of my other engineers and he HAS managed to recover from this by issuing a software halt (so as to be kind to the SD/eMMC file-system), then pressing the reset button. Dave.

Next step would be to modify the board to connect to an external switch that would only reset the PHY when the event occurred and see if that clears the issue.

Gerald

I will look into that. First I will have to find a board that fails “reliably”. Also I will look to see if a software reset is feasible. If so I can easily apply this to several boards. Dave.

Sounds good. Keep me posted.

Gerald

Update. I put some debug printouts in the smsc code found that when booting from cold on rare occasions smsc_phy_config_init() is not called - hence no PHY. Soft rebooting does not fix this, but reset or power cycling does. From this I am inferring that the PHY chip is getting into a state where it is not recognized. If this is the case, then there is no way I can think of performing a software reset If all else fails I will try sacrificing one of my boards to disconnect the nRST pin and taking it to a switch as you suggested, Gerald. With my limited soldering skills, this will be a bit of a last resort! More to follow. Dave.

Thanks for the update! It may be that the detection code is not giving it enough time to initialize.

Gerald

I guess this would be in U-Boot then?

Not necessarily.

Gerald

Could you please explain further? My reasoning is that once the interface gets into this condition, then no amount of soft rebooting will fix it. Only a hard reboot (button or power cycle) appears to clear the problem. Even if I manually stop the boot process in U-Boot and assign the board a manual IP address, I cannot see packets when I ping another address. To my way of thinking there is only the internal processor’s internal ROM code and MLO preceding this point. Is this where you think the problem may be? Dave.

It could be Uboot. It could be Kernel code. It could be a driver issue. I have no idea exactly where the issue could possibly exist. Once the condition has been created unless SW knows about it and can resolve it and code is provided to correct it, then the condition would I think continue to exist. The goal would be to try and prevent the condition if possible.

ROM code does not handle Ethernet. MLO loads the UBoot. I don’t see the issue there. But, never say never.

Now, there is a bit that sets the mode in the Sysboot pins that if interfered with could cause the system to come up in the RMII mode instead of the MII mode.

Gerald

Gerald,
     Thanks for your excellent attentiveness to this matter. I will continue to research this issue according to your suggestions. In the meantime I will rely on the sledgehammer approach using our UPS/watchdog cape as I mentioned earlier.

Thanks for an excellent product :slight_smile:

Dave.

I've had this happen once myself, I think. I take it from the cape your working
on that a watchdog reboot isn't sufficient to get it going again?

Britton

Pushing the reset button does not clear the issue?

I am not sure Gerald. This is pretty rare, and I cannot swear one way or the
other. This would still be disastrous on an unattended system.
We are in the process of designing an intelligent battery powered cape which
contains its own watchdog and will cycle power after initiating an orderly
shutdown. This should fix the problem, but I would prefer to find a less
heavy handed approach which does not involve restarting the entire system.

I've had this happen once myself, I think. I take it from the cape your working
on that a watchdog reboot isn't sufficient to get it going again?

You are correct. The internal watchdog does not release the hung PHY condition. Also, to the best of my knowledge the internal watchdog does not provide a method by which I can gracefully shutdown the file systems. I have had several SD card corruptions that I have attributed to the internal watchdog firing during file-system writes.

Dave.

Yeah they do that. For what its worth I haven't had any more problems with
the SD card since I switched to an 'industrial' SD card (APacer
AP-MSD04GCS4P-1TM).

Britton