BeagleBone Black: Ethernet transmits packets but does not receive them

Thomas_Laskowski · July 16, 2013, 9:24pm

It scares me how quiet this group has been about this topic. Isn’t anyone else experiencing this issue out there? We’re considering switching to the original Beaglebone, because we are running out of time on our project. Does anyone know if the original Beaglebone is stable? Thanks,

-Tom

Gerald_Coley1 · July 16, 2013, 9:29pm

I have had no issues in this area. I cannot make it fail, Unplugging and plugging it in won’t fail. I am not sure what needs to be fixed if it keeps working. Ethernet HW is the exact same as the BeagleBone.

You have two options. Request and RMA and let us look at it, If the HW has a failure we will fix it. If not, then we can tell you and and see what might be wrong on your end.

Second option is to do nothing and see what others may fine.

I know we got one board in with is issue on an RMA. Worked fine.

Gerald

Thomas_Laskowski · July 16, 2013, 9:53pm

Thank you for your input, Gerard. I don’t understand how it works for you guys. I am using a gigabit switch and a gigabit router, but the board always boots with 10 base T. The 100 base T light is off. I don’t know what to do, requesting an RMA doesn’t seem to make sense, because it will probably work for you.

-Tom

Gerald_Coley1 · July 16, 2013, 9:59pm

It may. It may not. We have had a couple of PHYs go bad. Go ahead and request the RMA.

Gerald

Charles_Steinkuehler · July 16, 2013, 10:10pm

I am seeing some issues here. I do get a 100 MBit link, but I see
issues sometimes on reboot and when trying to reestablish link.

I'm virtually certain it is not a hardware issue...it "feels" like a
reset issue with the device driver, but I haven't had time to try and
track down the issue. For instance, occasionally DHCP will fail to
acquire a lease and the ethernet will be 'wedged' and not work. Of
interest is the phy is reset when the startup scripts launch the dhcp
client and if things are working properly I get a "link status up"
message from the kernel after the first DHCP packet was transmitted.
When DHCP fails, I don't get the link up message and I get the "no Rx
packets" behavior you describe.

Of course, I'm running Debian with a Xenomai patched kernel, so
despite the fact that I think the issue is related to the port of the
AM335x Ethernet driver to the 3.8 kernel, the "official" BeagleBone
folks are going to ignore any issues I have because I'm so far "off
the reservation".

I suspect you had good results with your BBW (as I did) because it's
running the 3.2 kernel.

I fully support the decision to use the 3.8 Kernel on the BeagleBone
Black, but it is definitely still a bit rough around the edges. I
also understand there might be a bit of back-story that explains why
the 3.8 stuff was not in better shape when the 'Black shipped, but
it's just heresay.

- --
Charles Steinkuehler
charles@steinkuehler.net

Gerald_Coley1 · July 16, 2013, 10:29pm

We will be watching for the Eskimbob board!

Gerald

Thomas_Laskowski · July 16, 2013, 11:03pm

Sorry Gerald, I think I accidentally replied directly to you…

Gerald_Coley1 · July 16, 2013, 11:14pm

Sounds good. Keep us updated!

Gerald

Charles_Steinkuehler · July 16, 2013, 11:33pm

I'm seeing identical results in the official Angstrom release and my
hacked MachineKit Debian based images. Some problem specifics.

With both images, if I ping the gateway and remove then re-insert the
Ethernet cable, everything behaves as expected. Packets get dropped
while the cable is unplugged, and the phy comes back up properly when
the Ethernet cable is reconnected.

The problems come when trying to bring an interface down and back up
again. In Angstrom (I tested the 6/20 release):

* Click on the "funny icon to the left of the date"
* Select "Properties"
* Click on "Wired Networks"
* Click the "Disable" button
* Wait a moment
* Click on the "Enable" button

In Debian just:

ifdown eth0
ifup eth0

After this sequence of events (bring the interface down and back up),
with *BOTH* distributions I see exactly the same behavior. I'm
getting DHCP packets at the DHCP server, replies are going out to the
'Bone, but they are not getting received. Ditto for various other
traffic (ie: ARP requests/replies).

I have played a bit with both distributions trying to get the broken
networking up from the command line. I am no newbie at this...I've
been using various network configuration commands at a _very_ low
level since the late 1990's with the Linux Router Project. I have so
far not been able to get any farther than the automated tools, with Tx
working and Rx dead. Tx counters in /proc/net/dev are incrementing,
but the Rx counters are stuck. I suspect unloading the Ethernet
driver and reloading it might fix the issue, but it's not compiled as
a loadable module. :-/

So this pretty much exactly matches the failure described in earlier
e-mails, the behavior is exactly the same between two completely
different OS installs and two different kernels (although they share
the same BeagleBone specific patch set).

Smells like a software driver bug to me...

Gerald: Can you test disabling and enabling the driver via the GUI on
a default install? I'd _really_ like to know if you see the same
problem or if it works for you.

- --
Charles Steinkuehler
charles@steinkuehler.net

Charles_Steinkuehler · July 16, 2013, 11:35pm

I'm seeing identical results in the official Angstrom release and
my hacked MachineKit Debian based images. Some problem specifics.

With both images, if I ping the gateway and remove then re-insert
the Ethernet cable, everything behaves as expected. Packets get
dropped while the cable is unplugged, and the phy comes back up
properly when the Ethernet cable is reconnected.

The problems come when trying to bring an interface down and back
up again. In Angstrom (I tested the 6/20 release):

* Click on the "funny icon to the left of the date" * Select
"Properties"

Oops..."Properties" above should be "Preferences"

* Click on "Wired Networks" * Click the "Disable" button * Wait a
moment * Click on the "Enable" button

- --
Charles Steinkuehler
charles@steinkuehler.net

Thomas_Laskowski · July 16, 2013, 11:39pm

I get the same result when using ifconfig eth0 down/up on a stock image. But unplugging then plugging back in, or plugging in a cable after a boot works fine.

-Tom

Thomas_Laskowski · July 17, 2013, 4:15pm

The “bad” board is indeed bad. I flashed the latest Angstrom image onto eMMC and it fails the cable unplug test. The other board works fine so far.

-Tom

eskimobob · July 25, 2013, 4:04pm

Just had an email to say that the Beagle Hospital has now received my RMA board and are going to run it through some tests.

I’m hoping they find a software issue that can be fixed because I have two more BBBs here that exhibit the same problem.

Gerald_Coley1 · July 25, 2013, 5:44pm

Hospital only handles HW issues and repairs any damaged or failed parts.You won’t be seeing any SW solutions from the hospital.

Gerald

Thomas_Laskowski · July 25, 2013, 10:50pm

My other board is still working. Running Debian Wheezy. I also sent a board for RMA and they said there was a short on the Ethernet chip. They fixed it and are sending it back.

-Tom

eskimobob · July 26, 2013, 10:29pm

Thanks Gerald, that makes sense. I had hoped therefore that they would at least recreate the problem and say it was either hardware or software however I have just heard:

“Since the software in your unit is 2013-06-20, we have run our production tests on it without reflashing its eMMC. Everything is working fine. We also performed manual ping and iperf tests on that unit and re-created your failure scenarios (hot plugging Ethernet cable on start up, removing and hot plugging). We haven’t seen anything unusual so far. It is now in Ethernet burn-in test. We will keep you updated as we find out anything new.”

Based on that feedback, it seems likely that they will not see the problem. That leads me to wonder again whether it has something to do with my setup - e.g. router.
Can anyone think of any other useful tests I can suggest to them that might reveal the problem? I intend checking that when removing the LAN, they are waiting at least 5 secs before hot plugging. Also I will suggest they try ifdown eth0 followed by ifup eth0.

Martin

Gerald_Coley1 · July 27, 2013, 1:20am

That is what I see as well. It could eb a protocol issue or as you say something in your setup. Maybe try and simplify the scenario, like connecting it to a PC. No network, fix the IP address and see how it looks.

Gerald

eskimobob · July 29, 2013, 7:40pm

Only just getting time to play with this again…

In order to try out Gerald’s suggestions, I set it to static IP using Derek Molloy’s blog instructions (here). Note: I’m still testing for the moment on my router. I was surprised to find that unplugging the LAN, waiting more than 10 seconds then hot-plugging it resulted in automatic reconnection to the router, something I have never seen before.

Ok, I thought it must be something weird when configured to use dhcp instead of static address. I therefore used set-ipv4-method to change back to dhcp but found that no matter what, I could not connect to my network.

Although it had clearly set the mode to dhcp, I could not find out how to use connman to remove the nameservers that I had set. I therefore edited the settings file manually to remove the nameserver info then rebooted. Now, it comes up in dhcp mode and connects happily to my router time and again regardless of how many times I unplug and hot-plug the LAN cable :-/

Something is apparently different - but what!
Off to do some more investigation.

nemanja · July 29, 2013, 10:44pm

Exactly eskimobob, that’s what I see too. Switching between static and DHCP usually requires a restart of the networking interfaces and service. That’s what always kills it for me and explains why it worked when you rebooted. But doing it while BBB is running causes the networking portion to fail.

eskimobob · July 30, 2013, 8:01am

After pulling my hair out trying to get back to the “not-working” situation, I came to the conclusion that something else must have changed in my setup. It turns out that my router has 3 x 10/100 LAN ports and 1 x 10/100/1000 LAN port. Between my previous testing and my subsequent tests to try a static IP, I had changed the port that the BBB LAN cable was attached to.

I have now confirmed that when connecting to the 10/100 port, I see the problem but when connecting to the 10/100/1000 port, I see absolutely no problem whatsoever.

I am going to do some more investigation this morning but at least now I have something to focus on and I can reliably recreate the problem and reliably remove the problem.

My plan is to try fitting a 10/100 switch between the working port and the BBB to see if that affects things. If anyone has any further suggestions for things to try that might help resolve what is causing this, please shout…