USB EHCI problems

Michael_Evans · June 4, 2009, 9:17pm

I assume you’ve tried different peripherals (USB device and USB hub) to rule out it being the USB device itself and/or the hub…? What about the other ports on the hub…? Do they drop out too…?

David_Hagood · June 4, 2009, 9:47pm

The USB port on the Beagleboard dies as do all the ports on the hub -
after that, unplugging and plugging back in the hub does nothing: only
a reboot will restore the port.

I've already tried a couple of hubs, and I could get the failures on
both heavy access to a USB memory stick and to a USB to Ethernet
adapter.

Right now we are performing experiments with my Beagleboard in one of
our environmental chambers: we are currently running it at 0C to
reproduce my tests, but in moving the board to the chamber we
unavoidably had to change the configuration, so we are sequencing
through:

Different hubs.
Different power supplies.
Different devices on the bus in addition to the memory stick (mouse,
Ethernet device), etc.
Presence/absence of a device on the HDMI port.

Right now, things aren't failing - which is puzzling because I had a
100% reproducibility before. However, that might be a GOOD thing if I
can work out what variable caused the change.

If I can't get it to start failing, I'll try to get to exactly the
configuration I had in my office, then start simplifying. Failing
that, I'll take the chamber back to 20C, and then up to 40C.

Hopefully we can characterize this enough to at least make it
reproducible.

Gerald_Coley1 · June 4, 2009, 10:18pm

Request an RMA immediately!

Gerald

Gerald_Coley1 · June 5, 2009, 3:12pm

The board will be replaced and your board will be evaluated along with the other two boards we have. On the other two boards, we could not get them to fail, but in both those cases, the replacement boards worked fine. So, this is not something that is in all boards,

Gerald

David_Hagood · June 5, 2009, 3:17pm

(NOTE: I pulled my previous message, as I had a couple of errors in it
that I wanted to correct...)

Well, before I do that, I'd like to characterized, as best I can, what
is going on.

I have some more data:

My board, held in the chamber at 0C, configured as I had it in my
office, did fail eventually, but it was MUCH more reliable than it was
at ambient. It took many hundreds of runs of my test (dd if=/dev/zero
of=/media/disk/zeros bs=1024 count=100000) before it failed with the
disabled message.

When we brought the chamber up to 25C it died on the second run of the
test.

We are currently running the same test with a different Beagleboard,
but otherwise the same configuration and at 25C.

For reference, my configuration is:

Beagleboard on a 5V, 2A supply purchased from DigiKey.

RS-232 on the board connected to an external serial terminal.

8G Class 6 SDHC with Ubuntu on it.

USB host port driving a 4 port, self-powered hub with the USB memory
stick, keyboard, mouse, and a second 4 port hub on it.
Second 4 port self powered hub driving a Wii USB to Ethernet device
(no network connected).

HDMI port connected to a flat-panel interface to a 12" flat panel.

I'm going to let the second board run in the chamber at 24C for
another hour, then I will put my board back on the bench and run my
tests with everything but the serial port, SDHC, and USB memory on it.

After I do that, I will post my results.

David_Hagood · June 5, 2009, 3:20pm

Yes, that checks with the boards we have - mine dies, a couple of
others don't. I'd like to try to get as much information as possible
to allow others to have a better shot at reproducing the problems.

Is there anything I should check on the two boards I have (e.g. lot
numbers on the boards and/or parts) that would be of use in
troubleshooting this?

Gerald_Coley1 · June 5, 2009, 3:29pm

Well, it sounds like you are adventurous. If you download the Allegro files, you can find a series of testpoints on board that are the signals coming from the OMAP3530 to the SMSC PHY. You can scrape off the solder mask and the probe these signals. You may need to be able probe these points and detect differences between the good and the bad boards.

Another idea is to focus on the PHY, cool it and heat it to see if there are any changes in behavior.

Gerald

David_Hagood · June 5, 2009, 4:07pm

I can do that pretty easily.

The second board ran fine at 25C, so there are some board-to-board
variations.

OK, what I propose to do is:
1) Set my board back up on the bench here in my office.
2) Try cooling the PHY down with freeze-spray and run my tests.
3) Try cooling the OMAP down, holding the PHY at ambient.
4) Try my tests at ambient after disconnecting various pieces of
hardware, to try to simplify the test case down.

I don't think I'll go so far as to probe the signal lines - while the
OMAP and the USB support is important to work, I have a lot of other
items that are also important, and that other folks AREN'T working on.

However, will post my results, and can include them on paper when I
return the board under RMA.

Hopefully that will be enough to help you guys to reproduce the issue.

Gerald_Coley1 · June 5, 2009, 4:08pm

Thank you!

Gerald

Marcus_Bauer · June 5, 2009, 4:28pm

Similar/same problem here. The USB stops working after some time -
usually between half and hour and a day.

This also happens while no activity on the USB, i.e. only a keyboard and
a mouse connected; a reboot is then needed to bring USB it back. However
I am using Debian and there is a hint on the elinux Wiki that USB is
"flaky" on the revC boards, so maybe it is a kernel problem?

FWIW, uname -a :

Linux beagle 2.6.28-oer17 #1 Wed Mar 25 06:26:12 UTC 2009 armv7l
GNU/Linux

I could still run a test with the Angstrom images.

Marcus

John_Beetem · June 5, 2009, 4:32pm

Just speculating here...

I wonder if it's possibly a bad solder joint? One of the nasty
problems with BGAs is that your connections between boards and ICs are
made with solder which fractures instead of flexing when there are
thermal mismatches. This is a serious issue in desert and space
applications where you may have equipment that goes through 50-100C
temperature swings at least once a day. You would normally not see
this with BGAs as small as the OMAP and the SMSC PHY, particularly
over only a 25C swing. However, if one of those tiny solder balls
wasn't soldered properly the first time, its conducivity could vary
over temperature due to thermal mismatch of the IC and the board.
This is very hard to diagnose, though JTAG can help if it's
implemented.

One failure mode is for a ball to switch from being a conductor to a
capacitor, so AC signals can get through provided the load is
extremely low impedance. When you try to diagnose it with a 'scope
probe, the probe's capacitance changes the behavior resulting in a
"Heisenbug".

As I said, just a speculation for readers' entertainment. I'm still
voting for a timing issue. If it were a solder problem, we'd probably
be seeing defects all over the place instead of just the EHCI USB
port.

John

Gerald_Coley1 · June 5, 2009, 4:36pm

You may have a point there. This is a nasty part in that it is small and .4mm pitch. It is a bear to work with. My problem is that I can’t get these boards to fail, so if I can find one that fails, I will have it reflowed to see if it solves the problem.

Gerald

David_Hagood · June 5, 2009, 5:39pm

OK, I have enough results to, I hope, enable you to work out the
issues:

Here are my tests:

all parts at ambient:
beagle -> hub (mem) -> hub (key, mouse, ethernet) : passed
In other words, the Beagle was driving a self-powered hub with the USB
memory on it, and that hub was in turn driving a second (also powered)
hub with keyboard, mouse, and Ethernet on it.

It passed my 100MB write test more than 10 times without error.

all at ambient:
beagle -> hub (mem,key,mouse) -> hub (ethernet) : failed immediately!
This makes me wonder if there is something about having multiple
devices with interrupt endpoints on the same device.

Phy cooled:
beagle -> hub (mem,key,mouse) -> hub (ethernet) : failed immediately!

OMAP cooled:
beagle -> hub (mem,key,mouse) -> hub (ethernet) : failed in 2 runs
This is likely the same result as with the Phy cooled - the difference
between one run or 2 runs is pretty marginal.

all at ambient:
beagle -> hub (mem,key,mouse,ethernet): failed immediately.
This would tend to remove the second hub as an issue.

all at ambient:
beagle -> mem : passed
I wonder if this is more like the test cases the folks at Beagleboard
are running - perhaps they aren't using a hub?

all at ambient:
beagle -> hub2 (mem,key,mouse,ethernet):
test 1: ( BUG: soft lockup - CPU#0 stuck for 61s! [aplay:2366])
I don't have an explanation for this.
Test 2: failed before booting complete. By this I mean the system had
the "port disabled (EMI)" message before I even started my tests.

all at ambient:
beagle -> hub2 (mem) : Failed.

all at ambient:
beagle -> hub (mem) Failed after longer time.
These were 2 tests to check if there needed to be other devices on the
bus.

PHY cooled:
beagle -> hub (mem) : passed (more than 10 reps)

PHY cooled:
beagle -> hub (mem,key,mouse,ethernet): failed

My hypothesis is that there needs to be more than one device on the
bus, preferably many devices with interrupt endpoints.

OK, so: what do I do for the RMA?

Gerald_Coley1 · June 5, 2009, 6:34pm

You are correct. The test that is done at the factory is the memory thumbdrive only.

As to an RAM goto http://beagleboard.org/support/rma

Gerald

David_Hagood · June 5, 2009, 9:09pm

You are correct. The test that is done at the factory is the memory
thumbdrive only.

You may want to add a powered hub, keyboard, and mouse. Also, do you
really try to beat upon the stick as I have been (copying a 100MB
file), or are you just doing a small write?

As to an RAM gotohttp://beagleboard.org/support/rma

Sent, awaiting reply. I'll see if I can get it out this weekend. I'm
assuming putting it in the box and putting the box in a padded
shipping envelope should be enough?

Gerald_Coley1 · June 6, 2009, 12:16am

That wil be fine!

Gerald

Frans_Meulenbroeks · June 6, 2009, 12:22pm

See also my earlier message on EHCI problems
http://groups.google.com/group/beagleboard/browse_thread/thread/fb3caeb7ffdcc02f/6a7a0b2a21317538?lnk=gst&q=strange+ehci#6a7a0b2a21317538

I have also problems with the keyboard and I get corrupted data with
my USB 1.1 pwc webcam (connected through a hub of course).

I was under the impression this was a SW issue.

Frans

Duckyduck · June 6, 2009, 6:13pm

Hi Gerard,

I've a board with exact this problem, heavy I/O traffic through the
USB HOST will generate the "disabled by hub (EMI?), re-enabling"
error. Using dd to move chunks of 100MB to /dev/null will let the USB
crash after 200MB.
There's no difference between connections through a USB hub or
directly connected.

What do you suggest, send it for RMA or wait a few weeks till more is
known? Because I'm living in the Netherlands it'll cost quite a bit to
send it back to the USA

Wkr,
Joep

Rob_Walker1 · June 6, 2009, 11:17pm

My beagleboard has similar issues. I'm in the UK, so if I RMA it, will I have
to pay import duty + VAT again for the replacement?

Rob

Gerald_Coley1 · June 7, 2009, 12:28am

You should not have to. It is marked as an RMA repair. It is unclear at this time if this is a SW or HW issue,. We only have two boards that have reported th eissue in hous eand we can’t get them to fail. So, there may be someting a little off that shows up more often on some boards than others. We have replaced some boards and they seem to work.

It is your call as to whether or not you want to send it in for replacement or to wait and see what happens in the SW realm. You can move to the OTG for host funtions if you like in the mean time.

Gerald