USB EHCI problems

Hi,

Only the 1.8V. The 3.3V is generated by the PHY, so please do not hook
anything to that are you will blow up the device. You will not be able to
capture any events on the voltage rails and tie those to this issue. It is
something you cannot see using a scope.

I've tried supplying the 1.8V using an external LDO regulator and
cutting the trace. same result.
if the voltage rails are stable, maybe it is due to some other events.

KP

Just to add a little bit of information to the issue. I've been
running 2.6.30.2-goe1.2 kernel in my BeagleBoard. I have not been
using it much lately, it has been mostly idle and had been running for
several weeks till today. Yesterday I put the repeated
dd-from-usb-hd-test running and today morning it was still running
fine. Today I was shuffling around my stuff on the desk and when I was
leaving work I noticed that the board had hung. Console log had:

hub 2-0:1.0: port 2 disabled by hub (EMI?), re-enabling...
usb 2-2: USB disconnect, address 2
usb 2-2.2: USB disconnect, address 3
ehci-omap ehci-omap.0: dev 2.2 ep1in scatterlist error -108/-108
ehci-omap ehci-omap.0: dev 2.2 ep1in scatterlist error -108/-108
ehci-omap ehci-omap.0: dev 2.2 ep1in scatterlist error -108/-108
ehci-omap ehci-omap.0: dev 2.2 ep1in scatterlist error -108/-108
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
end_request: I/O error, dev sda, sector 84122904
Buffer I/O error on device sda, logical block 10515363

etc.

Did the board hang as a result of me pushing it? The cables, hubs and
USB devices have been working fine in many setups. Can e.g. bending a
cable trigger the noise issue that Gerald has been describing? Or was
there really a loose cable? I kinda doubt that, as it is now running
fine again..

How about the power fed to the BeagleBoard - should one USB port be
able to feed enough? Is using a Y-cable safer? Or the other way
around, i.e. one port better than Y-cable? Are there differences in
stability if you feed from the USB port or the power connector?

It is possible that moving it around caused the issue if you had a loose cable. I would think if that were the case, you should ne able to unplug the USB cable and plug it back in which should cause it to reconnect. If not, then it may indeed be hung. If it ran all night, then you are most likely OK. We have found that if the test passes overnight you should be OK. Even if it fails after that much time, you most likely will not see the issue during normal usage. That test is very brutal and really stresses the board.

If you are powering the board via a USB cable, then lots of things can happen from the PC side. Most of our testing is being done with a DC supply powering the board. We have not seen any difference in operation between a USB and DC powered board.

Gerald

Just to make things more interesting:

I just had a failure when running the dd stress test through a hub,
but the USB chip did not die. The flash drive stopped working, but my
USB-Ethernet adapter was still working. I pulled out the offending USB
flash drive, plugged it back in and it was re-detected and started
working again normally. This was the second invocation of dd. The
first dd copied the full 1GB successfully.

Update:

I then ran the test again and it failed not far into the first dd,
this time taking down the USB chip.

Details:
1GB Sony USB Flash drive plugged into USB-powered 4-port hub.
Board is Rev C3
kernels, install, etc following http://elinux.org/BeagleBoardUbuntu

dd failed the first time (before the re-plug) on the second time
around with lots of errors like these:
[ 585.643524] end_request: I/O error, dev sda, sector
1463296
[ 585.649353] Buffer I/O error on device sda, logical block 182912
<snip>
[ 585.693084] Buffer I/O error on device sda, logical block
182919
[ 586.097076] sd 0:0:0:0: [sda] Assuming drive cache: write
through
[ 586.157897] scsi 0:0:0:0: rejecting I/O to dead device
<snip>
[ 601.369995] usb 1-2.2: device descriptor read/64, error -110
[ 657.658447] usb 1-2.2: device not accepting address 7, error -110

Gerald Coley <gerald@beagleboard.org> writes:

It is possible that moving it around caused the issue if you had a loose
cable. I would think if that were the case, you should ne able to unplug the
USB cable and plug it back in which should cause it to reconnect. If
not, then it may indeed be hung. If it ran all night, then you are
most likely OK. We have found that if the test passes overnight you
should be OK. Even if it fails after that much time, you most likely
will not see the issue during normal usage. That test is very brutal
and really stresses the board.

I started a new test today and this time I did not poke around the
cables, and got just again the same:

hub 1-0:1.0: port 2 disabled by hub (EMI?), re-enabling...
usb 1-2: USB disconnect, address 2
usb 1-2.2: USB disconnect, address 7
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
end_request: I/O error, dev sda, sector 230104
Buffer I/O error on device sda, logical block 28763
Buffer I/O error on device sda, logical block 28764

etc.

The trouble with replugging is that my root is on the USB disk, so
even though the board is alive, it cannot recover.

If you are powering the board via a USB cable, then lots of things can happen
from the PC side. Most of our testing is being done with a DC supply powering
the board. We have not seen any difference in operation between a USB and DC
powered board.

I've been using a separate hub for powering the board, I guess I could
try with lab power whether it makes any difference.

In order to isolate this issue versus other SW issues, I suggest you run the test with the validation kernel and remove the hub. Just plug the USB flash drive directly into the board and reepat the test.

http://code.google.com/p/beagleboard/wiki/BeagleboardRevCValidation

Gerald

Try the validation kernel because the root is on the SD card and just makes the drive a storage device and it won’t affect kernel operation, This is a very brutal test. In fact, just because it fails this test after a long period of time, it is not likely that you will have any issues in normal operations, even with very large file sizes. But I am not sure what affect this test will have when using it as the root. We know what we have on the validation kernel, so if you can stay with this to verify the operation, that would be good. If you get the failure there, then you may be having this issue.

Gerald

Gerald Coley <gerald@beagleboard.org> writes:

Try the validation kernel because the root is on the SD card and just makes
the drive a storage device and it won't affect kernel operation

Similar message with the validation kernel:

hub 1-0:1.0: port 2 disabled by hub (EMI?), re-enabling...
usb 1-2: USB disconnect, address 2
sd 0:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
end_request: I/O error, dev sda, sector 25469696
Buffer I/O error on device sda, logical block 3183712
Buffer I/O error on device sda, logical block 3183713
etc..

And as expected, the kernel is definitely not hung. The board is
powerd via USB, with Y-cable.

Is there SR-enabled "reference kernel" for testing purposes?

No there is not a kernel with SR enabled. It is not clear that it will solve all board issues. It does take care of some, but not all of them. I suggest you try it with a DC supply. If this does not have an impact, let me know.

Gerald

Gerald Coley <gerald@beagleboard.org> writes:

No there is not a kernel with SR enabled. It is not clear that it will solve
all board issues. It does take care of some, but not all of them. I suggest
you try it with a DC supply. If this does not have an impact, let me know.

Gerald

Seems to happen with lab power supply too :frowning:

This happened with the validation kernel, USB disk directly connected
to the host port, BB powered from the DC connector.

hub 1-0:1.0: port 2 disabled by hub (EMI?), re-enabling...
usb 1-2: USB disconnect, address 2
sd 0:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
end_request: I/O error, dev sda, sector 285771760
__ratelimit: 99 callbacks suppressed
Buffer I/O error on device sda, logical block 35721470
Buffer I/O error on device sda, logical block 35721471
sd 0:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
end_request: I/O error, dev sda, sector 285771776
Buffer I/O error on device sda, logical block 35721472
Buffer I/O error on device sda, logical block 35721473
Buffer I/O error on device sda, logical block 35721474
etc..

Hi all,
in March, I started playing with BeagleBoard rev. B. I tried different
kernels and I found the only hub working with kernel 2.6.29 is this
one.
http://www.halfpricecables.com/images/Hub/15831.jpg
That hub seems not available at the Trust webpage.
At the moment, I'm playing with Beagleboard rev C2, kernel 2.6.29. I
tried a lot of hubs, including this one
http://www.trust.com/products/product.aspx?artnr=15919
but, as you know, no one is working.

Hi, my USB Ethernet connection from a beagle C2 to a Linux host is
running fine. But when i'm connecting the beagle to a WinXP SP3 (should
have a built-in RNDIS driver) host i'm not able to setup a connection. I
modified the IDs in the linux.inf, so that it is matching the IDs
presented in the device manager (VID_0525&PID_A4AA). The device appears
in the manager with the yellow exclamation mark and after setup the
error code 10 is showed ("Could not start device"). I googled some hours
for this error, but i found no working solution, even if one finds a lot
about that topic.

Does someone have further experience with that combination?
Does someone have a working linux.inf for WinXP?

Thanks, Sascha

If you pass the dd test, then you do not have a board with the issue.

Are you really sure that dd is a really good USB reliability validation test?
After all, you already had some validation test before you started to
use 'dd'. And you have been also very confident in your old test, before it
proved to be useless...

Trying to copy data over USB ethernet to a USB HDD is not even a test, it's
just a basic use case for everyone having this hardware. It is simply expected
to work. I would like to get a confirmation that this copy over ethernet works
without problems on at least some of the other boards.

Depending on the test results, we could narrow down the possible sources of
the problem:

1. Let's suppose that the problem is reproducible on some boards and not on
the others. In this case it could be very likely HW related. If it's the same
known USB HW issue, then the 'dd' test is not good enough and needs to be
replaced with something better. If it's some other HW issue, then I guess
somebody may want to investigate it too, just in order to ensure that an extra
iteration of USB HW fixes would not be needed and the next HW revision of
beagleboard will have a rock solid USB EHCI port.

2. If the problem is reproducible everywhere. Then it may be really purely SW
related. Or it just means that 100% of the boards are buggy, just to different
extent, with all of them not suitable for some of the use cases (like mine).

3. If the problem is not reproducible for anyone else. Then I'm just a lucky
guy who happened to have a really unique board :slight_smile: I don't think it is very
likely scenario though.

Sounds simple, right?

I would look at SW issues in some other areas.

Are there some known USB EHCI related SW issues in the validation kernel?

As I mentioned before, I already tried a very basic test: built the kernel
from the exactly same sources, used very similar kernel configuration, same
hub and peripherals, but ran the test on a different device (not beagleboard).
As you may have guessed - it does not have any problems.

Are you suggesting me to still waste time looking for (probably nonexisting)
SW problems? I myself would wait for the results of running this test on some
other boards first. And if nobody cares, I can just forget about this issue
too. Having USB HDD working with beagleboard is not that much critical for me.

I will see if Ican find someone to do what you ask. It is not something I am
setup to do.

Honestly I did not expect you to do anything. I just had a glimpse of hope
that some of the subscribers who have similar USB peripherals could test their
boards for USB stability and report back the results.

br, Siarhei

This is NOT a simple issue we are dealing with but it is HW related.

I don't really want to go into all the testing we have doen over the last
6 months. The issue is clear in my mind as to the source of the issue and
the solution of the issue, which cannot be implemented on a Rev C3 board.
It is a noise issue due to current consumption of the OMAP3 processor based
on either hot or cold material in use by the board. The USB PHY is VERY
sensitive to this noise. Remove the noise and you remove the issue.

60% of all boards have no problem. 32% of the boards have the problem that
can be fixed by increasing the capacitance on the 1.8V rail. 8% of the
boards can be fixed by running the PHY on a seperate volatge rail from the
rest of the system.

OK, I have no doubt that you are really working hard on fixing this issue.
I'm just going to get one more board eventually. Would be nice if it happens
to have a really usable and reliable USB :slight_smile:

The dd test is a very tough test. Just because the test fails, it is not an
idication that the issue will show up in the worst real world case usage
scenario one can come up with. These are the 32% boards. It is clear that
failure of this test the first time it is run is a clear indication that
issues will indeed show up in the simplest of operation in a real world use
case scenario. These are the 8% boards

Based on the experiments with my board, I just strongly suspect that 'dd' is
not a tough test and it detects reliability problems only on a (admittedly
large) fraction of boards. I hope that you also have something else to
additionally validate your future HW fixes for Rev C4 or whetever comes
next. Otherwise it will be kind of a lottery.

....

scenario one can come up with. These are the 32% boards. It is clear that

....

case scenario. These are the 8% boards

Hello, about this figures, i have 3 rev c boards all with an unstable
usb host port, i dont think this is a matter of just 8% or 30%, this
is a matter of simply the kind of usage you give the board.

If you aren't using the board with heavy processing loads, they will
appear mostly stable.
But please notice this inst a problem of just beagleboard, i have
other OMAP3 boards and all of them appear to be usb host port
unstable....
Its really strange that the TI have such problems with an "simple" usb
port, the OMAP3 processor is a very complex beast... Or maybe the
reference design have some problem...
One more thing to the owners of rev c boards, i don't really think
this problem could be solved by an "clever" software patch, this
problem is in the hardware, some say that usb phy is the culprit,
after some hacks, around one of my boards i can tell this is not an
power supply problem or/and noise, to me the problem is inside the
omap3....

Nuno

I appreciate your opinion. We will take your feedback and see if we can use it to help us solve the issue.

Thank you!

Gerald

So, I am not a kernel guy, I'm an app and occasional driver guy. I
build my kernel using openembedded recipe. I need to build from
source because I have kernel modules that need to be built.
Knowing this, is there an easy way to explain what I need to do to the
kernel source to get this SmartReflex thing working?

Failing actually being able to apply the patch needed to correct the
problem, is there a way to programmatically detect it? At the least,
I could drop such a tool into a cron job and have it bounce the system
in order to recover automatically.

Yes a cron job could do this.

It is relatively easy to detect because all the USB devices on that
port disappear when it crashes (at least on all boards I've used).

As for how to bounce the system, my approach is to use the watchdog
timer functionality from the companion chip (TPS65950 or TWL4030).
Watchdogs are like deadman switches: a simple circuit with a countdown
timer attached to a board reset. Set the countdown to 60 seconds, and
if you then fail to ping (reset) the timer within that 60 seconds, it
resets the board.

NOTE: when I reset the board this way it sometimes turns off the OTG
USB port. Using the watchdog to reset it again will bring it back. So
if you're using the OTG, add a startup script to check your OTG
devices are there and reset again if it's not (this reliably brought
them back for me). This happens to me when using a hub(externally
powered) plugged into the OTG, but not when it's a single device.

The kernel I'm using (following elinux.org/BeagleBoardUbuntu) has
support for it, so to reset the board, I set the watchdog countdown to
5 seconds, then start it and then don't ping it. 5 seconds later the
board will reset. Here's some C code. NOTE: you may need to be root to
run this so it can open /dev/watchdog. This will trigger a reset on
any computer that has a watchdog so *don't* test it on your PC unless
you've saved everything. :wink:

/*
* Watchdog Driver Reset Program
*/

// seconds until reset
#define DELAY 5
#define DEVICE "/dev/watchdog"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/types.h>
#include <linux/watchdog.h>

/*
* The main program. Just sets the timeout, then enables the watchdog.
*/
int main(int argc, char *argv[])
{
        int fd = open(DEVICE, O_WRONLY);

        if (fd == -1) {
                fprintf(stderr, "Couldn't open watchdog device '%s'.
\n", DEVICE);
                fflush(stderr);
                return 1;
        }

        int timeout = DELAY;
        ioctl(fd, WDIOC_SETTIMEOUT, &timeout);
        printf("Watchdog will reset in %d seconds (requested %d)\n",
timeout, DELAY);

        return 0;
}

Thanks for that. I've been able to get it to work when testing, but
I've recently noticed that my devices don't always disappear outright
when the USB crashes under normal operation. Really the only thing I
have on it is my hub and wireless nic, but basically the nic stops
working, all attempts to reconnect are just errored, but the device is
still there, and lsusb still reports it is there. If I manually
unplug my wireless nic, then the whole thing crashes (a message comes
up indicating something but I don't have it right now). When I also
plug in an external hard drive, and do the dd test to it, yeah,
everything crashes, but not with just my nic.
Is there not just some way of saying "hey, USB port, are you alive?"
I can come up with some kind of check if network is up or down and can
be brought up and whatnot, but it seems like each of these checks I
put in for this problem just keep getting more and more hackish.