Random usb bus crash on BeagleBoard XM Rev C.

Antonio_Manuel_Abad · February 10, 2012, 6:34pm

Hi everyone,

I'm using a BeagleBoard XM Rev C. running a Robert E. Nelson Debian
kernel (3.0.4-x3) for a robotics project. The only problem I've had
with the board thus far is a random crash of the usb hub. I can tell
when it happens because I have a bunch of devices hooked up to the usb
hub, including an xbee radio and a couple of GPS units. These
devices have little LEDs that turn on and stay on when power is being
supplied to them. When the crash happens, all of these LEDs turn
off. Because of the random nature of the crashes and how the ethernet
port is tied into the USB hub (its unusable when the crash happens), I
was unable to get any additional info on the system in this state
until recently. Fortunately, it happened on Sunday night, and I was
able to log into the board via the serial port.

Here is some dmesg output:

[ 4.209533] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
[ 4.216522] Freeing init memory: 356K
[ 4.239715] ehci-omap ehci-omap.0: port 2 reset error -110
[ 4.308868] mmc0: host does not support reading read-only switch.
assuming write-enable.
[ 4.332641] mmc0: new SDHC card at address 1234
[ 4.339324] mmcblk0: mmc0:1234 SA04G 3.67 GiB
[ 4.352630] mmcblk0: p1 p2
[ 4.404296] udev[67]: starting version 167
[ 4.934967] ehci-omap ehci-omap.0: port 2 reset error -110
[ 5.575592] ehci-omap ehci-omap.0: port 2 reset error -110
[ 5.667083] EXT4-fs (mmcblk0p2): INFO: recovery required on
readonly filesystem
[ 5.677642] EXT4-fs (mmcblk0p2): write access will be enabled
during recovery
[ 6.216217] ehci-omap ehci-omap.0: port 2 reset error -110
[ 6.856872] ehci-omap ehci-omap.0: port 2 reset error -110
[ 7.285278] hub 1-0:1.0: Cannot enable port 2. Maybe the USB cable
is bad?
[ 7.356781] ehci-omap ehci-omap.0: port 2 reset error -110
[ 8.059997] ehci-omap ehci-omap.0: port 2 reset error -110
[ 8.700622] ehci-omap ehci-omap.0: port 2 reset error -110
[ 9.341064] ehci-omap ehci-omap.0: port 2 reset error -110
[ 9.981719] ehci-omap ehci-omap.0: port 2 reset error -110
[ 10.410308] hub 1-0:1.0: Cannot enable port 2. Maybe the USB cable
is bad?
[ 10.481811] ehci-omap ehci-omap.0: port 2 reset error -110
[ 11.184844] ehci-omap ehci-omap.0: port 2 reset error -110
[ 11.825592] ehci-omap ehci-omap.0: port 2 reset error -110
[ 12.466217] ehci-omap ehci-omap.0: port 2 reset error -110
[ 13.106811] ehci-omap ehci-omap.0: port 2 reset error -110
[ 13.535339] hub 1-0:1.0: Cannot enable port 2. Maybe the USB cable
is bad?
[ 13.606811] ehci-omap ehci-omap.0: port 2 reset error -110
[ 13.645141] EXT4-fs (mmcblk0p2): recovery complete
[ 14.301269] EXT4-fs (mmcblk0p2): mounted filesystem with ordered
data mode. Opts: (null)
[ 14.313476] ehci-omap ehci-omap.0: port 2 reset error -110
[ 14.952606] ehci-omap ehci-omap.0: port 2 reset error -110
[ 15.591217] ehci-omap ehci-omap.0: port 2 reset error -110
[ 16.231933] ehci-omap ehci-omap.0: port 2 reset error -110
[ 16.660339] hub 1-0:1.0: Cannot enable port 2. Maybe the USB cable
is bad?
[ 16.670013] hub 1-0:1.0: unable to enumerate USB device on port 2
[ 16.686035] EXT4-fs (mmcblk0p2): re-mounted. Opts: errors=remount-
ro
[ 16.809448] udev[242]: starting version 167
[ 17.245697] input: gpio-keys as /devices/platform/gpio-keys/input/
input1
[ 17.537811] twl_rtc twl_rtc: rtc core: registered twl_rtc as rtc0
[ 17.565887] twl_rtc twl_rtc: Power up reset detected.
[ 17.581848] rtc-ds1307: probe of 2-0068 failed with error -5

And here is some lsusb output:

[ 538.091125] ehci-omap ehci-omap.0: port 2 reset error -110
[ 538.100677] ehci-omap ehci-omap.0: port 2 reset error -110
[ 538.110137] ehci-omap ehci-omap.0: port 2 reset error -110
[ 538.119537] ehci-omap ehci-omap.0: port 2 reset error -110
[ 538.128936] ehci-omap ehci-omap.0: port 2 reset error -110
[ 538.137145] hub 1-0:1.0: hub_port_status failed (err = -32)
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

I searched for these errors and found just a handful of references to
a potential issue in the ehci module. So, with all of that, I have
two questions:

1. Any ideas about what could be going on? Is it software or hardware
related?
2. Any ideas about how to reproduce the problem in a more reliable way
(i.e. instead of having it fail randomly)? One of the threads I found
suggested that perhaps a mass data transfer would prompt a failure.
However, I couldn't prompt the failure to happen when copying a 1.5 GB
file from a memory stick onto the board.

Thank you in advance for any help.

-Tony

Gerald_Coley1 · February 10, 2012, 6:45pm

What is the current rating on your 5VDC power supply? What is the total current consumption of the devices plugged into the HUB?

Gerald

Antonio_Manuel_Abad · February 10, 2012, 7:48pm

What is the current rating on your 5VDC power supply?

The power supply is rated for 5A.

What is the total current consumption of the devices plugged into the HUB?

I don't know the exact figure, but my guess is 200-300mA. Also, all
of the USB components are plugged into a powered usb hub, which is
then plugged into a usb port on the Beagle Board.

-Tony

Gerald_Coley1 · February 10, 2012, 7:50pm

So, why aren’t the components plugged into the four USB ports on the BeagleBoard-xM? Not sure why the HUB is needed. Which USB port are you using on the BeagleBoard-xM?

Gerald

Antonio_Manuel_Abad · February 10, 2012, 7:56pm

So, why aren't the components plugged into the four USB ports on the
BeagleBoard-xM? Not sure why the HUB is needed.

The hub is needed because there are a total of 6 USB devices plugged
into the system. Previously, I was plugging in devices into 3 of the
usb ports on the board, and plugging in an unpowered hub, into which
the remaining 2 devices were plugged. When using the unpowered hub,
the crashes happened more frequently, but still randomly. We moved to
a powered hub to obviate the problem, but the problem persists, albeit
less frequently.

Which USB port are you using on the BeagleBoard-xM?

Right now the USB hub is plugged into the top, outermost port on the
board. I believe we've had it plugged in on a different port, but
still observed the same behavior.

-Tony

Gerald_Coley1 · February 10, 2012, 8:01pm

OK, I got it. I not sure it is a power issue. Is the Ethernet port active as well? It sounds like that maybe the EHCI handler is getting over loaded in the SW somewhere. Not sure what impact putting a HUB in the middle would do to make that operate a little better unless it were power related issue of some kind.

Gerald

Antonio_Manuel_Abad · February 10, 2012, 8:07pm

Is the Ethernet port active as well?

Not when the crashes occurs. We mainly communicate with the board
through an Xbee Pro Radio running at 56 kbaud (via USB). When the
crash occurs, the ethernet is useless; that's expected, I suppose.

It sounds like that maybe the EHCI handler is getting over loaded in the SW somewhere.

One theory that I haven't been able to validate is that a ton of data
through the Xbee radio is causing the issue. Specifically, the radio
link is dropping a lot of packets (another issue, but not Beagle
related), and comms is happening primarly using XML-RPC. XML-RPC is
TCP-IP based, hence error correction and constant packet re-sending
with dropped packets. I need a more reliable way to fail the ehci
module before I can conclude anything, though.

-Tony

Richard4 · February 10, 2012, 10:32pm

I've so far seen this problem with SMSC9514s connected to EHCI on:

  - The built-in one on Beagle xM Rev C whilst doing heavy network I/O
  - Overo Firestorm whilst doing heavy network and memory I/O
      and a trivial amount of USB ethernet I/O.
  - One of our own boards doing a lot of I/O but not using USB
     at all.

Using:

  - Beagle xM's built in SMSC9514
  - Two different boards of our design containing SMSC9514 wired
       to Overo firestorms and our own boards.
  - TUSB1210 as well as the SMSC devices on Firestorm and Beagle xM

Ironically, with the C97 fix, an ordinary Beagle C4 seems rock
solid.

I've no evidence that this is a power issue and adding decoupling
to my Firestorm has no effect.

I'm also becoming convinced that this is something to do
with the EHCI block rather than external hardware.
Has anyone raised this on E2E or with TI or shall I?

It's making at least one of my projects embarrassing (it
reboots every few minutes) and seems to have been a consistent
complaint on various mailing lists for OMAP3 hardware for
some time.

I note that my kernel is using DPLL5 M=0x78 N=0x0C
DIV_120M=1 , which is rather a long way from the recommended
443 / 5 , 8 in advisory 2.1 . I'll rig my kernel to use the
right values and see if that helps ..

In the meantime, if there is anything I can do to help,
I can currently reproduce this issue at will ..

Richard.

Antonio_Manuel_Abad · February 10, 2012, 10:35pm

In the meantime, if there is anything I can do to help,
I can currently reproduce this issue at will ..

A procedure to reliably reproduce the issue would be a big help to me.

Thanks,

Tony

Richard4 · February 10, 2012, 10:48pm

Sadly, my code which currently reproduces it can't be made public
(and the hardware is a bit specialised) - if I come across some software
which will do it that I can release, I'll put it up ..

Sorry,

Richard.

Antonio_Manuel_Abad · February 10, 2012, 10:59pm

Sadly, my code which currently reproduces it can't be made public
(and the hardware is a bit specialised) - if I come across some software
which will do it that I can release, I'll put it up ..

No apology necessary; I understand. In fact, thank you for the
detailed explanation you gave me in the previous post. Incidentally,
I think it confirms the theory I have that send/resend request because
of packet loss over the comm link in my application (via the USB) is
the root cause for the USB crash. And that's a huge step forward for
me and this project.

One more question: have you heard of any such issues on the
BeagleBone? Given the architecture of the BeagleBone, can you
formulate an opinion about whether this might also be an issue on the
BeagleBone?

Many thanks,

Tony

Richard4 · February 10, 2012, 11:39pm

You can tell I'm bored.

This patch will set your DPLL5 to the "right" settings, as
specified in SPRZ319e, table 36, provided your system clock is
13 or 26MHz. In all other ways, it is horrid, but here it is
if you want to try it.

It's against v3.2.1 + some other stuff I added to my local kernel
earlier, so no guarantees it applies cleanly. It's had a whole
120 of seconds testing, so YMMV :-).

Enjoy,

Richard.

erratum-21-clksel.diff (7.49 KB)

Richard4 · February 10, 2012, 11:54pm

Sadly, my code which currently reproduces it can't be made public
(and the hardware is a bit specialised) - if I come across some software
which will do it that I can release, I'll put it up ..

No apology necessary; I understand. In fact, thank you for the
detailed explanation you gave me in the previous post. Incidentally,
I think it confirms the theory I have that send/resend request because
of packet loss over the comm link in my application (via the USB) is
the root cause for the USB crash. And that's a huge step forward for
me and this project.

It could well be - I suspect that sending several multicast packets
in a fast burst makes my crash more frequent, but I haven't any hard
data to back that up and it's awfully hard to become superstitious in
these circumstances ..

Anyway, if you're really desperate, you could try my patch and see
if it helps? If nothing else, it gets you more spurious debugging for
your boot sequence. Because you needed that

One more question: have you heard of any such issues on the
BeagleBone? Given the architecture of the BeagleBone, can you
formulate an opinion about whether this might also be an issue on the
BeagleBone?

Having had all the experience of having just downloaded the
datasheet, I dunno. Certainly not if you use the built-in ethernet
on the AM3359.

The AM3359 uses the musb block rather than the EHCI block from
the DM3730. I haven't seen any reports of this problem with MUSB.
However, the underlying clocking issue is in the PLL rather than
the EHCI block itself.

AM3359 also uses a different DPLL arrangement.

So: if it is the erratum 2.1 bug, it depends on whether that
bug exists in the AM3359's DPLL as well - it seems from a quick
look that the clocking arrangements for AM3359 are different and
so you should be safe, but it can't hurt to check with TI.

If it isn't, I'd guess not, as I don't imagine the MUSB core
has the same bugs as the EHCI core and I would very much hope
that the integrated PHY doesn't have any issues with the
integrated USB core!

I'd say it's not likely - and will now be proved wrong by
a slew of bug reports, no doubt!

You could try asking directly on E2E and seeing if someone
from TI can give you the lowdown?

Good luck!

Richard.

Antonio_Manuel_Abad · February 11, 2012, 5:26pm

You can tell I'm bored.

And very generous with your time and knowledge. Thank you Richard.

It's against v3.2.1 + some other stuff I added to my local kernel
earlier, so no guarantees it applies cleanly.

I applied the patch to the 3.2.0-12-omap kernel and it seemed to take
just fine. I just cross-compiled the patched kernel and only had to
remove some staging drivers that had nothing to do with what was in
the patch.

So, now for a very, very newb question: Can anyone instruct me or
point me to instructions for the rest of the steps I need to get from
the compiled kernel to a working image flashed on an sd card? It's
been about a decade since I rolled my own kernel, and this will be my
first time doing so for an ARM board.

It's had a whole 120 of seconds testing, so YMMV :-).

I'll let you know how it works on my end; I'll try my best to
replicate the condition with the radios that seemd to cause the usb
crash.

-Tony

Richard4 · February 11, 2012, 6:39pm

You can tell I'm bored.

And very generous with your time and knowledge. Thank you Richard.

Nah, just bored, but you're very welcome

It's against v3.2.1 + some other stuff I added to my local kernel
earlier, so no guarantees it applies cleanly.

I applied the patch to the 3.2.0-12-omap kernel and it seemed to take
just fine. I just cross-compiled the patched kernel and only had to
remove some staging drivers that had nothing to do with what was in
the patch.

Excellent

I am now getting:

[ 5788.827178] ------------[ cut here ]---------
If I really have been able to resolve the other oops by upgrading
to 3.2.5, tonights overnight test working will make me pretty
confident that my patch is A Good Thing(tm).---
[ 5788.832061] Kernel BUG at c00e422c [verbose debug info unavailable]
[ 5788.838653] Internal error: Oops - undefined instruction: 0 [#1]
[ 5788.844970] Modules linked in: dsplinkk(O) sdmak(O) cmemk(O) bulk(P) kbus(O) m3d2(P)
[ 5788.853179] CPU: 0 Tainted: P O (3.2.5-00015-g6353637 #1)
[ 5788.860137] PC is at kmem_freepages+0xd8/0x198
[ 5788.864807] LR is at slab_destroy+0x28/0x60
[ 5788.869232] pc : [<c00e422c>] lr : [<c00e4938>] psr: 60000093
[ 5788.869232] sp : cf3f1cf0 ip : 0000b365 fp : cf807230
[ 5788.881286] r10: 00000000 r9 : 00000000 r8 : 00000327
[ 5788.886779] r7 : c0bb7de8 r6 : 00000001 r5 : ffffffff r4 : c0bb7db4
[ 5788.893646] r3 : cf800440 r2 : c0d36ca0 r1 : cb365000 r0 : 00000000
[ 5788.900512] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
[ 5788.908111] Control: 10c5387d Table: 8f068019 DAC: 00000015
[ 5788.914154] Process camerad (pid: 669, stack limit = 0xcf3f02f0)
[ 5788.920471] Stack: (0xcf3f1cf0 to 0xcf3f2000)
[ 5788.925048] 1ce0: cf3f7b80 cf800440 c062f3f0 c0bb7eb8
[ 5788.933654] 1d00: 00000000 00100100 cf800440 00000005 c062f3f0 c00e4ad8 00200

Which looks like nastiness in the network stack, but does at least
seem unrelated to USB - this was always a problem in my setup, so
I don't think I have caused it.

So, now for a very, very newb question: Can anyone instruct me or
point me to instructions for the rest of the steps I need to get from
the compiled kernel to a working image flashed on an sd card? It's
been about a decade since I rolled my own kernel, and this will be my
first time doing so for an ARM board.

If you are using one of the common distributions, with X-Loader and
u-boot, simply doing a make uImage to get a uImage for the kernel
and then putting it on the boot partition of the SD card (usually
partition 1), overwriting the uImage that is already there will do
it.

(remember to take a copy of your old kernel before obliterating it,
though, just in case.. )

It's had a whole 120 of seconds testing, so YMMV :-).

I'll let you know how it works on my end; I'll try my best to
replicate the condition with the radios that seemd to cause the usb
crash.

Thank you! That would be great - any data would be very welcome so
if you have the time to run it by your hardware, I'd very much
appreciate it.

If you have a quick peek utility, the registers to peek at to make
sure your settings really are right are:

[48004d4c] = 0001bc05 [ 113669 ]
[48004d50] = 00000008 [ 8 ]

(0x1bc = M, 0x05 = N, 0x08 = FREQSEL)

It seems to be working for me so far - I've not seen the USB PHY
reset since I applied it, though since I am getting the above oops
I haven't been able to test overnight.

Richard.

Richard4 · February 12, 2012, 4:20pm

Hello all,

Good news and bad news.

I've modified my USB drop-out test case not to include any
interaction with the LAN9221 I'm using and with my DPLL5
patch, I can happily run the LAN9514 overnight with no
problems. Note that the patch I posted on Friday has a
couple of bugs in it (that don't affect Beagles, but which
will hit people running non-26/13MHz 3630s).

I'd be interested in any other reports, good or bad, and
if good I'll try and get a cleaned-up patch to the kernel folk.

The bad news is that I now suspect the LAN911x driver of doing
something bad if placed under heavy load for hours at a time;
be warned ..

Richard.

Antonio_Manuel_Abad · February 12, 2012, 7:53pm

If you are using one of the common distributions, with X-Loader and
u-boot, simply doing a make uImage to get a uImage for the kernel
and then putting it on the boot partition of the SD card (usually
partition 1), overwriting the uImage that is already there will do
it.

Richard,

I copied over the uImage to the boot partition. I can bring the
system up, but there's some badness with the modules. e.g.:

modprobe: FATAL: Could not load /lib/modules/3.2.5/modules.dep: No
such file or directory

FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/3.2.5/modules.dep: No such file or directory

Do I have to do something witht the modules with the newly compiled
kernel? From another thread I have going, Robert Nelson had one more
step for a cross-compile-related question:

"mount your sd card's rootfs:

make CROSS_COMPILE=arm-linux-gnueabi- ARCH=arm modules_install
INSTALL_MOD_PATH=/media/rootfs/"

Thanks,

Tony

RobertCNelson · February 12, 2012, 8:05pm

if your "/lib/modules/3.2.5/" directory is real and does exist.. Give
"sudo depmod -a" a try to generate modules.dep..

Regards,

Antonio_Manuel_Abad · February 12, 2012, 8:52pm

if your "/lib/modules/3.2.5/" directory is real and does exist..

Unfortunately, it doesn't, neither on the board or at the lib/modules
directory in the source tree. And uname -r returns 3.2.5. Ugh. This
is the 3.2.12-omap source I grabbed from the Ubuntu folks; I guess
there's a missing patch or something. Since I know you guys have
working 12.04 hard float images for the Beagle, I'll start over
tonight using one of them.

Thanks,

Tony

Antonio_Manuel_Abad · February 13, 2012, 7:41am

I can't get the patched kernel that I compiled from source to boot on
the board (a BeagleBoard xm Rev C). Right now, the board just hangs
after displaying: "Uncompressing Linux... done, booting kernel" over
the serial port.

Here are the exact series of steps that I took to compile and install
the kernel (all as root):
1. Downloaded kernel source 3.2. (i.e.
http://www.kernel.org/pub/linux/kernel/v3.x/linux-3.2.tar.gz)
2. Applied the required patch: patch -p1 < patch-3.2-psp1.diff
3. Applied Richard's patch: patch -p1 < erratum-21-clksel.diff
4. Config using the provided defconfig (renamed xm_config): make
xm_defconfig CROSS_COMPILE=arm-linux-gnueabi- ARCH=arm
5. Compiled kernel image: make uImage CROSS_COMPILE=arm-linux-gnueabi- ARCH=arm
6. Further config to remove the rts5139 module: make menuconfig
CROSS_COMPILE=arm-linux-gnueabi- ARCH=arm
7. Compiled kernel modules: modules CROSS_COMPILE=arm-linux-gnueabi- ARCH=arm
8. Copied kernel image to boot partition: cp arch/arm/boot/uImage /media/boot/.
9. Installed kernel modules: make modules_install
INSTALL_MOD_PATH=/media/rootfs/ CROSS_COMPILE=arm-linux-gnueabi-
ARCH=arm

So, I copied the above steps with and without Richard's patch (but
both times having applied the psp1 patch); the result is the same. No
joy.

Thanks,

Tony