USB EHCI problems

Duckyduck · August 17, 2009, 1:17pm

Just guess work… Probably only TI knows the real reason of these problems…
Koen’s patch enables the OMAP feature to dynamicaly adjust the Peripheral power rail’s voltage to a lower voltage, with timing problems you typically see that using higher voltages solves these kind of problems. So my guess is that internal chip temperature causes problems. But again I’m just guessing.

2009/8/17 Kiam Peng Wee <wee.kiampeng@orangeknob.com>

Gerald_Coley1 · August 17, 2009, 1:46pm

We have been chasing the issue for a while now and it appears to be noise that affects the PHY itself. It is not really an OMAP issue per se, but an interaction between the OMAP and the PHY voltage rails. It appears that by reducing the VDD2, it lowers the current consumed by the OMAP and thereby reduces the noise to the PHY voltage rails. We have other methods for solving this issue, but for Beagle this is the most affective and easiest solution. This issue does not show up on all boards. It is dependent on certain OMAP devices that have a higher current consumption as other parts, something smart reflex was created to handle.

Increasing the voltage does NOT solve this issue. We tried it and it makes it worse.

Gerald

Koen_Kooi · August 17, 2009, 2:50pm

This is the first time I hear about timing and/or temp problems when running at <= 600MHz

regards,

Koen

Gerald_Coley1 · August 17, 2009, 3:26pm

This is NOT a temperature issue. This is a noise issue on the external VIO_1V8 rail caused by increase in current consumption on certain OMAP devices. It is NOT a temperature issue.

Gerald

Kai_Blin · August 18, 2009, 9:43pm

For me that doesn't seem to change much, if anything at all.

I'm running above kernel, booted into angstrom. I connect to this via
USB ethernet sitting on a hub.
I did an echo 1 > /sys/power/sr_vdd2_autocomp, and checked the value
"1" was actually set there by running cat /sys/power/sr_vdd2_autocomp
afterwards.

So far, so good.
Then I connected a self-powered USB disk and started reading from it
via dd, and poof, my ethernet connection is gone. As angstrom doesn't
seem to log to /var/log/messages, but running a script that saves
dmesg output once a second, I seem to have gotten the following
output:

----8<----

[ 852.024475] WARNING: at arch/arm/mach-omap2/pm34xx.c:300
prcm_interrupt_handler+0xc4/0x100()
[ 852.042022] prcm: WARNING: PRCM interrupt received, but no code to
handle it (00340000)
[ 852.059143] Modules linked in: ircomm_tty ircomm irda ipv6
[ 852.069244] [<c0440acc>] (dump_stack+0x0/0x14) from [<c0066ea8>]
(warn_slowpath+0x68/0x9c)
[ 852.086517] [<c0066e40>] (warn_slowpath+0x0/0x9c) from [<c004910c>]
(prcm_interrupt_handler+0xc4/0x100)
[ 852.105468] r3:00340000 r2:c051414f
[ 852.113861] r7:34300034 r6:00340000 r5:00000001 r4:00000000
[ 852.124511] [<c0049048>] (prcm_interrupt_handler+0x0/0x100) from
[<c00919dc>] (handle_IRQ_event+0x3c/0x74)
[ 852.144134] r7:0000000b r6:00000000 r5:00000000 r4:cfadf9a0
[ 852.154876] [<c00919a0>] (handle_IRQ_event+0x0/0x74) from
[<c0092e60>] (handle_level_irq+0x94/0xec)
[ 852.174224] r7:00000102 r6:c0590000 r5:0000000b r4:c05a2398
[ 852.185363] [<c0092dcc>] (handle_level_irq+0x0/0xec) from
[<c003b058>] (__exception_text_start+0x58/0x70)
[ 852.206024] r5:c0591ee0 r4:0000000b
[ 852.215209] [<c003b000>] (__exception_text_start+0x0/0x70) from
[<c003ba30>] (__irq_svc+0x30/0x80)
[ 852.235717] Exception stack(0xc0591e38 to 0xc0591e80)
[ 852.246734]
1e20: 0000001f
c0590000
[ 852.267181] 1e40: c05d9da0 00000100 0000005f 00000000 c0590000
00000102 c05d339c 00000000
[ 852.288024] 1e60: 0000000a c0591eb4 c0591eb8 c0591e80 c006c2e4
c006c1e4 20000153 ffffffff
[ 852.309326] r5:d8200000 r4:ffffffff
[ 852.319213] [<c006c1a0>] (__do_softirq+0x0/0x100) from [<c006c2e4>]
(irq_exit+0x44/0x88)
[ 852.339904] [<c006c2a0>] (irq_exit+0x0/0x88) from [<c003b05c>]
(__exception_text_start+0x5c/0x70)
[ 852.361541] [<c003b000>] (__exception_t

----8<----

The hdd seems to be still spinning, however that seems to continue
even when I power cycle the BB. I don't currently have a serial cable
connector for the board, so I can't check without the ethernet
adapter.

All in all this seems to be more broken than the musb port on a 2.6.28
kernel, where you need to use some file transfer over the network more
sophisticated than "dd if=/dev/<usb disk> of=- | nc server port" for
things to blow up.

Cheers,
Kai

Frantisek_Dufka · August 19, 2009, 8:16am

Kai Blin wrote:

For me that doesn't seem to change much, if anything at all.

Yes, with my later tests when running at 600 MHz I had failures too (maybe it depends on phase of the moon too?) however when changing clock to 500MHz the usb was rock solid for me. Maybe on my first test I used 500MHz too.

something like
echo 500000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_frequency
did the trick

----8<----

[ 852.024475] WARNING: at arch/arm/mach-omap2/pm34xx.c:300
prcm_interrupt_handler+0xc4/0x100()
[ 852.042022] prcm: WARNING: PRCM interrupt received, but no code to
handle it (00340000)
[ 852.059143] Modules linked in: ircomm_tty ircomm irda ipv6

Had the same thing with no module inserted too. Usually I see one such error and everything still appears to work fine but few times I saw endless stream of this and had to powercycle the board.

Also there is perhaps some power management issue with serial port. Every first character after some timeout is garbled so one needs to type with no longer interruption and correct first character.

All in all this seems to be more broken

Definitely it is not kernel for normal usage.

Frantisek

Frantisek_Dufka · August 19, 2009, 8:21am

Frantisek Dufka wrote:

however when changing clock to 500MHz the usb was rock solid for me. Maybe on my first test I used 500MHz too.

something like
echo 500000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_frequency
did the trick

However when thinking about it it makes little sense since the issue is related to vdd2 voltage and it should be same for 500 and 600, only vdd1 should change for those OPPs (?).

And BTW just to be sure I enabled SmartReflex also for vdd1 when testing, I'm not sure if it makes any difference.

Kiam_Peng_Wee · August 19, 2009, 9:17am

Hi!

Siarhei_Siamashka · August 19, 2009, 12:03pm

Here are my results, collected over the last few days.

I actually happen to have one of the boards with semi-stable EHCI. It does not
show any problems with all the configurations except when hdd and ethernet
adapter are connected to the same hub connected to EHCI port and actively
used together.

My primary kernel is derived from angstrom demo 2.6.28 with some additional
patches. I used u-boot blob from angstrom in the recent tests. It was reported
here that u-boot may introduce some usb related problems:
http://groups.google.com/group/beagleboard/browse_thread/thread/84e47409fa17bab8/914914384a3a6b50
But I could not see any difference in usb stability vs. the u-boot that I
used before, so probably my u-boot was already ok, or still not broken, or
something.

If I use usbnet connection via cdc_ether on OTG port and HDD is connected to
EHCI port (via hub, along with mouse, keyboard and bluetooth dongle), then I
could not break this setup with any of the tests. The tests included the 'dd'
copy of hdd to /dev/null, backup of dvd from pc via 'vobcopy -m' to sshfs
mounted beagle hdd, recompilation of gentoo rootfs on usb hdd ('emerge -e
world').

But connecting usb ethernet adapter to the same ehci hub and using it for
networking, breaks at least 'vobcopy -m' test. The test with 'dd' still seems
to be fine.

The experimental kernel provided by Koen is less stable with the regard to
EHCI. It even fails 'dd' test, which never happened to me before. But I tried
it at 600MHz only with 'echo -n 1 > /sys/power/sr_vdd2_autocomp'. I will try
500MHz later.

Disclaimer: I have run only a limited number of iterations for each test, so
I expect that anything that I reported as 'working fine', may actually still
fail if tested more extensively

So if the goal is to have multiple USB peripheral devices connected to beagle
and working reliable, then trying to make use of OTG host is the best solution
for me in the short run.

Frantisek_Dufka · August 19, 2009, 12:32pm

Siarhei Siamashka wrote:

I actually happen to have one of the boards with semi-stable EHCI. It does not
show any problems with all the configurations except when hdd and ethernet
adapter are connected to the same hub connected to EHCI port and actively
used together.

Then maybe there are more EHCI issues involved? HW and also buggy
kernel? With my board I'm quite happy if harddisk alone works directly
on EHCI port or together with keyboard attached to hub.

If I use usbnet connection via cdc_ether on OTG port and HDD is connected to
EHCI port (via hub, along with mouse, keyboard and bluetooth dongle), then I
could not break this setup with any of the tests.

So maybe your board does not have the vdd2 related issue and the
instability is elsewhere (kernel, another hw issue)?

So far for me with smartreflex enabled on vdd2 and at 500MHz it at least
did not fail dd test (dd of whole 60GB and 320GB disks done at two
different nights).

Disclaimer: I have run only a limited number of iterations for each test, so
I expect that anything that I reported as 'working fine', may actually still
fail if tested more extensively

Same here of course

Kai_Blin · August 19, 2009, 2:05pm

Here are my results, collected over the last few days.

I actually happen to have one of the boards with semi-stable EHCI. It does not
show any problems with all the configurations except when hdd and ethernet
adapter are connected to the same hub connected to EHCI port and actively
used together.

I'm seeing those kind of issues on a revB board on the OTG port and on
a revC board on the OTG port or the EHCI port. This might be a
completely separate problem.

...snip...

But connecting usb ethernet adapter to the same ehci hub and using it for
networking, breaks at least 'vobcopy -m' test. The test with 'dd' still seems
to be fine.

That seems to match my experience with USB ethernet and disk used at
the same time. As I said before I can reproduce that kind of problem
on the OTG port as well.

Cheers,
Kai

Nuno_Felicio · August 22, 2009, 12:20am

Hello, what is the state of the EHCI port ? Does the SmartReflex
option solves finally the problem?

Thanks in advance

Duckyduck · August 22, 2009, 1:02pm

Hi Guys,

I’ve just done some experiments with the SR build Kernel, unfortunately this does NOT solve the problem for my board.
Output:
root@beagleboard:~# echo 1 > /sys/power/sr_vdd2_autocomp
root@beagleboard:~# cat /sys/power/sr_vdd2_autocomp
1
root@beagleboard:~# dd if=/dev/sda1 of=/dev/null count=100 ibs=1M
100+0 records in
204800+0 records out
104857600 bytes (105 MB) copied, 6.16533 seconds, 17.0 MB/s
root@beagleboard:~# dd if=/dev/sda1 of=/dev/null count=100 ibs=1M
100+0 records in
204800+0 records out
104857600 bytes (105 MB) copied, 1.29066 seconds, 81.2 MB/s
root@beagleboard:~# dd if=/dev/sda1 of=/dev/null count=100 ibs=1M
100+0 records in
204800+0 records out
104857600 bytes (105 MB) copied, 1.2909 seconds, 81.2 MB/s
root@beagleboard:~# dd if=/dev/sda1 of=/dev/null count=1000 ibs=1M
[ 267.249420] hub 1-0:1.0: port 2 disabled by hub (EMI?), re-enabling…
[ 267.260040] usb 1-2: USB disconnect, address 2
[ 267.268341] usb 1-2.1: USB disconnect, address 7
[ 267.278594] sd 1:0:0:0: [sda] Unhandled error code
[ 267.287414] sd 1:0:0:0: [sda] Result: hostbyte=0x07 driverbyte=0x00
[ 267.297790] end_request: I/O error, dev sda, sector 332503
[ 267.307403] Buffer I/O error on device sda1, logical block 41555
[ 267.318420] usb 1-2: clear tt 4 (9081) error -19
[ 267.327331] usb 1-2: clear tt 4 (9081) error -19
[ 267.336059] usb 1-2: clear tt 4 (9081) error -19
[ 267.353179] sd 1:0:0:0: [sda] Unhandled error code
[ 267.362213] sd 1:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
[ 267.373016] end_request: I/O error, dev sda, sector 332511
[ 267.382812] Buffer I/O error on device sda1, logical block 41556
[ 267.393157] Buffer I/O error on device sda1, logical block 41557
[ 267.403411] Buffer I/O error on device sda1, logical block 41558
[ 267.413543] Buffer I/O error on device sda1, logical block 41559
[ 267.423461] Buffer I/O error on device sda1, logical block 41560
[ 267.433227] Buffer I/O error on device sda1, logical block 41561
[ 267.442749] Buffer I/O error on device sda1, logical block 41562
[ 267.452117] Buffer I/O error on device sda1, logical block 41563
[ 267.461303] Buffer I/O error on device sda1, logical block 41564
[ 267.470794] sd 1:0:0:0: [sda] Unhandled error code
[ 267.478546] sd 1:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
[ 267.487792] end_request: I/O error, dev sda, sector 332751
dd: reading `/dev/sda1’: Input/output error
162+1 records in
332440+0 records out
170209280 bytes (170 MB) copied, 6.16784 seconds, 27.6 MB/s
root@beagleboard:~# [ 267.690765] usb 1-2.4: USB disconnect, address 8

After this the EHCI power died till a reset…

Gerald/Koen → Is there still hope for my board? Is this really THE Final solution?

Wkr,
Joep

2009/8/22 Nuno Felicio <nuno.felicio@gmail.com>

Gerald_Coley1 · August 22, 2009, 1:05pm

We believe so, but maybe not with all kernels. So we need to do some more work. It appears that while on some kernels, the dd test pasees when you make the change, something else is causing some issues when you run EHCI. We consider the dd test the defnitive test for this issue. This issue has been a tough one to track down because most boards do not have this issue. We have a feel for what it is, but it looks like there may be some other issues out there in the SW with EHCI that show up.

Gerald

Duckyduck · August 22, 2009, 1:16pm

Hi Gerald,

Some more information that maybe help:
EHCI → HUB (powered) → HDD
With first test as post above also USB->Ethernet connected to HUB.

U-boot:
Texas Instruments X-Loader 1.4.2 (Feb 19 2009 - 12:01:24)
Reading boot sector
Loading u-boot.bin from mmc

U-Boot 2009.01-00013-g52eddcd (Feb 03 2009 - 22:22:56)
OMAP3530-GP rev 2, CPU-OPP2 L3-165MHz
OMAP3 Beagle board + LPDDR/NAND
DRAM: 256 MB
NAND: 256 MiB
In: serial
Out: serial
Err: serial
Board revision C
Serial #491c00030000000004013f8a17014005
Hit any key to stop autoboot: 0
reading uImage
2934668 bytes read

Booting kernel from Legacy Image at 80300000 …

Image Name: Angstrom/2.6.29/beagleboard
Image Type: ARM Linux Kernel Image (uncompressed)
Data Size: 2934604 Bytes = 2.8 MB
Load Address: 80008000
Entry Point: 80008000
Verifying Checksum …

Second test (w/o USB->Ethernet connected):

root@beagleboard:/sys/devices/system/cpu/cpu0/cpufreq# echo 500000 > scaling_max_freq
[ 129.536132] SR1: VDD autocomp is not active
root@beagleboard:/sys/devices/system/cpu/cpu0/cpufreq# echo 1 > /sys/power/sr_vdd2_autocomp
root@beagleboard:/sys/devices/system/cpu/cpu0/cpufreq# echo 1 > /sys/power/sr_vdd1_autocomp
root@beagleboard:/sys/devices/system/cpu/cpu0/cpufreq# echo 500000 > scaling_max_freq
root@beagleboard:/sys/devices/system/cpu/cpu0/cpufreq# cd /

root@beagleboard:/# dd if=/dev/sda1 of=/dev/null count=1000 ibs=1M
1000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 52.9087 seconds, 19.8 MB/s

root@beagleboard:/# dd if=/dev/sda1 of=/dev/null count=1000 ibs=10M
[ 355.735931] hub 1-0:1.0: port 2 disabled by hub (EMI?), re-enabling…
[ 355.747070] usb 1-2: USB disconnect, address 2
[ 355.755859] usb 1-2.1: USB disconnect, address 3
[ 355.767730] sd 0:0:0:0: [sda] Unhandled error code
[ 355.776855] sd 0:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
[ 355.787261] end_request: I/O error, dev sda, sector 4568031
[ 355.797027] Buffer I/O error on device sda1, logical block 570996
[ 355.807281] Buffer I/O error on device sda1, logical block 570997
[ 355.817382] Buffer I/O error on device sda1, logical block 570998
[ 355.827392] Buffer I/O error on device sda1, logical block 570999
[ 355.837341] Buffer I/O error on device sda1, logical block 571000
[ 355.847229] Buffer I/O error on device sda1, logical block 571001
[ 355.857086] Buffer I/O error on device sda1, logical block 571002
[ 355.866973] Buffer I/O error on device sda1, logical block 571003
[ 355.876678] Buffer I/O error on device sda1, logical block 571004
[ 355.886230] Buffer I/O error on device sda1, logical block 571005
[ 355.896331] sd 0:0:0:0: [sda] Unhandled error code
[ 355.904571] sd 0:0:0:0: [sda] Result: hostbyte=0x01 driverbyte=0x00
[ 355.914337] end_request: I/O error, dev sda, sector 4568271
dd: reading `/dev/sda1’: Input/output error
223+1 records in
4567968+0 records out
2338799616 bytes (2.3 GB) copied, 117.905 seconds, 19.8 MB/s
root@beagleboard:/#

So still no luck
I’ll again go in “hope mode” till you guys come op with new things to test

Wkr,
Joep

2009/8/22 Gerald Coley <gerald@beagleboard.org>

Soren_Steen_Christen · August 22, 2009, 2:23pm

Hello, what is the state of the EHCI port ? Does the SmartReflex
option solves finally the problem?

From: beagleboard@googlegroups.com [mailto:beagleboard@googlegroups.com]

On Behalf Of Gerald Coley

---
We believe so, but maybe not with all kernels. So we need to do some more

work. It

appears that while on some kernels, the dd test pasees when you make the

change,

something else is causing some issues when you run EHCI. We consider the

dd test

the defnitive test for this issue. This issue has been a tough one to

track down

because most boards do not have this issue. We have a feel for what it is,

but it

looks like there may be some other issues out there in the SW with EHCI

that show up.

Hi Gerald,

Can you elaborate a bit more on what's the root cause of this problem? To be
honest I find it a bit strange that enabling Smart Reflex should fix the
problem. It might very well be the case, but with all the trouble we have
had with ECHI until now, I would like to know why this "fine-tuning"
mechanism can solve the EHCI problems?

In my current opinion either the EHCI implementation (OMAP or PHY) must be
running just at the edge and thereby by concept be "unstable"(?), or
SmartReflex is doing a lot more than I currently understand...

Would a proper fix of the problem require another/new HW design compared to
what's on Beagle Rev C3 or is the SR SW change considered the final fix?

To be honest I'm a bit scared, that SmartReflex is just hiding the problem
even more, instead of fixing it. I hope you can tell/explain otherwise by
elaborate a bit more on the technicalities behind the problem (and why Smart
Reflex fixes it :-)?

I think that would be appreciated by many persons on the list - At least I
would highly appreciate to get this info...

Best regards and thanks in advance
Søren

Siarhei_Siamashka · August 22, 2009, 3:15pm

Siarhei Siamashka wrote:
> I actually happen to have one of the boards with semi-stable EHCI. It
> does not show any problems with all the configurations except when hdd
> and ethernet adapter are connected to the same hub connected to EHCI port
> and actively used together.

Then maybe there are more EHCI issues involved? HW and also buggy
kernel?

Yes, I'm pretty sure that there may be still lots of problems in SW
and HW. That's why doing more tests with different boards and kernels
and sharing all this information may help to get a better understanding
about what is happening.

As an additional experiment, I also compiled linux kernel for ps3
(another gadget running linux) from exactly the same 2.6.28 sources
as I'm using on beagle. I tried to use the identical kernel config
options where possible, especially those options which are usb related.
The result is that all tests pass fine on ps3.

This pretty much confirms that my usb hub, hdd, ethernet adapter
and all the other usb peripherals are most likely ok. Additionally,
it reduces the likelihood of having some kind of "crossplatform"
bug or regression in this particular kernel snapshot.

So most likely the problem is somewhere in beagle hardware or
OMAP-specific drivers.

With my board I'm quite happy if harddisk alone works directly
on EHCI port or together with keyboard attached to hub.

Just for additional verification and to be sure that your kernel is not
obviously broken. Can you try to run some tests on your board with
the following kernel?
http://siarhei.siamashka.name/files/20090820/beagle-kernel/
That's the kernel which I'm using at the moment.

Also have you tried OTG host on your board?

Siarhei_Siamashka · August 22, 2009, 3:28pm

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Please have a look at
http://groups.google.com/group/beagleboard/msg/8cafc08c5cc04954
and
http://groups.google.com/group/beagleboard/browse_thread/thread/7adbfcd2162dae28

Koen can probably clarify what's the relationship between u-boot versions and
usb problems. That is if he considers us worthy providing any kind of reply of
course

Gerald_Coley1 · August 22, 2009, 4:52pm

It appaears to be about noise reflected back on the the 1.8V rail that causes the PHY to lock up. Reduction of current, slower speed or lowering VDD2, reduces the current consumed by the OMAP, reducing the noise on the 1.8V and removing the lockup. It i sall about asynchronous events all lining up at once to give us the issue. It does not occure on all boards. Only some of the boards depending on wheter the OMAP used is a hot or cold die.

The SR is one fix of about 7 or 8. Some of these fixes work on some boards and some they don’t. I think we are dealing with 4 to 5 sources of the issue dpending an a lot of variables.

I appreciate you feedback, but we have been looking at this issue for 4 sold months now with the design teams and SMSC. I could type for four days and explain what we have tried and not found to be the issue. I am not inclined to to go into all that now a sit won’t really accomplish a thing. My goal is to find the issue and communicate the solution. There will be several things done in HW on the next revision. Some related to this issue and some not. Some may work and some may not. This is a tough one that is not subject to any experince most people have had withhtis type of issue. There are too many wayys to fix it. So right now we are looing fo rthe best on that works on all bad boards.

Gerald

John_USP · August 22, 2009, 7:18pm

It appaears to be about noise reflected back on the the 1.8V rail that causes the PHY to lock up. Reduction of current, slower speed or lowering VDD2, reduces the current consumed by the OMAP, reducing the noise on the 1.8V and removing the lockup. It i sall about asynchronous events all lining up at once to give us the issue. It does not occure on all boards. Only some of the boards depending on wheter the OMAP used is a hot or cold die.

OK, so what is a hot die and what is a cold die? Also, noise is normally related to PCB layout, power plane stacking, decoupling capacitor placement and decoupling capacitor ESR. I’m not sure I understand how the relationship between voltage/current and noise can make that much difference. Perhaps there is a racing condition that is causing some sort of contention and drawing current during the transition. Maybe reducing the voltage changes the rise/fall times and helps reduce the contention. Just thinking out loud.

The SR is one fix of about 7 or 8. Some of these fixes work on some boards and some they don’t. I think we are dealing with 4 to 5 sources of the issue dpending an a lot of variables.

I appreciate you feedback, but we have been looking at this issue for 4 sold months now with the design teams and SMSC. I could type for four days and explain what we have tried and not found to be the issue. I am not inclined to to go into all that now a sit won’t really accomplish a thing. My goal is to find the issue and communicate the solution. There will be several things done in HW on the next revision. Some related to this issue and some not. Some may work and some may not. This is a tough one that is not subject to any experince most people have had withhtis type of issue. There are too many wayys to fix it. So right now we are looing fo rthe best on that works on all bad boards.

Gerald