USB EHCI problems

Nuno_Felicio · August 22, 2009, 11:50pm

Gerald ,

thanks for the info, have been in contact with other users of the omap
in other dev boards and from what i can see this is a matter in the
hardware, the OMAP and/or companion chips are the problem, theres one
thing that i sincerely don't understand, TI have created an wonderful
piece of hardware but it seems is unable to deliver a working EHCI
port... is that so complex in comparison with the rest?

I'm a bit frustrated at this moment.... thank GOD the board i'm using
have Ethernet port and WIFI on board..... i have run some tests and
surprise the EHCI in this board also fails ARGHHHHH....

Nuno

Gerald_Coley1 · August 23, 2009, 1:23am

Hot and cold die refer the amount of leakage a particular wafer has. That is what smart reflex does It sets the voltage based on how much current it needs. Hot die consumes more current. I agree on your typical noise analysis, except in this case it looks as if the noise is a result of the need for instantainous current from OMAP when all of these different things switch at once. This puts a spike back out onto to the 1.8V rail which causes the SMSC PHY to lock up. Depending on the current consumption of the OMAP, based on Hot or Cold die, DRAM activity, processor speed, etc. there may or may not be an issue. We do not see this issue on all boards. If we have a bad board, say per your thoughts a crappy layout, and we replace the processor, it works. No more issue. So, how did this fix the layout?

We have checked all the timings over the last four months until we are blue in the face. No issue on the SMSC side or the OMAP side. This is not your typical noise issue. We already use low ESR capacitors, just like the datasheet says for the TPS65950. If I increase the capacitance to say 33Uf on a very bad board, it fixes it on 60% of the boards we try this on, but not all of the boards. If we enable smart reflex on VDD2, it fixes this on some boards, but not all boards. If we lower the frequency of the ARM to say 250MHz, it seems to fix it on mostboards. Obviously this is not an option.

Also, you can’t think of race conditions as an issue. OMAP is nothing but a race. With the ARM, MMU, SGX, DRAM, LCD, EHCI all running at different frequencies, there is always a point where everything can be switching at once. The only area we see an issue is on the EHCI. We know that in some cases the port only fails if you are running the SGX (More current) at the same time. OMAP EHCI is not locking up because a reset does not clear. It requires a power cycle. We do have a SW fix that will unlock the PHY and restore the port, but I ma not in favor of that one. I would rahter stop th econdition all together.

So, we are fighting to find the fix that fixes it on all boards. That is what is taking so long to get to the bottom of this one. Just for the record, the Beagle is not the board with this issue. In fact, Beagle is more stable than most of the other boards.

Gerald

Gerald_Coley1 · August 23, 2009, 1:54am

I have explained the situation many times here. It is not based on the complexity of EHCI. It is sysem level issue with many many facets.

Gerald

Howard_Harte · August 23, 2009, 3:56am

Did you try a 1uF ceramic cap in parallel with the low ESR 33uF cap? Or two 16uF caps in parallel to lower the ESR even more? That is definitely a difficult issue to solve. Good luck with it.

-Howard

John_USP · August 23, 2009, 6:49am

Also, you can’t think of race conditions as an issue. OMAP is nothing but a race. With the ARM, MMU, SGX, DRAM, LCD, EHCI all running at different frequencies, there is always a point where everything can be switching at once. The only area we see an issue is on the EHCI. We know that in some cases the port only fails if you are running the SGX (More current) at the same time. OMAP EHCI is not locking up because a reset does not clear. It requires a power cycle. We do have a SW fix that will unlock the PHY and restore the port, but I ma not in favor of that one. I would rahter stop th econdition all together.

The race condition I’m referring to is one where one circuit is driving a logic 0 and then is supposed to tristate before another circuit drives that same circuit to a logic 1 or visa versa. In some cases, there is insufficient dead time because of stray capacitance or saturated drive circuits, etc. In this case, both circuits are in conflict and cause a large current spike and potentially cause a ground circuit bounce. This causes all the circuits sourcing current from that ground pin to act irregularly. What you would be looking for is not just noise in the 1.8V plane, but also noise in the ground plane itself. Decoupling capacitors cannot deal with this situation.

So, we are fighting to find the fix that fixes it on all boards. That is what is taking so long to get to the bottom of this one. Just for the record, the Beagle is not the board with this issue. In fact, Beagle is more stable than most of the other boards.

Gerald

Gerald_Coley1 · August 23, 2009, 12:23pm

Yes. We actually have a speadsheet with a list of experiments on it. All sorts of different combinations of things. Keep in mind that this is issue is on other boards as well and it is funny how it only shows up on about 40% o fthe boards of each board type. It is very consistent. Beagle is currrently one of the boards build in the largest volume and the closest to us, so we have unlimited visability in to it. As I said, a LOT of things work, but we can’t seem to fond one thing that works on all of the boards. It is very frustrating!

Gerald

Gerald_Coley1 · August 23, 2009, 1:03pm

I agree with your thoughts. But, we do know that changing capacitance does fix the issue on some board, just like changing VDD2 settings, isolating the 1.8V on the PHY from the OMAP, and lowering the processor speed works on some boards. In all these instances the circuitry inside the OMAP and the PHY are still running at the same speed.

Gerald

Frantisek_Dufka · August 24, 2009, 6:15am

Siarhei Siamashka wrote:

Just for additional verification and to be sure that your kernel is not
obviously broken. Can you try to run some tests on your board with
the following kernel?
http://siarhei.siamashka.name/files/20090820/beagle-kernel/
That's the kernel which I'm using at the moment.

This works just like the kernel from Google Code Archive - Long-term storage for Google Code Project Hosting.

EHCI dies in first seconds of dd test and I can't enable smartreflex or set different cpu clock so there is nothing more to test.

Also have you tried OTG host on your board?

Not yet, I don't have cable with grounded 5th pin and it looks like beagleboard kernels cannot switch usb mode to host in software via
echo host >/sys/devices/platform/musb_hdrc/mode like on Nokia N810.

Frantisek

Siarhei_Siamashka · August 25, 2009, 4:41pm

Siarhei Siamashka wrote:
> Just for additional verification and to be sure that your kernel is not
> obviously broken. Can you try to run some tests on your board with
> the following kernel?
> http://siarhei.siamashka.name/files/20090820/beagle-kernel/
> That's the kernel which I'm using at the moment.

This works just like the kernel from
Google Code Archive - Long-term storage for Google Code Project Hosting.

EHCI dies in first seconds of dd test and I can't enable smartreflex or
set different cpu clock so there is nothing more to test.

OK, thanks. I just wanted to confirm that the only difference between our
boards is HW stability of EHCI port. The fact that SmartReflex workaround
helps you (a bit) is also a useful information.

Now I wonder if my board is actually considered 'stable' (with a software bug
somewhere in the kernel) or 'slightly-broken'.

To get more statistics, it would be interesting to know if anybody is actually
able to use USB ethernet adapter without issues together with USB HDD or
even USB flash stick? It can be checked by just copying lots of data in either
direction over network using ssh. If majority of boards actually have a stable
EHCI port, it should be not too hard to confirm. And that will save a bit of
my time, trying to debug a software part

I tried two different USB hubs and two different ethernet adapters: D-link
DUB-100E rev.A3 and A-link NA1GU (both are 'asix' based, but use different
chips). In all cases it fails (usb hub disconnects and requires device reboot
to recover).

Any success stories with USB ethernet on beagle EHCI port under high load?

Just to make things clear. My board passes the following test:
http://groups.google.com/group/beagleboard/msg/60e488906d42a3de
And if I have an unstable board after all, it means that 'dd' test
is not "aggressive" enough.

It would be nice to ensure that EHCI is bug free on the next revision of the
board. EHCI on the current boards may be even a lost cause, but honestly I
don't care that much. It was a free addition, and it is still somewhat useful
(at least for me).

> Also have you tried OTG host on your board?

Not yet, I don't have cable with grounded 5th pin

Same here.

They have this cable at the digikey:
http://search.digikey.com/scripts/DkSearch/dksus.dll?Detail&name=WM17135-ND
But ordering it alone to Europe seems to be a bit of an overkill

and it looks like beagleboard kernels cannot switch usb mode to host in
software via echo host >/sys/devices/platform/musb_hdrc/mode like on Nokia
N810.

Thanks for the hint. Indeed, my "USB A-female <-> mini-B" adapter works fine
with N800 using the following instructions:
http://muru.com/linux/n800-usb-host/

So at least it confirms that my adapter does not have any mechanical
problems. Additionally it should be possible to try tracing the execution
path in musb on both beagleboard and N800 to figure out what may be different.
Trying different kernels and patches on both beagleboard and N800 may be
useful too.

Somehow it feels like it is perceived here that finding the right connector or
soldering pins is the only option for enabling OTG host. Purely SW solution
must exist too, I'm almost sure that ID pin is not hardwired in any way, but
serves only informational purpose. Kernel just needs to be kicked in some way
to powerup vbus. Probably using something like this:
http://patchwork.kernel.org/patch/6349/

But this is kind of offtopic here, there are also threads about OTG host in
the list

Gerald_Coley1 · August 25, 2009, 4:53pm

I have lots of Beagles that work great with a USB Thumbdrive. I also have Beagles that do not. If you want to determine whether this might be a SW or HW issue, I suggest you use the validation kernel

http://code.google.com/p/beagleboard/wiki/BeagleboardRevCValidation

and run the following test repeatedly to see if you can get it to fail.

dd if=/dev/sda of=/dev/null bs=1M count=1M

Gerald

Duckyduck · August 25, 2009, 5:52pm

Try dd if=/dev/sda of=/dev/null bs=1M count=1000

My board fails after about 170MB.

-----Oorspronkelijk bericht-----

Siarhei_Siamashka · August 26, 2009, 10:05am

I think I already made it clear that 'dd' test passes on my board with
various combinations of kernel configurations (including your validation
kernel) and various types of hardware connected to EHCI port.

Now it would be nice if you (or anybody else) could do the following:
1. Get a good board with 'stable' EHCI port
2. Prepare a beagleboard setup, which uses validation kernel
3. Run 'dd' test and make sure that it passes fine
4. Get some USB hub, connect USB HDD (or USB flash) and USB ethernet adapter
to it, connect the hub to EHCI port
5. Install ssh daemon on the device and make sure that network is up and
running
6. Create a new empty ext3 partition on /dev/sda1 and mount it
7. Try to copy lots of data from PC over network to this freshly created
partition using scp (mine board fails after copying ~4-5GB, trying to transfer
20-30GB should be probably sufficient)
8. Report the results here

I'm tempted to add more comments about the possible outcomes and what could be
causing them, but I'll stop here. The ball is on your side now.

Nuno2 · August 26, 2009, 11:14am

Hello one quick question, is it possible on a RevC beagleboard to cut
the track the provides 1.8v vdd to the phy and use another source to
provide a stable vcc to phy?

If the problem really is noise one the vdd line that would solves all
the problems....

is the vdd line accessible ?

Gerald_Coley1 · August 26, 2009, 11:26am

Yes you can. But the only 1.8V rail available is the 1.8V rail that it is currently on. The LDO outputs from the TPS65950 are not accessible. we have tried doing this and connectting it back to C97. It has improved a few boards, but not enough to get the board to pass the dd test over a lot of iterations (>1000).

This has been done on the OMAP3530 EVM and it has seemed to solve the issue. But, it is not something we can do on a Rev C3 board.

Gerald

Steve_Sakoman · August 26, 2009, 1:40pm

Have you tried a ferrite bead on the 1.8V rail between the OMAP and
PHY? I've used this technique in similar situations in the past.
Caps required on both sides of the bead of course.

Steve

Gerald_Coley1 · August 26, 2009, 3:07pm

Yes we have but tt did not help.

Gerald

Nuno_Felicio · August 26, 2009, 10:07pm

Gerard,

and have you tried an completely separated power supply just to supply
the phy? if that works for me its ok =),
i will adapt my boards for that, and be very happy =), its just a
matter of adding an converter from 5v to 1.8v

Nuno

Gerald_Coley1 · August 26, 2009, 10:31pm

Yes and it seems to solve the issue. But, it is not something that we can do on the Rev C3 Beagle. At some point I plan to move to a LDO on th eTPS65950, but it is not accessible on the Rev C3 Beagle. before I say it is the solution, I will need to see it work on 100s of boards first.

Please let us know if this solves the issue on your board!

Gerald

Kiam_Peng_Wee · August 27, 2009, 10:00am

Hi,

Gerald_Coley1 · August 27, 2009, 12:06pm

Only the 1.8V. The 3.3V is generated by the PHY, so please do not hook anything to that are you will blow up the device. You will not be able to capture any events on the voltage rails and tie those to this issue. It is something you cannot see using a scope.

Gerald