Boot failure "external abort on non-linefetch" in cpsw_probe with any image after Wi-Fi install

More information:

@Victor:

Thanks for your comments @Loren.

It’s quite odd that the board boots fine with old kernels and not with now ones, isn’t it?. I tried the images available at http://www.armhf.com/index.php/boards/beaglebone-black/:

[ 2.477446] registered taskstats version 1
[ 2.538597] davinci_mdio 4a101000.mdio: davinci mdio revision 1.6
[ 2.545147] davinci_mdio 4a101000.mdio: no live phy, scanning all
[ 2.552441] davinci_mdio: probe of 4a101000.mdio failed with error -5
[ 2.559885] Detected MACID = bc:6a:29:84:8d:3a
[ 2.565340] Unhandled fault: external abort on non-linefetch (0x1008) at 0xd0894000
[ 2.573651] Internal error: : 1008 [#1] SMP ARM
[ 2.578448] Modules linked in:
[ 2.581705] CPU: 0 Not tainted (3.8.13-bone30 #1)
[ 2.587076] PC is at cpsw_probe+0x528/0xbc8
[ 2.591518] LR is at ioremap_page_range+0xd8/0x16c
[ 2.596595] pc : [] lr : [] psr: a0000113
[ 2.596595] sp : cf05de38 ip : cf04d250 fp : cf42f298
[ 2.608723] r10: 00000001 r9 : cf42f540 r8 : d0894000
[ 2.614252] r7 : cf113800 r6 : 00000000 r5 : cf113810 r4 : cf42f000
[ 2.621154] r3 : 00000000 r2 : 00000000 r1 : 4a100e13 r0 : d0894000
[ 2.628061] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
[ 2.635788] Control: 10c5387d Table: 80004019 DAC: 00000015
[ 2.641865] Process swapper/0 (pid: 1, stack limit = 0xcf05c240)
[ 2.648216] Stack: (0xcf05de38 to 0xcf05e000)
[ 2.652831] de20: 00000000 00000000
[ 2.661486] de40: cf447c08 cf42f540 00000000 c014c2f0 22222222 00000020 00000000 cf447c88
[ 2.670140] de60: cf447c08 cf447c08 00000008 c014c1e0 00000000 cf447c08 cf112488 cf446d40
[ 2.678794] de80: 00000000 c014cba0 cf0474b8 c005e608 00000000 00000003 cf112488 00000000
[ 2.687446] dea0: c0a171ec cf113810 cf113818 cf113810 cf113844 c0a171ec c098c5b4 c09a9dc0
[ 2.696099] dec0: 00000000 cf05c008 00000000 c037c480 00000000 cf113810 cf113844 c098c5b4
[ 2.704751] dee0: 00000000 c037c66c 00000000 c098c5b4 c037c604 c037acbc cf047478 cf111c80
[ 2.713405] df00: c098c5b4 cf446d40 c0981ff0 c037bc44 c07ee6aa c07ee6aa 00000000 c098c5b4
[ 2.722060] df20: c090ebe8 c09218d4 c0901ed4 c037cbb8 00000007 c090ebe8 c09218d4 c0901ed4
[ 2.730716] df40: c09a9dc0 c0008894 c0901ed4 0000f442 c0921900 00000008 00000007 c090ebe8
[ 2.739369] df60: c09218d4 c09a9dc0 c09a9dc0 000000f1 c090ebf0 c08dc918 00000007 00000007
[ 2.748020] df80: c08dc270 00000000 00000000 c05fd740 00000000 00000000 00000000 00000000
[ 2.756672] dfa0: 00000000 c05fd748 00000000 c000d478 00000000 00000000 00000000 00000000
[ 2.765323] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 2.773974] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffeffae6 5fbfaaaa
[ 2.782667] [] (cpsw_probe+0x528/0xbc8) from [] (driver_probe_device+0xa4/0x1e4)
[ 2.792340] [] (driver_probe_device+0xa4/0x1e4) from [] (__driver_attach+0x68/0x8c)
[ 2.802304] [] (__driver_attach+0x68/0x8c) from [] (bus_for_each_dev+0x70/0x84)
[ 2.811885] [] (bus_for_each_dev+0x70/0x84) from [] (bus_add_driver+0xdc/0x218)
[ 2.821463] [] (bus_add_driver+0xdc/0x218) from [] (driver_register+0x9c/0x124)
[ 2.831044] [] (driver_register+0x9c/0x124) from [] (do_one_initcall+0x8c/0x150)
[ 2.840729] [] (do_one_initcall+0x8c/0x150) from [] (kernel_init_freeable+0x108/0x1cc)
[ 2.850979] [] (kernel_init_freeable+0x108/0x1cc) from [] (kernel_init+0x8/0xe4)
[ 2.860665] [] (kernel_init+0x8/0xe4) from [] (ret_from_fork+0x14/0x3c)
[ 2.869505] Code: e59f1650 ebfe1d58 ea0000d1 e58485c0 (e5982000)
[ 2.875956] —[ end trace 85aa0dcf7be9c2ab ]—
[ 2.881634] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

I’ll try to set compile the kernel myself and play a bit with the modules to see if there’s something that can be done.

I ended up compiling the kernel again and deactivating the TI MDIO driver from menuconfig. After this tweak the 3.8 kernel boots fine.

I briefly described what i did here. Hope somebody can benefit from it.

My board is back from RMA, with what looks like a new ethernet chip. It is ever so slightly raised up along one edge, though the pin alignment and soldering job are perfect. I guess it could have been that way before, but usually that’s a sign of manual replacement. The only info I was able to get from the RMA Team was:

My board is back from RMA, with what looks like a new ethernet chip. It is
ever so slightly raised up along one edge, though the pin alignment and
soldering job are perfect. I guess it could have been that way before, but
usually that's a sign of manual replacement. The only info I was able to
get from the RMA Team was:
---
After running the diagnostic tests, we found that there was a Ethernet
malfunction. We have fixed the issue and everything is properly working.
---

The board was carefully solvent cleaned after the repair; a little glob of
glue or rosin I had noticed before is now gone. But I noticed lots of tiny
solder splashes on the bottom of the board, mostly along the expansion
connector pins. A couple of them could have been a real problem if the
board coating hadn't protected the traces. All popped off easily with a
fingernail or blunt plastic tool.

So far, the board boots fine and works as expected.

The differences between booting and panic:

< cpsw, usb_ether
---
> Phy not found <-- with bad ethernet, just before reading uEnv.txt
> PHY reset timed out
> cpsw, usb_ether

< [ time ] pinctrl-single 44e10800.pinmux: could not request pin 21 on
device pinctrl-single
< systemd-fsck[85]: Angstrom: clean, 49509/112672 files, 354728/449820
blocks
< [ time ] libphy: PHY 4a101000.mdio:01 not found <-- with good ethernet!
< [ time ] net eth0: phy 4a101000.mdio:01 not found on slave 1 <-- last
line before logo
<
< .---O---.
< | | .-. o o
< | | |-----.-----.-----.| | .----..-----.-----.
< | | | __ | ---'| '--.| .-'| | |
< | | | | | |--- || --'| | | ' | | | |
< '---'---'--'--'--. |-----''----''--' '-----'-'-'-'
< -' |
< '---'
---
> [ time ] pinctrl-single 44e10800.pinmux: could not request pin 21 on
device pinctrl-single
> [ time ] Unhandled fault: external abort on non-linefetch (0x1008) at
0xe09fe000

So in both conditions it complains about "phy not found"! With a bad chip,
it complains near the beginning of U-Boot. With working ethernet, it
complains at the very end of kernel boot. It seems like someone who knows
the details of cpsw_probe needs to figure out how to make it report a
failed ethernet chip gracefully. And why libphy still reports an error when
the ethernet is good and boot is successful.

I'm finally able to login and view files. I'm wondering if these are
standard, or are they leftover from the RMA testing:
---
root@beaglebone:/# cat /media/BEAGLEBONE/uEnv.txt
optargs=quiet drm.debug=7
root@beaglebone:/# cat /media/BEAGLEBONE/uEnv.txtboot
optargs=run_hardware_tests quiet
---

After receiving the board back, I couldn't use VNC or SSH, though I could
ping the ethernet ports. In both cases Wireshark showed my external request
followed by an immediate RST from the BBB. I tried re-installing the
previous VNC package, but it said "Package x11vnc (0.9.13-r0.8) installed
in root is up to date. Still, the trick to make it load itself didn't seem
to work. I found

http://feeds.angstrom-distribution.org/feeds/v2012.12/ipk/eglibc/all/angstrom-x11vnc-xinit_1.0-r2.0_all.ipk
.
and that installed and worked immediately after a restart. The "netstat
-lntu" command did not see it until after it was active, even though it did
seem to see all the other open ports immediately after booting.

SSH was trickier. I finally found
Redirecting to Google Groups
-----
"ssh_exchange_identification: Connection closed by remote host"
From looking at the script above (/etc/init.d/dropbear) it seems like the
identity file in /etc/dropbear/dropbear_rsa_host_key might be causing the
problem and the script recreates them if they don't exist. So I removed it
and started dropbear (/etc/init.d/dropbear start) again and it generated
new keys and then I could ssh in. It now works! (The side effect of doing
this is you also have to remove a line in the client's ~/.shh/know_hosts
because the identity of the beaglebone has changed.)
-----
My /etc/dropbear/dropbear_rsa_host_key file was zero-length, so I removed
it. The "dropbear start" command didn't work for me, a BBB restart was
required after I manually deleted the key file. I also unchecked the
"History" box in TeraTerm - and it saved a new RSA fingerprint. Now works
with default password choice and blank password field, and also works with
Tunnelier.

Thanks for updating the thread. Good to know it's working now.

Other random things I just learned...

At least on Windows, when the USB cable is connected, there is a "Gadget
Serial" device USBSER000 from "Linux Developer Community" available as a
COM port (ttyGS0 in the BBB), alongside the "USB Serial Port" VCP0 from
FTDI which is my debug console adapter COM port (ttyO0 in the BBB). The
"gadget" port is only active after boot is complete, so I didn't have much
opportunity to see it before! But it claimed a lower COM port number, so I
assume it installed along with the ethernet gadget when I first connected
via USB.

That leaves the question, could I have somehow fried my ethernet chip? I
checked my incoming cable and it is fully DC isolated. The connector on the
BBB is fully DC isolated. It is not a POE-capable connector, there is no
diode array that could feed power into the grounded pin 8. So if I did
something to cause my failure, it was not through the ethernet cable.

Hmm this could be a one-off case. I guess if there are more instances like
this then someone needs to
dig deeper. For now just hack away :wink:

My BBB is still working great, wireless and all, after the RMA repair. But one silly detail has been bothering me…

The ifconfig-reported MAC address of the usb0 port changed after the repair:
usb0 Link encap:Ethernet HWaddr 6E:5A:F6:F0:F3:45
After RMA repair:
usb0 Link encap:Ethernet HWaddr 06:57:23:9E:EA:C7

Both of those are locally administered addresses, so they probably aren’t read from any hardware, and thus probably don’t suggest anything about what was done to my board during the RMA repair. But why the change?

http://processors.wiki.ti.com/index.php/AMSDK_u-boot_User’s_Guide