Image 2015-7-26, kernel 4.1.3-ti-r6

Downloaded and installed “bone-debian-8.1-console-armhf-2015-07-26-2gb.img”
No modifications or updates.
“Trusted” BBB Rev.C powered with Adafruit 2A wallwart via +5V barrel connector.
No capes. Just power and Ethernet connected.
Times and dates are UTC.

BeagleBoard.org Debian Image 2015-07-26
Linux beaglebone 4.1.3-ti-r6 #1 SMP PREEMPT Fri Jul 24 23:16:27 UTC 2015 armv7l GNU/Linux

Jul 28 04:38:27 Initial Boot after download and install

Jul 28 23:15:41 Autonomous Reboot

Wed Jul 29 01:00:53 UTC 2015 – Current time of report

Thanks for testing Graham!

debian@beaglebone:~$ uptime
19:48:12 up 14 days, 5:49, 1 user, load average: 0.00, 0.01, 0.05

runs and hides

Robert:

I know you are not looking for a hardware modification as a solution, but if hardware modifications would help diagnose and debug the problem, I have the capability to do “blue wires” or change parts, for about anything other than a BGA.

The trick is to get the BBB to tell you what is triggering the reboot.

The USB power feed does not do it, the +5 V barrel power input does.

Kernel 3.xx does not do it, kernel 4.x.x does.

I suspect some code changes from the USB main line code has changed the way the power source/direction sniffing works.
For instance, the USB connectors on tablets, which is probably inside kernel 4, can accept power for charging tablet batteries, or supply power for running thumb-drives or other USB accessories.
How they do this would change code in this exact area.

If the PMIC is allowed to autonomously make power source switch-overs, without permission from the Sitara, and it is making bad decisions, as others have suggested, then I am not sure how to approach it.

— Graham

I can also probe the board with a logic analyzer, wait to the event to occur and figure where in the hardware it started. Just don’t want to be doing something that have already been discarded.

Robert: could you please put us in the loop about your investigation so we can team efforts?

Thanks,
Nuno

Has anyone done a control register dump of the PMIC under kernel 4 and compared it to a register dump under kernel 3 to see if anything has changed?
— Graham

Well right now it's..

git bisect start
git bisect good v3.14
git bisect bad v4.12-rc4

rebuild...

run for 24 hours (or if it resets earlier)... retest..

Wish we could trigger it faster.. but it's going to take awhile..

Regards,

OK. Understand.
If you need help testing a version, send me a note.
— Graham

I have seen some releases take more than 24 hours to fail.
If it fails, then it does, but if it doesn’t, you may have to go two full days before you can start thinking that you found it.
— Graham

I can set up easily 24 boards, and with some additional work even 48, if that helps.

---- Günter (dl4mea)

Thanks Günter, right now i'm trying to get it between two pure
mainline good/bad commits..

Then we can start a distributed git bisect. I'll setup a git repo, to
help automate it..

With a 24-36 hour test, each git bisect step is x^2

      C
  B G
B G B G

So with 7 boards you can quickly do the first 4 steps.. At-least the
fails are quick. :wink:

Regards,

I believe is better to run several boards with the same kernel to reduce the mean time to a BAD evidence.

Otherwise we have seen already times up to 3 days for a single board to reboot.

So I would suggest to try instead only 2 levels at once, so 3 kernel revisions, with 8 boards running at the same time each revision for a total of 24 boards, hopping that we can advance 2 bisect levels / day.

Your git repo will be much appreciated. Thanks.

Nuno

Just a quick update..

verfied a non v4.1.x-ti kernel failed last night..

Right now i'm looking between:

linux-image-4.0.4-bone4 <> linux-image-4.1.2-bone12

usb: apt-get install linux-image-4.0.4-bone4 (up 1 day, 9 hours, 56 minutes)
dc: apt-get install linux-image-4.0.4-bone4 (up 1 day, 9 hours, 56 minutes)

3: apt-get install linux-image-4.1.1-bone10 (up 1 hour, 15 minutes)

Ran out of power supplies, so i'm going to get another 3 boards
running tonight.. So i'll have 9 running in total..

Regards,

Well, that is progress.
I don’t think I have ever done a git bisect where it takes one or two days to tell good from bad.
I know this is painful for you.
Thanks for grinding on it.
— Graham

When we get down to a kernel git bisect i have a script ready that'll
take all the pain out of it.. :wink: and right now it's looking v4.0.0
<-> v4.1.0...

I'm current keeping track of things here:

https://gist.github.com/RobertCNelson/b52f8318e9798625b655

I'm helping out grandma this weekend, so i'll be updating that gist
when i can. (As long as the boards don't need a "hard power reset")

Regards,

mid-day update, 9 boards now running.. linux-image-4.1.0-rc4-bone2
just rebooted on me..

so, current kernel bisect:
v4.0.0 <-> v4.1.0-rc4

just waiting on rc1/rc2/rc3 to fail next.. :wink:

Regards,

mid-day update, 9 boards now running.. linux-image-4.1.0-rc4-bone2
just rebooted on me..

so, current kernel bisect:
v4.0.0 <-> v4.1.0-rc4

just waiting on rc1/rc2/rc3 to fail next.. :wink:

rc3 just rebooted

v4.0.0 <-> v4.1.0-rc3

Regards,

and after 14 hours, rc1 just rebooted...

I'm going to quickly rebuild rc1 without my bone0 patchset..

v4.0.0 <-> v4.1.0-rc1

Regards,

okay first one up:

http://rcn-ee.homeip.net:81/farm/testing/v4.1.x/just-rc1/linux-image-4.1.0-rc1-x0-dirty_1cross_armhf.deb

NOTE... any boards without a phy mask of fffffffe should not be used
for the testing going forward..

root@test-bbb-3:~# dmesg | grep mdio
[ 3.432508] davinci_mdio 4a101000.mdio: davinci mdio revision 1.6
[ 3.432526] davinci_mdio 4a101000.mdio: detected phy mask fffffffe

as ethernet will be broken..

Regards,

This is what I can add for now, please note the uptime of the first, which is up since 8 days.
These are 3 BB-Black from the very first production of Embest, probably there was a different PMIC?
I should get 16 out of the April Embest production (which were stored at a different location) today and I will include them afterwards.

dmesg | grep “phy mask”

[ 3.687140] davinci_mdio 4a101000.mdio: detected phy mask fffffffe

cat /proc/device-tree/model

TI AM335x BeagleBone Black

uname -a

Linux bb1cf1 4.1.2-ti-r4.6 #1 SMP PREEMPT Tue Jul 21 08:24:37 CDT 2015 armv7l GNU/Linux

uptime

07:15:08 up 8 days, 11:13, 1 user, load average: 0.00, 0.01, 0.05

dmesg | grep “phy mask”

[ 2.722583] davinci_mdio 4a101000.mdio: detected phy mask fffffffe

cat /proc/device-tree/model

TI AM335x BeagleBone Black

uname -a

Linux bbc1a1 4.0.4-bone4 #1 Mon May 18 05:59:35 UTC 2015 armv7l GNU/Linux

uptime

07:17:40 up 9:26, 1 user, load average: 0.00, 0.01, 0.04

dmesg | grep “phy mask”

[ 2.722541] davinci_mdio 4a101000.mdio: detected phy mask fffffffe

cat /proc/device-tree/model

TI AM335x BeagleBone Black

uname -a

Linux bbde39 4.0.4-bone4 #1 Mon May 18 05:59:35 UTC 2015 armv7l GNU/Linux

uptime

07:18:55 up 9:28, 1 user, load average: 0.00, 0.01, 0.05

for you all to enjoy, one of the BB-White:

dmesg | grep “phy mask”

[ 3.676818] davinci_mdio 4a101000.mdio: detected phy mask fffffffe

cat /proc/device-tree/model

TI AM335x BeagleBone

uname -a

Linux rc22 4.1.0-rc8-bone9 #1 Wed Jun 17 00:05:43 UTC 2015 armv7l GNU/Linux

uptime

07:20:25 up 22 days, 2:36, 2 users, load average: 0.24, 0.20, 0.18