Debian 8.1 / kernel 4.1.x test releases are unstable

Graham1 · July 12, 2015, 2:17pm

I have several Rev C BBB units, that have been in use enough to be considered “trusted” hardware.

All of them, running Debian 8.1 kernel 3.14 are rock solid. By that, I mean that they will run for months without problems.
Maybe longer, I have not left them undisturbed for any longer than that.

I have tried to run the Debian 8.1 / kernel 4.1.x test releases, and they all autonomously reboot several times per day.
No regular or reproducible pattern, nothing in the syslog, other than the reboot process itself.

This includes the 2015-07-05 kernel 4.1.1 release. (bone-debian-8.1-lxqt-4gb-armhf-2015-07-05-4gb.img)

All I do is put it on a 16 GB card, expand the available memory to 16 GB from 4 GB, and turn off the 4 blinky lights. No other changes. The unit will reboot at random, several times per day.

This has been common to all the kerenl 4.1 releases. Although a random pattern to the reboots, it is easy enough to reproduce. If you turn off the blinky lights, it is easy to recognize, since they come back on when it reboots. Or examine the syslog.

— Graham

Guenter_Koellner · July 12, 2015, 4:22pm

I absolutely agree with Graham’s report. I also saw plenty of unexplainable resets of the Beaglebone, same as Graham says when just having them on the table, naked, no cape, just flashed with fresh image. My power supplies are 5V/2A from a german quality vendor and I’m using them in hundreds, the power supply is not the reason. There is no information in journalctl -f, no information on the RS232 console, the board just resets without any indication.

I made a test with a bigger number of Beaglebone (white) and Beaglebone-Black, and these are the results within 24h. All but two of them are operating this release:

uname -a Linux bb151f 4.1.0-rc8-bone9 #1 Wed Jun 17 00:05:43 UTC 2015 armv7l GNU/Linux

These are my test results:

MAC | BB | Comment | Number of Resets / uptime
|

| - | - | - |
00:18:31:e0:54:35 | White | stable since power on (1day)
| - |
bc:6a:29:cc:a5:ae | White | stable since power on (7 days)
| - |
00:18:31:8b:59:4e | White | stable | -
|
d4:94:a1:85:c2:3d | White | stable
running 4.1.1-bone9
| -
|
78:a5:04:cd:cf:b3 | Black | stable | -
|
78:a5:04:ce:13:21 | Black | stable since power on (12days)
| - |
d0:5f:b8:d7:53:ec | Black | unstable | reboots every 4-6h |
6c:ec:eb:5d:26:09 | Black | unstable, even with
4.2.0-rc1-bone1 | 3 |
d0:39:72:45:1c:f1 | Black | instable,
got 1x stuck in U-Boot
| 6 |
78:a5:04:ca:a9:4e | Black |
| 3 |
78:a5:04:fe:f6:11 | Black |
| 3 |
78:a5:04:cf:4f:8e | Black |
| 6 |
78:a5:04:db:5d:63 | Black |
| 3 |
54:4a:16:c5:ea:75 | Black |
| 2 |
78:a5:04:cf:84:5a | Black |
| 3 |
78:a5:04:fd:93:dc | Black |
| 2 |
78:a5:04:fe:de:13 | Black |
| 5 |
78:a5:04:cf:5a:40 | Black |
| 4 |
78:a5:04:cf:6c:1f | Black |
| 6 |
6c:ec:eb:a5:15:1f | Black |
| 2 |
78:a5:04:cf:65:48 | Black |
| 4 |
78:a5:04:ca:8f:34 | Black |
| 3 |

I’m sampling uptime of all boards every 30min, and sometimes the simple script that collects that data gets stuck, so the true number is definitly higher.

All my Beaglebone (white) under test are rock-solid, and so are also two BB-Black. Those two BB-Black are elder devices from the first BB-Black on the market, while those that are unstable are mostly latest production from Embest.

As Graham reports, all the boards are stable when running something before 4.x.x, in my case this is the very old 3.8 Angstrom:

`

uname -a

Linux beaglebone 3.8.13 #1 SMP Tue Jul 30 11:56:13 CEST 2013 armv7l GNU/Linux

lsb_release -a

Distributor ID: Angstrom
Description: Angstrom GNU/Linux v2012.12 (Core edition)
Release: v2012.12
Codename: Core edition
`

I am now changing my worst candidates back to

`
uname -a
Linux bb1cf1 3.19.3-bone4 #1 Fri Mar 27 16:05:22 UTC 2015 armv7l GNU/Linux

`

Any comment or help would be greatly appreciated. If I can add some testing, let me know.

— Guenter (dl4mea)

Peter_Hurley · July 12, 2015, 4:55pm

The common debugging method for problems like this is to bisect.
However, if the start and end points are 3.14 and 4.1.x, respectively, that would be prohibitive.
Best to find a closer start point than 3.14.

Also, is 4.1.x stable if you don't mess with the image?

Regards,
Peter Hurley

Guenter_Koellner · July 12, 2015, 5:19pm

Instabilities have been found by just flashing elinux.org images from, for example,
http://elinux.org/Beagleboard:BeagleBoneBlack_Debian#Jessie_Snapshot_console, in special Flasher: (console) (BeagleBone Black eMMC)
and letting the board idle with network + serial console connected

Also, is 4.1.x stable if you don’t mess with the image

I am willing to try with serveral images, as I have 15 boards suffering from this under supervision
But I don’t understand which sequence to go for.
There are so many, if I look for them in

apt-cache search linux-image

— Guenter (dl4mea)

RobertCNelson · July 12, 2015, 5:46pm

Instabilities have been found by just flashing elinux.org images from, for example,
http://elinux.org/Beagleboard:BeagleBoneBlack_Debian#Jessie_Snapshot_console, in special Flasher: (console) (BeagleBone Black eMMC)
and letting the board idle with network + serial console connected

Also, is 4.1.x stable if you don’t mess with the image

I am willing to try with serveral images, as I have 15 boards suffering from this under supervision
But I don’t understand which sequence to go for.
There are so many, if I look for them in
apt-cache search linux-image

grep ti | grep 4.1

BTW, we had similar issues when we started testing 3.14… I saw it happen on a board Thursday, won’t be able to dig into it again to Monday.

William_Hermans · July 12, 2015, 6:02pm

watchdog was the first thing that popped into my mind heh.

Graham_Haddock · July 12, 2015, 6:18pm

I will try it by reloading a totally untouched “bone-debian-8.1-lxqt-4gb-armhf-2015-07-05-4gb.img”,

and report back. No cape, trusted Rev.C hardware and power supply. All communications via

Ethernet.

By my saying that 3.14 is rock solid, this includes up to “bone-debian-8.1-lxqt-4gb-armhf-2015-06-15-4gb.img”,

which was the last non-kernel-4 test release.

Same hardware, same power supplies, same Ethernet connection. No other hardware or connections.

— Graham

William_Hermans · July 12, 2015, 7:19pm

I’ve had this, or something similar happen to me a few times. When I did apt-get update again right after, it succeeded. But I’m still not sure of the cause.

William_Hermans · July 12, 2015, 7:22pm

Anyway, guys, give me an idea of what you’re doing on these boards. When you get random system reset, and I’ll test here too. I have a couple free beaglebones I can run arbitrary tests on at the moment.

William_Hermans · July 12, 2015, 7:26pm

By the way, currently on sdcard I am running wheezy 7.8 I believe.
debian@beaglebone:~$ cat /etc/dogtag
BeagleBoard.org Debian Image 2015-03-01
debian@beaglebone:~$ uname -a
Linux beaglebone 3.8.13-bone70 #1 SMP Fri Jan 23 02:15:42 UTC 2015 armv7l GNU/Linux

So I could apt-get install linux-image-4.1 and see if this could be related to the rootfs, or what.

Guenter_Koellner · July 12, 2015, 7:32pm

W: Failed to fetch http://repos.rcn-ee.com/debian/dists/jessie/main/binary-armhf/Packages Hash Sum mismatch

Solution: http://askubuntu.com/questions/41605/trouble-downloading-packages-list-due-to-a-hash-sum-mismatch-error

Simply: rm -rf /var/lib/apt/lists/*

Graham_Haddock · July 12, 2015, 7:35pm

Hi William:

Doing nothing with the board. It is just sitting on the side connected to +5V power and Ethernet.

So, for example, late last night (Central US time) I loaded “bone-debian-8.1-lxqt-4gb-armhf-2015-07-05-4gb.img”

onto a trusted uSD card expanded the memory using gparted to the full 16GB, and turned off the four blue

blinky lights. No other changes.

Then I went to bed.

Reading syslog,

(Times are GMT, boot completion defined as systemd updating the time to network time.

the initial boot (completion) was at JUL 12, 05:09:27

the lab was quiet, lights off, nothing running.

The BBB automously rebooted at 08:25:33, 13:13:22, and 14:32:27

I am now rerunning with untouched reload of “bone-debian-8.1-lxqt-4gb-armhf-2015-07-05-4gb.img”

Just load, install and boot. Talk to command line by SSH.

— Graham

William_Hermans · July 12, 2015, 7:46pm

Hi William:

Doing nothing with the board. It is just sitting on the side connected to +5V power and Ethernet.

*So, for example, late last night (Central US time) I loaded "bone-debian-8.1-lxqt-4gb-*armhf-2015-07-05-4gb.img"

onto a trusted uSD card expanded the memory using gparted to the full 16GB, and turned off the four blue

blinky lights. No other changes.

Then I went to bed.

Reading syslog,

(Times are GMT, boot completion defined as systemd updating the time to network time.

the initial boot (completion) was at JUL 12, 05:09:27

the lab was quiet, lights off, nothing running.

The BBB automously rebooted at 08:25:33, 13:13:22, and 14:32:27

*I am now rerunning with untouched reload of "bone-debian-8.1-lxqt-4gb-*armhf-2015-07-05-4gb.img"

Just load, install and boot. Talk to command line by SSH.

— Graham

OK. Well, my own personal feelings is that this could be related to systemd. Somehow. I have no proof so substantiate that.

So, I’ll work on the problem bottom to top. What I mean by this is that I’ll start with a rootfs I know that works. In my case wheezy 7.8. I’ve got that running now with

debian@beaglebone:~$ uname -a
Linux beaglebone 4.1.0-rc8-bone9 #1 Tue Jun 16 23:45:22 UTC 2015 armv7l GNU/Linux.

I’ll let it sit and idle for a day or so. After that, I’ll download and flash the Jessie image, install sysv, disable systemd. Then start the “test” over again.

Oh and yeah if one of you can do me a favor and run one of your boards as is with

sudo cpufreq-set -g performance and see if the problem clears up ?

Guenter_Koellner · July 12, 2015, 7:50pm

I now have one of the worst case rebooters running on 3.19.3-bone4 (already installed 8h ago)
root@bb1cf1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 8.1 (jessie)
Release: 8.1
Codename: jessie
root@bb1cf1:~# uname -a
Linux rc1cf1 3.19.3-bone4 #1 Fri Mar 27 16:05:22 UTC 2015 armv7l GNU/Linux
root@bb1cf1:~# uptime
19:47:17 up 7:49, 1 user, load average: 0.47, 0.17, 0.09

and one on Robert’s suggestion 4.1.1-ti-r2
root@rc6c1f:~# uname -a
Linux rc6c1f 4.1.1-ti-r2 #1 SMP PREEMPT Wed Jul 8 17:03:29 UTC 2015 armv7l GNU/Linux
root@rc6c1f:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 8.1 (jessie)
Release: 8.1
Codename: jessie
root@rc6c1f:~# uname -a
Linux rc6c1f 4.1.1-ti-r2 #1 SMP PREEMPT Wed Jul 8 17:03:29 UTC 2015 armv7l GNU/Linux
root@rc6c1f:~# uptime
19:49:39 up 22 min, 1 user, load average: 1.09, 0.69, 0.31

William_Hermans · July 12, 2015, 8:06pm

Something else that might help troubleshoot this issue if we can get a “snapshot” of each system by way of ps aux and store them somewhere for later examination. maybe pastebin.

http://pastebin.com/ydneAtne

Peter_Hurley · July 12, 2015, 8:29pm

Yeah, I think the transition to linear irq domain (added at 3.18) made cpsw
a little extra flaky. Plus the new omap_8250 serial driver is not bug-free;
just found a flow control bug in the h/w last week.

I've had ssh shells go sideways on occasion, but not with that kind of
regularity or effect.

Like I said, the right diagnostic method is bisecting the kernel.
It's going to take a while (multiple days) if several hours are required to
distinguish good from bad kernel.

Regards,
Peter Hurley

William_Hermans · July 12, 2015, 8:41pm

Yeah, I think the transition to linear irq domain (added at 3.18) made cpsw
a little extra flaky. Plus the new omap_8250 serial driver is not bug-free;
just found a flow control bug in the h/w last week.

I’ve had ssh shells go sideways on occasion, but not with that kind of
regularity or effect.

Like I said, the right diagnostic method is bisecting the kernel.
It’s going to take a while (multiple days) if several hours are required to
distinguish good from bad kernel.

Regards,
Peter Hurley

Hi Peter. “bisecting the kernel” is unknown to me. As in the meaning, But I was wondering if some sort of remote, and very verbose logging might not help ? Currently I’m in the process of reading / learning advanced Linux programming, and have all these crazy ideas of what we could do. Just not sure what to “trap” and exactly 100% how to trap it.

William_Hermans · July 12, 2015, 9:00pm

ah I see. following my own advice comes in handy sometimes . . . as in GitBisect. A bit out of my abilities.

Peter_Hurley · July 12, 2015, 9:16pm

Linux mainline kernel source is really just a massive linear series of patches,
one after the other, all tracked by git. Bisecting is a method of reducing the
number of patches under test by 1/2 at each iteration to arrive at a problem commit.

So, for example, let's say that I have a problem that cropped up on 4.2-rc1,
but the problem wasn't happening on 4.1-rc7.

I start a bisect with git:
$ git bisect start v4.2-rc1 v4.1-rc7
Bisecting: 6261 revisions left to test after this (roughly 13 steps)
[4570a37169d4b44d316f40b2ccc681dc93fedc7b] Merge tag 'sound-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

I build this kernel, test it, and mark it good or bad. Let's say the problem
doesn't exhibit in this kernel.

$ git bisect good
Bisecting: 3371 revisions left to test after this (roughly 12 steps)
[8d7804a2f03dbd34940fcb426450c730adf29dae] Merge tag 'driver-core-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Now I build this kernel, test it, and mark it good or bad. Let's say bad this time.

$git bisect bad
[3d9f96d850e4bbfae24dc9aee03033dd77c81596] Merge tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Each time, the number of commits under test are being reduced by 1/2.
For a small kernel like Beaglebone this is no big deal, couple of minutes build
time. For a 64-bit distro kernel, this can take several days for 14 iterations.

Having a bunch of BBBs, all testing the same kernel at the same time
significantly improves the confidence at each iteration that the kernel
is "good" or "bad" (since obviously a problem that takes time to manifest may
be mistakenly identified as "good" and then the bisect will narrow on the wrong
commits).

Instrumenting a problem like this is basically impossible.

Regards,
Peter Hurley

William_Hermans · July 12, 2015, 9:25pm

Thanks Peter for the in depth explanation. I was actually just reading a very detailed blog post by a person bug hunting in fedora 20 . . . the blog post could be considered a book in of its self, and wow yes, lots of learning to do before I can achieve the same myself.