BeagleBone Black doesn't sometimes start. Only Power LED is on

ivan.wunderlin · August 12, 2015, 11:35pm

Jumping Rx to 3V3 pretty much solved the problem for me.

Why only “pretty much”? Well - I actually installed my first project on a customer site. This project consists of 12 boards communicating back to a management system. Because of the issue with the eth0 occasionally not reconnecting (e.g. after the switch is restarted) I added safety code that restarts the board if there is no communication back to the management system for 120s.

The system has been running perfectly for 2 months - but last week the customer accidentally switched the management system off (power button pressed; I’ve since disabled that button). The problem was only detected after 4 days - so all 12 boards kept restarting for 4 days. This resulted in roughly 30,000 restarts. After the management system was finally switched back on 2 boards did not establish communication. I had to go on site (luckily it is in the same city) and found that the 2 boards did not start up. So 2 failed restarts out of 30,000 - even with Rx connected straight to 3V3.

I have to admit that I now look into other options (other than the BBB) because as much as I like it I cannot use a HW platform that a.) does not start up 100.00% reliably (even after manual modifications that should not be necessary in the 1st place) and that b.) that does not always connect back to the switch. I am scratching my head a little bit why the BBB team does not fix those issues as they are not exactly disputed. It is clear enough that the BBB does have a serious reboot issue (as this thread certainly proves). Even worse is the issue with eth0 not always re-connecting if static IP configuration is used. Just in case somebody wonders if I really implemented all fixes suggested in the various forums - this is what I’ve done for all 12 boards for my 1st project:

Soldered Rx to 3V3 (much improved the restart issue but still 2 out of 30,000 restarts fail)
Installed latest Debian image
Updated Kernel to latest version
Removed apache service
Removed DHCP service
Removed wicd service
Disabled lightdm (my applications doesn’t need it)
Disabled HDMI (my applications doesn’t need it)
Adjusted /etc/interfaces (set static IP for eth0)

Note that ifconfig shows the correct IP address when the issue with eth0 not connecting happens.

So yeah - the BBB seems to be a great board but unfortunately not reliable enough for 24*7 unattended operation. And I really cannot spend more time on trouble shooting - I have enough work at my hands with the actual application so I need a platform that works out of the box.

RobertCNelson · August 13, 2015, 2:06pm

Please confirm which kernel your running:

uname -r

There's a big thread on this list, where a bunch spend about 2 weeks
bisecting the v4.1.x kernel to find the cause of the "random" reboot..

sudo apt-get update
sudo apt-get install linux-image-4.1.4-ti-r9
sudo reboot

Regards,

ivan.wunderlin · August 13, 2015, 10:14pm

Hi Robert,

Please confirm which kernel your running

3.8.13-bone71 (updated beginning of last June)

There’s a big thread on this list, where a bunch spend about 2 weeks bisecting the v4.1.x kernel to find the cause of the “random” reboot…

I don’t have an issue though with random reboots - the reboots are initiated on purpose by my application (because of the eth0 problem) - but as described 2 in 30,000 reboots failed.

Cheers,
Ivan

Colin_Bester · August 13, 2015, 11:04pm

In my case I am running 3.8.13-bone68 and system is pretty darn solid if it does start up. I have not seen Ethernet fail nor have I had any random reboots, but occasional I do have a device not power up when power is applied. We have not been able to determine a consistent cause and I am not convinced it’s due to the mentioned RX pin as I am pretty sure I saw that this pin is pulled low (which I still think is wrong polarity) on rev C boards.

In addition, we have physically blocked off the PWR button and only expose the reset button via a small pin hole in our enclosure.

AQG_Chris · August 13, 2015, 11:25pm

I’ll chime in again too - we originally tested our 3v3-rx jumper with a direct connection, but then decided it might be nice to keep the ability to use the serial debug port. Right now we’ve got a 470 ohm resistor pulling rx up, which seems to allow communication over serial still. We also tried values of 220 through 1kohm and were still able to send characters to uart0. 100 ohm was too low to send anything over serial. I’ve not done extensive reboot testing yet with each of these, but we will likely settle on either 470 or 1kohm.

If you probe the RX pin of the BB with nothing attached, we find 0v. The TX pin (both on the BB and on the FTDI board we attach to the BB) is at 3.3v by default, and we’ve also noticed that the problem never occurs when we’re hooked up to uart0 with our FTDI chip adapter. That made pulling the line up rather than down an attractive option. We did try pulling RX down to ground at some point, but the very first test I did after that resulted in power LED on but no boot.

We still haven’t done a mass deployment, so for now I’m taking our smaller testing runs and experiences like Ivan’s to guide us. Sidenote - yes, we have noticed eth0 not showing up as well, although it’s not critical for our application.

Andrew_Glen · August 13, 2015, 11:28pm

For what it’s worth, I run hundreds of 24/7 unattended systems with the BBB. We have tested reboots into the tens of thousands, and with some work we are able to achieve zero failures.

Off the top of my head here are the key platform specific things I do that you might want to look at:

Mod the h/w to avoid the Ethernet issue (from another post) (this may be fixed in the latest kernel, but we locked the s/w down a while back.
Use a custom u-boot to avoid the uart boot issue.
Configure the file system as read only (perform all app-level read/writes on a separate partition).
Disable all journaling/logging, etc, except to temporary ramfs.
Force fsck to run and auto-recover on each boot (this was prior to running read-only, may not be necessary now).
Remove all unneeded processes.
Enable the watchdog.
Use ‘allow-hotplug’ on the ethernet connection.
There was an issue with USB mounting causing the CAN bus to fail, but this was resolved with a patch to the kernel I submitted a year or so ago. Prior to the fix I had a script checking for a failure and re-init’ing as necesary.

Have you been able to ascertain how the boot sequence fails, e.g. run 30k reboots and record serial output?

Cheers,
Andrew.

Colin_Bester · August 13, 2015, 11:38pm

I just took a look at Rev C schematics and there is a 100K pulldown on the RX pin so I wouldn’t have expected pulling directly to ground to make it any worse. All in all 100K is not much of a pull down, but I do agree that pull up is what you want - that at least is idle state on a serial line (from what I recall). My gut feel would be to use around a 3K resistor would should allow plenty headroom for hooking up a serial monitor if you ever wanted to - its a real pity that there is no convenient 3V3 on the monitor header.

~C

Gerald_Coley1 · August 14, 2015, 12:22am

3,3V and 5V shorted together would most certainly have been something undesirable. Pity that the FTDI only puts out 5V on that header and not a voltage level that is the same as the signal level.

The purpose of the buffer which was to prevent current coming from the the FTDI signals and powering the processor when power was off and the FTDI cable was still plugged in. It actually made it worse when we tried a pullup. So the weak pulldown was a compromise.

If we ever do another revision, which is doubtful with all those clones out there, I would consider adding a pullup on the other side of the buffer again and see if we can get it to work.

You could create a pullup by using a voltage divider between the 5V rail and ground and connecting your pullup to that.

Or you can add look at adding pullup via SW on the RX pin.

Gerald

Dr_Michael_J_Chudobi · August 24, 2015, 6:33pm

Andrew,

Could you elaborate on what hardware changes you made to fix the ethernet issue?

- Mike

WulfMan · August 24, 2015, 7:06pm

IF not using USB ground Vusb been working for me.

Brian_Adams · October 2, 2015, 10:23pm

We are building our images using the omap-image-builder scripts. Looks like everything is current (u_boot_release=“v2015.10-rc3”) yet we are still experiencing a 4.5% boot failure rate where we have a single power light and no other leds lit. Do I understand correctly that the uboot software fix that checks for any key press essentially reduces the failure rate because the random noise on the uart would have to be in the range of a real key-code to stop the boot sequence? And if that is right, that while reduced, there is still a range of noise that can abort the boot sequence?

RobertCNelson · October 2, 2015, 10:29pm

Random noise on tx/rx will stop the board...

For a board that "stops" what happens when you plug in a usb-usart,
are you really in u-boot?

(u-boot prompt should echo back on the first enter)

One reason i can't lock down u-boot by default, CircuitCo's tester
expect to take over u-boot to program the eeprom as part of their
tester machines..

Regards,

Mikkel_Kirkgaard_Nie · October 3, 2015, 10:54am

we are still experiencing a 4.5% boot failure rate
For a board that "stops" what happens when you plug in a usb-usart,
are you really in u-boot?
(u-boot prompt should echo back on the first enter)

Characters on uart0 was clearly, without any doubt, what interrupted the boot when I did my testing last winter (that was the unmodified uboot); Mikini Services » boot issue

I checked numerous times with a serial connection that uboot was indeed waiting for commands on uart0 in this condition.

As for the cause, I still have a suspicion towards the circuitry design around the buffer U15 and its OE and _OE (discussed in Mikini Services» Blog Archive » Beaglebone Black periodic boot failure; establishing failure rate and possible cause). But I guess nobody wants to waste time analyzing this issue on a stale design such as the BB.

One reason i can't lock down u-boot by default, CircuitCo's tester
expect to take over u-boot to program the eeprom as part of their
tester machines..

And that is indeed a valid argument for keeping the feature. Changing stuff in factory is non-trivial.
But still you would probably be able to identify the one character that their testers actually do use to stop boot and only react to that one. That would lower possibility by 2^8 (if all chars are still valid in current uboot).
Or future CircuitCo production image uboots could be built from a branch/fork/patchset that stops on one char as now. This would allow the proper "wait for string" solution to be implemented in the default uboot. People doing their own builds like Brian would then actually build a uboot that doesn't intendedly ruin boot stability.

It is really disheartening to know that people are still fighting this problem even though a solution is well-known and has been for such a long time.

Mikkel

Christopher_Stack · October 6, 2015, 10:24pm

I also am experiencing this issue. I’m trying to avoid a hardware modification so I was trying to follow Mikkel’s post: http://www.mikini.dk/index.php/category/beaglebone-black/boot-issue. However I can’t find a FAT partition on my sd-card… Is this something that changed when switching from Angstrom to Debian? Searching for u-boot.img and MLO and only found them in the u-boot backup folder so I assume that’s not what I need to change.

I’m running the Debian image from 03-01-2015. Using the newer images did not increase reliability by much.

Thanks,

Chris

William_Hermans · October 7, 2015, 12:31am

However I can’t find a FAT partition on my sd-card… Is this something that changed when switching from Angstrom to Debian? Searching for u-boot.img and MLO and only found them in the u-boot backup folder so I assume that’s not what I need to change.

This has been in effect since around kernel 3.8.13-bone47 on Wheezy. Since partition, with MLO, and u-boot.img being located in the first 1M of the partition.

William_Hermans · October 7, 2015, 12:34am

For more information you can read my blog post here: http://www.embeddedhobbyist.com/2015/09/beaglebone-black-working-with-debianlinux-images/

It’s briefly covered in the “Advanced usage of tar” section.

Christopher_Stack · October 8, 2015, 8:06pm

Let me know if I need to put this in a new thread, or if someone wants to take off the group entirely.

I followed the directions here: https://eewiki.net/display/linuxonarm/BeagleBone+Black#BeagleBoneBlack-Bootloader:U-Boot. Specifically this section: Bootloader: U-Boot and the bootloader portion of Setup microSD card.

After following those directions the Beagle Bone fails to boot, leaving 2 LEDS on and messages on UART0 saying that there is no partition table.

I see the next step in Setup microSD card is creating the partition table, but that doesn’t seem to be working it complains that the device is currently busy. I’d prefer not to blow away the operating system if that’s possible? Just to see what happened I forced the sfdisk command and it seemed to create 4 partitions, only one of which had a size which seems right to me. When i rebooted though I get an error that says bad device mmc 1.

Any ideas?

Thanks,

Chris

RobertCNelson · October 8, 2015, 8:25pm

busy? odd as you just blew out the partition table in the step before...

I'm guessing... Virtual Machine???

or an evil auto-mounter...

Regards,

William_Hermans · October 8, 2015, 8:29pm

I’ve found that 2 LEDs, or any number of LEDS on, and stuck on after a boot is a uEnv.txt configuration error. Anyway you can show us your uEnv.txt file contents ?

Also it would be good to get the output from the serial debug port.

Christopher_Stack · October 8, 2015, 8:37pm

Sorry that this is probably a dumb question, but is it possible to not blow away the partition table since I just want to swap the MLO and u-boot.img?

I’m just running the standard image on a Beagle Bone Black that does not have an eMMC or hdmi chip. I assumed it was busy because I was trying to reprogram the disk while on an OS that was using the disk?

This was the error:

Checking that no-one is using this disk right now …
BLKRRPART: Device or resource busy

After adding the force option it said:

Checking that no-one is using this disk right now …
BLKRRPART: Device or resource busy

This disk is currently in use - repartitioning is probably a bad idea.
Umount all file systems, and swapoff all swap partitions on this disk.
Use the --no-reread flag to suppress this check.

Disk /dev/mmcblk0: 121008 cylinders, 4 heads, 16 sectors/track

sfdisk: ERROR: sector 3069576189 does not have an msdos signature
/dev/mmcblk0: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = mebibytes of 1048576 bytes, blocks of 1024 bytes, counting from 0

Device Boot Start End MiB #blocks Id System
/dev/mmcblk0p1 * 1 3780 3780 3870720 83 Linux
/dev/mmcblk0p2 0 - 0 0 0 Empty
/dev/mmcblk0p3 0 - 0 0 0 Empty
/dev/mmcblk0p4 0 - 0 0 0 Empty
Successfully wrote the new partition table

Re-reading the partition table …
BLKRRPART: Device or resource busy

I have the serial output after the last test, but not from the test where I did not recreate the partition table.

U-Boot 2015.10-rc2-00001-g5879130-dirty (Oct 09 2015 - 15:52:49 +0000)

Watchdog enabled
I2C: ready
DRAM: 512 MiB
Reset Source: Global external warm reset has occurred.
Reset Source: Power-on reset has occurred.
MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1
Using default environment

Net: not set. Validating first E-fuse MAC
cpsw
Hit any key to stop autoboot: 0
gpio: pin 53 (gpio 53) value is 1
switch to partitions #0, OK
mmc0 is current device
gpio: pin 54 (gpio 54) value is 1
Checking for: /uEnv.txt …
Checking for: /boot.scr …
Checking for: /boot/boot.scr …
Checking for: /boot/uEnv.txt …
** Invalid partition 2 **
** Invalid partition 3 **
** Invalid partition 4 **
** Invalid partition 5 **
** Invalid partition 6 **
** Invalid partition 7 **
gpio: pin 56 (gpio 56) value is 0
gpio: pin 55 (gpio 55) value is 0
gpio: pin 54 (gpio 54) value is 0
Card did not respond to voltage select!
gpio: pin 54 (gpio 54) value is 1
Card did not respond to voltage select!
** Bad device mmc 1 **
Checking for: /uEnv.txt …
Card did not respond to voltage select!
** Bad device mmc 1 **
Checking for: /boot.scr …
Card did not respond to voltage select!
** Bad device mmc 1 **
Checking for: /boot/boot.scr …
Card did not respond to voltage select!
** Bad device mmc 1 **
Checking for: /boot/uEnv.txt …
Card did not respond to voltage select!
** Bad device mmc 1 **
Card did not respond to voltage select!
** Bad device mmc 1 **
Card did not respond to voltage select!
** Bad device mmc 1 **
Card did not respond to voltage select!
** Bad device mmc 1 **
Card did not respond to voltage select!
** Bad device mmc 1 **
Card did not respond to voltage select!
** Bad device mmc 1 **
Card did not respond to voltage select!
** Bad device mmc 1 **

FAILSAFE: U-Boot UMS (USB Mass Storage) enabled, media now available over the usb slave port …
Card did not respond to voltage select!
** Bad device mmc 1 **

Which uEnv.txt would you like? The only modifications I made from stock were to disable the eMMC and hdmi.

Thanks,

Chris