Hardware watchdog for BBB

John_Syn · May 16, 2016, 10:35pm

Batteries have a limit number of charge cycles. Now you are going to say the battery remains charged, so you don’t have an issue with number of charge cycles, but if you keep the battery charged to greater than 90%, the battery won’t last more than 2 to 3 years. You can extend the life of LiPO by limiting the charge to 60% of capacity, which is the charge percentage when you purchase a new battery.

Regards,
John

Dave_Loomis · May 16, 2016, 11:52pm

You can sum it all up into this; The problem is completely solved by using a battery and having acpid installed. Except you need a way to completely disconnect power, from the BBB's input, for a single, or perhaps two corner cases that would otherwise require a hard reset.

I love the no-nonsense mentality, and quality design behind this approach for most use cases. But, for some high-reliability use cases like mine -- a device permanently installed in a remote, client wall — batteries aren’t a great fit.

  For long-term accessibility: Battery maintenance, even after years of initial functionality, is extremely inconvenient or impossible.
  For insurance reasons: The potential liability of installing LiPo, which is known to have potential fire issues, into a client’s wall.
  For shipping reasons: The added hassle of international shipping of LiPo-based systems.

All these fancy high cost solutions are honestly ridiculous, and if you can just use an OTS UPS . . .

Hardware cost is relative. The “high” cost (<$100) of a system design is nothing, if it will potentially save me things like panicked client calls, last-minute international plane tickets and high-pressure field repairs. Those just aren’t fun. Obviously every project out there isn’t heading to a NASA rover, but in some lines of work this kind of service is expected when a high-end, mission-critical system goes down. In the end, if I do my job right, the price is just passed on to the client who is willing to pay a premium for a high reliability, maintenance-free product.

I’d like to be able to deploy those systems based on the BBB, because I know it, find the platform highly versatile, and a good match for the variety of projects I take on. I think the PRUs especially make this a very unique little SoC.

Best,
ST

Gerald_Coley1 · May 17, 2016, 12:12am

Sounds good to me!

Gerald

Super_Twang · May 17, 2016, 12:56am

@Gerald
I’m a bit new on this forum; forgive me if you’re not the right person to ask… (please, anyone else, respond if you are) Its just that I’ve heard other people request and defer to your opinion on things, and, well, you have a pretty official looking email address!

I understand the goal of keeping the BBB cost low (part of what attracted me to it over the RPi in the first place). I agree, it is good to keep it competitive and within the reach of hobbyists. I realize it may not be as much of an issue to the use-cases of others, but it’d seem that an inability to keep a BBB reliably running without physically pressing its reset button is a bit of an Achilles Heel for the platform (speaking here not generally of UPS-style Mains protection, but specifically of the issue that sometimes prevents its restart without physically pressing the button. IE you can’t simply use a OTS UPS, without some additional logic. ) It’d seem that anyone who wants to leave their device running for an extended period of time would be impacted by this.

Do you know folks at TI who are as invested as you are in the success of the BBB platform? In my own research, I’ve come to understand that TI makes many of the components that might form the basis of a rock solid “Reliability System” (as discussed in this thread). (Things like supercap chargers, buck-boost chips, etc) It’d seem that this problem would be a natural fit for someone at TI to solve in the form of a TI Reference Design, or Application Notes for their product line on their end. Such an effort would be a win-win. TI would be able to sell more TI components, support the BBB user community and open the device to new use cases, and their resulting markets. The BBB community would be able to implement (or purchase) the TI/BBB reliability circuit, and focus on their primary designs, without having to solve the same basic reliability issue over and over.

Is there someone at TI, I could present this idea to? Is this the kind of thing that’d even get considered and resourced? Is TI nimble enough to care, and responsive to the BBB user community?

Thanks for your thoughts,
ST

Graham1 · May 17, 2016, 12:56am

It all depends on what you are worried about.

I have several BBBs that I use as servers, and I want them to be robust.

So while working through power backup and an external hardware watchdog per all the previous discussions, we have a thunderstorm roll through the area.

No close strikes, but the Ethernet network interface went catatonic, would not send or receive, but didn’t throw any errors.

I could not SSH into the command line.

But the local serial port/command line worked fine. The kernel seemed to be happily running, and not worried about anything in particular.

The system logs looked like someone had disconnected the Ethernet cable during the storm, but the network was still physically connected, with the RJ-45 socket lights blinking.

A power cycle reestablished everything.

So, probably some kind of transient flipped a few configuration register bits and stopped the Ethernet interface.
No physical damage.

This kind of thing can not be unique, because I note that there are Ethernet controlled power strips with “Auto-Ping.”

Stated feature is “Auto-Ping” feature to intelligently reboot a locked-up AP, router, VoIP phone, server, camera. or other device automatically.

Web Power Switch 7. http://www.digital-loggers.com/lpc.html

So either I can go buy a $115 smart AC power switch, or use an Ethernet-PIC instead of the MSP430.

— Graham

William_Hermans · May 17, 2016, 1:03am

This kind of thing can not be unique, because I note that there are Ethernet controlled power strips with “Auto-Ping.”

Stated feature is “Auto-Ping” feature to intelligently reboot a locked-up AP, router, VoIP phone, server, camera. or other device automatically.

This isn’t unique to the Beaglebones. We get close strikes here all the time during the summer, and in fact decapped a realtech ethernet bridge chip in a PoE switch last year . . . But my personal equipment gets reset all the time from static discharges from close strike several times a year. Mostly it’s just a GbE switch I have in my bedroom, but other devices get messes up from time to time as well. Once or twice our PoE WDS routers have too . . .

Gerald_Coley1 · May 17, 2016, 1:07am

I am no longer with TI. You would need to ask Jason as to the level of commitment that TI has toward doing something like this.

If there is a decently robust need for a platform based on the BBB, but with some more robust and desired features, I would be open to making that happen.

Gerald

William_Hermans · May 17, 2016, 1:07am

@Dave loomis

Fine, if something does not work for you then by all means don’t use it. But the high cost rediculous stuff I was speaking to all uses batteries . . .so you’re SoL there too.

Also, if LiPO is not good for you, think about switching to a different battery chemistry. Get crazy, think out side the box.

Super_Twang · May 17, 2016, 1:07am

@Graham
Wow! I hadn’t yet thought of Ethernet as a point of failure. Apart from the (“It doesn’t always soft-reset" issue — see outline I.B.1.b) I’d guess you could solve this with the onboard watchdog timer. Run some kind of daemon that periodically “Checks for good ethernet” (a bit vague, I know), if found, it tickles the watchdog, if not, it provokes a reboot. But yes, the problem remains that the reboot doesn’t always complete.

Of course if your ethernet got fried, that’d turn into a reboot cycle without some logic to notify you of the problem, and stop after a number of cycles.

William_Hermans · May 17, 2016, 1:14am

Ethernet has always been a point of failure as much as Telephone lines have been in the past. Except for Ethernet it is usually less drastic as many networks are not physically accessible to the elements. Like ours, where we have two external WDS routers that are point to point 1/2 miles distance in between, with a power transformer connected to a house on one end that loves to be struck by lightning . . .

Graham1 · May 17, 2016, 2:05am

Twang:

Well, that is what the “Auto-Ping” is all about.

If I don’t get a ping from you in the last two minutes, then you get power-cycled/rebooted.

There are IoT PICs that are ~$5 that can speak Ethernet and could be programmed to reset, or press the power button if 5V was present, and they had not heard from the BBB lately.

More appropriate monitoring for a server, than watching some GPIO wiggle.

— Graham

zamek_z · May 17, 2016, 6:45am

Hi All,

We have a power supply like this. We use it for a Wandboard.

Technical parameters:

input voltage:9-32VDC
output voltage: 5VDC/2500mA
there is a 25F supercapacity which can hold the power up to 30s after the power broken.
There is an input GPIO for watchdog and an output GPIO to power ok signal.
We can make it for Beagleboard.

Please write me, if you interest.

Lachlan_Audas · May 17, 2016, 7:38am

Here is a example circuit.

It has WD CPU, with RTC with supper cap… Batt, or low leakage cap,

1: It will control the target system, BBB, resberry PI, etc.

2: Power on the target system at preset time.

3: WD processor is always power cycled, and save’s it’s current state to RTC or FRAM or flash.

4: Allows alternate boot device on your target system if it supported on target. Can download image over Serial or SPI…
etc (In case of flash/SD fail on target), ram test… Diagnostics’s… etc

5: Can check target system for correct startup, can check target system for freeze etc.

This are many options on how you check you target system, you can run shell commands on the target system for example

to ping host’s, send check value’s etc. It’s up to you how smart you wont to be. Any thing you can do any serial port… on linux

for example.

6: supper reliable.

7: Back up power is up to you,… how to design for target requirements

8: And cheap !!

Lachlan,

PS: sorry about image size, as I can’t get eagle to do a nice PDF !!

design can be in KiCad… easy to convert.

Lachlan

(Attachment a_quick_direty_example.png is missing)

Super_Twang · May 17, 2016, 3:34pm

@Graham
I’ll have to experiment with this. Thanks for the suggestion! It is definitely a higher level approach that could be easier to piece together with low-cost OTS components.

Do you have a specific PIC in mind? If not, I can dig around for a good one. Last time I used a PIC it was all assembly language, with no USB ICSP and a PC-only dev environment. Has that changed? (I’m developing from a Mac)

Initially my thought was that it wouldn’t work for me because my device is designed to work while disconnected from a larger network (It is connected to a router broadcasting a private access point). But, there is nothing preventing me from connecting a switch to the router, and then the device and an auto-ping power control to the switch. My own little auto-ping network… Hmmm!

ST

Super_Twang · May 17, 2016, 3:36pm

@Lachlan
Thank you for sharing your design. I’m definitely learning from it and I’m sure others will too.
ST

Lachlan_Audas · May 17, 2016, 3:52pm

No problems, you can have the design files… in eagle or kicad,

it’s just a example… but if you think about it, it uses a old trick

from aircraft, thou in this case it’s a complete power cycle.
They would save what the CPU was doing in NV memory of some sort(I assume core memory)…
reset the CPU and reload the data…check it, and continue…
so there may have 5 reset a second… etc…
remember airplanes can have many light strikes… etc…
and if your life is dependent’s on those chips getting it correct,
you wont to use every trick in the book!

I was on a flight (747-400) bad day it was Sept 11, it was from

Australia to USA, the pilot come on the PA and said there was some

problems with some on board systems and resetting it did not fix the problem.

So had to pull the circuit breakers… for the part of the plain, and wait…

and reactivate them fix the problem, so reset’s don’t always fix the problem…
even on a 747-400 power cycale was the only thing which work.

Lachlan.

Lachlan

Super_Twang · May 17, 2016, 6:30pm

Hi y’all,

Below is a snippet of an offline email exchange between myself and Gerald of Beagleboard.org… I’m reposting it here because I think it further clarifies some of the subtleties of what is being discussed here, specifically how brown-outs are what throw the wrench into a simple solution with existing hardware. It also illustrates some real-world constraints Gerald et al are facing around a potential solution at the board level. I hope this helps the bring anyone new to the discussion up to speed, as it helped me.

Best,
ST

Graham1 · May 17, 2016, 7:12pm

Twang:

You could look at the PIC32MX5xx/6xx/7xx series or PIC32MZ series.
The low end starts below $5, quantity one. They will need an external Ethernet phi chip.
32 bit MIPS core, program in C, full Ethernet stack available.

If you want to experiment, get a PIC32MX starter card.

Ti may have something equivalent on an ARM core. I just happen to be more familiar with the PICs.

— Graham

William_Hermans · May 17, 2016, 7:30pm

Super twang,

I honestly think you’re over thinking the situation. It’s good to try and cover all possibilities, but you’re asking questions of people that have not answered specific questions that were answered by others already. There are several smart people on this group. Of which I’d like to count myself among them, but in my own case I know I do not think of everything. Which is why my buddy and I have talked at length on this subject trying to work everything out.

. . .And you know what, we missed something that thanks to Graham I’m thinking of now. A stale Ethernet connection is every bit as bad as a hung system.

@Graham,

What I propose is that you do not need an Ethernet Micro connected to the BBB. Instead, you have the BBB ping the outside world once every set time frame, and it a ping comes back unreachable after say 5-10 minutes. You just stop “kicking the dog”. Which does present a potential problem that Your internet connection may just be down. But a remote system that reboots once every 5-10 minutes because the internet connection is down is not something I’d personally see as a bad thing. After all you’re unable to connect to the system anyway.

Super_Twang · May 17, 2016, 9:20pm

@William
You right I could be overthinking! I’m juggling a lot of factors, and looking for both a quick, low-hanging-fruit, short term solution (ala off-the-shelf Ether-auto-ping, minimal hardware patch, or commercial product), as well as over the longer term a rock-solid well engineered long term solution (likely supercap-based). I’m hoping what I contribute here might be as useful to others as all of the conversations on this topic (yours included!) that preceded me have been to me.

It is admittedly hard to know when I’ve done “Enough,” esp. since I lack my own direct domain knowledge. Part of the problem is being able to discern which design proposals in this and other threads are actually relevant to my own use case, and constraints. You’ve had some great suggestions, and I like the way you approached your own design, but parts of it don’t work for me (LiPo-batteries, reasons stated prior). Also, present in most everyone’s proposal is something small/cheap MCU-based. Esp for the near term quick fix, the time investment required by learning a new MCU (or something like GreenPak), and its dev kit, (even though I’m quite interested), makes me feel admittedly a little desperate for alternatives.

In other news, you’re right, Graham’s dead-Ethernet scenario was something I hadn’t even considered! Fortunately for my own case, most of the time my systems will be isolated from the greater network, with a 6 inch Ethernet run to a dedicated private access point. So I won’t have as much potential exposure to lighting strikes on the Ethernet lines, etc, but also can’t ping control/watchdog from anywhere external. Your own deployment — the “Lightning magnet” – sounds like its dealing with a pretty gnarly environment!

My research on this is winding down so this channel will probably quiet down. Again, thanks for the help.

Best,
ST