[beagleboard] Date/Time wildly jumping forward 2^17 seconds every few seconds/minutes

Andrew_Bradford · February 10, 2013, 2:02pm

Why not trust ntp?

Which RTC? How's it backed? Does it switch to bone power source
ever? If so, which one?

What's wrong with just not having ntp nor an external RTC?

If you can't reproduce it, it's a bit hard to debug...

-Andrew

Matthias_Larisch · February 10, 2013, 3:12pm

Why not trust ntp?

→ Not applicable in our environment. Completely disabled to not disturb anything.

Which RTC? How’s it backed? Does it switch to bone power source
ever? If so, which one?

→ DS1337 running from CR2032 Cell or bone supply (ultra low reverse current shottky in series with each supply), will last for about 5 years.

What’s wrong with just not having ntp nor an external RTC?

I do have the external RTC but use it only for setting system time on boot. That is totally acceptable for me.

If you can’t reproduce it, it’s a bit hard to debug…

That is why I am asking. It would be a problem if I cannot explain the cause of this problem because that means it may happen in the future again. That would be really bad… As at least soft reset is of no use I see no easy way to avoid it. Even if a hard reset would make the system working fine: Why did this bug occur in first place? I forgot to mention: We had about 0°C environment temperature with relatively low humidity.

Maybe someone knows more about the internals of the date/time system? Is 2^17 (131072) seconds a number occuring anywhere related to it?

mickeyf · February 12, 2013, 2:42pm

Why did this bug occur in first place? I forgot to mention: We had about 0°C environment temperature with relatively low humidity.

This is pure speculation on my part, but temperature changes do cause expansion/contraction and things like questionable solder joints or even the internals of borderline ICs can be effected. More often you seen it with overheating rather than cooling, but that’s just because most of us prefer to work somewhere more comfortable than in the freezer.

Back in the days of CPM I once tracked down a bad memory chip by torturing each of a bunch of 4k chips by holding an ice cube in a baggie against them until one of them finally confessed. (No coolant spray available.)

Do you see this on every beagle, or only on one?
Can you do long running tests with the board in question both at severly reduced and at room temperature?

Andrew_Bradford · March 25, 2013, 6:21pm

Matthias,

> Why not trust ntp?

-> Not applicable in our environment. Completely disabled to not
disturb anything.

> Which RTC? How's it backed? Does it switch to bone power source
> ever? If so, which one?
>
-> DS1337 running from CR2032 Cell or bone supply (ultra low reverse
current shottky in series with each supply), will last for about 5
years.

I'm running a DS1339C-33# on my bones.

> What's wrong with just not having ntp nor an external RTC?

I do have the external RTC but use it _only_ for setting system time
on boot. That is totally acceptable for me.

> If you can't reproduce it, it's a bit hard to debug...

That is why I am asking. It would be a problem if I cannot explain
the cause of this problem because that means it may happen in the
future again. That would be really bad... As at least soft reset is
of no use I see no easy way to avoid it. Even if a hard reset would
make the system working fine: Why did this bug occur in first place?
I forgot to mention: We had about 0°C environment temperature with
relatively low humidity.

Maybe someone knows more about the internals of the date/time system?
Is 2^17 (131072) seconds a number occuring anywhere related to it?

I now bite my tongue, somewhat.

I have ntp running on all my bones with Debian Squeeze armel and the
DS1339 RTC. The RTC is powered by a linear regulator through SYS5_V
via a lithium battery that connects to the TPS65217 through the battery
pins in P6. This part is not my issue.

My issue, which mimics yours, is that sometimes people who I've given a
system (bone + custom cape) will return them to me after running for
weeks or more and the boot sequence will stop for fsck due to last
superblock mount being in the future.

I've not yet seen this happen on any system I run. It only happens on
systems I've given to others, but rarely. I do not know if it is a
slow progression of time going faster than it should or if it's step
jumps. My attempts to reproduce the problem on my desk have yielded
nothing, so far.

The two data points I have are that the time got about 36 hours ahead
and 4 days 12 hours (108 hours) ahead. This roughly corresponds to
your observed 2^17 seconds jumps.

If I find more information about this, I'll be sure to reply.

Regards,
Andrew

George_Lu · March 25, 2013, 9:45pm

Hi Andrew, Matthias,

For almost 10 months now my partner and I have had a few bones deployed in a remote monitoring application. We don’t have external RTC on them. We have ntp configured to get time through GPS (on /dev/ttyO4), augmented by PPS through a GPIO pin. We found that the bones would occasionally jump forward by about 1.5 days into the future. In each case, the ntpd died about ~9 min after the “time travel” started. We were able to add daemon to detect absence of ntpd and restart it. We could see in log file there is a 9 min period when the timestamp is ahead by ~1.5 days. Eventually we modified the daemon to detect if there has been abnormal jump in system time, thereby able to correct the problem within a min.

After reading your email, I went back to calculate the exact time jump from a captured event last December, it is 36 hours 25 min, which is 2^17 seconds!

Some of these bones have been powered by a solar harvesting system (solar panel + LiFePO4 battery + 12-to-5V converter with 3A output). Nevertheless we have seen the same time travel in lab units, whether connected through wifi or ethernet. The bones are all running ubuntu from Robert Nelson’s repository with 3.2 series kernels modified to add PPS-GPIO driver and register gpio1_31 for PPS. ntp version is 4.2.6p5, which is still the most current production release on http://www.ntp.org/downloads.html.

We have been thinking this is a ntp problem. It happens perhaps once every few weeks so it is not a big deal for us, especially since we have a way to correct it within a min. Do I understand you correctly that you have seen this 2^17 seconds time travel without ntp?

George

Andrew_Bradford · March 26, 2013, 12:02pm

Of note, the systems I have had returned to me, the application that
runs on them makes a lot of calls to clock_gettime(). When I say a lot,
I mean possibly more often than once per millisecond under heavy load.
This probably isn't the ideal way of obtaining the current time (system
the application interacts with wants time stamps, for now) but it's how
it works today.

I had been running a bone with the same Debian install but no cape and
the application not running for 3 weeks without powerdown or reboot. I
did not observe any time jump. [1] seems to indicate a similar problem
in older kernels, I'm not sure if this type of problem may still be
present. I'll try to take a look. It might be possible that this is
related (it could also be a dead end).

[1]:https://lkml.org/lkml/2007/8/23/96

-Andrew

Juanjo · March 30, 2013, 3:57am

Could it be related to this problem ?

https://groups.google.com/forum/?fromgroups=#!topic/beagleboard/j4fdat3Bj04

Hiremath_Vaibhav · April 1, 2013, 4:53am

You are not using internal RTC module, right?

If yes, then below issue is not related you and your setup.

Just to clarify more for your understanding,

AM335x has internal RTC module which takes in 32.768Khz clock from external crystal

Connected to RTC module. IN order to enable the above 32.768Khz clock you have to

Enable it by writing RTC MMR registers and as per spec it requires ~2sec stabilization

Time. So we are doing it in MLO boot-stage so that by the time kernel comes up

We can use this 32.768Khz clock for kernel system-timers.

Thanks,

Vaibhav

Hiremath_Vaibhav · April 2, 2013, 6:39am

Thanks to Vaibhav B, he just reminded me to the Timer posted mode errata we have for AM335x and the issue which you are seeing could be related to that, as you are using clock_gettime().

Can you try applying below patch and see whether you still get same issue?

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/arm/mach-omap2?id=971d0254480572bc6dc5574c28ef8fe014660a31

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/arm/mach-omap2?id=bfd6d021120d5994c4cc94d87ec03642be1540e7

NOTE: These patches are not present in v3.2 kernel and have been merged to later kernel version in Mainline.

Andrew_Bradford · April 2, 2013, 1:37pm

I am not using the internal RTC as an RTC for saving time when not
powered, no.

I do have the internal RTC code running in my kernel as I need to turn
the TPS65217 off using the ALARM at poweroff. Thus I'm not completely
clear on who exactly is responding to requests for the time within the
kernel.

I am running the PSP v3.2 kernel with some modifications.

The external i2c RTC on my boards comes up as rtc0. The internal
am335x RTC comes up as rtc1.

Thanks,
Andrew

CJNZ1 · April 5, 2013, 12:56am

Hi,

I am having a similar issue with the clock.
I can’t apply these patches as they are for 3.8 and I am using 3.2 - as per the other person in this thread.

It looks like it is a fairly major update to port these patches into 3.2 code?

Cheers,
CJ

Hiremath_Vaibhav · April 5, 2013, 5:10am

Its not that’s difficult though J

I have just backported both the patches to v3.2 kernel (merged into one) without any testing, see whether it works for you.

Thanks,

Vaibhav

0001-ARM-OMAP3-Implement-timer-workaround-for-errata-i103.patch (11.7 KB)

CJNZ1 · April 8, 2013, 1:16am

Hi - wow thanks for doing that!
I had to make a few changes to get it compiling but they were much simpler than when I was trying to do it.
Compiled and booted and initial tests look like the patch has solved the problem.
I will keep the forum posted if there are any issues that pop up in further testing.

CJNZ1 · April 8, 2013, 3:02am

Seems it is still dropping time.
I am seeing 1 second difference between the hardware clock and system clock after a couple of hours:

hwclock -u ; date

Mon Apr 8 14:59:45 2013 -0.388764 seconds
Mon Apr 8 14:59:44 2013

Hiremath_Vaibhav · April 15, 2013, 5:19am

Can you confirm that this is very consistent behavior? For example, does diff always occurs after definite amount of time?

Also, can you also try using PLL 32K clock instead of RTC-32K clock for Timer? You can do this by modifying mach-omap2/timer.c file to change the parent-clock.

Thanks,

Vaibhav

CJNZ1 · April 19, 2013, 2:05am

Hi Hiremath,

Yes, it starts to drift about 3 seconds a day getting progressively worse as time goes on.

Sorry to be a pain - but I have tried to find the place to change system clock in timer.c - but I can’t find where or what to change to.
Is it something to do with this entry in timer.c?
/* Parent clocks, eventually these will come from the clock framework */
#define OMAP2_MPU_SOURCE “sys_ck”

Cheers,
CJ

Vaibhav_Bedia · April 21, 2013, 6:48am

I think we are mixing up two issues here. Time jumping wildly by 2^17 seconds is
something i would expect the backport of the timer patches to have fixed and if
omeone can confirm that part it would be great. If we still see the random jumps
we’ll need to dig deeper in the internal timer (not the RTC) details and look for other
bugs (s/w and/or Si).

Time drifting gradually could be due to the crystal that you are using for supplying
the said timer and not something related to the random time jumps that were reported.

That being said, in case you are using the internal RTC, there’s mechanism to
compensate for drifts. Have a look at the AM335x TRM for this. IIRC the BeagleBone
PMIC RTC also has the compensation feature. So, if that’s the RTC that you are using
you need to check the PMIC datasheet.

Regards,
Vaibhav B.

Hiremath_Vaibhav · April 22, 2013, 5:06am

I believe, this is related to system timer and not the RTC.

Thanks,

Vaibhav

Vaibhav_Bedia · April 22, 2013, 11:17am

hwclock reads the RTC. And gradual drift between two timekeepers is different from random jumps that were initially reported.

Regards,
Vaibhav B.

Hiremath_Vaibhav · April 23, 2013, 5:59am

“# hwclock” indeed uses RTC for the time information but “# date” doesn’t, right?

The patch which I shared earlier fixed the system timer posted mode related Errata.

But in this particular scenario, I am not sure where is who is misbehaving,

hwclock -u ; date

Mon Apr 8 14:59:45 2013 -0.388764 seconds [RTC time info]

Mon Apr 8 14:59:44 2013 [system time info]

Thanks,

Vaibhav