dslreports logo
 
    All Forums Hot Topics Gallery
spc
Search similar:


uniqs
3793

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

Anyone experience NTP/clock issues today?

Hi folks,

Did anyone here witness any NTP-related problems today? Specifically, their clocks briefly rolling back to either sometime in the year 2000 (don't know day/month/time right now), or 1970 (presumably epoch time)?

Something happened today that's being discussed on outages@ which impacted *IX and Windows folks, but nobody at this point is totally sure what happened or how it managed to happen. I wouldn't say "tons of people", but definitely more than just a couple.

Relevant threads are ones that contain "NTP Issue" in their subject near the bottom:

»puck.nether.net/pipermai ··· ead.html
»puck.nether.net/pipermai ··· ead.html

I'm trying to get folks who have logs to actually provide some useful/granular dates/times to see if something lines up that might help developers or other engineers determine what the cause may have been.

Thanks everyone.
Bink
Villains... knock off all that evil
join:2006-05-14
Colorado

2 edits

Bink

Member

Nothing out of the ordinary for me today. From two boxes at home:
Nov 19 11:48:42 t42p ntpd[16039]: 2 out of 3 peers valid
Nov 19 11:48:42 t42p ntpd[16039]: bad peer from pool pool.ntp.org (66.178.0.75)
 
Nov 19 06:28:42 t41 ntpd[20835]: 2 out of 4 peers valid
Nov 19 06:28:42 t41 ntpd[20835]: bad peer from pool pool.ntp.org (208.75.88.4) 
Nov 19 06:28:42 t41 ntpd[20835]: bad peer from pool pool.ntp.org (178.18.16.124)
 
That said, if your ntpd decides to change your system’s clock to 12 years in the past, like in the URL you posted, I don’t care what NTP server you’re using, your ntpd is broken.

Steve
I know your IP address

join:2001-03-10
Tustin, CA

2 recommendations

Steve to koitsu

to koitsu
The Mayans ran a beta test.

chrisretusn
Retired
Premium Member
join:2007-08-13
Philippines

chrisretusn to koitsu

Premium Member

to koitsu
Nothing out of the ordinary in my logs.

leibold
MVM
join:2002-07-09
Sunnyvale, CA
Netgear CG3000DCR
ZyXEL P-663HN-51

1 edit

leibold to koitsu

MVM

to koitsu
No ntp related issues, however my primary clock source is a GPS receiver (it sounds like those people who were effected by the problem synced to the NISTUSNO servers ?).

Edit: corrected USNO vs. NIST. Both are official US timekeepers but the specific servers mentioned in the posts (tick and tock) are owned and operated by USNO.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

I'll expand on what transpired (since some list subscribers' mail clients are... well... don't get me started).

The very first post that brought this to my attention stated they were syncing exclusively to navy.mil NTP servers and saw their clocks go back to 2000: »puck.nether.net/pipermai ··· 834.html

I provided some semi-technical commentary here: »puck.nether.net/pipermai ··· 835.html

A follow-up from one person showed that their ntpd crashed entirely that same day and when re-syncing wanted to roll the clock back 12 years (which ntpd normally should not do; maximum deviation is 1000 seconds, unless you run ntpd with -g (please don't!): »puck.nether.net/pipermai ··· 838.html

Which lead me to wonder if maybe there is some internal high-resolution timer or frequency counter which rolled somewhere: »puck.nether.net/pipermai ··· 839.html

A follow-up from the original individual who forwarded outages@ some info indicated that someone exclusively using a hardware GPS (i.e. stratum 0) also saw systems on their network roll back to year 1970 (I have to assume this meant epoch time). Please be sure to also note the inline reply from Mark Andrews at ISC (he's talking about the same thing I did above, re: don't run ntpd with -g, and don't use ntpdate as your "way to sync clocks"): »puck.nether.net/pipermai ··· 837.html

Which prompted a response from someone using both Microsoft and NIST NTP servers stating their systems jumped back to year 2000. Reading between the lines, I have to assume this person's network was also Windows-based vs. *IX-based: »puck.nether.net/pipermai ··· 840.html

Which in turn prompted a response from someone recommending folks look at NIST's GPS archive -- however, their archive does not have data for 11/18, 11/19, or 11/20, which is damnably suspicious too: »puck.nether.net/pipermai ··· 841.html

So in summary (at this point anyway), it looked to me like if something did happen, it spanned multiple TCR types (IRIG/AFNOR and GPS), it affected people in different ways (some rolling back to epoch time, others rolling back to year 2000), and it was potentially spread across at least navy.mil, NIST, and Microsoft NTP servers (naturally I can't tell who was syncing with what without logs).

My own home server and my VPS box in southern California -- both which sync to very specific NTP servers across the United States (I don't use ntp.org's pool for example) showed no signs of anomalies in logs. Both run FreeBSD 9.1-PRERELEASE with ntpd 4.2.4p8 (part of the base system).

Hope this provides a bit more analysis of what transpired on the list and what prompted me to ask here. Thanks everyone for chiming in! I'll go with Steve See Profile's prognosis for now.

timcuth
Braves Fan
Premium Member
join:2000-09-18
Pelham, AL
Technicolor ET2251

timcuth to koitsu

Premium Member

to koitsu
Just for the record, my system seems to be fine, too:

tim@tim-pc:~$ ntpq -pn
remote refid st t when poll reach delay offset jitter
==============================================================================
+169.229.70.201 128.32.206.55 3 u 465 1024 377 95.929 -3.987 3.699
-108.61.73.243 209.51.161.238 2 u 456 1024 377 61.717 4.142 191.345
*2605:6f00:877:: 108.76.168.145 2 u 535 1024 377 87.684 -6.791 1.896
+69.64.72.238 192.12.19.20 2 u 1605 1024 356 67.798 -1.567 184.742
tim@tim-pc:~$ date
Tue Nov 20 07:50:27 CST 2012


I do use the ntp pool servers.

Tim

Exodus
Your Daddy
Premium Member
join:2001-11-26
Earth

Exodus to koitsu

Premium Member

to koitsu
None of our AIX, Solaris, SCO, Red Hat, SuSE environments had any NTP issues across the four major time zones in the US today.

koitsu
MVM
join:2002-07-16
Mountain View, CA

koitsu

MVM

Looking for details for November 19th specifically.

Exodus
Your Daddy
Premium Member
join:2001-11-26
Earth

Exodus

Premium Member

None at all since the time change. We had a problem with a bogus patch from IBM on the AIX servers but that was it.

leibold
MVM
join:2002-07-09
Sunnyvale, CA
Netgear CG3000DCR
ZyXEL P-663HN-51

leibold to koitsu

MVM

to koitsu
I checked a few NTP servers at work that sync to external clocks over the Internet (some to ISP provided servers or similar specific sources, others are on the NTP pool) but none of them logged anything between 11/18/2012 and now that would indicate a problem.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

Cool, thanks guys. I'm still hashing this out with folks on the mailing lists (at least one has been debunked as a false alarm, i.e. another example of an admin who doesn't understand the logs he's supposed to understand), but there are some examples of folks getting gigantic time rollbacks from NTP servers (trying to roll back 12 years) and ntpd, naturally, rejecting them.

leibold
MVM
join:2002-07-09
Sunnyvale, CA

leibold

MVM

Please keep us posted if you find out more information. As far as I know our servers will reject any update greater then 1200 seconds (20 minutes) but even at non-debug log levels I would expect such an event to appear in the log files.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

1 recommendation

koitsu

MVM

Lots of discussion going on on outages right now about this. The ISC has stated something official:

»isc.sans.edu/diary.html? ··· id=14548

And I have gotten one private Email from a UNIX SA who gave me logs for analysis, where his systems were doing ntpdate in a cronjob got impacted (while a couple others running ntpd did not). I scolded him big time for ntpdate-via-cron + explained a bit how ntpd worked and off he went to rectify that.
koitsu

koitsu

MVM

News sites picked this up today.

»news.slashdot.org/story/ ··· -to-2000

Most of the comments there aren't worth reading, but still, news is news.

And for Windows folks (yes I know this is an *IX forum, but we all know Windows folks), this goes over why this happened to you (please read it in full, do not skim it -- particularly the "How Did This Happen?" section)

»blogs.technet.com/b/askp ··· 000.aspx

All that said, guess what key piece of infrastructure crapped bricks over this and did the wrong thing? Apparently JunOS, or at least some releases. The mailing list hasn't picked up my response yet, but it looks like xntpd (which FreeBSD (what JunOS is based on) hasn't used since 2004)) restarted itself on JunOS when it saw the clock skew, restarting itself with effectively the -g flag, then allowed the massive clock skew through. I told the user to file a JTAC case because this is a flat out bug that should be high severity but not necessarily high priority.

Edit: sod it, I'll just include it here verbatim.

From: Jeremy Chadwick <jdc@koitsu.org>
To: "R. Benjamin Kessler" <Ben.Kessler@zenetra.com>
Date: Tue, 20 Nov 2012 13:41:02 -0800
Cc: Scott Voll <svoll.voip@gmail.com>, Sid Rao <srao@ctigroup.com>, outages <outages@outages.org>, "nanog@nanog.org"
        <nanog@nanog.org>
Subject: Re: [outages] NTP Issues Today
 
This is a Juniper bug of some kind.  It needs to be reported to them.
This is what transpired:
 
- xntpd detected the bogus NTP update from 172.20.167.251, wanting
  to roll back -378691200 seconds, and rejected it.
- xntpd restarted itself, or some related script/daemon restarted
  xntpd.  Ask Juniper why -- I don't know why it did this, and it
  shouldn't have
- Shortly after restarting, xntpd sync'd to 172.20.167.251 and
  ***did*** set the clock back -378691200.411331 seconds.
 
To me, this indicates the xntpd restart may have been restarted with the
"-g" flag (which would honour clock skew >1000s).  Why/how this happened
is something Juniper will need to explain.
 
Regardless, that is wrong behaviour, and is what caused your problem.
JunOS is based on FreeBSD (kernel and some base system bits -- and
it's very, VERY old FreeBSD), xntpd bring me back to the year of 1998,
no joke.  We haven't had xntpd in FreeBSD since December 2004.
 
Open a JTAC case and make this Juniper's problem.  I'm sorry you got
bit by this, but they need to address this.  I would classify this
as high severity, although priority level may be low since it's no
longer happening, but it's still a bug.
 
--
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |
 
 

leibold
MVM
join:2002-07-09
Sunnyvale, CA

leibold

MVM

I agree that xntpd shouldn't have died / restarted but it is not unusual for an xntpd startup script to first issue an ntpdate command to set the local clock before starting xntpd itself to keep it synchronized.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

said by leibold:

I agree that xntpd shouldn't have died / restarted but it is not unusual for an xntpd startup script to first issue an ntpdate command to set the local clock before starting xntpd itself to keep it synchronized.

Except in this case, it wasn't ntpdate. For whatever reason the outages list isn't archiving a bunch of mails, so I'm just going to have to post them.

From: "R. Benjamin Kessler" <Ben.Kessler@zenetra.com>
To: Jeremy Chadwick <jdc@koitsu.org>, Scott Voll <svoll.voip@gmail.com>
Date: Tue, 20 Nov 2012 21:07:22 +0000
CC: Sid Rao <srao@ctigroup.com>, outages <outages@outages.org>, "nanog@nanog.org" <nanog@nanog.org>
Subject: RE: [outages] NTP Issues Today
 
Logs from a Juniper router in a customer network - we had hundreds of these affected.  They all synchronize to internal hosts
(172.20.167.251 and .252) which are configured to get time from  NIST and USNO
 
CORP-NTP-01#sh ntp as
 
      address         ref clock     st  when  poll reach  delay  offset    disp
*~192.5.41.41      .IRIG.            1   354   512  377    34.2    0.36     1.4
+~132.163.4.101    .ACTS.            1   336   512  377    35.0   -2.54    18.7
 ~127.127.7.1      127.127.7.1      10    59    64  377     0.0    0.00     0.0
 * master (synced), # master (unsynced), + selected, - candidate, ~ configured
 
CORP-NTP-02#sh ntp as
 
      address         ref clock     st  when  poll reach  delay  offset    disp
*~192.5.41.41      .IRIG.            1    65   512  377    36.5    0.91     0.6
+~132.163.4.101    .ACTS.            1    95   512  377    34.3   -1.31    22.8
 ~127.127.7.1      127.127.7.1      10    44    64  377     0.0    0.00     0.0
 * master (synced), # master (unsynced), + selected, - candidate, ~ configured
 
Here are the logs from one of the Junipers:
 
Nov 19 14:24:48  XXXX xntpd[912]: kernel time sync enabled 2001
Nov 19 15:50:11  XXXX xntpd[912]: synchronized to 172.20.167.252, stratum=2
Nov 19 16:41:23  XXXX xntpd[912]: no servers reachable
Nov 19 16:44:24  XXXX xntpd[912]: synchronized to 172.20.167.251, stratum=2
Nov 19 16:44:24  XXXX xntpd[912]: time correction of -378691200 seconds exceeds sanity limit (1000); set clock manually to the
correct UTC time.
Nov 19 16:44:24  XXXX init: ntp (PID 912) exited with status=255
Nov 19 16:44:24  XXXX init: ntp (PID 70200) started
Nov 19 16:44:24  XXXX xntpd[70200]: ntpd 4.2.0-a Sat Apr 10 00:32:46 UTC 2010 (1)
Nov 19 16:44:24  XXXX xntpd[70200]: mlockall(): Resource temporarily unavailable
Nov 19 16:44:24  XXXX xntpd[70200]: precision = 0.582 usec
Nov 19 16:44:24  XXXX xntpd[70200]: Listening on interface ggsn_vpn, 128.0.0.1#123
Nov 19 16:44:24  XXXX xntpd[70200]: kernel time sync status 2040
Nov 19 16:44:24  XXXX xntpd[70200]: frequency initialized -64.931 PPM from /var/db/ntp.drift
Nov 19 16:44:24  XXXX xntpd[70200]: Configuring iburst flag for server
Nov 19 16:44:24  XXXX xntpd[70200]: Configuring iburst flag for server
Nov 19 16:44:33  XXXX xntpd[70200]: synchronized to 172.20.167.251, stratum=2
Nov 19 16:44:32  XXXX xntpd[70200]: time reset -378691200.411331 s
Nov 19 16:44:32  XXXX xntpd[70200]: kernel time sync disabled 2041
Nov 19 16:45:44  XXXX xntpd[70200]: synchronized to 172.20.167.251, stratum=2
Nov 19 16:45:51  XXXX xntpd[70200]: kernel time sync enabled 2001
Nov 19 16:45:56  XXXX xntpd[70200]: NTP Server Unreachable
Nov 19 16:53:25  XXXX xntpd[70200]: no servers reachable
Nov 19 17:03:09  XXXX xntpd[70200]: NTP Server Unreachable
Nov 19 17:13:00  XXXX xntpd[70200]: NTP Server Unreachable
Nov 19 17:20:27  XXXX xntpd[70200]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:20:27  XXXX xntpd[70200]: time correction of 378691200 seconds exceeds sanity limit (1000); set clock manually to the
correct UTC time.
Nov 19 17:20:27  XXXX init: ntp (PID 70200) exited with status=255
Nov 19 17:20:27  XXXX init: ntp (PID 70766) started
Nov 19 17:20:27  XXXX xntpd[70766]: ntpd 4.2.0-a Sat Apr 10 00:32:46 UTC 2010 (1)
Nov 19 17:20:27  XXXX xntpd[70766]: mlockall(): Resource temporarily unavailable
Nov 19 17:20:27  XXXX xntpd[70766]: precision = 0.570 usec
Nov 19 17:20:27  XXXX xntpd[70766]: Listening on interface ggsn_vpn, 128.0.0.1#123
Nov 19 17:20:27  XXXX xntpd[70766]: kernel time sync status 2040
Nov 19 17:20:27  XXXX xntpd[70766]: frequency initialized -64.931 PPM from /var/db/ntp.drift
Nov 19 17:20:27  XXXX xntpd[70766]: Configuring iburst flag for server
Nov 19 17:20:27  XXXX xntpd[70766]: Configuring iburst flag for server
Nov 19 17:20:35  XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:20:36  XXXX xntpd[70766]: time reset +378691200.387434 s
Nov 19 17:20:36  XXXX xntpd[70766]: kernel time sync disabled 6041
Nov 19 17:21:48  XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:21:48  XXXX xntpd[70766]: kernel time sync disabled 2041
Nov 19 17:21:52  XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 00:02:29  XXXX xntpd[70766]: synchronized to 172.20.167.251, stratum=2
Nov 20 01:44:56  XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 02:19:03  XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 02:53:12  XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 03:44:26  XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 05:26:58  XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 05:44:02  XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 07:43:35  XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 08:00:39  XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 08:34:48  XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 08:51:54  XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 10:34:22  XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 20 11:25:16  XXXX xntpd[70766]: synchronized to 172.20.167.251, stratum=2
Nov 20 12:33:56  XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 20 14:16:05  XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 14:33:10  XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 15:07:19  XXXX xntpd[70766]: synchronized to 172.20.167.251, stratum=2
 

Look carefully what happened here:

Nov 19 17:20:27  XXXX xntpd[70200]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:20:27  XXXX xntpd[70200]: time correction of 378691200 seconds exceeds sanity limit (1000); set clock manually to the
correct UTC time.
Nov 19 17:20:27  XXXX init: ntp (PID 70200) exited with status=255
Nov 19 17:20:27  XXXX init: ntp (PID 70766) started
Nov 19 17:20:27  XXXX xntpd[70766]: ntpd 4.2.0-a Sat Apr 10 00:32:46 UTC 2010 (1)
...
Nov 19 17:20:35  XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:20:36  XXXX xntpd[70766]: time reset +378691200.387434 s
 

So it's as I said -- xntpd either restarted itself, or some JunOS daemon restarting bit did it (possibly a modified init(8)), and what it did was (effectively) restart xntpd with the -g flag, which permitted the >1000s skew, which is what hurt the above individual and their systems. (I won't get into the fact that they should probably have been syncing with much more than just 3 servers *ahem*)

leibold
MVM
join:2002-07-09
Sunnyvale, CA
Netgear CG3000DCR
ZyXEL P-663HN-51

leibold

MVM

said by koitsu:

Look carefully what happened here:

That was the second jump (12 years forward) when the ntp servers were providing the correct year again.

The log you posted also shows the initial jump (12 years back) when PID 912 terminated and PID 70200 starts up.

I very much doubt that configuring additional ntp servers would have prevented the problem. It is apparent from the log that in both jumps the local time was reset as soon as xntpd had synchronized with the first ntp server.

More important then quantity (in my opinion 3 sources is fine) is a smart selection of the time servers. Choosing multiple servers by the same provider in the same geographic location isn't really providing a reasonable level of redundancy. Ideally one clock source is internal to the organization so that time is maintained even when Internet connectivity is lost.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

You're right, I copy-pasted the wrong part. Doh!

Based on the logs, the following happened:

- xntpd detected an update from 172.20.167.251 that demanded the clock be updated by -378691200 seconds. This was ignored by ntpd (as it should be)
- xntpd either killed itself off and was restarted by init, or init killed xntpd and restarted it (possibly with the -g flag)
- xntpd starts up and syncs with 172.20.167.251, setting the clock back 12 years (due to use of -g flag, presumably)
- 40 minutes later...
- xntpd detected an update demanding the clock be updated by 378691200 seconds. This update was also ignored by ntpd (which is a little baffling because if xntpd was restarted from before with -g, it should have honoured this update)
- xntpd either killed itself off and was restarted by init, or init killed xntpd and restarted it
- xntpd starts up and syncs with 172.20.167.252, setting the clock forward 12 years

This behaviour still seems wrong. xntpd shouldn't have honoured either of those clock updates (rolling back 12 years, nor going forward 12 years), hence need for JTAC case.

leibold
MVM
join:2002-07-09
Sunnyvale, CA
Netgear CG3000DCR
ZyXEL P-663HN-51

1 recommendation

leibold

MVM

Agreed.

This particular version of xntpd appears to allow large jumps in time on the first synchronization after startup (probably intentionally eliminating the need for ntpdate to set local time on boot) but once synchronized it rejects updates that exceed +/- 1000 seconds.

The fact that it terminates after receiving an invalid timestamp however totally defeats the safeguard of rejecting them (since on restart the previously rejected timestamp is accepted).
Bink
Villains... knock off all that evil
join:2006-05-14
Colorado

1 recommendation

Bink

Member

Couldn’t agree with this more. Setting the clock upon SYSTEM startup is one thing—doing this upon daemon startup is another (and highly flawed).