12/04/2024 | Press release | Distributed by Public on 12/03/2024 19:48
In the complex software development environment that we are in there are countless layers of abstraction that we build upon. This is part of what enables development to be so productive in this day and age. Most of the time this is helpful. The average developer not needing to worry about CPU registers, page sizes, or TCP routes allows them to focus on what makes their software unique and valuable. This is a good thing.
That said, when abstractions we build upon don't meet our expectations, it can have significant impacts. One such expectation that we often expect to be there is having an accurate clock on the machine. While we shouldn't expect all clocks within a distributed system to have identical clocks (although some modern advancements are making this closer to a reality) we do expect the clocks on our machines to be reasonably accurate. This is thanks to protocols such as Network Time Protocol (NTP), which is one of the earlier protocols in computing.
How NTP works
NTP has gone through many revisions throughout its life with its initial revision documented in 1985, and its revision at the time of writing of Version 4, published in 2010. The work in NTP has not stopped either. Discussions have happened referring to Version 5 and other time synchronization protocols also exist today. On top of that, there is also Simple Network Time Protocol (SNTP), which simplifies the protocol by making a stateless version that is compatible with NTP servers. No matter the version of NTP the core concepts are the same so let's cover them at a high level.
NTP (as well as SNTP, which is what this article focuses on) operates over UDP with the server listening on port 123. Because it uses UDP it doesn't handle retry or retransmissions automatically nor does it need to. NTP is largely a stateless protocol on the client side (and is completely stateless when using SNTP) and the servers need no state about the clients other than what is sent in the request. NTP timestamps are 64-bit fixed point numbers in seconds elapsed since 1/1/1900 0:00:00 UTC. The integer part is the first 32 bits and the fractional part is the latter 32 bits. The lower-order fractional bits give an increment of 0.2 nanoseconds. When a timestamp is not available, like right after startup, all the bits are marked as 0 to indicate it is an invalid timestamp. In addition to the timestamp data bits, there are a couple of other data fields used in the protocol.
Leap Indicator
This is a two-bit field indicating whether there is an impending additional second or removal of a second to compensate for the mismatch between clocks and the Earth's rotation. The indicator can have the following values:
Version Number
A 3-bit integer indicates the NTP version in use, currently the modern standard is Version 4.
Mode
A 3-bit integer indicates the mode of NTP that it is running in. The following modes are defined.
Stratum
An 8-bit integer indicates how many layers the responding server is down from a primary time source. The following defined stratum exists in the specification.
Poll
An 8-bit signed integer representing the maximum interval between successive messages in log2 seconds.
Precision
An 8-bit signed integer representing the precision of the clock in log2 seconds. For instance, a value of -18 corresponds to a precision of about one millisecond.
Reference ID
A 32-bit code detailing the particular server, reference clock, or depending on the state of the stratum field in the packet. For a stratum value of 0, this value is the for the packet - these will be discussed further below. For stratum values of 1, this is a four-octet, left-justified, zero-padded ASCII string assigned to the reference clock. IANA maintains the official list of what values are valid here but any value that starts with an is reserved for unregistered experimentation. For Stratum 2 and above (secondary servers and clients), this value is the reference identifier of the server from which it received its information.
Reference Timestamp
The time when the system clock was last set or corrected, in NTP timestamp format.
Origin Timestamp
Time at the client when the request departed the server, in NTP timestamp format.
Receive Timestamp
Time at the server when the request arrived from the client, in NTP timestamp format.
Transmit Timestamp
Time at the server when the response was sent to the client, in NTP timestamp format.
Note
There is no field in the header as that is calculated and stored in the client upon receipt of the packet at the earliest available moment.
Kiss-o'-Death packets
When the Stratum field is 0 that indicates an error condition and the Reference ID field is used to convey the reason for the kiss-o'-death (KoD) packet, these values are called . These different kiss codes can provide useful information to an intelligent client so they can take the appropriate response. The codes are encoded in four-character ASCII strings that are left justified. There are various kiss codes and a full list of them can be found in the specification but some particularly useful kiss codes are the following:
Walkthrough
Figure 1 shows a simple example of the flow of data in the protocol. As you can see, not all fields are populated off the bat, instead, throughout the process it is filling in more and more information until at the end it has all the data it needs to calculate. The four timestamps collected are then used to compute the offset of the client from the server. Then to get the offset from there we can calculate using the following formula:
This formula and the size of the data elements mean that the client must have an initial time set within 34 of the time server for this algorithm to work.
The roundtrip delay can also be calculated using the following formula:
The incident
That was a lot of background to cover this incident, but even without worrying about this incident, it can be useful to know the basics of how NTP works. The software in question is a fork of a project created in the early 2000s that had a built-in NTP client implementation. The default time server that was used was , which is a large, global virtual cluster of time servers that are open to the public. Having a NTP client built into an application like this in 2024 is odd as we have great time synchronization systems built into our operating systems at this point. Since it had never been a problem no one worried too much about it. It 'just worked' so why worry about it? This was not correct and we should have used our understanding of "every line of code is a liability" here instead.
The NTP client
The NTP client that was written was the simplest SNTP client you could write. It ignored much of the specification and simply took the happy path workflow and implemented that. As often is the case, the happy path was the most common case and we went years if not over a decade with this code with no one detecting any issues with its non-happy path processing or noticing if/when it hit one of those non-happy path cases.
The non-happy path
One part of the specification not implemented was considering the value of the when receiving a response. As noted above a value of indicates that the response should be discarded and the field should be considered for more information. What the implementation would do instead is simply process the returned values as if they had been a valid response.
The incident begins
The specifics of what went wrong aren't important, but suffice it to say that many metrics within the application started reporting wild values. We quickly whittled down the problem to an invalid date being returned when the application was asked for the current time. Instead of reporting the correct time it would report a time shortly after . This was an issue, and we quickly grew suspicious of the custom NTP implementation, though the exact problem remained unclear. We reviewed the code carefully but couldn't see a problem with the implementation (we did not know much about the NTP protocol so we were unaware of all of the missing cases that should have been handled).
We initially thought that may have been hacked as navigating to it in our web browsers would occasionally return a Rick Roll. This was quickly determined not to be the case by consulting various developer communities and confirming that it was not a common issue. Plus, we note that NTP operates over port 123 using UDP, while browsers use ports 80/443 with TCP.
Still confused about what could be happening, and especially confused that we were so neatly being reset to , we were rather confused. We did determine that if the NTP server was inaccessible the code would fall back to using the server's time. We took over responding to DNS requests for and responded with an unroutable IP address. This stopped the immediate bleeding.
While handling the immediate problems one member of the team extracted the custom NTP client code from the project and modified it to continuously poll different NTP servers that were routable behind (remember this is a virtual cluster where anyone can host a server) so there were over 4,500 different servers that could be responding. Then looking at the results he would output any that gave confusing values with huge offsets.
Hours after starting this process of exhaustively testing each possible backend server we started to get responses with huge offsets being reported. But why? We then repeatedly queried that exact server with a command line NTP client () to gain more information and this is what we saw:
Seeing this information we now learned about Kiss of Death packets. Sure enough, as detailed above, the was 0, the Reference ID () was , and t2 and t3 were not given values, which upon further research we learned is yet an additional way to communicate to clients that the packet should not be trusted.
Because the and fields were never considered it simply used the and like they were valid, which had the effect of basically taking the system time back to the beginning of the NTP timestamp space (as NTP timestamps start at 1/1/1900).
With this knowledge in hand, we had an understanding of what the problem was and how to fix it (and that our temporary fix would indeed prevent the issue from happening). We removed the NTP client implementation from the project and made it always use the system time. With this change, we simplified our code and made it more robust so it was a win-win even though it was painful to get there.
Why did this start happening all of a sudden?
We found limited evidence suggesting this issue had occurred in a few isolated cases in the past. However, when the incident arose this time, it was happening widely, prompting the reasonable question, why now? Unfortunately, we don't know the answer. Potentially there was a new group of servers brought online at that couldn't handle the load they were given and so they started responding with limit errors. Maybe some new or existing time servers decided to chaos test everyone that used them to make sure everyone's client implementation could handle the legitimate non-happy path responses (we couldn't). We were still uncertain, but we discovered that our group wasn't the only one affected. Over time, more and more reports emerged of this issue occurring with the open source project.
I am proud of the team I worked with and that we were able to detect, diagnose, and recover from the issue before many in the community had even discovered it was a problem. And, we were able to offer a warning as well as a suggested way forward to the community at large.
Lessons learned
It is easy to take advantage of the products and technologies you build your solutions on top of. That is OK most of the time - if we had to reimplement the whole stack from top to bottom each time we took on a project we would never get anywhere. Even so, understanding how the system works and where it can break is always worth its time in my opinion. We relearned that every line of code is a liability. If you believe some code isn't adding value, remove it. At best, it's harmless but useless; at worst, it's not contributing positively and may even be causing harm.
This incident was a valuable opportunity to apply various debugging techniques. As not all issues can be resolved with the same approach, having multiple methods at your disposal is incredibly useful.
Additional resources
To further test this issue I wrote an extremely simple NTP server that always responds with a rate limit response. The code for that NTP server is hosted here.
Kyle Carter is a Principal Software Engineer who works on a quality management system and has a passion for software and architecture. He shares his experiences to help others in the industry while continuing to learn himself, valuing the constant learning inherent in software development.
Originally published at Scaled Code.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.