I’ve recommended using round-trip latency measurements wherever possible. The primary rationale is that this avoids any need for clock synchronization because both time stamps are taken on the same clock. But this does leave open the possibility that the one clock used may drift during the latency measurement interval.
The bottom line is that there’s no need to worry because the measurement intervals we’re talking about are so small that even a terrible clock wouldn’t accumulate much error. Below I’ll give a bunch of background that’ll give you the error in a 100 µs measurement, then talk about gettimeofday() vs. TSC.
The scenario here is that you’re taking two readings on the same clock, separated by some interval which is the real round-trip time of the message. Your clock may not be a perfect clock, so if it tells you that 100 microseconds elapsed, how do you know that it wasn’t actually 110 µs or maybe 90 µs that went by?
This is a question of the frequency accuracy of your clock. It’s asking if your clock is running fast or running slow. It’s not asking if the clock is set to the proper time of day. That would be asking if your clock were ahead or behind. It’s not asking if your clock reads the same as some master clock (that’s a question of it time accuracy or its synchronization).
Consider an experiment where you read your clock now and then again one day from now. You should see that 86400 seconds have elapsed. If your clock is running fast, it might say that 86410 seconds had elapsed. If it’s running slow, it might say that only 86390 seconds had elapsed.
So how do you know the frequency accuracy of your clock? You have to compare it to a master clock. This is exactly what you’re doing in the experiment above. Somehow, you determined when one day was up, presumably by consulting some clock known to be more accurate than the one you’re trying to test.
This is exactly how NTP works. (I’m going to play a bit fast and loose with acronyms here. For the sake of readability, I’ll just say “NTP” even when it would be more correct to say “ntpd.” That’s the name of the daemon that implements the NTP protocol and does all the work.)
NTP’s job is to synchronize clocks. NTP could discover that your clock was running slow by 10 seconds per day and then just reset it once per day at midnight, but that’d be pretty heavy handed. It could also reset your clock by 1 second 10 times per day. But it’d make a lot more sense if NTP just sped up your clock so that it wasn’t always falling behind the master. Working with the kernel, this is exactly what NTP does. It corrects for any long-term frequency error in your clock. What we have is an idealized software clock that’s running on top of an imperfect hardware clock that’s being software corrected (or “disciplined” in the lingo of clock synchronization).
If you’re curious, you can ask NTP how much long-term correction it’s applying with the command
ntpq -c "rv 0 frequency"
Most Linux systems and Macs will have the ntpq command installed by default. You should see something that looks like this
status=06f4 leap_none, sync_ntp, 15 events, event_peer/strat_chg, frequency=14.748
The frequency error of a clock is commonly measured in parts per million or PPM. It could just be measured in percent as well, but clock frequency errors tend to be so small that the zeros get in the way. 1 PPM is the same as 0.0001% but a lot easier to say and write.
The 10 second per day error I’ve been talking about above works out to about 116 PPM. The output above is saying this machine’s clock is running 14.748 PPM fast over the long term. 12 PPM is about 1 second per day. So this machine would gain a little over one second per day if it weren’t running NTP. With NTP running, the kernel applies a correction for the long-term error that typically brings the frequency error down far below 1 PPM. More on that later.
For arcane but sound technical reasons, NTP will not even attempt to synchronize clocks that run too fast or too slow. Imagine a clock that ran at 2x speed. If you only checked it once per day (and didn’t pay any attention to the date), the time would always be correct. So NTP will happily discipline the clock on the system shown above, but will refuse if the measured frequency error of the underlying hardware clock is more than 500 PPM.
So one thing we know right away is that if NTP is happily disciplining the clock on a machine, then the frequency error in its hardware time of day clock must be < 500 PPM. A frequency error of 500 PPM works out to 43 seconds per day.
NTP would be pretty cool if all it did was measure the long-term frequency error of your clock and conspire with the kernel to correct it. Even if you lost your connection to a master clock, your error would typically be below 1 PPM as long as the temperature stayed constant to within 1 degree Celsius. But as long as you have a connection to a master clock, NTP is also working to compensate for any short-term drift caused by temperature or other factors. NTP is constantly measuring the short-term frequency error of your clock and calls this the stability. You can check it with the command
ntpq -c "rv 0 stability"
You’ll see output that looks like this
status=0674 leap_none, sync_ntp, 7 events, event_peer/strat_chg, stability=0.001
That’s saying the short term frequency accuracy of this machine is 0.001 PPM or 1 part per billion. That’s damn accurate. And note where it says sync_ntp. That means this machine is only synchronized using the NTP protocol over the wire as most machines would be. Here’s the output from one of my GPS disciplined machines.
status=21f4 leap_none, sync_atomic/PPS, 15 events, event_peer/strat_chg, stability=0.000
The sync_atomic/PPS tells you that it’s getting a local GPS signal instead of just depending on NTP over the wire. The short-term frequency error is as close to zero as can reliably be measured.
There are some good comments on the quality of PC clocks over in the NTP FAQ.
With all this background, we can compute the error in a round-trip latency measurement. You can take the frequency error of your clock in PPM divided by one million to get a multiplier for the error over an interval. So a clock that’s off by 500 PPM would drift by 500/1000000 = 0.0005 seconds per second. That’s 500 µs per second. Over a 100 microsecond interval, it’d drift by 500*0.0001 = 0.05 µs or 50 nanoseconds. And remember, that’s for a clock that’s losing or gaining 43 seconds per day. You’d notice it if it were that fast or slow. Even a cheap wrist watch will do far better than this.
If your clock’s stability is down to 1 PPM, which is far more common, then your error for a 100 µs test is 0.0001 µs or 0.1 ns or 100 picoseconds.
I encourage you to type the commands on machines running NTP to know your stability. Then do the math and figure out the error. I suspect it’ll be solidly sub microsecond, probably in the low nanoseconds. All plenty good enough for latency measurements in our work.
All of the above discussion assumes you’re looking at the results of gettimeofday() disciplined by NTP. It’s really a software clock driven by hardware, with any known errors removed by NTP. I’m sometimes asked about measuring time intervals based on the TSC clock rather than the real time clock. TSC is strictly a hardware clock with no opportunity for NTP to correct frequency errors. If a machine leaves the factory with an error of 499 PPM in the clock that drives TSC, it may move around a few PPM based on temperature, but it’s highly unlikely to get to zero. Still, it’s reasonable to assume the error won’t be more than 500 PPM and that’s only 50 ns of error on a 100 µs measurement interval so you’re probably ok using TSC even though it’s not corrected for frequency error against a master clock.
TSC is harder to use than software clocks like gettimeofday() because you can’t use it to measure time without first measuring the passage of time while the CPU runs to develop a conversion factor between CPU cycles run and time passed. This happens to be measuring the actual CPU clock frequency. You also have to watch out for processors that don’t sync TSC across cores and other gory details. It has the advantage of sub-microsecond resolution which becomes more important as we move to measuring latencies ever closer to 1 microsecond.
Any digital measurement has a quantization error of +/- 1 count. So gettimeofday() with its microsecond resolution has a quantization error of 2% over a 50 µs interval even if the frequency of the clock is perfect. You can see the trouble with trying to measure 10 µs intervals with gettimeofday() where quantization error alone accounts for a 10% error. So the pain of using TSC becomes more tolerable as the interval goes down if you want to keep the accuracy up.
29West provides example code for latency measurement that just uses gettimeofday() to keep the examples simple. But we also use TSC internally in our products on platforms that support it.
October 7, 2009 at 10:10 pm
[...] Estimating Errors in Round-Trip Latency Measurements due to Clock Drift [...]