timecounter.ms revision 116411
1139749Simp.EQ
2113584Ssimokawadelim ��
3113584Ssimokawa.EN
4113584Ssimokawa.\"
5113584Ssimokawa.\" ----------------------------------------------------------------------------
6113584Ssimokawa.\" "THE BEER-WARE LICENSE" (Revision 42):
7113584Ssimokawa.\" <phk@login.dknet.dk> wrote this file.  As long as you retain this notice you
8113584Ssimokawa.\" can do whatever you want with this stuff. If we meet some day, and you think
9113584Ssimokawa.\" this stuff is worth it, you can buy me a beer in return.   Poul-Henning Kamp
10113584Ssimokawa.\" ----------------------------------------------------------------------------
11113584Ssimokawa.\"
12113584Ssimokawa.\" $FreeBSD: head/share/doc/papers/timecounter/timecounter.ms 116411 2003-06-15 18:49:46Z phk $
13113584Ssimokawa.\"
14113584Ssimokawa.if n .ND
15113584Ssimokawa.TI
16113584SsimokawaTimecounters: Efficient and precise timekeeping in SMP kernels.
17113584Ssimokawa.AA
18113584Ssimokawa.A "Poul-Henning Kamp" "The FreeBSD Project"
19113584Ssimokawa.AB
20113584Ssimokawa.PP
21113584SsimokawaThe FreeBSD timecounters are an architecture-independent implementation
22113584Ssimokawaof a binary timescale using whatever hardware support is at hand  
23113584Ssimokawafor tracking time.  The binary timescale converts using simple
24113584Ssimokawamultiplication to canonical timescales based on micro- or nano-seconds
25113584Ssimokawaand can interface seamlessly to the NTP PLL/FLL facilities for clock
26113584Ssimokawasynchronisation.  Timecounters are implemented using lock-less
27113584Ssimokawastable-storage based primitives which scale efficiently in SMP   
28113584Ssimokawasystems.  The math and implementation behind timecounters will 
29113584Ssimokawabe detailed as well as the mechanisms used for synchronisation. \**
30113584Ssimokawa.AE
31113584Ssimokawa.FS
32113584SsimokawaThis paper was presented at the EuroBSDcon 2002 conference in Amsterdam.
33113584Ssimokawa.FE
34113584Ssimokawa.SH
35119418SobrienIntroduction
36127468Ssimokawa.PP
37119418SobrienDespite digging around for it, I have not been able to positively
38119418Sobrienidentify the first computer which knew the time of day.
39127468SsimokawaThe feature probably arrived either from the commercial side
40127468Ssimokawaso service centres could bill computer cycles to customers or from
41113584Ssimokawathe technical side so computers could timestamp external events,
42113584Ssimokawabut I have not been able to conclusively nail the first implementation down.
43113584Ssimokawa.LP
44113584SsimokawaBut there is no doubt that it happened very early in the development
45113584Ssimokawaof computers
46127468Ssimokawaand if systems like the ``SAGE'' [SAGE] did not know what time
47117126Sscottlit was I would be amazed.
48117126Sscottl.LP
49117732SsimokawaOn the other hand, it took a long time for a real time clock to
50113584Ssimokawabecome a standard feature:
51113584Ssimokawa.LP
52113584SsimokawaThe ``Apple ]['' computer
53113584Ssimokawahad neither in hardware or software any notion what time it was.
54127468Ssimokawa.LP
55127468SsimokawaThe original ``IBM PC'' did know what time it was, provided you typed
56127468Ssimokawait in when you booted it, but it forgot when you turned it off.
57127468Ssimokawa.LP
58127468SsimokawaOne of the ``advanced technologies'' in the ``IBM PC/AT'' was a battery
59113584Ssimokawabacked CMOS chip which kept track of time even when the computer
60113584Ssimokawawas powered off.
61113584Ssimokawa.LP
62127468SsimokawaToday we expect our computers to know the time, and with network
63113584Ssimokawaprotocols like NTP we will usually find that they do, give and
64113584Ssimokawatake some milliseconds.
65113584Ssimokawa.LP
66113584SsimokawaThis article is about the code in the FreeBSD kernel which keeps
67113584Ssimokawatrack of time.
68113584Ssimokawa.SH
69113584SsimokawaTime and timescale basics
70113584Ssimokawa.PP
71113584SsimokawaDespite the fact that time is the physical quantity (or maybe entity
72113584Ssimokawa?) about which we know the least, it is at the same time [sic!] what we
73113584Ssimokawacan measure with the highest precision of all physical quantities.
74113584Ssimokawa.LP
75113584SsimokawaThe current crop of atomic clocks will neither gain nor loose a
76113584Ssimokawasecond in the next couple hundred million years, provided we
77113584Ssimokawastick to the preventative maintenance schedules.  This is a feat
78113584Ssimokawaroughly in line with to knowing the circumference of the Earth
79113584Ssimokawawith one micrometer precision, in real time.
80113584Ssimokawa.LP
81113584SsimokawaWhile it is possible to measure time by means other than oscillations,
82113584Ssimokawafor instance transport or consumption of a substance at a well-known
83113584Ssimokawarate, such designs play no practical role in time measurement because
84113584Ssimokawatheir performance is significantly inferior to oscillation based
85113584Ssimokawadesigns.
86113584Ssimokawa.LP
87113584SsimokawaIn other words, it is pretty fair to say that all relevant
88113584Ssimokawatimekeeping is based on oscillating phenomena:
89113584Ssimokawa.TS
90113584Ssimokawacenter;
91113584Ssimokawal l.
92117126Sscottlsun-dial	Earths rotation about its axis.
93127468Ssimokawacalendar	Ditto + Earths orbit around the sun.
94117228Ssimokawaclockwork	Mechanical oscillation of pendulum.
95170374Ssimokawacrystals	Mechanical resonance in quartz.
96117228Ssimokawaatomic	Quantum-state transitions in atoms.
97117228Ssimokawa.TE
98113584Ssimokawa.LP
99113584SsimokawaWe can therefore with good fidelity define ``a clock'' to be the
100113584Ssimokawacombination of an oscillator and a counting mechanism:
101113584Ssimokawa.LP
102113584Ssimokawa.PSPIC fig3.eps
103113584Ssimokawa.LP
104113584SsimokawaThe standard second is currently defined as
105113584Ssimokawa.QP
106113584SsimokawaThe duration of  9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium 133 atom.
107168776Spjd.LP
108113584Ssimokawaand we have frequency standards which are able to mark a sequence of 
109113584Ssimokawasuch seconds
110113584Ssimokawawith an error less than �2 cdot 10 sup{-15}� [DMK2001] with commercially 
111113584Ssimokawaavailable products doing better than �1 cdot 10 sup{-14}� [AG2002].
112113584Ssimokawa.LP
113113584SsimokawaUnlike other physical units with a conventionally defined origo,
114113584Ssimokawalongitude for instance, the ephemeral nature of time prevents us
115113584Ssimokawafrom putting a stake in the ground, so to speak, and measure from
116113584Ssimokawathere.  For measuring time we have to rely on ``dead reckoning'',
117113584Ssimokawajust like the navigators before Harrison built his clocks [RGO2002]:
118113584SsimokawaWe have to tally how far we went from our reference point, keeping a
119113584Ssimokawarunning total at all times, and use that as our estimated position.
120113584Ssimokawa.LP
121113584SsimokawaThe upshot of this is, that we cannot define a timescale by any
122113584Ssimokawaother means than some other timescale(s).
123113584Ssimokawa.LP
124113584Ssimokawa``Relative time'' is a time interval between two events, and for this
125113584Ssimokawawe only need to agree on the rate of the oscillator.
126113584Ssimokawa.LP
127113584Ssimokawa``Absolute time'' consists of a well defined point in time and the
128113584Ssimokawatime interval since then, this is a bit more tricky.
129113584Ssimokawa.LP
130113584SsimokawaThe Internationally agreed upon TAI and the UTC timescales 
131113584Ssimokawastarts at (from a physics point of view) arbitrary points in time
132113584Ssimokawaand progresses in integral intervals of the standard second, with the
133113584Ssimokawadifference being that UTC does tricks to the counting to stay roughly
134113584Ssimokawain sync with Earths rotation \**.
135113584Ssimokawa.FS
136113584SsimokawaThe first atomic based definition actually operated in a different way:
137113584Ssimokawaeach year would have its own value determined for the frequency of the
138113584Ssimokawacaesium resonance, selected so as to match the revolution rate of the
139113584SsimokawaEarth.  This resulted in time-intervals being very unwieldy business,
140113584Ssimokawaand more and more scientists realized that that the caesium resonance
141113584Ssimokawawas many times more stable than the angular momentum of the Earth.
142113584SsimokawaEventually the new leap-second method were introduced in 1972.
143113584SsimokawaIt is interesting to note that the autumn leaves falling on the
144113584Ssimokawanorthern hemisphere affects the angular momentum enough to change 
145113584Ssimokawathe Earths rotational rate measurably.
146113584Ssimokawa.FE
147113584Ssimokawa.LP
148113584SsimokawaTAI is defined as a sequence of standard seconds (the first timescale),
149113584Ssimokawacounted from January 1st 1958 (the second timescale).
150113584Ssimokawa.LP
151113584SsimokawaUTC is defined basically the same way, but every so often a leap-second
152113584Ssimokawais inserted (or theoretically deleted) to keep UTC synchronised
153113584Ssimokawawith Earths rotation.
154113584Ssimokawa.LP
155113584SsimokawaBoth the implementation of these two, and a few others speciality 
156113584Ssimokawatimescales are the result of the
157113584Ssimokawacombined efforts of several hundred atomic frequency standards in
158113584Ssimokawavarious laboratories and institutions throughout the world, all
159113584Ssimokawareporting to the BIPM in Paris who calculate the ``paper clock'' which
160113584SsimokawaTAI and UTC really are using a carefully designed weighting algorithm \**. 
161113584Ssimokawa.FS
162113584SsimokawaThe majority of these clocks are model 5071A from Agilent (the test
163113584Ssimokawaand measurement company formerly known as ``Hewlett-Packard'') which
164113584Ssimokawacount for as much as 85% of the combined weight.
165113584SsimokawaA fact the company deservedly is proud of.
166113584SsimokawaThe majority of the remaining weight is assigned to a handful of big
167113584Ssimokawacustom-design units like the PTB2 and NIST7.
168113584Ssimokawa.FE
169113584Ssimokawa.LP
170113584SsimokawaLeap seconds are typically announced six to nine months in advance,
171113584Ssimokawabased on precise observations of median transit times of stars and VLBI
172113584Ssimokawaradio astronomy of very distant quasars.
173113584Ssimokawa.LP
174113584SsimokawaThe perceived wisdom of leap-seconds have been gradually decreasing
175113584Ssimokawain recent years, as devices and products with built-in calendar
176113584Ssimokawafunctionality becomes more and more common and people realize that
177113584Ssimokawauser input or software upgrades are necessary to instruct the
178113584Ssimokawacalendar functionality about upcoming leap seconds.
179113584Ssimokawa.SH
180113584SsimokawaUNIX timescales
181113584Ssimokawa.PP
182113584SsimokawaUNIX systems use a timescale which pretends to be UTC, but defined
183113584Ssimokawaas the count of standard seconds since 00:00:00 01-01-1970 UTC,
184113584Ssimokawaignoring the leap-seconds.  This definition has never been perceived
185113584Ssimokawaas wise.
186113584Ssimokawa.LP
187113584SsimokawaIgnoring leap seconds means that unless some trickery is performed
188113584Ssimokawawhen a leap second happens on the UTC scale, UNIX clocks would be
189113584Ssimokawaone second off.  Another implication is that the length of a
190117126Sscottltime interval calculated on UNIX time_t variables, can be up to 22
191127468Ssimokawa(and counting) seconds wrong relative to the same time interval
192117126Sscottlmeasured on the UTC timescale.
193170374Ssimokawa.LP
194117228SsimokawaRecent efforts have tried to make the NTP protocol make up for this
195117228Ssimokawadeficiency by transmitting the UTC-TAI offset as part of the protocol.
196113584Ssimokawa[MILLS2000A]
197113584Ssimokawa.LP
198113584SsimokawaFractional seconds are represented two ways in UNIX, ``timeval'' and
199113584Ssimokawa``timespec''.  Both of these formats are two-component structures
200113584Ssimokawawhich record the number of seconds, and the number of microseconds
201113584Ssimokawaor nanoseconds respectively.
202127468Ssimokawa.LP
203113584SsimokawaThis unfortunate definition makes arithmetic on these two formats
204113584Ssimokawaquite expensive to perform in terms of computer instructions:
205113584Ssimokawa.DS
206113584Ssimokawa.ps -1
207113584Ssimokawa/* Subtract timeval from timespec */
208113584Ssimokawat3.tv_sec = t1.tv_sec - t2.tv_sec;
209113584Ssimokawat3.tv_nsec = t1.tv_nsec -
210113584Ssimokawa             t2.tv_usec * 1000;
211113584Ssimokawaif (t3.tv_nsec >= 1000000000) {
212113584Ssimokawa    t3.tv_sec++;
213113584Ssimokawa    t3.tv_nsec -= 1000000000;
214113584Ssimokawa} else if (t3.tv_nsec < 0) {
215113584Ssimokawa    t3.tv_sec--;
216113584Ssimokawa    t3.tv_nsec += 1000000000;
217113584Ssimokawa}
218113584Ssimokawa.ps +1
219113584Ssimokawa.DE
220113584Ssimokawa.LP
221113584SsimokawaWhile nanoseconds will probably be enough for most timestamping
222113584Ssimokawatasks faced by UNIX computers for a number of years, it is an
223113584Ssimokawaincreasingly uncomfortable situation that CPU clock periods and
224113584Ssimokawainstruction timings are already not representable in the standard
225113584Ssimokawatime formats available on UNIX for consumer grade hardware,
226113584Ssimokawaand the first POSIX mandated API, \fCclock_getres(3)\fP has 
227113584Ssimokawaalready effectively reached end of life as a result of this.
228113584Ssimokawa.LP
229113584SsimokawaHopefully the various standards bodies will address this issue
230113584Ssimokawabetter in the future.
231113584Ssimokawa.SH
232113584SsimokawaPrecision, Stability and Resolution
233113584Ssimokawa.PP
234Three very important terms in timekeeping are ``precision'', 
235``stability'' and ``resolution''.
236While the three words may seem to describe somewhat the
237same property in most uses, their use in timekeeping covers three
238very distinct and well defined properties of a clock.
239.LP
240Resolution in clocks is simply a matter of the step-size of the
241counter or in other words: the rate at which it steps.
242A counter running on a 1 MHz frequency will have a resolution
243of 1 microsecond.
244.LP
245Precision talks about how close to the intended rate the clock runs,
246stability about how much the rate varies and resolution about the
247size of the smallest timeinterval we can measure.
248.LP
249From a quality point of view, Stability is a much more
250valuable property than precision, this is probably best explained
251using a graphic illustration of the difference between the two
252concepts:
253.LP
254.PSPIC fig1.eps
255.LP
256In the top row we have instability, the bullet holes are spread over
257a large fraction of the target area.
258In the bottom row, the bullets all hit in a very small area.
259.LP
260On the left side, we have lack of precision, the holes obviously are
261not centred on the target, a systematic offset exists.
262In the right side we have precision, the bullets are centred on
263the target \**.
264.FS
265We cannot easily get resolution into this analogy, the obvious
266representation as the diameter of the bullet-hole is not correct,
267it would have to be the grid or other pattern of locations where
268the bullet could possibly penetrate the target material, but this
269gets too quantum-mechanical-oid to serve the instructional purpose.
270.FE
271.LP
272Transposing these four targets to actual clocks, the situation
273could look like the following plots:
274.LP
275.PSPIC fig2.eps
276.LP
277On the x-axis we have time and on the y-axis how wrong the clock
278was at a given point in time.
279.LP
280The reason atomic standards are such a big deal in timekeeping is
281that they are incredibly stable: they are able to generate an oscillation
282where the period varies by roughly a millionth of a billonth of a
283second in long term measurements.
284.LP
285They are in fact not nearly as precise as they are stable, but as
286one can see from the graphic above, a stable clock which is not
287precise can be easily corrected for the offset and thus calibrated
288is as good as any clock.
289.LP
290This lack of precision is not necessarily a flaw in these kinds of
291devices, once you get into the �10 cdot 10 sup{-15}� territory
292things like the blackbody spectrum at the particular absolute 
293temperature of the clocks hardware and general relativistic
294effects mostly dependent on the altitude above earths center
295has to be corrected for \**. 
296.FS
297This particularly becomes an issue with space-based atomic standards
298as those found on the ``Navstar'' GPS satellites.
299.FE
300.SH
301Design goals of timecounters
302.PP
303After this brief description of the major features of the local
304landscape, we can look at the design goals of timecounters in detail:
305.LP
306.I "Provide timestamps in timeval and timespec formats,"
307.IP
308This is obviously the basic task we have to solve, but as was noted
309earlier, this is in no way the performance requirement.
310.LP
311.I "on both the ``uptime'' and the POSIX timescales,"
312.IP
313The ``uptime'' timescale is convenient for time intervals which are
314not anchored in UTC time: the run time of processes, the access
315time of disks and similar.
316.IP
317The uptime timescale counts seconds starting from when the system
318is booted.  The POSIX/UTC timescale is implemented by adding an
319estimate of the POSIX time when the system booted to the uptime
320timescale.
321.LP
322.I "using whatever hardware we have available at the time,"
323.IP
324Which in a subtle way also implies ``be able to switch from one
325piece of hardware to another on the fly'' since we may not know
326right up front what hardware we have access to and which is
327preferable to use.
328.LP
329.I "while supporting time the NTP PLL/FLL discipline code,"
330.IP
331The NTP kernel PLL/FLL code allows the local clock and timescale
332to be synchronised or syntonised to an external timescale either
333via network packets or hardware connection.  This also implies
334that the rate and phase of the timescale must be manoeuvrable
335with sufficient resolution.
336.LP
337.I "and providing support for the RFC 2783 PPS API,"
338.IP
339This is mainly for the benefit of the NTPD daemons communication
340with external clock or frequency hardware, but it has many other
341interesting uses as well [PHK2001].
342.LP
343.I "in a SMP efficient way."
344.IP
345Timestamps are used many places in the kernel and often at pretty
346high rate so it is important that the timekeeping facility
347does not become a point of CPU or lock contention.
348.SH
349Timecounter timestamp format.
350.PP
351Choosing the fundamental timestamp format for the timecounters is
352mostly a question of the resolution and steer-ability requirements.
353.LP
354There are two basic options on contemporary hardware: use a 32 bit
355integer for the fractional part of seconds, or use a 64 bit which
356is computationally more expensive.
357.LP
358The question therefore reduced to the somewhat simpler: can we get
359away with using only 32 bit ?
360.LP
361Since 32 bits fractional seconds have a resolution of slightly
362better than quarter of a nanosecond (.2328 nsec) it can obviously
363be converted to nanosecond resolution struct timespec timestamps
364with no loss of precision, but unfortunately not with pure 32 bit
365arithmetic as that would result in unacceptable rounding errors.
366.LP
367But timecounters also need to represent the clock period of the
368chosen hardware and this hardware might be the GHz range CPU-clock.
369The list of clock frequencies we could support with 32 bits are:
370.TS
371center;
372l l n l.
373�2 sup{32} / 1�	�=�	4.294	GHz
374�2 sup{32} / 2�	�=�	2.147	GHz
375�2 sup{32} / 3�	�=�	1.432	GHz
376\&...
377�2 sup{32} / (2 sup{32}-1)�	�=�	1.000	Hz
378.TE
379We can immediately see that 32 bit is insufficient to faithfully
380represent clock frequencies even in the low GHz area, much less in
381the range of frequencies which have already been vapourwared by
382both IBM, Intel and AMD.
383QED: 32 bit fractions are not enough.
384.LP
385With 64 bit fractions the same table looks like:
386.TS
387center;
388l l r l.
389�2 sup{64} / 1�	�=�	� 18.45 cdot 10 sup{9}�	GHz
390�2 sup{64} / 2�	�=�	� 9.223 cdot 10 sup{9}�	GHz
391\&...
392�2 sup{64} / 2 sup{32}�	�=�	4.294	GHz
393\&...
394�2 sup{64} / (2 sup{64}-1)�	�=�	1.000	Hz
395.TE
396And the resolution in the 4 GHz frequency range is approximately one Hz.
397.LP
398The following format have therefore been chosen as the basic format
399for timecounters operations:
400.DS
401.ps -1
402struct bintime {
403    time_t  sec;
404    uint64_t frac;
405};
406.ps +1
407.DE
408Notice that the format will adapt to any size of time_t variable,
409keeping timecounters safely out of the ``We SHALL prepare for the
410Y2.038K problem'' war zone.
411.LP
412One beauty of the bintime format, compared to the timeval and
413timespec formats is that it is a binary number, not a pseudo-decimal
414number.  If compilers and standards allowed, the representation
415would have been ``int128_t'' or at least ``int96_t'', but since this
416is currently not possible, we have to express the simple concept
417of multiword addition in the C language which has no concept of a
418``carry bit''.
419.LP
420To add two bintime values, the code therefore looks like this \**:
421.FS
422If the reader suspects the '>' is a typo, further study is suggested.
423.FE
424.LP
425.DS
426.ps -1
427uint64_t u;
428
429u = bt1->frac;
430bt3->frac = bt1->frac + bt2->frac;
431bt3->sec  = bt1->sec  + bt2->sec;
432if (u > bt3->frac)
433    bt3->sec += 1;
434.ps +1
435.DE
436.LP
437An important property of the bintime format is that it can be
438converted to and from timeval and timespec formats with simple
439multiplication and shift operations as shown in these two
440actual code fragments:
441.DS
442.ps -1
443void
444bintime2timespec(struct bintime *bt,
445                 struct timespec *ts)
446{
447
448    ts->tv_sec = bt->sec;
449    ts->tv_nsec = 
450      ((uint64_t)1000000000 * 
451      (uint32_t)(bt->frac >> 32)) >> 32;
452}
453.ps +1
454.DE
455.DS
456.ps -1
457void
458timespec2bintime(struct timespec *ts,
459                 struct bintime *bt)
460{
461
462    bt->sec = ts->tv_sec;
463    /* 18446744073 = 
464      int(2^64 / 1000000000) */
465    bt->frac = ts->tv_nsec * 
466      (uint64_t)18446744073LL; 
467}
468.ps +1
469.DE
470.LP
471.SH
472How timecounters work
473.PP
474To produce a current timestamp the timecounter code
475reads the hardware counter, subtracts a reference
476count to find the number of steps the counter has
477progressed since the reference timestamp.
478This number of steps is multiplied with a factor
479derived from the counters frequency, taking into account
480any corrections from the NTP PLL/FLL and this product
481is added to the reference timestamp to get a timestamp.
482.LP
483This timestamp is on the ``uptime'' time scale, so if
484UNIX/UTC time is requested, the estimated time of boot is
485added to the timestamp and finally it is scaled to the
486timeval or timespec if that is the desired format.
487.LP
488A fairly large number of functions are provided to produce
489timestamps, depending on the desired timescale and output
490format:
491.TS
492center;
493l r r.
494Desired	uptime	UTC/POSIX
495Format	timescale	timescale
496_
497bintime	binuptime()	bintime()
498timespec	nanouptime()	nanotime()
499timeval	microuptime()	microtime()
500.TE
501.LP
502Some applications need to timestamp events, but are not
503particular picky about the precision.
504In many cases a precision of tenths or hundreds of 
505seconds is sufficient.
506.LP
507A very typical case is UNIX file timestamps:
508There is little point in spending computational resources getting an
509exact nanosecond timestamp, when the data is written to
510a mechanical device which has several milliseconds of unpredictable
511delay before the operation is completed.
512.LP
513Therefore a complementary shadow family of timestamping functions 
514with the prefix ``get'' have been added.
515.LP
516These functions return the reference
517timestamp from the current timehands structure without going to the
518hardware to determine how much time has elapsed since then.
519These timestamps are known to be correct to within rate at which
520the periodic update runs, which in practice means 1 to 10 milliseconds.
521.SH
522Timecounter math
523.LP
524The delta-count operation is straightforward subtraction, but we
525need to logically AND the result with a bit-mask with the same number
526(or less) bits as the counter implements,
527to prevent higher order bits from getting set when the counter rolls over:
528.DS 
529.ce 
530.EQ
531Delta Count = (Count sub{now} - Count sub{ref}) ~ BITAND ~ mask
532.EN
533.DE
534The scaling step is straightforward.
535.DS
536.ce 
537.EQ
538T sub{now} = Delta Count cdot R sub{counter} + T sub{ref}
539.EN
540.DE
541The scaling factor �R sub{counter}� will be described below.
542.LP
543At regular intervals, scheduled by \fChardclock()\fP, a housekeeping
544routine is run which does the following:
545.LP
546A timestamp with associated hardware counter reading is elevated
547to be the new reference timecount:
548.DS
549
550.ce
551.EQ
552Delta Count = (Count sub{now} - Count sub{ref}) ~ BITAND ~ mask
553.EN
554
555.ce
556.EQ
557T sub{now} = Delta Count cdot R sub{counter}
558.EN
559
560.ce
561.EQ
562Count sub{ref} = Count sub{now}
563.EN
564
565.ce
566.EQ
567T sub{ref} = T sub{now}
568.EN
569.DE
570.LP
571If a new second has started, the NTP processing routines are called
572and the correction they return and the counters frequency is used
573to calculate the new scaling factor �R sub{counter}�:
574.DS
575.ce
576.EQ
577R sub{counter} = {2 sup{64} over Freq sub{counter}} cdot ( 1 + R sub{NTP} )
578.EN
579.DE
580Since we only have access to 64 bit arithmetic, dividing something
581into �2 sup{64}� is a problem, so in the name of code clarity
582and efficiency, we sacrifice the low order bit and instead calculate:
583.DS
584.ce
585.EQ
586R sub{counter} = 2 cdot {2 sup{63} over Freq sub{counter}} cdot ( 1 + R sub{NTP} )
587.EN
588.DE
589The �R sub{NTP}� correct factor arrives as the signed number of
590nanoseconds (with 32 bit binary fractions) to adjust per second.
591This quasi-decimal number is a bit of a square peg in our round binary
592hole, and a conversion factor is needed.
593Ideally we want to multiply this factor by:
594.DS
595.ce
596.EQ
5972 sup {64} over {10 sup{9} cdot 2 sup{32}} = 4.294967296
598.EN
599.DE
600This is not a nice number to work with.
601Fortunately, the precision of this correction is not critical, we are
602within an factor of a million of the �10 sup{-15}� performance level
603of state of the art atomic clocks, so we can use an approximation
604on this term without anybody noticing.
605.LP
606Deciding which fraction to use as approximation needs to carefully
607consider any possible overflows that could happen.
608In this case the correction may be as large as \(+- 5000 PPM which
609leaves us room to multiply with about 850 in a multiply-before-divide
610setting.
611Unfortunately, there are no good fractions which multiply with less
612than 850 and at the same time divide by a power of two, which is
613desirable since it can be implemented as a binary shift instead of
614an expensive full division.
615.LP
616A divide-before-multiply approximation necessarily results in a loss
617of lower order bits, but in this case dividing by 512 and multiplying
618by 2199 gives a good approximation where the lower order bit loss is
619not a concern:
620.DE
621.EQ
6222199 over 512 = 4.294921875
623.EN
624.DE
625The resulting error is an systematic under compensation of 10.6PPM 
626of the requested change, or �1.06 cdot 10 sup -14� per nanosecond
627of correction.
628This is perfectly acceptable.
629.LP
630Putting it all together, including the one bit we put on the alter for the
631Goddess of code clarity, the formula looks like this:
632.DS
633.ce
634.EQ
635R sub{counter} = 2 cdot {{2 sup{63} + 2199 cdot {R sub{NTP}} over 1024} over Freq sub{counter}}
636.EN
637.DE
638Presented here in slightly unorthodox format to show the component arithmetic
639operations as they are carried out in the code.
640.SH
641Frequency of the periodic update
642.PP
643The hardware counter should have a long enough
644period, ie, number of distinct counter values divided by
645frequency, to not roll over before our periodic update function
646has had a chance to update the reference timestamp data.
647.LP
648The periodic update function is called from \fChardclock()\fP which
649runs at a rate which is controlled by the kernel parameter
650.I HZ .
651.LP
652By default HZ is 100 which means that only hardware with a period
653longer than 10 msec is usable.
654If HZ is configured higher than 1000, an internal divider is
655activated to keep the timecounter periodic update running 
656no more often than 2000 times per second.
657.LP
658Let us take an example:
659At HZ=100 a 16 bit counter can run no faster than:
660.DS
661.ce
662.EQ
6632 sup{16} cdot {100 Hz} = 6.5536 MHz
664.EN
665.DE
666Similarly, if the counter runs at 10MHz, the minimum HZ is
667.DS
668.ce
669.EQ
670{10 MHz} over {2 sup{16}} = 152.6 Hz
671.EN
672.DE
673.LP
674Some amount of margin is of course always advisable,
675and a factor two is considered prudent.
676.LP
677.SH
678Locking, lack of ...
679.PP
680Provided our hardware can be read atomically, that our arithmetic
681has enough bits to not roll over and that our clock frequency is
682perfectly, or at least sufficiently, stable, we could avoid the
683periodic update function, and consequently disregard the entire
684issue of locking.
685We are seldom that lucky in practice.
686.LP
687The straightforward way of dealing with meta data updates is to
688put a lock of some kind on the data and grab hold of that before
689doing anything.
690This would however be a very heavy-handed approach.  First of
691all, the updates are infrequent compared to simple references,
692second it is not important which particular state of meta data
693a consumer gets hold of, as long as it is consistent: as long
694as the �Count sub{ref}� and �T sub{ref}� are a matching pair,
695and not old enough to cause an ambiguity with hardware counter
696rollover, a valid timestamp can be derived from them.
697.LP
698A pseudo-stable-storage with generation count method has been
699chosen instead.
700A ring of ten ``timehands'' data structures are used to hold the
701state of the timecounter system, the periodic update function
702updates the next structure with the new reference data and
703scaling factor and makes it the current timehands.
704.LP
705The beauty of this arrangement lies in the fact that even though
706a particular ``timehands'' data structure has been bumped from being
707the ``currents state'' by its successor, it still contains valid data
708for some amount of time into the future.
709.LP
710Therefore, a process which has started the timestamping process but
711suffered an interrupt which resulted in the above periodic processing
712can continue unaware of this afterwards and not suffer corruption
713or miscalculation even though it holds no locks on the shared
714meta-data.
715.PSPIC fig4.eps
716.LP
717This scheme has an inherent risk that a process may be de-scheduled for
718so long time that it will not manage to complete the timestamping
719process before the entire ring of timehands have been recycled.
720This case is covered by each timehand having a private generation number
721which is temporarily set to zero during the periodic processing, to
722mark inconsistent data, and incremented to one more than the
723previous value when the update has finished and the timehands
724is again consistent.
725.LP
726The timestamping code will grab a copy of this generation number and
727compare this copy to the generation in the timehands after completion
728and if they differ it will restart the timestamping calculation.
729.DS
730.ps -1
731do {
732    th = timehands;
733    gen = th->th_generation;
734    /* calculate timestamp */
735} while (gen == 0 ||
736   gen != th->th_generation);
737.ps +1
738.DE
739.LP
740Each hardware device supporting timecounting is represented by a
741small data structure called a timecounter, which documents the
742frequency, the number of bits implemented by the counter and a method
743function to read the counter.
744.LP
745Part of the state in the timehands structure is a pointer to the
746relevant timecounter structure, this makes it possible to change
747to a one piece of hardware to another ``on the fly'' by updating
748the current timehands pointer in a manner similar to the periodic
749update function. 
750.LP
751In practice this can be done with sysctl(8):
752.DS
753.ps -1
754sysctl kern.timecounter.hardware=TSC
755.ps +1
756.DE
757.LP
758at any time while the system is running.
759.SH
760Suitable hardware
761.PP
762A closer look on ``suitable hardware'' is warranted
763at this point.
764It is obvious from the above description that the ideal hardware
765for timecounting is a wide binary counter running at a constant
766high frequency
767and atomically readable by all CPUs in the system with a fast
768instruction(-sequence).
769.LP
770When looking at the hardware support on the PC platform, one
771is somewhat tempted to sigh deeply and mutter ``so much for theory'',
772because none of the above parameters seems to have been on the
773drawing board together yet.
774.LP
775All IBM PC derivatives contain a device more or less compatible
776with the venerable Intel i8254 chip.
777This device contains 3 counters of 16 bits each,
778one of which is wired so it can interrupt the CPU when the 
779programmable terminal count is reached.
780.LP
781The problem with this device is that it only has 8bit bus-width,
782so reading a 16 bit timestamp takes 3 I/O operations: one to latch
783the count in an internal register, and two to read the high and
784low parts of that register respectively.
785.LP
786Obviously, on multi-CPU systems this cannot be done without some
787kind of locking mechanism preventing the other CPUs from trying
788to do the same thing at the same time.
789.LP
790Less obviously we find it is even worse than that:
791Since a low priority kernel thread
792might be reading a timestamp when an interrupt comes in, and since
793the interrupt thread might also attempt to generate a timestamp,
794we need to totally block interrupts out while doing those three
795I/O instructions.
796.LP
797And just to make life even more complicated, FreeBSD uses the same
798counter to provide the periodic interrupts which schedule the
799\fChardclock()\fP routine, so in addition the code has to deal with the
800fact that the counter does not count down from a power of two and
801that an interrupt is generated right after the reloading of the
802counter when it reaches zero.
803.LP
804Ohh, and did I mention that the interrupt rate for hardclock() will
805be set to a higher frequency if profiling is active ? \**
806.FS
807I will not even mention the fact that it can be set also to ridiculous
808high frequencies in order to be able to use the binary driven ``beep''
809speaker in the PC in a PCM fashion to output ``real sounds''.
810.FE
811.LP
812It hopefully doesn't ever get more complicated than that, but it
813shows, in its own bizarre and twisted way, just how little help the
814timecounter code needs from the actual hardware.
815.LP
816The next kind of hardware support to materialise was the ``CPU clock
817counter'' called ``TSC'' in official data-sheets.
818This is basically a on-CPU counter, which counts at the rate 
819of the CPU clock.
820.LP
821Unfortunately, the electrical power needed to run a CPU is pretty
822precisely proportional with the clock frequency for the
823prevailing CMOS chip technology, so
824the advent of computers powered by batteries prompted technologies
825like APM, ACPI, SpeedStep and others which varies or throttles the
826CPU clock to match computing demand in order to minimise the power
827consumption \**.
828.FS
829This technology also found ways into stationary computers from
830two different vectors.
831The first vector was technical: Cheaper cooling solutions can be used
832for the CPU if they are employed resulting in cheaper commodity
833hardware.
834The second vector was political: For reasons beyond reason, energy
835conservation became an issue with personal computers, despite the fact
836that practically all north American households contains 4 to 5 household
837items which through inefficient designs waste more power than a 
838personal computer use.
839.FE
840.LP
841Another wiggle for the TSC is that it is not usable on multi-CPU
842systems because the counter is implemented inside the CPU and
843not readable from other CPUs in the system.
844.LP
845The counters on different CPUs are not guaranteed
846to run syntonously (ie: show the same count at the same time).
847For some architectures like the DEC/alpha architecture they do not even
848run synchronously (ie: at the same rate) because the CPU clock frequency
849is generated by a small SAW device on the chip which is very sensitive
850to temperature changes.
851.LP
852The ACPI specification finally brings some light:
853it postulates the existence of a 24 or 32 bit
854counter running at a standardised constant frequency and
855specifically notes that this is intended to be used for timekeeping.
856.LP
857The frequency chosen, 3.5795454... MHz\**
858.FS
859The reason for this odd-ball frequency has to be sought in the ghastly
860colours offered by the original IBM PC Color Graphics Adapter:  It
861delivered NTSC format output and therefore introduced the NTSC colour
862sync frequency into personal computers.
863.FE
864 is not quite as high as one
865could have wished for, but it is certainly a big improvement over
866the i8254 hardware in terms of access path.
867.LP
868But trust it to Murphys Law: The majority of implementations so far
869have failed to provide latching suitable to avoid meta-stability
870problems, and several readings from the counter is necessary to
871get a reliable timestamp.
872In difference from the i8254 mentioned above, we do not need to
873any locking while doing so, since each individual read is atomic.
874.LP
875An initialization routine tries to test if the ACPI counter is properly
876latched by examining the width of a histogram over read delta-values.
877.LP
878Other architectures are similarly equipped with means for timekeeping,
879but generally more carefully thought out compared to the haphazard
880developments of the IBM PC architecture.
881.LP
882One final important wiggle of all this, is that it may not be possible
883to determine which piece of hardware is best suited for clock
884use until well into or even after the bootstrap process.
885.LP
886One example of this is the Loran-C receiver designed by Prof. Dave Mills
887[MILLS1992]
888which is unsuitable as timecounter until the daemon program which
889implements the software-half of the receiver has properly initialised
890and locked onto a Loran-C signal.
891.SH
892Ideal timecounter hardware
893.LP
894As proof of concept, a sort of an existentialist protest against
895the sorry state describe above, the author undertook a project to
896prove that it is possible to do better than that, since none of
897the standard hardware offered a way to fully validate the timecounter
898design.
899.LP
900Using a COTS product, ``HOT1'', from Virtual Computers Corporation
901[VCC2002] containing a FPGA chip on a PCI form factor card, a 26
902bit timecounter running at 100MHz was successfully implemented.
903.LP
904.PSPIC fig5.eps
905.LP
906.LP
907In order to show that timestamping does not necessarily have to
908be done using unpredictable and uncalibratable interrupts, an
909array of latches were implemented as well, which allow up to 10
910external signals to latch the reading of the counter when
911an external PPS signal transitions from logic high to logic
912low or vice versa.
913.LP
914Using this setup, a standard 133 MHz Pentium based PC is able to
915timestamp the PPS output of the Motorola UT+ GPS receiver with
916a precision of \(+- 10 nanoseconds \(+- one count which in practice
917averages out to roughly \(+- 15 nanoseconds\**:
918.FS
919The reason the plot does not show a very distinct 10 nanosecond
920quantization is that the GPS receiver produces the PPS signal from
921a clock with a roughly 55 nanosecond period and then predicts in
922the serial data stream how many nanoseconds this will be offset
923from the ideal time.
924This plot shows the timestamps corrected for this ``negative
925sawtooth correction''.
926.FE
927.LP
928.PSPIC gps.ps
929.LP
930It shold be noted that the author is no hardware wizard and
931a number of issues in the implementation results in less than
932ideal noise performance.
933.LP
934Now compare this to ``ideal'' timecounter to the normal setup
935where the PPS signal is used
936to trigger an interrupt via the DCD pin on a serial port, and
937the interrupt handler calls \fCnanotime()\fP to timestamp
938the external event \**:
939.FS
940In both cases, the computers clock frequency controlled
941with a Rubidium Frequency standard.
942The average quality of crystals used for computers would
943totally obscure the curves due to their temperature coefficient.
944.FE
945.LP
946.PSPIC intr.ps
947.LP
948It is painfully obvious that the interrupt latency is the
949dominant noise factor in PPS timestamping in the second case.
950The asymetric distribution of the noise in the second plot
951also more or less entirely invalidates the design assumption
952in the NTP PLL/FLL kernel code that timestamps are dominated
953by gaussian noise with few spikes.
954.SH
955Status and availability
956.PP
957The timecounter code has been developed and used in FreeBSD
958for a number of years and has now reached maturity.
959The source-code is located almost entirely in the kernel source file
960kern_tc.c, with a few necessary adaptations in code which
961interfaces to it, primarily the NTP PLL/FLL code.
962.LP
963The code runs on all FreeBSD platforms including i386, alpha,
964PC98, sparc64, ia64 and s/390 and contains no wordsize or
965endianess issues not specifically handled in the sourcecode.
966.LP
967The timecounter implementation is distributed under the ``BSD''
968open source license or the even more free ``Beer-ware'' license.
969.LP
970While the ability to accurately model and compensate for
971inaccuracies typical of atomic frequency standards are not
972catering to the larger userbase, but this ability and precision
973of the code guarntees solid support for the widespread deployment
974of NTP as a time synchronization protocol, without rounding
975or accumulative errors.
976.LP
977Adding support for new hardware and platforms have been
978done several times by other developers without any input from the
979author, so this particular aspect of timecounters design
980seems to work very well.
981.SH
982Future work
983.PP
984At this point in time, no specific plans exist for further
985development of the timecounters code.
986.LP
987Various micro-optimizations, mostly to compensate for inadequate
988compiler optimization could be contemplated, but the author
989resists these on the basis that they significantly decrease
990the readability of the source code.
991.SH
992Acknowledgements
993.PP
994.EQ
995delim ��
996.EN
997The author would like to thank:
998.LP
999Bruce Evans
1000for his invaluable assistance
1001in taming the evil i8254 timecounter, as well as the enthusiastic
1002resistance he has provided throughout.
1003.PP
1004Professor Dave Mills of University of Delaware for his work on
1005NTP, for lending out the neglected twin Loran-C receiver and for
1006picking up the glove when timecounters made it clear
1007that the old ``microkernel'' NTP timekeeping code were not up to snuff
1008[MILLS2000B].
1009.PP
1010Tom Van Baak for helping out, despite the best efforts of the National
1011Danish Posts center for Customs and Dues to prevent it.
1012.PP
1013Corby Dawson for helping with the care and feeding for caesium standards.
1014.PP
1015The staff at the NELS Loran-C control station in B�, Norway for providing
1016information about step-changes.
1017.PP
1018The staff at NELS Loran-C station Ei�e, Faeroe
1019Islands for permission to tour their installation.
1020.PP
1021The FreeBSD users for putting up with ``micro uptime went backwards''.
1022.SH
1023References
1024.LP
1025[AG2002]
1026Published specifications for Agilent model 5071A Primary Frequency
1027Standard on 
1028.br
1029http://www.agilent.com
1030.LP
1031[DMK2001]
1032"Accuracy Evaluation of a Cesium Fountain Primary Frequency Standard at NIST."
1033D. M. Meekhof, S. R. Jefferts, M. Stephanovic, and T. E. Parker
1034IEEE Transactions on instrumentation and measurement, VOL. 50, NO. 2,
1035APRIL 2001.
1036.LP
1037[PHK2001]
1038"Monitoring Natural Gas Usage"
1039Poul-Henning Kamp
1040http://phk.freebsd.dk/Gasdims/
1041.LP
1042[MILLS1992]
1043"A computer-controlled LORAN-C receiver for precision timekeeping."
1044Mills, D.L. 
1045Electrical Engineering Department Report 92-3-1, University of Delaware, March 1992, 63 pp.
1046.LP
1047[MILLS2000A]
1048Levine, J., and D. Mills. "Using the Network Time Protocol to transmit International Atomic Time (TAI)". Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting (Reston VA, November 2000), 431-439.
1049.LP
1050[MILLS2000B]
1051"The nanokernel."
1052Mills, D.L., and P.-H. Kamp.
1053Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting (Reston VA, November 2000), 423-430.
1054.LP
1055[RGO2002]
1056For an introduction to Harrison and his clocks, see for
1057instance 
1058.br
1059http://www.rog.nmm.ac.uk/museum/harrison/
1060.br
1061or for
1062a more detailed and possibly better researched account: Dava
1063Sobels excellent book, "Longitude: The True Story of a Lone
1064Genius Who Solved the Greatest Scientific Problem of His
1065Time" Penguin USA (Paper); ISBN: 0140258795.
1066.LP
1067[SAGE]
1068This ``gee-wiz'' kind of article in Dr. Jobbs Journal is a goot place to
1069start:
1070.br
1071http://www.ddj.com/documents/s=1493/ddj0001hc/0085a.htm
1072.LP
1073[VCC2002]
1074Please consult Virtual Computer Corporations homepage:
1075.br
1076http://www.vcc.com
1077