1<!DOCTYPE doctype PUBLIC "-//w3c//dtd html 4.0 transitional//en">
2<html>
3<head>
4  <meta http-equiv="Content-Type"
5 content="text/html; charset=iso-8859-1">
6  <meta name="GENERATOR"
7 content="Mozilla/4.76 [en] (X11; U; FreeBSD 4.3-RELEASE i386) [Netscape]">
8  <title>Master Lease</title>
9</head>
10<body>
11<center>
12<h1>Master Leases for Berkeley DB</h1>
13</center>
14<center><i>Susan LoVerso</i> <br>
15<i>sue@sleepycat.com</i> <br>
16<i>Rev 1.1</i><br>
17<i>2007 Feb 2</i><br>
18</center>
19<p><br>
20</p>
21<h2>What are Master Leases?</h2>
22A master lease is a mechanism whereby clients grant master-ship rights
23to a site and that master, by holding lease rights can provide a&nbsp;
24guarantee of durability to a replication group for a given period of
25time.&nbsp; By granting a lease to a master,
26a&nbsp; client will not participate in an election to elect a new
27master until that granted master lease has expired.&nbsp; By holding a
28collection of granted leases, a master will be able to supply
29authoritative read requests to applications.&nbsp; By holding leases a
30read operation on a master can guarantee several things to the
31application:<br>
32<ol>
33  <li>Authoritative reads: a guarantee that the data being read by the
34application is durable and can never be rolled back.</li>
35  <li>Freshness: a guarantee that the data being read by the
36application <b>at the master</b> is
37not stale.</li>
38  <li>Master viability: a guarantee that a current master with valid
39leases will not encounter a duplicate master situation.<br>
40  </li>
41</ol>
42<h2>Requirements</h2>
43The requirements of DB to support this include:<br>
44<ul>
45  <li>After turning them on, users can choose to ignore them in reads
46or not.</li>
47  <li>We are providing read authority on the master only.&nbsp; A
48read on a client is equivalent to a read while ignoring leases.</li>
49  <li>We guarantee that data committed on a master <b>that has been
50read by an application on the
51master</b> will not be rolled back.&nbsp; Data read on a client or
52while ignoring leases <i>or data
53successfully updated/committed but not read,</i>
54may be rolled back.<br>
55  </li>
56  <li>A master will not return successfully from a read operation
57unless it holds a
58majority of leases unless leases are ignored.</li>
59  <li>Master leases will remove the possibility of a current/correct
60master being "shot down" by DUPMASTER.&nbsp; <b>NOTE: Old/Expired
61masters may discover a
62later master and return DUPMASTER to the application however.</b><br>
63  </li>
64  <li>Any send callback failure must result in premature lease
65expiration on the master.<br>
66  </li>
67  <li>Users who change the system clock during master leases void the
68guarantee and may get undefined behavior.&nbsp; We assume time always
69runs forward. <b>[document this.]</b><br>
70  </li>
71  <li>Clients are forbidden from participating in elections while they
72have an outstanding lease granted to another site.</li>
73  <li>Clients are forbidden from accepting a new master while they have
74an outstanding lease granted to another site.</li>
75  <li>Clients are forbidden from upgrading themselves to master while
76they have an outstanding lease granted to another site.</li>
77  <li>When asked for a lease grant explicitly by the master, the client
78cannot grant the lease to the master unless the LSN in the master's
79request has been processed by this client.<br>
80  </li>
81</ul>
82The requirements of the
83application using leases include:<br>
84<ul>
85  <li>Users must implement (Base API users on their own, RepMgr users
86via configuration) a majority (or larger) ACK policy. <br>
87  </li>
88  <li>The application must use the election mechanism to decide a master.
89It may not simply declare a site master.</li>
90  <li>The send callback must return an error if the majority ACK policy
91is not met for PERM records.</li>
92  <li>Users must set the number of sites in the group.</li>
93  <li>Using leases in a replication group is all-or-none.&nbsp;
94Therefore, if a site knows it is using leases, it can assume other
95sites are also.<br>
96  </li>
97  <li>All applications that care about read guarantees must forward or
98perform all reads on the master.&nbsp; Reading on the client means a
99read ignoring leases. </li>
100</ul>
101<p>There are some open questions
102remaining.</p>
103<ul>
104  <li>There is one major showstopper issue, see Crashing - Potential
105problem near the end of the document.&nbsp; We need a better solution
106than the one shown there (writing to disk every time a lease is
107granted). Perhaps just documenting that durability means it must be
108flushed to disk before success to avoid that situation?<br>
109  </li>
110  <li>What about db-&gt;join?&nbsp; Users can call join, but the calls
111on the join cursor to get the data would be subject to leases and
112therefore protected.&nbsp; Ok, this is not an open question.</li>
113  <li>What about other read-like operations?&nbsp; Clearly <i>
114DB-&gt;get, DB-&gt;pget, DBC-&gt;get,
115DBC-&gt;pget</i> need lease checks.&nbsp; However, other APIs use
116keys.&nbsp; <i>DB-&gt;key_range</i>
117provides an estimate only so it shouldn't need lease checks. <i>
118DB-&gt;stat</i> provides exact counts
119to <i>bt_nkeys</i> and <i>bt_ndata</i> fields.&nbsp; Are those
120fields considered authoritative that providing those values implies a
121durability guarantee and therefore <i>DB-&gt;stat</i>
122should be subject to lease verification?&nbsp; <i>DBC-&gt;count</i>
123provides a count for
124the number of data items associated with a key.&nbsp; Is this
125authoritative information? This is similar to stat - should it be
126subject to lease verification?<br>
127  </li>
128  <li>Do we require master lease checks on write operations?&nbsp; I
129think lease checks are not needed on write operations.&nbsp; It doesn't
130add correctness and adds a lot of complexity (checking leases in put,
131del, and cursors, then what about rename, remove, etc).<br>
132  </li>
133  <li>Do master leases give an iron-clad guarantee of never rolling
134back a transaction? No, but it should mean that a committed transaction
135can never be <b>read</b> on a master
136unless the lease is valid.&nbsp; A committed transaction on a master
137that has never been presented to the application may get rolled back.<br>
138  </li>
139  <li>Do we need to quarantine or prevent reads on an ex-master until
140sync-up is done?&nbsp; No.&nbsp; A master that is simply downgraded to
141client or crashes and reboots is now a client.&nbsp; Reading from that
142client is the same as saying Ignore Leases.</li>
143  <li>What about adding and removing sites while leases are
144active?&nbsp; This is SR 14778.&nbsp; A consistent <i>nsites</i> value
145is required by master
146leases.&nbsp; <b>The resolution of 14778
147is a prerequisite - currently owned by Alan</b>.&nbsp; It isn't
148clear to me what a master is
149supposed to do if the value of nsites gets smaller while leases are
150active.&nbsp; Perhaps it leaves its larger table intact and simply
151checks for a smaller number of granted leases?<br>
152  </li>
153  <li>Can users turn leases off?&nbsp; No.&nbsp; There is no planned <i>turn
154leases off</i> API.</li>
155  <li>Clock skew will be a percentage.&nbsp; However, the smallest, 1%,
156is probably rather large for clock skew.&nbsp; Percentage was chosen
157for simplicity and similarity to other APIs.&nbsp; What granularity is
158appropriate here?</li>
159</ul>
160<h2>API Changes</h2>
161The API changes that are visible
162to the user are fairly minimal.&nbsp;
163There are a few API calls they need to make to configure master leases
164and then there is the API call to turn them on.&nbsp; There is also a
165new flag to existing APIs to allow read operations to ignore leases and
166return data that
167may be non-durable potentially.<br>
168<h3>Lease Timeout<br>
169</h3>
170There is a new timout the user
171must configure for leases called <b>DB_REP_LEASE_TIMEOUT</b>.&nbsp;
172This timeout will be new to
173the <i>dbenv-&gt;rep_set_timeout</i> method. The <b>DB_REP_LEASE_TIMEOUT</b>
174has no default and it is required that the user configure a timeout
175before they turn on leases (obviously, this timeout need not be set of
176leases will not be used).&nbsp; That timeout is the amount of time
177the lease is valid on the master and how long it is granted
178on the client.&nbsp; This timeout must be the same
179value on all sites (like log file size).&nbsp; <b>[Document this
180requirement.&nbsp; We cannot
181enforce it across the group easily.]</b> The timeout used when
182refreshing leases is the <b>DB_REP_ACK_TIMEOUT</b>
183for RepMgr application.&nbsp; For Base API applications, lease
184refreshes will use the same mechanism as <b>PERM</b> messages and they
185should
186have no additional burden.&nbsp; This timeout is used for lease
187refreshment and is the amount of time a reader will wait to refresh
188leases before returning failure to the application from a read
189operation.<br>
190<br>
191This timeout will be both stored
192with its original value, and also
193converted to a <i>db_timespec</i>
194using the <b>DB_TIMEOUT_TO_TIMESPEC</b>
195macro and have the clock skew accounted for and stored in the shared
196rep structure:<br>
197<pre>db_timeout_t lease_timeout;<br>db_timespec lease_duration;<br></pre>
198NOTE:&nbsp; By sending the lease refresh during DB operations, we are
199forcing/assuming that the operation's process has a replication
200transport function set.&nbsp; That is obviously the case for write
201operations, but would it be a burden for read processes (on a
202master)?&nbsp; I think mostly not, but if we need leases for <i>
203DB-&gt;stat</i> then we need to
204document it as it is certainly possible for an application to have a
205separate or dedicated <i>stat</i>
206application or attempt to use <i>db_stat</i>
207(which will not work if leases must be checked).<br>
208<br>
209Leases should be checked after the local operation so that we don't
210have a window/boundary if we were to check leases first, get
211descheduled, the lose our lease and then perform the operation.&nbsp;
212Do the operation, then check leases before returning to the user.<br>
213<h3>Using Leases</h3>
214There is a new API that the user must call to tell the system to use
215the lease mechanism.&nbsp; The method must be called before the
216application calls <i>dbenv-&gt;rep_start</i>
217or <i>dbenv-&gt;repmgr_start</i>.
218This new
219method is:<br>
220<br>
221<pre>&nbsp;&nbsp;&nbsp; dbenv-&gt;rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)<br>
222</pre>
223The <i>clock_scale_factor</i>
224parameter is interpreted as a percentage, greater than 100 (to transmit
225a floating point number as an integer to the API) that represents the
226maximum shkew between any two sites' clocks.&nbsp; That is, a <span
227 style="font-style: italic;">clock_scale_factor</span> of 150 suggests
228that the greatest discrepancy between clocks is that one runs 50%
229faster than the others.&nbsp; Both the
230master and client sides
231compensate for possible clock skew.&nbsp; The master uses the value to
232compensate in case the replica has a slow clock and replicas compensate
233in case they have a fast clock.&nbsp; This scaling factor will need to
234be divided by 100 on all sites to truly represent the percentage for
235adjustments made to time values.<br>
236<br>
237Assume the slowest replica's clock is a factor of <i>clock_scale_factor</i>
238slower than the
239fastest clock.&nbsp; Using that assumption, if the fastest clock goes
240from time t1 to t2 in X
241seconds, the slowest clock does it in (<i>clock_scale_factor</i> / 100)
242* X seconds.<br>
243<br>
244The <i>flags</i> parameter is not
245currently used.<br>
246<br>
247When the <i>dbenv-&gt;rep_set_lease</i>
248method is called, we will set a configuration flag indicating that
249leases are turned on:<br>
250<b>#define REP_C_LEASE &lt;value&gt;</b>.&nbsp;
251We will also record the <b>u_int32_t
252clock_skew</b> value passed in.&nbsp; The <i>rep_set_lease</i> method
253will not allow
254calls after <i>rep_start.&nbsp; </i>If
255multiple calls are made prior to calling <i>rep_start</i> then later
256calls will
257overwrite the earlier clock skew value.&nbsp; <br>
258<br>
259We need a new flag to prevent calling <i>rep_set_lease</i>
260after <i>rep_start</i>.&nbsp; The
261simplest solution would be to reject the call to
262<i>rep_set_lease&nbsp;
263</i>if<b>
264REP_F_CLIENT</b>
265or <b>REP_F_MASTER</b> is set.&nbsp;
266However that does not work in the cases where a site cleanly closes its
267environment and then opens without running recovery.&nbsp; The
268replication state will still be set.&nbsp; The prevention will be
269implemented as:<br>
270<pre>#define REP_F_START_CALLED &lt;some bit value&gt;<br></pre>
271In __rep_start, at the end:<br>
272<pre>if (ret == 0 ) {<br>	REP_SYSTEM_LOCK<br>	F_SET(rep, REP_F_START_CALLED)<br>	REP_SYSTEM_UNLOCK<br>}</pre>
273In <i>__rep_env_refresh</i>, if we
274are the last reference closing the env (we already check for that):<br>
275<pre>F_CLR(rep, REP_F_START_CALLED);</pre>
276<b>[Please review the logic here
277carefully.]</b> In order to avoid run-time floating point operations
278on <i>db_timespec</i> structures,
279when a site is declared as a client or master in <i>rep_start</i> we
280will pre-compute the
281lease duration based on the integer-based clock skew and the
282integer-based lease timeout.&nbsp; A master should set a replica's
283lease expiration to the <b>start time of
284the sent message +
285(lease_timeout / clock_scale_factor)</b> in case the replica has a
286slow clock.&nbsp; Replicas extend their leases to <b>received message
287time + (lease_timeout *
288clock_scale_factor)</b> in case this replica has a fast clock.&nbsp;
289Therefore, the computation will be as follows if the site is becoming a
290master:<br>
291<pre>db_timeout_t tmp;<br>tmp = (db_timeout_t)((double)rep-&gt;lease_timeout / ((double)rep-&gt;clock_skew / (double)100));<br>rep-&gt;lease_duration = DB_TIMEOUT_TO_TIMESPEC(&amp;tmp);<br></pre>
292Similarly, on a client the computation is:<br>
293<pre>tmp = (db_timeout_t)((double)rep-&gt;lease_timeout * ((double)rep-&gt;clock_skew / (double)100));<br></pre>
294When a site changes state, its lease duration will change based on
295whether it is becoming a master or client and it will be recomputed
296from the original values.&nbsp; Note that these computations, coupled
297with the fact that the lease on the master is computed based on the
298master's time that it sent the message means that leases on the master
299are more conservatively computed than on the clients.<br>
300<br>
301The <i>dbenv-&gt;rep_set_lease</i>
302method must be called after <i>dbenv-&gt;open</i>,
303similar to <i>dbenv-&gt;rep_set_config</i>.&nbsp;
304The reason is so that we can check that this is a replication
305environment and we have access to the replication shared memory region.<br>
306<h3>Read Operations<br>
307</h3>
308Authoritative read operations on the master with leases enabled will
309abide by leases by default.&nbsp; We will provide a flag that allows an
310operation on a master to ignore leases.&nbsp; <b>All read operations
311on a client imply
312ignoring leases.</b> If an application wants authoritative reads
313they must forward the read requests to the master and it is the
314application's responsibility to provide the forwarding.
315The consensus was that forcing <span style="font-weight: bold;">DB_IGNORE_LEASE</span>
316on client read operations (with leases enabled, obviously) was too
317heavy handed.&nbsp; Read operations on the client will ignore leases,
318but do no special flag checking.<br>
319<br>
320The flag will be called <b>DB_IGNORE_LEASE</b>
321and it will be a flag that can be OR'd into the DB access method and
322cursor operation values.&nbsp; It will be similar to the <b>DB_READ_UNCOMMITTED</b>
323flag. <b>[Keith, I will need your help here for
324finding a bit in the DB flags that isn't in use for my new flag.&nbsp;
325That
326looks like a very full and confusing area...]<br>
327<br>
328</b>The methods that will
329adhere to leases are:<br>
330<ul>
331  <li><i>Db-&gt;get</i></li>
332  <li><i>Db-&gt;pget</i></li>
333  <li><i>Dbc-&gt;get</i></li>
334  <li><i>Dbc-&gt;pget</i></li>
335  <li><i>Db-&gt;stat </i><b>[maybe?]</b></li>
336  <li><i>Dbc-&gt;count</i><b>[maybe?]</b></li>
337</ul>
338The code that will check leases for a client reading would look
339something
340like this, if we decide to become heavy-handed:<br>
341<pre>if (IS_REP_CLIENT(dbenv)) {<br>	[get to rep structure]<br>	if (FLD_ISSET(rep-&gt;config, REP_C_LEASE) &amp;&amp; !LF_ISSET(DB_IGNORE_LEASE)) {<br>		db_err("Read operations must ignore leases or go to master");<br>		ret = EINVAL;<br>		goto err;<br>	}<br>}<br></pre>
342On the master, the new code to abide by leases is more complex.&nbsp;
343After the call to perform the operation we will check the lease.&nbsp;
344In that checking code, the master will see if it has a valid
345lease.&nbsp; If so, then all is well.&nbsp; If not, it will try to
346refresh the leases.&nbsp; If that refresh attempt results in leases,
347all is well.&nbsp; If the refresh attempt does not get leases, then the
348master cannot respond to the read as an authority and we return an
349error.&nbsp; The new error is called <b>DB_REP_LEASE_EXPIRED</b>.&nbsp;
350The location of the master lease check is down after the internal call
351to read the data is successful:<br>
352<pre>if (IS_REP_MASTER(dbenv) &amp;&amp; !LF_ISSET(DB_IGNORE_LEASE)) {<br>	[get to rep structure]<br>	if (FLD_ISSET(rep-&gt;config, REP_C_LEASE) &amp;&amp;<br>	    (ret = __rep_lease_check(dbenv)) != 0) {<br>		/*<br>		 * We don't hold the lease.<br>		 */<br>		goto err;<br>	}<br>}<br></pre>
353See below for the details of <i>__rep_lease_check</i>.<br>
354<br>
355Also note that if leases (or replication) are not configured, then <span
356 style="font-weight: bold;">DB_IGNORE_LEASE</span> is a no-op.&nbsp; It
357is ignored (and won't error) if used when leases are not in
358effect.&nbsp; The reason is so that we can generically set that flag in
359utility programs like <span style="font-style: italic;">db_dump</span>
360that walk the database with a cursor.&nbsp; Note that <span
361 style="font-style: italic;">db_dump</span> is the only utility that
362reads with a cursor.<span style="font-style: italic;"><span
363 style="font-style: italic;"></span></span><br>
364<h3><b>Nsites
365and Elections</b></h3>
366The call to <i>dbenv-&gt;rep_set_nsites</i>
367must be performed before the call to <i>dbenv-&gt;rep_start</i>
368or <i>dbenv-&gt;repmgr_start</i>.&nbsp;
369This document assumes either that <b>SR
37014778</b> gets resolved, or assumes that the value of <i>nsites</i> is
371immutable.&nbsp; The
372master and all clients need to know how many sites and leases are in
373the group.&nbsp; Clients need to know for elections.&nbsp; The master
374needs to know for the size of the lease table and to know what value a
375majority of the group is. <b>[Until
37614778 is resolved, the master lease work must assume <i>nsites</i> is
377immutable and will
378therefore enforce that this is called before <i>rep_start</i> using
379the same mechanism
380as <i>rep_set_lease</i>.]</b><br>
381<br>
382Elections and leases need to agree on the number of sites in the
383group.&nbsp; Therefore, when leases are in effect on clients, all calls
384to <i>dbenv-&gt;rep_elect</i> must
385set the <i>nsites</i> parameter to
3860.&nbsp; The <i>rep_elect</i> code
387path will return <b>EINVAL</b> if <b>REP_C_LEASE</b> is set and <i>nsites</i>
388is non-0.
389<h2>Lease Management</h2>
390<h3>Message Changes</h3>
391In order for clients to grant leases to the master a new message type
392must be added for that purpose.&nbsp; This will be the <b>REP_LEASE_GRANT</b>
393message.&nbsp;
394Granting leases will be a result of applying a <b>DB_REP_PERMANENT</b>
395record and therefore we
396do not need any additional message in order for a master to request a
397lease grant.&nbsp; The <b>REP_LEASE_GRANT</b>
398message will pass a structure as its message DBT:<br>
399<pre>struct __rep_lease_grant {<br>	db_timespec msg_time;<br>#ifdef DIAGNOSTIC<br>	db_timespec expire_time;<br>#endif<br>} REP_GRANT_INFO;<br></pre>
400In the <b>REP_LEASE_GRANT</b>
401message, the client is actually giving the master several pieces of
402information.&nbsp; We only need the echoed <i>msg_time</i> in this
403structure because
404everything else is already sent.&nbsp; The client is really sending the
405master:<br>
406<ul>
407  <li>Its EID (parameter to <span style="font-style: italic;">rep_send_message</span>
408and <span style="font-style: italic;">rep_process_message</span>)<br>
409  </li>
410  <li>The PERM LSN this message acknowledged (sent in the control
411message)</li>
412  <li>Unique identifier echoed back to master (<i>msg_time</i> sent in
413message as above)</li>
414</ul>
415On the client, we always maintain the maximum PERM LSN already in <i>lp-&gt;max_perm_lsn</i>.&nbsp;
416<h3>Local State Management</h3>
417Each client must maintain a <i>db_timespec</i>
418timestamp containing the expiration of its granted lease.&nbsp; This
419field will be in the replication shared memory structure:<br>
420<pre>db_timespec grant_expire;<br></pre>
421This timestamp already takes into account the clock skew.&nbsp; All
422new fields must be initialized when the region is created. Whenever we
423grant our master lease and want to send the <b>REP_LEASE_GRANT</b>
424message, this value
425will be updated.&nbsp; It will be used in the following way:
426<pre>db_timespec mytime;<br>DB_LSN perm_lsn;<br>DBT lease_dbt;<br>REP_GRANT_INFO gi;<br><br><br>timespecclear(&amp;mytime);<br>timespecclear(&amp;newgrant);<br>memset(&amp;lease_dbt, 0, sizeof(lease_dbt));<br>memset(&amp;gi, 0, sizeof(gi));<br>__os_gettime(dbenv, &amp;mytime);<br>timespecadd(&amp;mytime, &amp;rep-&gt;lease_duration);<br>MUTEX_LOCK(rep-&gt;clientdb_mutex);<br>perm_lsn = lp-&gt;max_perm_lsn;<br>MUTEX_UNLOCK(rep-&gt;clientdb_mutex);<br>REP_SYSTEM_LOCK(dbenv);<br>if (timespeccmp(mytime, rep-&gt;grant_expire, &gt;))<br>	rep-&gt;grant_expire = mytime;<br>gi.msg_time = msg-&gt;msg_time;<br>#ifdef DIAGNOSTIC<br>gi.expire_time = rep-&gt;grant_expire;<br>#endif<br>lease_dbt.data = &amp;gi;<br>lease_dbt.size = sizeof(gi);<br>REP_SYSTEM_UNLOCK(dbenv);<br>__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &amp;perm_lsn, &amp;lease_dbt, 0, 0);<br></pre>
427This updating of the lease grant will occur in the <b>PERM</b> code
428path when we have
429successfully applied the permanent record.<br>
430<h3>Maintaining Leases on the
431Master/Rep_start</h3>
432The master maintains a lease table that it checks when fulfilling a
433read request that is subject to leases.&nbsp; This table is initialized
434when a site calls<i>
435dbenv-&gt;rep_start(DB_MASTER)</i> and the site is undergoing a role
436change (i.e. a master making additional calls to <i>dbenv-&gt;rep_start(DB_MASTER)</i>
437does
438not affect an already existing table).<br>
439<br>
440When a non-master site becomes master, it must do two things related to
441leases on a role change.&nbsp; First, a client cannot upgrade to master
442while it has an outstanding lease granted to another site.&nbsp; If a
443client attempts to do so, an error, <b>EINVAL</b>,
444will be returned.&nbsp; The only way this should happen is if the
445application simply declares a site master, instead of using
446elections.&nbsp; Elections will already wait for leases to expire
447before proceeding. (See below.) <b>[I
448believe an error is sufficient and we do not need, for version 1 at
449least, any other complex waiting mechanism.&nbsp; Applications that
450don't use elections and declare masters are quite rare.]</b><br>
451<br>
452Second, once we are proceeding with becoming a master, the site must
453allocate the table it will use to maintain lease information.&nbsp;
454This table will be sized based on <i>nsites</i>
455and it will be an array of the following structure:<br>
456<pre>struct  {<br>	int eid;			/* EID of client site. */<br>	db_timespec start_time;	/* Unique time ID client echoes back on grants. */<br>	db_timespec end_time;	/* Master's lease expiration time. */<br>	DB_LSN lease_lsn;	/* Durable LSN this lease applies to. */<br>	u_int32_t flags;	/* Unused for now?? */<br>} REP_LEASE_ENTRY;<br></pre>
457<h3>Granting Leases</h3>
458It is the burden of the application to make sure that all sites in the
459group
460are using leases, or none are.&nbsp; Therefore, when a client processes
461a <b>PERM</b>
462log record that arrived from the master, it will grant its lease
463automatically if that record is permanent (i.e. <b>DB_REP_ISPERM</b>
464is being returned),
465and leases are configured.&nbsp; A client will not send a
466lease grant when it is processing log records (even <b>PERM</b>
467ones) it receives from other clients that use client-to-client
468synchronization.&nbsp; The reason is that the master requires a unique
469time-of-msg ID (see below) that the client echoes back in its lease
470grant and it will not have such an ID from another client.<br>
471<br>
472The master stores a time-of-msg ID in each message and the client
473simply echoes it back to the master.&nbsp; In its lease table, it does
474keep the base
475time-of-msg for a valid lease.&nbsp; When <b>REP_LEASE_GRANT</b>
476message comes in,
477the master does a number of things:<br>
478<ol>
479  <li>Pulls the echoed timespec from the client message, into <i>msg_time</i>.<br>
480  </li>
481  <li>Finds the entry in its lease table for the client's EID.&nbsp; It
482walks the table searching for the ID.&nbsp; EIDs of <span
483 style="font-weight: bold;">DB_EID_INVALID</span> are
484illegal.&nbsp; Either the master will find the entry, or it will find
485an empty slot in the table (i.e. it is still populating the table with
486leases).</li>
487  <li>If this is a previously unknown site lease, the master
488initializes the entry by copying to the <i>eid</i>, <i>start_time, </i>and
489    <i>lease_lsn</i> fields.&nbsp; The master
490also computes the <i>end_time</i>
491based on the adjusted <i>rep-&gt;lease_duration</i>.</li>
492  <li>If this is a lease from a previously known site, the master must
493perform <i>timespeccmp(&amp;msg_time,
494&amp;table[i].start_time, &gt;)</i> and only update the <i>end_time</i>
495of the lease when this is
496a more recent message.&nbsp; If it is a more recent message, then we
497should update
498the <i>lease_lsn</i> to the LSN in
499the message.</li>
500  <li>Since lease durations are computed taking the clock skew into
501account, clients compute them based on the current time and the master
502computes it based on original sending time, for diagnostic purposes
503only, I also plan to send the client's expiration time.&nbsp; The
504client errs on the side of computing a larger lease expiration time and
505the master errs on the side of computing a smaller duration.&nbsp;
506Since both are taking the clock skew
507into account, the client's ending expiration time should never be
508smaller than
509the master's computed expiration time or their value for clock skew may
510not be correct.<br>
511  </li>
512</ol>
513Any log records (new or resent) that originate from the master and
514result in <b>DB_REP_ISPERM</b> get an
515ack.<br>
516<br>
517<h3>Refreshing Leases</h3>
518Leases get refreshed when a master receives a <b>REP_LEASE_GRANT</b>
519message from a client. There are three pieces to lease
520refreshment.&nbsp; <br>
521<h4>Lazy Lease Refreshing on Read<br>
522</h4>
523If the master discovers that leases are
524expired during the read operation, it attempts to refresh its
525collection of lease grants.&nbsp; It does this by calling a new
526function <i>__rep_lease_refresh</i>.&nbsp;
527This function is very similar to the already-existing function <i>__rep_flush</i>.&nbsp;
528Basically, to
529refresh the lease, the master simply needs to resend the last PERM
530record to the clients.&nbsp; The requirements state that when the
531application send function returns successfully from sending a PERM
532record, the majority of clients have that PERM LSN durable.&nbsp; We
533will have a new public DB error return called <b>DB_REP_LEASE_EXPIRED</b>
534that will be
535returned back to the caller if the master cannot assert its
536authority.&nbsp; The code will look something like this:<br>
537<pre>/*<br> * Use lp-&gt;max_perm_lsn on the master (currently not used on the master)<br> * to keep track of the last PERM record written through the logging system.<br> * need to initialize lp-&gt;max_perm_lsn in rep_start on role_chg.<br> */<br>call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT<br>if failure<br>	expire leases<br>	return lease expired error to caller<br>else /* success */<br>	recheck lease table<br>	/*<br>	 * We need to recheck the lease table because the client<br>	 * lease grant messages may not be processed yet, or got<br>	 * lost, or racing with the application's ACK messages or<br>	 * whatever. <br>	 */<br>	if we have a majority of valid leases<br>		return success<br>	else<br>		return lease expired error to caller <br></pre>
538<h4>Ongoing Update Refreshment<br>
539</h4>
540Second is having the master indicate to
541the client it needs to send a lease grant in response to the current
542PERM log message.&nbsp; The problem is
543that acknowledgements must contain a master-supplied message timestamp
544that the client sends back to the master.&nbsp; We need to modify the
545structure of the&nbsp; log record messages when leases are configured
546so
547that when a PERM message is sent, the master sends, and the client
548expects, the message timestamp.&nbsp; There are three fairly
549straightforward and different implementations to consider.<br>
550<ol>
551  <li>Adding the timestamp to the <b>REP_CONTROL</b>
552structure.&nbsp; If this option is chosen, then the code trivially
553sends back the timestamp in the client's reply.&nbsp; There is no
554special processing done by either side with the message contents.&nbsp;
555So, on a PERM log record, the master will send a non-zero
556timestamp.&nbsp; On a normal log record the timestamp will be zero or
557some known invalid value.&nbsp; If the client sees a non-zero
558timestamp, it sends a <b>REP_LEASE_GRANT</b>
559with the <i>lp-&gt;max_perm_lsn</i>
560after applying that log record.&nbsp; If it is zero, then the client
561does nothing different.&nbsp; The advantage is ease of code.&nbsp; The
562disadvantage is that for mixed version systems, the client is now
563dealing with different sized control structures.&nbsp; We would have to
564retain the old control structure so that during a mixed version group
565the (upgraded) clients can use, expect and send old control structures
566to the master.&nbsp; This is unfortunate, so let's consider additional
567implementations that don't require modifying the control structure.<br>
568  </li>
569  <li>Adding a new <b>REPCTL_LEASE</b>
570flag to the list of flags for the control structure, but do not change
571the control structure fields.&nbsp; When a master wants to send a
572message that needs a lease ack, it sets the flag.&nbsp; Additionally,
573instead of simply sending a log record DBT as the <i>rec</i> parameter
574for replication, we
575would send a new structure that had the timestamp first and then the
576record (similar to the bulk transfer buffer).&nbsp; The advantage of
577this is that the control structure does not change.&nbsp; Disadvantages
578include more special-cased code in the normal code path where we have
579to check the flag.&nbsp; If the flag is set we have to extract the
580timestamp value and massage the incoming data to pass on the real log
581record to <i>rep_apply</i>.&nbsp; On
582bulk transfer, we would just add the timestamp into the buffer.&nbsp;
583On normal transfers, it would incur an additional data copy on the
584master side.&nbsp; That is unfortunate.&nbsp; Additionally, if this
585record needs to be stored in the temp db, we need some way to get it
586back again later or <span style="font-style: italic;">rep_apply</span>
587would have to extract the timestamp out when it processed the record
588(either live or from the temp db).<br>
589  </li>
590  <li>Adding a different message type, such as <b>REP_LOG_ACK</b>.&nbsp;
591Similarly to <b>REP_LOG_MORE</b> this message would be a
592special-case version of a log record.&nbsp; We would extract out the
593timestamp and then handle as a normal log record.&nbsp; This
594implementation is rejected because it actually would require three new
595message types: <b>REP_LOG_ACK,
596REP_LOG_ACK_MORE, REP_BULK_LOG_ACK</b>.&nbsp; That is just too ugly
597to contemplate.</li>
598</ol>
599<b>[Slight digression:</b> it occurs
600to me while writing about #2 and #3 above, that our implementation of
601all of the *_MORE messages could really be implemented with a <b>REPCTL_MORE</b>
602flag instead of a
603separate message type.&nbsp; We should clean that up and simplify the
604messages but not part of master leases. Hmm, taking that thought
605process further, we really could get rid of the <b>REP_BULK_*</b>
606messages as well if we
607added a <b>REPCTL_BULK</b>
608flag.&nbsp; I think we should definitely do it for the *_MORE
609messages.&nbsp; I am not sure we should do it for bulk because the
610structure of the incoming data record is vastly different.]<br>
611<br>
612Of these options, I believe that modifying the control structure is the
613best alternative.&nbsp; The handling of the old structure will be very
614isolated to code dealing with old versions and is far less complicated
615than injecting the timestamp into the log record DBT and doing a data
616copy.&nbsp; Actually, I will likely combine #1 and the flag from #2
617above.&nbsp; I will have the <b>REPCTL_LEASE</b>
618flag that indicates a lease grant reply is expected and have the
619timestamp in the control structure.&nbsp; <b>[Is that necessary - it
620feels cleaner, but
621also we could just have a non-zero timestamp = send a
622reply without have it directed by a flag from the master.&nbsp; That
623means we would not need the flag, but builds in an assumption into the
624code instead of having the client simply send a grant when the flag
625says to do so.&nbsp; See Upgrades/Mixed versions below too.]</b>
626Also I will probably add in a spare field or two for future use in the <b>REP_CONTROL</b>
627structure.<br>
628<h4>Gap processing</h4>
629No matter which implementation we choose for ongoing lease refreshment,
630gap processing must be considered.&nbsp; The code above assumes the
631timestamps will be placed on PERM records only.&nbsp; Normal log
632records will not have a timestamp, nor a flag or anything else like
633that.&nbsp; However, any log message can fill a gap on a client and
634result in the processing of that normal log record to return <b>DB_REP_ISPERM</b>
635because later records
636were also processed.<br>
637<br>
638The current implementation should work fine in that case because when
639we store the message in the client temp db we store both the control
640DBT and the record DBT.&nbsp; Therefore, when a normal record fills a
641gap, the later PERM record, when retrieved will look just like it did
642when it arrived.&nbsp; The client will have access to the LSN, and the
643timestamp, etc.&nbsp; However, it does mean that sending the <b>REP_LEASE_GRANT</b>
644message must take
645place down in <i>__rep_apply</i>
646because that is the only place we have access to the contents of those
647stored records with the timestamps.<br>
648<br>
649There are two logical choices to consider for granting the lease when
650processing an update.&nbsp; As we process (either a live record or one
651read from the temp db after filling a gap) a PERM message, we send the <b>REP_LEASE_GRANT</b>
652message for each
653PERM record we successfully apply.&nbsp; Or, second, we keep track of
654the largest timestamp of all PERM records we've processed and at the
655end of the function after we've applied all records, we send back a
656single lease grant with the <i>max_perm_lsn</i>
657and a new <i>max_lease_timestamp</i>
658value to the master.&nbsp; The first is easier to implement, the second
659results in possibly slightly fewer messages at the expense of more
660bookkeeping on the client.<br>
661<br>
662A third, more complicated option would be to have the message timestamp
663on all records, but grants are only sent on the PERM messages.&nbsp; A
664reason to do this is that the later timestamp of a normal log record
665would be used as the timestamp sent in the reply and the master would
666get a more up to date timestamp value and a longer lease.&nbsp; <br>
667<br>
668<span style="font-weight: bold;">[Concern about gap processing here.]</span>&nbsp;
669If we change the <span style="font-weight: bold;">REP_CONTROL</span>
670structure to include the timestamp, we potentially break or at least
671need to revisit the gap processing algorithm.&nbsp; That code assumes
672that the control and record elements for the same LSN look the same
673each and every time.&nbsp; The code stores the <span
674 style="font-style: italic;">control</span> DBT as the key and the <span
675 style="font-style: italic;">rec</span> DBT as the data.&nbsp; We use a
676specialized compare function to sort based on the LSN in the control
677DBT.&nbsp; With master leases, the same record transmitted by a master
678multiple times or client for the same LSN will be different because the
679timestamp field will not be the same.&nbsp; Therefore, the client will
680end up with duplicate entries in the temp database for the same
681LSN.&nbsp; Both solutions (adding the timestamp to <span
682 style="font-weight: bold;">REP_CONTROL</span> and adding a <span
683 style="font-weight: bold;">REPCTL_LEASE</span> flag) can yield
684duplicate entries.&nbsp; The flag would cause the same record from the
685master and client to be different as well.<br>
686<h4>Handling Incoming Lease Grants<br>
687</h4>
688The third piece of lease management is handling the incoming <b>REP_LEASE_GRANT</b>
689message on the
690master.&nbsp; When this message is received, the master must do the
691following:<br>
692<pre>REP_SYSTEM_LOCK<br>msg_timestamp = cntrl-&gt;timestamp;<br>client_lease = __rep_lease_entry(dbenv, client eid)<br>if (client_lease == NULL)<br>	initial lease for this site, DB_ASSERT there is space in the table<br>	add this to the table if there is space<br>} else <br>	compare msg_timestamp with client_lease-&gt;start_time<br>	if (msg_timestamp is more recent &amp;&amp; msg_lsn &gt;= lease LSN)<br>		update entry in table<br>REP_SYSTEM_UNLOCK<br></pre>
693<h3>Expiring Leases</h3>
694Leases can expire in two ways.&nbsp; First they can expire naturally
695due to the passage of time.&nbsp; When checking leases, if the current
696time is later than the lease entry's <i>end_time</i>
697then the lease is expired.&nbsp; Second, they can be forced with a
698premature expiration when the application's transport function returns
699an error.&nbsp; In the first case, there is nothing to do, in the
700second case we need to manipulate the <i>end_time</i>
701so that all future lease checks fail.&nbsp; Since the lease <i>start_time</i>
702is guaranteed to not be in the future we will have a function <i>__rep_lease_expire</i>
703that will:<br>
704<pre>REP_SYSTEM_LOCK<br>for each entry in the lease table<br>	entry-&gt;end_time = entry-&gt;start_time;<br>REP_SYSTEM_UNLOCK<br></pre>
705Is there a potential race or problem with prematurely expiring
706leases?&nbsp; Consider an application that enforces an ALL
707acknowledgement policy for PERM records in its transport
708callback.&nbsp; There are four clients and three send the PERM ack to
709the application.&nbsp; The callback returns an error to the master DB
710code.&nbsp; The DB code will now prematurely expire its leases.&nbsp;
711However, at approximately the same time the three clients are also
712sending their <span style="font-weight: bold;">REP_LEASE_GRANT</span>
713messages to the master.&nbsp; There is a race between the master
714processing those messages and the thread handling the callback failure
715expiring the table.&nbsp; This is only an issue if the messages arrive
716after the table has been expired.<br>
717<br>
718Let's assume all three clients send their grants after the master
719expires the table.&nbsp; If we accept those grants and then a read
720occurs the read will succeed since the master has a majority of leases
721even though the callback failed earlier.&nbsp; Is that a problem?&nbsp;
722The lease code is using a majority and the application policy is using
723something other value.&nbsp; It feels like this should be okay since
724the data is held by leases on a majority.&nbsp; Should we consider
725having the lease checking threshold be the same as the permanent ack
726policy?&nbsp; That is difficult because Base API users implement
727whatever they want and DB does not know what it is.<br>
728<h3>Checking Leases</h3>
729When a read operation on the master completes, the last thing we need
730to do is verify the master leases.&nbsp; We've already discussed
731refreshing them when they are expired above.&nbsp; We need two things
732for a lease to be valid.&nbsp; It must be within the timeframe of the
733lease grant and the lease must be valid for the last PERM record
734LSN.&nbsp; Here is the logic
735for checking the validity of leases in <i>__rep_lease_check</i>:<br>
736<pre>#define MAX_REFRESH_TRIES	3<br>DB_LSN lease_lsn;<br>REP_LEASE_ENTRY *entry;<br>u_int32_t min_leases, valid_leases;<br>db_timespec cur_time;<br>int ret, tries;<br><br>	tries = 0;<br>retry:<br>	ret = 0;<br>	LOG_SYSTEM_LOCK<br>	lease_lsn = lp-&gt;lsn<br>	LOG_SYSTEM_UNLOCK<br>	REP_SYSTEM_LOCK<br>	min_leases = rep-&gt;nsites / 2;<br>	__os_gettime(dbenv, &amp;cur_time);<br>	for (entry = head of table, valid_leases = 0; entry != NULL &amp;&amp; valid_leases &lt; min_leases; entry++)<br>		if (timespec_cmp(&amp;entry-&gt;end_time, &amp;cur_time) &gt;= 0 &amp;&amp; log_compare(&amp;entry-&gt;lsn, lease_lsn) == 0)<br>			valid_leases++;<br>	REP_SYSTEM_UNLOCK<br>	if (valid_leases &lt; min_leases) {<br>		ret =__rep_lease_refresh(dbenv, ...);<br>		/*<br>		 * If we are successful, we need to recheck the leases because <br>		 * the lease grant messages may have raced with the PERM<br>		 * acknowledgement.  Give those messages a chance to arrive.<br>		 */<br>		if (ret == 0) {<br>			if (tries &lt;= MAX_REFRESH_TRIES) {<br>				/*<br>				 * If we were successful sending, but not successful in racing the<br>				 * message thread, yield the processor so that message<br>				 * threads may have a chance to run.<br>				 */<br>				if (tries &gt; 0)<br>					/* __os_sleep instead?? */<br>					__os_yield()<br>				tries++;<br>				goto retry;<br>			} else<br>				ret = DB_RET_LEASE_EXPIRED;<br>		}<br>	}<br>	return (ret);</pre>
737If the master has enough valid leases it returns success.&nbsp; If it
738does not have enough, it attempts to refresh them.&nbsp; This attempt
739may fail if sending the PERM record does not receive sufficient
740acks.&nbsp; If we do receive sufficient acknowledgements we may still
741find that scheduling of message threads means the master hasn't yet
742processed the incoming <b>REP_LEASE_GRANT</b>
743messages yet.&nbsp; We will retry a couple times (possibly
744parameterized) if the master discovers that situation.&nbsp; <br>
745<h2>Elections</h2>
746When a client grants a lease to a master, it gives up the right to
747participate in an election until that grant expires.&nbsp; If we are
748the master and <i>dbenv-&gt;rep_elect</i>
749is called, it should return, no matter what, like it does today.&nbsp;
750If we are a client and <i>rep_elect</i>
751is called special processing takes place when leases are in
752effect.&nbsp; First, the easy case is if the lease granted by this
753client has already expired, then the client goes directly into the
754election as normal.&nbsp; If a valid lease grant is outstanding to a
755master, this site cannot participate in an election until that grant
756expires.&nbsp; We have at least two options when a site calls the <i>dbenv-&gt;rep_elect</i>
757API while
758leases are in effect.<br>
759<ol>
760  <li>The simplest coding solution for DB would be simply to refuse to
761participate in the election if this site has a current lease granted to
762a master.&nbsp; We would detect this situation and return EINVAL.&nbsp;
763This is correct behavior and trivial to implement.&nbsp; The
764disadvantage of this solution is that the application would then be
765responsible for repeatedly attempting an election until the lease grant
766expired.<br>
767  </li>
768  <li>The more satisfying solution is for DB to wait the remaining time
769for the grant.&nbsp; If this client hears from the master during that
770time the election does not take place and the call to <i>rep_elect</i>
771returns with the
772information for the current/old master.</li>
773</ol>
774<h3>Election Code Changes</h3>
775The code changes to support leases in the election code are fairly
776isolated.&nbsp; First if leases are configured, we must verify the <i>nsites</i>
777parameter is set to 0.&nbsp;
778Second, in <i>__rep_elect_init</i>
779we must not overwrite the value of <i>rep-&gt;nsites</i>
780for leases because it is controlled by the <i>dbenv-&gt;rep_set_nsites</i>
781API.&nbsp;
782These changes are small and easy to understand.<br>
783<br>
784The more complicated code will be the client code when it has an
785outstanding lease granted.&nbsp; The client will wait for the current
786lease grant to expire before proceeding with the election.&nbsp; The
787client will only do so if it does not hear from the master for the
788remainder of the lease grant time.&nbsp; If the client hears from the
789master, it returns and does not begin participating in the
790election.&nbsp; A new election phase, <b>REP_EPHASE0</b>
791will exist so that the call to <i>__rep_wait</i>
792can detect if a master responds.&nbsp; The client, while waiting for
793the lease grant to expire, will send a <b>REP_MASTER_REQ</b>
794message so that the master will respond with a <b>REP_NEWMASTER</b>
795message and thus,
796allow the client to know the master exists.&nbsp; However, it is also
797desirable that if the master
798replies to the client, the master wants the client to update its lease
799grant.&nbsp; <br>
800<br>
801Recall that the <b>REP_NEWMASTER</b>
802message does not result in a lease grant from the client.&nbsp; The
803client responds when it processes a PERM record that has the <b>REPCTL_LEASE</b>
804flag set in the message
805with its lease grant up to the given LSN.&nbsp; Therefore, we want the
806client's <b>REP_MASTER_REQ</b> to
807yield both the discovery of the existing master and have the master
808refresh its leases.&nbsp; The client will also use the <b>REPCTL_LEASE</b>
809flag in its <b>REP_MASTER_REQ</b> message to the
810master.&nbsp; This flag will serve as the indicator to the master that
811it needs to deal with leases and both send the <b>REP_NEWMASTER</b>
812message and refresh
813the lease.<br>
814The code will work as follows:<br>
815<pre>if (leases_configured &amp;&amp; (my_grant_still_valid || lease_never_granted) {<br>	if (lease_never_granted)<br>		wait_time = lease_timeout<br>	else<br>		wait_time = grant_expiration - current_time<br>	F_SET(REP_F_EPHASE0);<br>	__rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);<br>	ret = __rep_wait(..., REP_F_EPHASE0);<br>	if (we found a master)<br>		return<br>} /* if we don't return, fall out and proceed with election */<br></pre>
816On the master side, the code handling the <b>REP_MASTER_REQ</b> will
817do:<br>
818<pre>if (I am master) {<br>	...<br>	__rep_send_message(REP_NEWMASTER...)<br>	if (F_ISSET(rp, REPCTL_LEASE))<br>		__rep_lease_refresh(...)<br>}<br></pre>
819Other minor implementation details are that<i> __rep_elect_done</i>
820must also clear
821the <b>REP_F_EPHASE0</b> flag.&nbsp;
822We also, obviously, need to define <b>REP_F_EPHASE0</b>
823in the list of replication flags.&nbsp; Note that the client's call to <i>__rep_wait</i>
824will return upon
825receiving the <b>REP_NEWMASTER</b>
826message.&nbsp; The client will independently refresh its lease when it
827receives the log record from the master's call to refresh the lease.<br>
828<br>
829Again, similar to what I suggested above, the code could simply assume
830global leases are configured, and instead of having the <b>REPCTL_LEASE</b>
831flag at all, the master
832assumes that it needs to refresh leases because it has them configured,
833not because it is specified in the <b>REP_MASTER_REQ</b>
834message it is processing. Right now I don't think every possible
835<b>REP_MASTER_REQ</b> message should result in a lease grant request.<br>
836<h4>Elections and Quiescient Systems</h4>
837It is possible that a master is slow or the client is close to its
838expiration time, or that the master is quiescient and all leases are
839currently expired, but nothing much is going on anyway, yet some client
840calls <i>__rep_elect</i> at that
841time.&nbsp; In the code above, we will not send the <b>REP_MASTER_REQ</b>
842because the lease is
843not valid.&nbsp; The client will simply proceed directly to sending the
844<b>REP_VOTE1</b> message, throwing all
845other clients into an election.&nbsp; The master is still master and
846should stay that way.&nbsp; Currently in response to a vote message, a
847master will broadcast out a <b>REP_NEWMASTER</b>
848to assert its mastership.&nbsp; That causes the election to
849complete.&nbsp; However, if desired the master may want to proactively
850refresh its leases.&nbsp; This situation indicates to me that the
851master should choose to refresh leases based on configuration, not a
852flag sent from the client.&nbsp; I believe anytime the master asserts
853its mastership via sending a <b>REP_NEWMASTER</b>
854message that I need to add code to proactively refresh leases at that
855time.<br>
856<h2>Other Implementation Details</h2>
857<h3>Role Changes<br>
858</h3>
859When a site changes its role via a call to <i>rep_start</i> in either
860direction, we
861must take action when leases are configured.&nbsp; There are three
862types of role changes that all need changes to deal with leases:<br>
863<ol>
864  <li><i>A master downgrading to a
865client.</i> When a master downgrades to a client, it can do so
866immediately after it has proactively expired all existing leases it
867holds.&nbsp; This situation is similar to an error from the send
868callback, and it effectively cancels all outstanding leases held on
869this site.&nbsp; Note that if this master expires its leases, it does
870not have any effect on when the clients' lease grants expire on the
871client side.&nbsp; The clients must still wait their full expected
872grant time.<br>
873  </li>
874  <li><i>A client upgrading to master.</i>
875If a client is upgrading to a master but it has an outstanding lease
876granted to another site, the code will return an <b>EINVAL</b>
877error.&nbsp; This situation
878only arises if the application simply declares this site master.&nbsp;
879If a site wins an election then the election itself should have waited
880long enough for the granted lease to expire and this state should not
881arise then.</li>
882  <li><i>A client finding a new master.</i>
883When a client discovers a new and different master, via a <b>REP_NEWMASTER</b>
884message then the
885client cannot accept that new master until its current lease grant
886expires.&nbsp; This situation should only occur when a site declares
887itself master without an election and that site's lease grant expires
888before this client's grant expires.&nbsp; However, it is <b>possible</b>
889for this situation to arise
890with elections also.&nbsp; If we have 5 sites holding an election and 4
891of those sites have leases expire at about the same time T, and this
892site's lease expires at time T+N and the election timeout is &lt; N,
893then those 4 sites may hold an election and elect a master without this
894site's participation.&nbsp; A client in this situation must call <i>__rep_wait</i>
895with the time remaining
896on its lease.&nbsp; If the lease is expired after waiting the remaining
897time, then the client can accept this new master.&nbsp; If the lease
898was refreshed during the waiting period then the client does not accept
899this new master and returns.<br>
900  </li>
901</ol>
902<h3>DUPMASTER</h3>
903A duplicate master situation can occur if an old master becomes
904disconnected from the rest of the group, that group elects a new master
905and then the partition is resolved.&nbsp; The requirement for master
906leases is that this situation will not cause the newly elected,
907rightful master to receive the <b>DB_REP_DUPMASTER</b>
908return.&nbsp; It is okay for the old master to get that return
909value.&nbsp; When a dual master situation exists, the following will
910happen:<br>
911<ul>
912  <li><i>On the current master and all
913current clients</i> - If the current master receives an update
914message or other conflicting message from the old master then that
915message will be ignored because the generation number is out of date.</li>
916  <li><i>On the old master</i> - If
917the old master receives an update message from the current master, or
918any other message with a later generation from any site, the new
919generation number will trigger this site to return <b>DB_REP_DUPMASTER</b>.&nbsp;
920However,
921instead of broadcasting out the <b>REP_DUPMASTER</b>
922message to shoot down others as well, this site, if leases are
923configured, will call <i>__rep_lease_check</i>
924and if they are expired, return the error.&nbsp; It should be
925impossible for us to receive a later generation message and still hold
926a majority of master leases.&nbsp; Something is seriously wrong and we
927will <b>DB_ASSERT</b> this situation
928cannot happen.<br>
929  </li>
930</ul>
931<h3>Client to Client Synchronization</h3>
932One question to ask is how lease grants interact with client-to-client
933synchronization. The only answer is that they do not.&nbsp; A client
934that is sending log records to another client cannot request the
935receiving client refresh its lease with the master.&nbsp; That client
936does not have a timestamp it can use for the master and clock skew
937makes it meaningless between machines.&nbsp; Therefore, sites that use
938client-to-client synchronization will likely see more lease refreshment
939during the read path and leases will be refreshed during live updates
940only.&nbsp; Of course, if a client supplies log records that fill a
941gap, and the later log records stored came from the master in a live
942update then the client will respond as per the discussion on Gap
943Processing above.<br>
944<h2>Interaction Matrix</h2>
945If leases are granted (by a client) or held (by a master) what should
946the following APIs and messages do?<br>
947<br>
948Other:<br>
949log_archive: Leases do not affect log_archive.&nbsp; OK.<br>
950dbenv-&gt;close: OK.<br>
951crash during lease grant and restart: <b>Potential
952problem here.&nbsp; See discussion below</b>.<br>
953<br>
954Rep Base API method:<br>
955rep_elect: Already discussed above.&nbsp; Must wait for lease to expire.<br>
956rep_flush: Master only, OK - this will be the basis for refreshing
957leases.<br>
958rep_get_*: Not affected by leases.<br>
959rep_process_message: Generally OK.&nbsp; We'll discuss each message
960below.<br>
961rep_set_config: OK.<br>
962rep_set_limit: OK<br>
963rep_set_nsites: Must be called before <i>rep_start</i>
964and <i>nsites</i> is immutable until
96514778 is resolved.<br>
966rep_set_priority: OK<br>
967rep_set_timeout: OK.&nbsp; Used to set lease timeout.<br>
968rep_set_transport: OK.<br>
969rep_start(MASTER): Role changes are discussed above.&nbsp; Make sure
970duplicate rep_start calls are no-ops for leases.<br>
971rep_start(CLIENT): Role changes are discussed above.&nbsp; Make sure
972duplicate calls are no-ops for leases.<br>
973rep_stat: OK. <b>[Do we have any stats
974we want to add?&nbsp; Currently none are planned, but may come up
975during implementation and testing as useful to have.&nbsp; Suggestions?]</b><br>
976rep_sync: Should not be able to happen.&nbsp; Client cannot accept new
977master with outstanding lease grant.&nbsp; Add DB_ASSERT here.<br>
978<br>
979REP_ALIVE: OK.<br>
980REP_ALIVE_REQ: OK.<br>
981REP_ALL_REQ: OK.<br>
982REP_BULK_LOG: OK.&nbsp; Clients check to send ACK.<br>
983REP_BULK_PAGE: Should never process one with lease granted.&nbsp; Add
984DB_ASSERT.<br>
985REP_DUPMASTER: Should never happen, this is what leases are supposed to
986prevent.&nbsp; See above.<br>
987REP_LOG: OK.&nbsp; Clients check to send ACK.<br>
988REP_LOG_MORE: OK <b>[maybe remove and
989use flag]</b> Clients check to send ACK.<br>
990REP_LOG_REQ: OK.<br>
991REP_MASTER_REQ: OK.<br>
992REP_NEWCLIENT: OK.<br>
993REP_NEWFILE: OK.&nbsp; Clients check to send ACK.<br>
994REP_NEWMASTER: See above.<br>
995REP_NEWSITE: OK.<br>
996REP_PAGE: OK.&nbsp; Should never process one with lease granted.&nbsp;
997Add DB_ASSERT.<br>
998REP_PAGE_FAIL:&nbsp; OK.&nbsp; Should never process one with lease
999granted.&nbsp; Add DB_ASSERT.<br>
1000REP_PAGE_MORE:&nbsp; OK.&nbsp; Should never process one with lease
1001granted.&nbsp; Add DB_ASSERT.<br>
1002REP_PAGE_REQ: OK.<br>
1003REP_REREQUEST: OK.<br>
1004REP_UPDATE: OK.&nbsp; Should never process one with lease
1005granted.&nbsp; Add DB_ASSERT.<br>
1006REP_UPDATE_REQ: OK.&nbsp; This is a master-only message.<br>
1007REP_VERIFY: OK.&nbsp; Should never process one with lease
1008granted.&nbsp; Add DB_ASSERT.<br>
1009REP_VERIFY_FAIL: OK.&nbsp; Should never process one with lease
1010granted.&nbsp; Add DB_ASSERT.<br>
1011REP_VERIFY_REQ: OK.<br>
1012REP_VOTE1: OK.&nbsp; See Election discussion above.&nbsp; It is
1013possible to receive one with a lease granted.&nbsp; Client cannot send
1014one with an outstanding lease however.<br>
1015REP_VOTE2: OK.&nbsp; See Election discussion above.&nbsp; It is
1016possible to receive one with a lease granted.<br>
1017<br>
1018If the following method or message processing is in progress and a
1019client wants to grant a lease, what should it do?&nbsp; Let's examine
1020what this means.&nbsp; The client wanting to grant a lease simply means
1021it is responding to the receipt of a <b>REP_LOG</b>
1022(or its variants) message and applying a log record.&nbsp; Therefore,
1023we need to consider a thread processing a log message racing with these
1024other actions.<br>
1025<br>
1026Other:<br>
1027log_archive: OK.&nbsp; <br>
1028dbenv-&gt;close: User error.&nbsp; User should not be closing the env
1029while other threads are using that handle.&nbsp; Should have no effect
1030if a 2nd dbenv handle to same env is closed.<br>
1031<br>
1032Rep Base API method:<br>
1033rep_elect: See Election discussion above.&nbsp; <i>rep_elect</i>
1034should wait and may grant
1035lease while election is in progress.<br>
1036rep_flush: Should not be called on client.<br>
1037rep_get_*: OK.<br>
1038rep_process_message: Generally OK.&nbsp; See handling each message
1039below.<br>
1040rep_set_config: OK.<br>
1041rep_set_limit: OK.<br>
1042rep_set_nsites: Must be called before <i>rep_start</i>
1043until 14778 is resolved.<br>
1044rep_set_priority: OK.<br>
1045rep_set_timeout: OK.<br>
1046rep_set_transport: OK.<br>
1047rep_start(MASTER): OK, can't happen - already protect racing <i>rep_start</i>
1048and <i>rep_process_message</i>.<br>
1049rep_start(CLIENT): OK, can't happen - already protect racing <i>rep_start</i>
1050and <i>rep_process_message</i>.<br>
1051rep_stat: OK.<br>
1052rep_sync: Shouldn't happen because client cannot grant leases during
1053sync-up.&nbsp; Incoming log message ignored.<br>
1054<br>
1055REP_ALIVE: OK.<br>
1056REP_ALIVE_REQ: OK.<br>
1057REP_ALL_REQ: OK.<br>
1058REP_BULK_LOG: OK.<br>
1059REP_BULK_PAGE: OK.&nbsp; Incoming log message ignored during internal
1060init.<br>
1061REP_DUPMASTER: Shouldn't happen.&nbsp; See DUPMASTER discussion above.<br>
1062REP_LOG: OK.<br>
1063REP_LOG_MORE: OK.<br>
1064REP_LOG_REQ: OK.<br>
1065REP_MASTER_REQ: OK.<br>
1066REP_NEWCLIENT: OK.<br>
1067REP_NEWFILE: OK.<br>
1068REP_NEWMASTER: See above.&nbsp; If a client accepts a new master
1069because its lease grant expired, then that master sends a message
1070requesting the lease grant, this client will not process the log record
1071if it is in sync-up recovery, or it may after the master switch is
1072complete and the client doesn't need sync-up recovery.&nbsp; Basically,
1073just uses existing log record processing/newmaster infrastructure.<br>
1074REP_NEWSITE: OK.<br>
1075REP_PAGE: OK.&nbsp; Receiving a log record during internal init PAGE
1076phase should ignore log record.<br>
1077REP_PAGE_FAIL: OK.<br>
1078REP_PAGE_MORE: OK.<br>
1079REP_PAGE_REQ: OK.<br>
1080REP_REREQUEST: OK.<br>
1081REP_UPDATE: OK.&nbsp; Receiving a log record during internal init
1082should ignore log record.<br>
1083REP_UPDATE_REQ: OK - master-only message.<br>
1084REP_VERIFY: OK.&nbsp; Receiving a log record during verify phase
1085ignores log record.<br>
1086REP_VERIFY_FAIL: OK.<br>
1087REP_VERIFY_REQ: OK.<br>
1088REP_VOTE1: OK.&nbsp; This client is processing someone else's vote when
1089the lease request comes in.&nbsp; That is fine.&nbsp; We protect our
1090own election and lease interaction in <i>__rep_elect</i>.<br>
1091REP_VOTE2: OK.<br>
1092<h4>Crashing - Potential Problem<br>
1093</h4>
1094It appears there is one area where we could have a problem.&nbsp; I
1095believe that crashes can cause us to break our guarantee on durability,
1096authoritative reads and inability to elect duplicate masters.&nbsp;
1097Consider this scenario:<br>
1098<ol>
1099  <li>A master and 4 clients are all up and running.</li>
1100  <li>The master commits a txn and all 4 clients refresh their lease
1101grants at time T.</li>
1102  <li>All 4 clients have the txn and log records in the cache.&nbsp;
1103None are flushing to disk.</li>
1104  <li>All 4 clients have responded to the PERM messages as well as
1105refreshed their lease with the master.</li>
1106  <li>All 4 clients hit the same application coding error and crash
1107(machine/OS stays up).</li>
1108  <li>Master authoritatively reads data in txn from step 2.</li>
1109  <li>All 4 clients restart the application and run recovery, thus the
1110txn from step 2 is lost on all clients because it isn't any logs.<span
1111 style="font-weight: bold;"></span><br>
1112  </li>
1113  <li>A network partition happens and the master is alone on its side.</li>
1114  <li>All 4 clients are on the other side and elect a new master.</li>
1115  <li>Partition resolves itself and we have duplicate masters, where
1116the former master still holds all valid lease grants.<span
1117 style="font-weight: bold;"></span><br>
1118  </li>
1119</ol>
1120Therefore, we have broken both guarantees.&nbsp; In step 6 the data is
1121really not durable and we've given it to the user.&nbsp; One can argue
1122that if this is an issue the application better be syncing somewhere if
1123they really want durability.&nbsp; However, worse than that is that we
1124have a legitimate DUPMASTER situation in step 10 where both masters
1125hold valid leases.&nbsp; The reason is that all lease knowledge is in
1126the shared memory and that is lost when the app restarts and runs
1127recovery.<br>
1128<br>
1129How can we solve this?&nbsp; The obvious solution is (ugh, yet another)
1130durable BDB-owned file with some information in it, such as the current
1131lease expiration time so that rebooting after a crash leaves the
1132knowledge that the lease was granted.&nbsp; However, writing and
1133syncing every lease grant on every client out to disk is far too
1134expensive.<br>
1135<br>
1136A second possible solution is to have clients wait a full lease timeout
1137before entering an election the first time. This solution solves the
1138DUPMASTER issue, but not the non-authoritative read.&nbsp; This
1139solution naturally falls out of elections and leases really.&nbsp; If a
1140client has never granted a lease, it should be considered as having to
1141wait a full lease timeout before entering an election.&nbsp;
1142Applications already know that leases impact elections and this does
1143not seem so bad as it is only on the first election.<br>
1144<br>
1145Is it sufficient to document that the authoritative read is only as
1146authoritative as the durability guarantees they make on the sites that
1147indicate it is permanent? Yes, I believe this is sufficient.&nbsp; If
1148the application says it is permanent and it really isn't, then the
1149application is at fault.&nbsp; Believing the application when it
1150indicates with the PERM response that it is permanent avoids the
1151authoritative problem <span style="font-weight: bold;">[document this
1152application requirement]</span>.&nbsp; <br>
1153<h2>Upgrade/Mixed Versions</h2>
1154Clearly leases cannot be used with mixed version sites since masters
1155running older releases will not have any knowledge of lease
1156support.&nbsp; What considerations are needed in the lease code for
1157mixed versions?<br>
1158<br>
1159First if the <b>REP_CONTROL</b>
1160structure changes, we need to maintain and use an old version of the
1161structure for talking to older clients and masters.&nbsp; The
1162implementation of this would be similar to the way we manage for old <b>REP_VOTE_INFO</b>
1163structures.&nbsp;
1164Second any new messages need translation table entries added.&nbsp;
1165Third, if we are assuming global leases then clearly any mixed versions
1166cannot have leases configured, and leases cannot be used in mixed
1167version groups.&nbsp; Maintaining two versions of the control structure
1168is not necessary if we choose a different style of implementation and
1169don't change the control structure.<br>
1170<br>
1171However, then how could an old application both run continuously,
1172upgrade to the new release and take advantage of leases without taking
1173down the entire application?&nbsp; I believe it is possible for clients
1174to be configured for leases but be subject to the master regarding
1175leases, yet the master code can assume that if it has leases
1176configured, all client sites do as well.&nbsp; In several places above
1177I suggested that a client could make a choice based on either a new <b>REPCTL_LEASE</b>
1178flag or simply having
1179leases turned on locally.&nbsp; If we choose to use the flag, then we
1180can support leases with mixed versions.&nbsp; The upgraded clients can
1181configure leases and they simply will not be granted until the old
1182master is upgraded and send PERM message with the flag indicating it
1183wants a lease grant.&nbsp; The client will not grant a lease until such
1184time.&nbsp; The clients, while having the leases configured, will not
1185grant a lease until told to do so and will simply have an expired
1186lease.&nbsp; Then, when the old master finally upgrades, it too can
1187configure leases and suddenly all sites are using them.&nbsp; I believe
1188this should work just fine and I will need to make sure a client's
1189granting of leases is only in response to the master asking for a
1190grant.&nbsp; If the master never asks, then the client has them
1191configured, but doesn't grant them.<br>
1192<h2>Testing</h2>
1193Clearly any user-facing API changes will need the equivalent reflection
1194in the Tcl API for testing, under CONFIG_TEST.<br>
1195<br>
1196I am sure the list of tests will grow but off the top of my head:<br>
1197Basic test: have N sites all configure leases, run some,&nbsp; read on
1198master, etc.<br>
1199Refresh test: Perform update on master, sleep until past expiration,
1200read on master and make sure leases are refreshed/read successful<br>
1201Error test: Test error conditions (reading on client with leases but no
1202ignore flag, calling after rep_start, etc)<br>
1203Read test: Test reading on both client and master both with and without
1204the IGNORE flag.&nbsp; Test that data read with the ignore flag can be
1205rolled back.<br>
1206Dupmaster test: Force a DUPMASTER situation and verify that the newer
1207master cannot get DUPMASTER error.<br>
1208Election test: Call election while grant is outstanding and master
1209exists.<br>
1210Call election while grant is outstanding and master does not exist.<br>
1211Call election after expiration on quiescient system with master
1212existing.<br>
1213Run with a group where some members have leases configured and other do
1214not to make sure we get errors instead of dumping core.<br>
1215<br>
1216<small><br>
1217</small>
1218</body>
1219</html>
1220