1<!DOCTYPE doctype PUBLIC "-//w3c//dtd html 4.0 transitional//en">
2<html>
3<head>
4  <meta http-equiv="Content-Type"
5 content="text/html; charset=iso-8859-1">
6  <meta name="GENERATOR"
7 content="Mozilla/4.76 [en] (X11; U; FreeBSD 4.3-RELEASE i386) [Netscape]">
8  <title>Master Lease</title>
9</head>
10<body>
11<center>
12<h1>Master Leases for Berkeley DB</h1>
13</center>
14<center><i>Susan LoVerso</i> <br>
15<i>sue@sleepycat.com</i> <br>
16<i>Rev 1.1</i><br>
17<i>2007 Feb 2</i><br>
18</center>
19<p><br>
20</p>
21<h2>What are Master Leases?</h2>
22A master lease is a mechanism whereby clients grant master-ship rights
23to a site and that master, by holding lease rights can provide a&nbsp;
24guarantee of durability to a replication group for a given period of
25time.&nbsp; By granting a lease to a master,
26a&nbsp; client will not participate in an election to elect a new
27master until that granted master lease has expired.&nbsp; By holding a
28collection of granted leases, a master will be able to supply
29authoritative read requests to applications.&nbsp; By holding leases a
30read operation on a master can guarantee several things to the
31application:<br>
32<ol>
33  <li>Authoritative reads: a guarantee that the data being read by the
34application is durable and can never be rolled back.</li>
35  <li>Freshness: a guarantee that the data being read by the
36application <b>at the master</b> is
37not stale.</li>
38  <li>Master viability: a guarantee that a current master with valid
39leases will not encounter a duplicate master situation.<br>
40  </li>
41</ol>
42<h2>Requirements</h2>
43The requirements of DB to support this include:<br>
44<ul>
45  <li>After turning them on, users can choose to ignore them in reads
46or not.</li>
47  <li>We are providing read authority on the master only.&nbsp; A
48read on a client is equivalent to a read while ignoring leases.</li>
49  <li>We guarantee that data committed on a master <b>that has been
50read by an application on the
51master</b> will not be rolled back.&nbsp; Data read on a client or
52while ignoring leases <i>or data
53successfully updated/committed but not read,</i>
54may be rolled back.<br>
55  </li>
56  <li>A master will not return successfully from a read operation
57unless it holds a
58majority of leases unless leases are ignored.</li>
59  <li>Master leases will remove the possibility of a current/correct
60master being "shot down" by DUPMASTER.&nbsp; <b>NOTE: Old/Expired
61masters may discover a
62later master and return DUPMASTER to the application however.</b><br>
63  </li>
64  <li>Any send callback failure must result in premature lease
65expiration on the master.<br>
66  </li>
67  <li>Users who change the system clock during master leases void the
68guarantee and may get undefined behavior.&nbsp; We assume time always
69runs forward. <br>
70  </li>
71  <li>Clients are forbidden from participating in elections while they
72have an outstanding lease granted to another site.</li>
73  <li>Clients are forbidden from accepting a new master while they have
74an outstanding lease granted to another site.</li>
75  <li>Clients are forbidden from upgrading themselves to master while
76they have an outstanding lease granted to another site.</li>
77  <li>When asked for a lease grant explicitly by the master, the client
78cannot grant the lease to the master unless the LSN in the master's
79request has been processed by this client.<br>
80  </li>
81</ul>
82The requirements of the
83application using leases include:<br>
84<ul>
85  <li>Users must implement (Base API users on their own, RepMgr users
86via configuration) a majority (or larger) ACK policy. <br>
87  </li>
88  <li>The application must use the election mechanism to decide a master.
89It may not simply declare a site master.</li>
90  <li>The send callback must return an error if the majority ACK policy
91is not met for PERM records.</li>
92  <li>Users must set the number of sites in the group.</li>
93  <li>Using leases in a replication group is all-or-none.&nbsp;
94Therefore, if a site knows it is using leases, it can assume other
95sites are also.<br>
96  </li>
97  <li>All applications that care about read guarantees must forward or
98perform all reads on the master.&nbsp; Reading on the client means a
99read ignoring leases. </li>
100</ul>
101<p>There are some open questions
102remaining.</p>
103<ul>
104  <li>There is one major showstopper issue, see Crashing - Potential
105problem near the end of the document.&nbsp; We need a better solution
106than the one shown there (writing to disk every time a lease is
107granted). Perhaps just documenting that durability means it must be
108flushed to disk before success to avoid that situation?<br>
109  </li>
110  <li>What about db-&gt;join?&nbsp; Users can call join, but the calls
111on the join cursor to get the data would be subject to leases and
112therefore protected.&nbsp; Ok, this is not an open question.</li>
113  <li>What about other read-like operations?&nbsp; Clearly <i>
114DB-&gt;get, DB-&gt;pget, DBC-&gt;get,
115DBC-&gt;pget</i> need lease checks.&nbsp; However, other APIs use
116keys.&nbsp; <i>DB-&gt;key_range</i>
117provides an estimate only so it shouldn't need lease checks. <i>
118DB-&gt;stat</i> provides exact counts
119to <i>bt_nkeys</i> and <i>bt_ndata</i> fields.&nbsp; Are those
120fields considered authoritative that providing those values implies a
121durability guarantee and therefore <i>DB-&gt;stat</i>
122should be subject to lease verification?&nbsp; <i>DBC-&gt;count</i>
123provides a count for
124the number of data items associated with a key.&nbsp; Is this
125authoritative information? This is similar to stat - should it be
126subject to lease verification?<br>
127  </li>
128  <li>Do we require master lease checks on write operations?&nbsp; I
129think lease checks are not needed on write operations.&nbsp; It doesn't
130add correctness and adds a lot of complexity (checking leases in put,
131del, and cursors, then what about rename, remove, etc).<br>
132  </li>
133  <li>Do master leases give an iron-clad guarantee of never rolling
134back a transaction? No, but it should mean that a committed transaction
135can never be <b>read</b> on a master
136unless the lease is valid.&nbsp; A committed transaction on a master
137that has never been presented to the application may get rolled back.<br>
138  </li>
139  <li>Do we need to quarantine or prevent reads on an ex-master until
140sync-up is done?&nbsp; No.&nbsp; A master that is simply downgraded to
141client or crashes and reboots is now a client.&nbsp; Reading from that
142client is the same as saying Ignore Leases.</li>
143  <li>What about adding and removing sites while leases are
144active?&nbsp; This is SR 14778.&nbsp; A consistent <i>nsites</i> value
145is required by master
146leases.&nbsp; &nbsp; It isn't
147clear to me what a master is
148supposed to do if the value of nsites gets smaller while leases are
149active.&nbsp; Perhaps it leaves its larger table intact and simply
150checks for a smaller number of granted leases?<br>
151  </li>
152  <li>Can users turn leases off?&nbsp; No.&nbsp; There is no planned <i>turn
153leases off</i> API.</li>
154  <li>Clock skew will be a percentage.&nbsp; However, the smallest, 1%,
155is probably rather large for clock skew.&nbsp; Percentage was chosen
156for simplicity and similarity to other APIs.&nbsp; What granularity is
157appropriate here?</li>
158</ul>
159<h2>API Changes</h2>
160The API changes that are visible
161to the user are fairly minimal.&nbsp;
162There are a few API calls they need to make to configure master leases
163and then there is the API call to turn them on.&nbsp; There is also a
164new flag to existing APIs to allow read operations to ignore leases and
165return data that
166may be non-durable potentially.<br>
167<h3>Lease Timeout<br>
168</h3>
169There is a new timout the user
170must configure for leases called <b>DB_REP_LEASE_TIMEOUT</b>.&nbsp;
171This timeout will be new to
172the <i>dbenv-&gt;rep_set_timeout</i> method. The <b>DB_REP_LEASE_TIMEOUT</b>
173has no default and it is required that the user configure a timeout
174before they turn on leases (obviously, this timeout need not be set of
175leases will not be used).&nbsp; That timeout is the amount of time
176the lease is valid on the master and how long it is granted
177on the client.&nbsp; This timeout must be the same
178value on all sites (like log file size).&nbsp; The timeout used when
179refreshing leases is the <b>DB_REP_ACK_TIMEOUT</b>
180for RepMgr application.&nbsp; For Base API applications, lease
181refreshes will use the same mechanism as <b>PERM</b> messages and they
182should
183have no additional burden.&nbsp; This timeout is used for lease
184refreshment and is the amount of time a reader will wait to refresh
185leases before returning failure to the application from a read
186operation.<br>
187<br>
188This timeout will be both stored
189with its original value, and also
190converted to a <i>db_timespec</i>
191using the <b>DB_TIMEOUT_TO_TIMESPEC</b>
192macro and have the clock skew accounted for and stored in the shared
193rep structure:<br>
194<pre>db_timeout_t lease_timeout;<br>db_timespec lease_duration;<br></pre>
195NOTE:&nbsp; By sending the lease refresh during DB operations, we are
196forcing/assuming that the operation's process has a replication
197transport function set.&nbsp; That is obviously the case for write
198operations, but would it be a burden for read processes (on a
199master)?&nbsp; I think mostly not, but if we need leases for <i>
200DB-&gt;stat</i> then we need to
201document it as it is certainly possible for an application to have a
202separate or dedicated <i>stat</i>
203application or attempt to use <i>db_stat</i>
204(which will not work if leases must be checked).<br>
205<br>
206Leases should be checked after the local operation so that we don't
207have a window/boundary if we were to check leases first, get
208descheduled, the lose our lease and then perform the operation.&nbsp;
209Do the operation, then check leases before returning to the user.<br>
210<h3>Using Leases</h3>
211There is a new API that the user must call to tell the system to use
212the lease mechanism.&nbsp; The method must be called before the
213application calls <i>dbenv-&gt;rep_start</i>
214or <i>dbenv-&gt;repmgr_start</i>.
215This new
216method is:<br>
217<br>
218<pre>&nbsp;&nbsp;&nbsp; dbenv-&gt;rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)<br>
219</pre>
220The <i>clock_scale_factor</i>
221parameter is interpreted as a percentage, greater than 100 (to transmit
222a floating point number as an integer to the API) that represents the
223maximum shkew between any two sites' clocks.&nbsp; That is, a <span
224 style="font-style: italic;">clock_scale_factor</span> of 150 suggests
225that the greatest discrepancy between clocks is that one runs 50%
226faster than the others.&nbsp; Both the
227master and client sides
228compensate for possible clock skew.&nbsp; The master uses the value to
229compensate in case the replica has a slow clock and replicas compensate
230in case they have a fast clock.&nbsp; This scaling factor will need to
231be divided by 100 on all sites to truly represent the percentage for
232adjustments made to time values.<br>
233<br>
234Assume the slowest replica's clock is a factor of <i>clock_scale_factor</i>
235slower than the
236fastest clock.&nbsp; Using that assumption, if the fastest clock goes
237from time t1 to t2 in X
238seconds, the slowest clock does it in (<i>clock_scale_factor</i> / 100)
239* X seconds.<br>
240<br>
241The <i>flags</i> parameter is not
242currently used.<br>
243<br>
244When the <i>dbenv-&gt;rep_set_lease</i>
245method is called, we will set a configuration flag indicating that
246leases are turned on:<br>
247<b>#define REP_C_LEASE &lt;value&gt;</b>.&nbsp;
248We will also record the <b>u_int32_t
249clock_skew</b> value passed in.&nbsp; The <i>rep_set_lease</i> method
250will not allow
251calls after <i>rep_start.&nbsp; </i>If
252multiple calls are made prior to calling <i>rep_start</i> then later
253calls will
254overwrite the earlier clock skew value.&nbsp; <br>
255<br>
256We need a new flag to prevent calling <i>rep_set_lease</i>
257after <i>rep_start</i>.&nbsp; The
258simplest solution would be to reject the call to
259<i>rep_set_lease&nbsp;
260</i>if<b>
261REP_F_CLIENT</b>
262or <b>REP_F_MASTER</b> is set.&nbsp;
263However that does not work in the cases where a site cleanly closes its
264environment and then opens without running recovery.&nbsp; The
265replication state will still be set.&nbsp; The prevention will be
266implemented as:<br>
267<pre>#define REP_F_START_CALLED &lt;some bit value&gt;<br></pre>
268In __rep_start, at the end:<br>
269<pre>if (ret == 0 ) {<br>	REP_SYSTEM_LOCK<br>	F_SET(rep, REP_F_START_CALLED)<br>	REP_SYSTEM_UNLOCK<br>}</pre>
270In <i>__rep_env_refresh</i>, if we
271are the last reference closing the env (we already check for that):<br>
272<pre>F_CLR(rep, REP_F_START_CALLED);</pre>
273In order to avoid run-time floating point operations
274on <i>db_timespec</i> structures,
275when a site is declared as a client or master in <i>rep_start</i> we
276will pre-compute the
277lease duration based on the integer-based clock skew and the
278integer-based lease timeout.&nbsp; A master should set a replica's
279lease expiration to the <b>start time of
280the sent message +
281(lease_timeout / clock_scale_factor)</b> in case the replica has a
282slow clock.&nbsp; Replicas extend their leases to <b>received message
283time + (lease_timeout *
284clock_scale_factor)</b> in case this replica has a fast clock.&nbsp;
285Therefore, the computation will be as follows if the site is becoming a
286master:<br>
287<pre>db_timeout_t tmp;<br>tmp = (db_timeout_t)((double)rep-&gt;lease_timeout / ((double)rep-&gt;clock_skew / (double)100));<br>rep-&gt;lease_duration = DB_TIMEOUT_TO_TIMESPEC(&amp;tmp);<br></pre>
288Similarly, on a client the computation is:<br>
289<pre>tmp = (db_timeout_t)((double)rep-&gt;lease_timeout * ((double)rep-&gt;clock_skew / (double)100));<br></pre>
290When a site changes state, its lease duration will change based on
291whether it is becoming a master or client and it will be recomputed
292from the original values.&nbsp; Note that these computations, coupled
293with the fact that the lease on the master is computed based on the
294master's time that it sent the message means that leases on the master
295are more conservatively computed than on the clients.<br>
296<br>
297The <i>dbenv-&gt;rep_set_lease</i>
298method must be called after <i>dbenv-&gt;open</i>,
299similar to <i>dbenv-&gt;rep_set_config</i>.&nbsp;
300The reason is so that we can check that this is a replication
301environment and we have access to the replication shared memory region.<br>
302<h3>Read Operations<br>
303</h3>
304Authoritative read operations on the master with leases enabled will
305abide by leases by default.&nbsp; We will provide a flag that allows an
306operation on a master to ignore leases.&nbsp; <b>All read operations
307on a client imply
308ignoring leases.</b> If an application wants authoritative reads
309they must forward the read requests to the master and it is the
310application's responsibility to provide the forwarding.
311The consensus was that forcing <span style="font-weight: bold;">DB_IGNORE_LEASE</span>
312on client read operations (with leases enabled, obviously) was too
313heavy handed.&nbsp; Read operations on the client will ignore leases,
314but do no special flag checking.<br>
315<br>
316The flag will be called <b>DB_IGNORE_LEASE</b>
317and it will be a flag that can be OR'd into the DB access method and
318cursor operation values.&nbsp; It will be similar to the <b>DB_READ_UNCOMMITTED</b>
319flag.
320<br>
321</b>The methods that will
322adhere to leases are:<br>
323<ul>
324  <li><i>Db-&gt;get</i></li>
325  <li><i>Db-&gt;pget</i></li>
326  <li><i>Dbc-&gt;get</i></li>
327  <li><i>Dbc-&gt;pget</i></li>
328</ul>
329The code that will check leases for a client reading would look
330something
331like this, if we decide to become heavy-handed:<br>
332<pre>if (IS_REP_CLIENT(dbenv)) {<br>	[get to rep structure]<br>	if (FLD_ISSET(rep-&gt;config, REP_C_LEASE) &amp;&amp; !LF_ISSET(DB_IGNORE_LEASE)) {<br>		db_err("Read operations must ignore leases or go to master");<br>		ret = EINVAL;<br>		goto err;<br>	}<br>}<br></pre>
333On the master, the new code to abide by leases is more complex.&nbsp;
334After the call to perform the operation we will check the lease.&nbsp;
335In that checking code, the master will see if it has a valid
336lease.&nbsp; If so, then all is well.&nbsp; If not, it will try to
337refresh the leases.&nbsp; If that refresh attempt results in leases,
338all is well.&nbsp; If the refresh attempt does not get leases, then the
339master cannot respond to the read as an authority and we return an
340error.&nbsp; The new error is called <b>DB_REP_LEASE_EXPIRED</b>.&nbsp;
341The location of the master lease check is down after the internal call
342to read the data is successful:<br>
343<pre>if (IS_REP_MASTER(dbenv) &amp;&amp; !LF_ISSET(DB_IGNORE_LEASE)) {<br>	[get to rep structure]<br>	if (FLD_ISSET(rep-&gt;config, REP_C_LEASE) &amp;&amp;<br>	    (ret = __rep_lease_check(dbenv)) != 0) {<br>		/*<br>		 * We don't hold the lease.<br>		 */<br>		goto err;<br>	}<br>}<br></pre>
344See below for the details of <i>__rep_lease_check</i>.<br>
345<br>
346Also note that if leases (or replication) are not configured, then <span
347 style="font-weight: bold;">DB_IGNORE_LEASE</span> is a no-op.&nbsp; It
348is ignored (and won't error) if used when leases are not in
349effect.&nbsp; The reason is so that we can generically set that flag in
350utility programs like <span style="font-style: italic;">db_dump</span>
351that walk the database with a cursor.&nbsp; Note that <span
352 style="font-style: italic;">db_dump</span> is the only utility that
353reads with a cursor.<span style="font-style: italic;"><span
354 style="font-style: italic;"></span></span><br>
355<h3><b>Nsites
356and Elections</b></h3>
357The call to <i>dbenv-&gt;rep_set_nsites</i>
358must be performed before the call to <i>dbenv-&gt;rep_start</i>
359or <i>dbenv-&gt;repmgr_start</i>.&nbsp;
360This document assumes either that <b>SR
36114778</b> gets resolved, or assumes that the value of <i>nsites</i> is
362immutable.&nbsp; The
363master and all clients need to know how many sites and leases are in
364the group.&nbsp; Clients need to know for elections.&nbsp; The master
365needs to know for the size of the lease table and to know what value a
366majority of the group is. <b>[Until
36714778 is resolved, the master lease work must assume <i>nsites</i> is
368immutable and will
369therefore enforce that this is called before <i>rep_start</i> using
370the same mechanism
371as <i>rep_set_lease</i>.]</b><br>
372<br>
373Elections and leases need to agree on the number of sites in the
374group.&nbsp; Therefore, when leases are in effect on clients, all calls
375to <i>dbenv-&gt;rep_elect</i> must
376set the <i>nsites</i> parameter to
3770.&nbsp; The <i>rep_elect</i> code
378path will return <b>EINVAL</b> if <b>REP_C_LEASE</b> is set and <i>nsites</i>
379is non-0.
380<h2>Lease Management</h2>
381<h3>Message Changes</h3>
382In order for clients to grant leases to the master a new message type
383must be added for that purpose.&nbsp; This will be the <b>REP_LEASE_GRANT</b>
384message.&nbsp;
385Granting leases will be a result of applying a <b>DB_REP_PERMANENT</b>
386record and therefore we
387do not need any additional message in order for a master to request a
388lease grant.&nbsp; The <b>REP_LEASE_GRANT</b>
389message will pass a structure as its message DBT:<br>
390<pre>struct __rep_lease_grant {<br>	db_timespec msg_time;<br>#ifdef DIAGNOSTIC<br>	db_timespec expire_time;<br>#endif<br>} REP_GRANT_INFO;<br></pre>
391In the <b>REP_LEASE_GRANT</b>
392message, the client is actually giving the master several pieces of
393information.&nbsp; We only need the echoed <i>msg_time</i> in this
394structure because
395everything else is already sent.&nbsp; The client is really sending the
396master:<br>
397<ul>
398  <li>Its EID (parameter to <span style="font-style: italic;">rep_send_message</span>
399and <span style="font-style: italic;">rep_process_message</span>)<br>
400  </li>
401  <li>The PERM LSN this message acknowledged (sent in the control
402message)</li>
403  <li>Unique identifier echoed back to master (<i>msg_time</i> sent in
404message as above)</li>
405</ul>
406On the client, we always maintain the maximum PERM LSN already in <i>lp-&gt;max_perm_lsn</i>.&nbsp;
407<h3>Local State Management</h3>
408Each client must maintain a <i>db_timespec</i>
409timestamp containing the expiration of its granted lease.&nbsp; This
410field will be in the replication shared memory structure:<br>
411<pre>db_timespec grant_expire;<br></pre>
412This timestamp already takes into account the clock skew.&nbsp; All
413new fields must be initialized when the region is created. Whenever we
414grant our master lease and want to send the <b>REP_LEASE_GRANT</b>
415message, this value
416will be updated.&nbsp; It will be used in the following way:
417<pre>db_timespec mytime;<br>DB_LSN perm_lsn;<br>DBT lease_dbt;<br>REP_GRANT_INFO gi;<br><br><br>timespecclear(&amp;mytime);<br>timespecclear(&amp;newgrant);<br>memset(&amp;lease_dbt, 0, sizeof(lease_dbt));<br>memset(&amp;gi, 0, sizeof(gi));<br>__os_gettime(dbenv, &amp;mytime);<br>timespecadd(&amp;mytime, &amp;rep-&gt;lease_duration);<br>MUTEX_LOCK(rep-&gt;clientdb_mutex);<br>perm_lsn = lp-&gt;max_perm_lsn;<br>MUTEX_UNLOCK(rep-&gt;clientdb_mutex);<br>REP_SYSTEM_LOCK(dbenv);<br>if (timespeccmp(mytime, rep-&gt;grant_expire, &gt;))<br>	rep-&gt;grant_expire = mytime;<br>gi.msg_time = msg-&gt;msg_time;<br>#ifdef DIAGNOSTIC<br>gi.expire_time = rep-&gt;grant_expire;<br>#endif<br>lease_dbt.data = &amp;gi;<br>lease_dbt.size = sizeof(gi);<br>REP_SYSTEM_UNLOCK(dbenv);<br>__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &amp;perm_lsn, &amp;lease_dbt, 0, 0);<br></pre>
418This updating of the lease grant will occur in the <b>PERM</b> code
419path when we have
420successfully applied the permanent record.<br>
421<h3>Maintaining Leases on the
422Master/Rep_start</h3>
423The master maintains a lease table that it checks when fulfilling a
424read request that is subject to leases.&nbsp; This table is initialized
425when a site calls<i>
426dbenv-&gt;rep_start(DB_MASTER)</i> and the site is undergoing a role
427change (i.e. a master making additional calls to <i>dbenv-&gt;rep_start(DB_MASTER)</i>
428does
429not affect an already existing table).<br>
430<br>
431When a non-master site becomes master, it must do two things related to
432leases on a role change.&nbsp; First, a client cannot upgrade to master
433while it has an outstanding lease granted to another site.&nbsp; If a
434client attempts to do so, an error, <b>EINVAL</b>,
435will be returned.&nbsp; The only way this should happen is if the
436application simply declares a site master, instead of using
437elections.&nbsp; Elections will already wait for leases to expire
438before proceeding. (See below.) 
439<br>
440<br>
441Second, once we are proceeding with becoming a master, the site must
442allocate the table it will use to maintain lease information.&nbsp;
443This table will be sized based on <i>nsites</i>
444and it will be an array of the following structure:<br>
445<pre>struct  {<br>	int eid;			/* EID of client site. */<br>	db_timespec start_time;	/* Unique time ID client echoes back on grants. */<br>	db_timespec end_time;	/* Master's lease expiration time. */<br>	DB_LSN lease_lsn;	/* Durable LSN this lease applies to. */<br>	u_int32_t flags;	/* Unused for now?? */<br>} REP_LEASE_ENTRY;<br></pre>
446<h3>Granting Leases</h3>
447It is the burden of the application to make sure that all sites in the
448group
449are using leases, or none are.&nbsp; Therefore, when a client processes
450a <b>PERM</b>
451log record that arrived from the master, it will grant its lease
452automatically if that record is permanent (i.e. <b>DB_REP_ISPERM</b>
453is being returned),
454and leases are configured.&nbsp; A client will not send a
455lease grant when it is processing log records (even <b>PERM</b>
456ones) it receives from other clients that use client-to-client
457synchronization.&nbsp; The reason is that the master requires a unique
458time-of-msg ID (see below) that the client echoes back in its lease
459grant and it will not have such an ID from another client.<br>
460<br>
461The master stores a time-of-msg ID in each message and the client
462simply echoes it back to the master.&nbsp; In its lease table, it does
463keep the base
464time-of-msg for a valid lease.&nbsp; When <b>REP_LEASE_GRANT</b>
465message comes in,
466the master does a number of things:<br>
467<ol>
468  <li>Pulls the echoed timespec from the client message, into <i>msg_time</i>.<br>
469  </li>
470  <li>Finds the entry in its lease table for the client's EID.&nbsp; It
471walks the table searching for the ID.&nbsp; EIDs of <span
472 style="font-weight: bold;">DB_EID_INVALID</span> are
473illegal.&nbsp; Either the master will find the entry, or it will find
474an empty slot in the table (i.e. it is still populating the table with
475leases).</li>
476  <li>If this is a previously unknown site lease, the master
477initializes the entry by copying to the <i>eid</i>, <i>start_time, </i>and
478    <i>lease_lsn</i> fields.&nbsp; The master
479also computes the <i>end_time</i>
480based on the adjusted <i>rep-&gt;lease_duration</i>.</li>
481  <li>If this is a lease from a previously known site, the master must
482perform <i>timespeccmp(&amp;msg_time,
483&amp;table[i].start_time, &gt;)</i> and only update the <i>end_time</i>
484of the lease when this is
485a more recent message.&nbsp; If it is a more recent message, then we
486should update
487the <i>lease_lsn</i> to the LSN in
488the message.</li>
489  <li>Since lease durations are computed taking the clock skew into
490account, clients compute them based on the current time and the master
491computes it based on original sending time, for diagnostic purposes
492only, I also plan to send the client's expiration time.&nbsp; The
493client errs on the side of computing a larger lease expiration time and
494the master errs on the side of computing a smaller duration.&nbsp;
495Since both are taking the clock skew
496into account, the client's ending expiration time should never be
497smaller than
498the master's computed expiration time or their value for clock skew may
499not be correct.<br>
500  </li>
501</ol>
502Any log records (new or resent) that originate from the master and
503result in <b>DB_REP_ISPERM</b> get an
504ack.<br>
505<br>
506<h3>Refreshing Leases</h3>
507Leases get refreshed when a master receives a <b>REP_LEASE_GRANT</b>
508message from a client. There are three pieces to lease
509refreshment.&nbsp; <br>
510<h4>Lazy Lease Refreshing on Read<br>
511</h4>
512If the master discovers that leases are
513expired during the read operation, it attempts to refresh its
514collection of lease grants.&nbsp; It does this by calling a new
515function <i>__rep_lease_refresh</i>.&nbsp;
516This function is very similar to the already-existing function <i>__rep_flush</i>.&nbsp;
517Basically, to
518refresh the lease, the master simply needs to resend the last PERM
519record to the clients.&nbsp; The requirements state that when the
520application send function returns successfully from sending a PERM
521record, the majority of clients have that PERM LSN durable.&nbsp; We
522will have a new public DB error return called <b>DB_REP_LEASE_EXPIRED</b>
523that will be
524returned back to the caller if the master cannot assert its
525authority.&nbsp; The code will look something like this:<br>
526<pre>/*<br> * Use lp-&gt;max_perm_lsn on the master (currently not used on the master)<br> * to keep track of the last PERM record written through the logging system.<br> * need to initialize lp-&gt;max_perm_lsn in rep_start on role_chg.<br> */<br>call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT<br>if failure<br>	expire leases<br>	return lease expired error to caller<br>else /* success */<br>	recheck lease table<br>	/*<br>	 * We need to recheck the lease table because the client<br>	 * lease grant messages may not be processed yet, or got<br>	 * lost, or racing with the application's ACK messages or<br>	 * whatever. <br>	 */<br>	if we have a majority of valid leases<br>		return success<br>	else<br>		return lease expired error to caller <br></pre>
527<h4>Ongoing Update Refreshment<br>
528</h4>
529Second is having the master indicate to
530the client it needs to send a lease grant in response to the current
531PERM log message.&nbsp; The problem is
532that acknowledgements must contain a master-supplied message timestamp
533that the client sends back to the master.&nbsp; We need to modify the
534structure of the&nbsp; log record messages when leases are configured
535so
536that when a PERM message is sent, the master sends, and the client
537expects, the message timestamp.&nbsp; There are three fairly
538straightforward and different implementations to consider.<br>
539<ol>
540  <li>Adding the timestamp to the <b>REP_CONTROL</b>
541structure.&nbsp; If this option is chosen, then the code trivially
542sends back the timestamp in the client's reply.&nbsp; There is no
543special processing done by either side with the message contents.&nbsp;
544So, on a PERM log record, the master will send a non-zero
545timestamp.&nbsp; On a normal log record the timestamp will be zero or
546some known invalid value.&nbsp; If the client sees a non-zero
547timestamp, it sends a <b>REP_LEASE_GRANT</b>
548with the <i>lp-&gt;max_perm_lsn</i>
549after applying that log record.&nbsp; If it is zero, then the client
550does nothing different.&nbsp; The advantage is ease of code.&nbsp; The
551disadvantage is that for mixed version systems, the client is now
552dealing with different sized control structures.&nbsp; We would have to
553retain the old control structure so that during a mixed version group
554the (upgraded) clients can use, expect and send old control structures
555to the master.&nbsp; This is unfortunate, so let's consider additional
556implementations that don't require modifying the control structure.<br>
557  </li>
558  <li>Adding a new <b>REPCTL_LEASE</b>
559flag to the list of flags for the control structure, but do not change
560the control structure fields.&nbsp; When a master wants to send a
561message that needs a lease ack, it sets the flag.&nbsp; Additionally,
562instead of simply sending a log record DBT as the <i>rec</i> parameter
563for replication, we
564would send a new structure that had the timestamp first and then the
565record (similar to the bulk transfer buffer).&nbsp; The advantage of
566this is that the control structure does not change.&nbsp; Disadvantages
567include more special-cased code in the normal code path where we have
568to check the flag.&nbsp; If the flag is set we have to extract the
569timestamp value and massage the incoming data to pass on the real log
570record to <i>rep_apply</i>.&nbsp; On
571bulk transfer, we would just add the timestamp into the buffer.&nbsp;
572On normal transfers, it would incur an additional data copy on the
573master side.&nbsp; That is unfortunate.&nbsp; Additionally, if this
574record needs to be stored in the temp db, we need some way to get it
575back again later or <span style="font-style: italic;">rep_apply</span>
576would have to extract the timestamp out when it processed the record
577(either live or from the temp db).<br>
578  </li>
579  <li>Adding a different message type, such as <b>REP_LOG_ACK</b>.&nbsp;
580Similarly to <b>REP_LOG_MORE</b> this message would be a
581special-case version of a log record.&nbsp; We would extract out the
582timestamp and then handle as a normal log record.&nbsp; This
583implementation is rejected because it actually would require three new
584message types: <b>REP_LOG_ACK,
585REP_LOG_ACK_MORE, REP_BULK_LOG_ACK</b>.&nbsp; That is just too ugly
586to contemplate.</li>
587</ol>
588<b>[Slight digression:</b> it occurs
589to me while writing about #2 and #3 above, that our implementation of
590all of the *_MORE messages could really be implemented with a <b>REPCTL_MORE</b>
591flag instead of a
592separate message type.&nbsp; We should clean that up and simplify the
593messages but not part of master leases. Hmm, taking that thought
594process further, we really could get rid of the <b>REP_BULK_*</b>
595messages as well if we
596added a <b>REPCTL_BULK</b>
597flag.&nbsp; I think we should definitely do it for the *_MORE
598messages.&nbsp; I am not sure we should do it for bulk because the
599structure of the incoming data record is vastly different.]<br>
600<br>
601Of these options, I believe that modifying the control structure is the
602best alternative.&nbsp; The handling of the old structure will be very
603isolated to code dealing with old versions and is far less complicated
604than injecting the timestamp into the log record DBT and doing a data
605copy.&nbsp; Actually, I will likely combine #1 and the flag from #2
606above.&nbsp; I will have the <b>REPCTL_LEASE</b>
607flag that indicates a lease grant reply is expected and have the
608timestamp in the control structure.&nbsp;
609Also I will probably add in a spare field or two for future use in the <b>REP_CONTROL</b>
610structure.<br>
611<h4>Gap processing</h4>
612No matter which implementation we choose for ongoing lease refreshment,
613gap processing must be considered.&nbsp; The code above assumes the
614timestamps will be placed on PERM records only.&nbsp; Normal log
615records will not have a timestamp, nor a flag or anything else like
616that.&nbsp; However, any log message can fill a gap on a client and
617result in the processing of that normal log record to return <b>DB_REP_ISPERM</b>
618because later records
619were also processed.<br>
620<br>
621The current implementation should work fine in that case because when
622we store the message in the client temp db we store both the control
623DBT and the record DBT.&nbsp; Therefore, when a normal record fills a
624gap, the later PERM record, when retrieved will look just like it did
625when it arrived.&nbsp; The client will have access to the LSN, and the
626timestamp, etc.&nbsp; However, it does mean that sending the <b>REP_LEASE_GRANT</b>
627message must take
628place down in <i>__rep_apply</i>
629because that is the only place we have access to the contents of those
630stored records with the timestamps.<br>
631<br>
632There are two logical choices to consider for granting the lease when
633processing an update.&nbsp; As we process (either a live record or one
634read from the temp db after filling a gap) a PERM message, we send the <b>REP_LEASE_GRANT</b>
635message for each
636PERM record we successfully apply.&nbsp; Or, second, we keep track of
637the largest timestamp of all PERM records we've processed and at the
638end of the function after we've applied all records, we send back a
639single lease grant with the <i>max_perm_lsn</i>
640and a new <i>max_lease_timestamp</i>
641value to the master.&nbsp; The first is easier to implement, the second
642results in possibly slightly fewer messages at the expense of more
643bookkeeping on the client.<br>
644<br>
645A third, more complicated option would be to have the message timestamp
646on all records, but grants are only sent on the PERM messages.&nbsp; A
647reason to do this is that the later timestamp of a normal log record
648would be used as the timestamp sent in the reply and the master would
649get a more up to date timestamp value and a longer lease.&nbsp; <br>
650<br>
651If we change the <span style="font-weight: bold;">REP_CONTROL</span>
652structure to include the timestamp, we potentially break or at least
653need to revisit the gap processing algorithm.&nbsp; That code assumes
654that the control and record elements for the same LSN look the same
655each and every time.&nbsp; The code stores the <span
656 style="font-style: italic;">control</span> DBT as the key and the <span
657 style="font-style: italic;">rec</span> DBT as the data.&nbsp; We use a
658specialized compare function to sort based on the LSN in the control
659DBT.&nbsp; With master leases, the same record transmitted by a master
660multiple times or client for the same LSN will be different because the
661timestamp field will not be the same.&nbsp; Therefore, the client will
662end up with duplicate entries in the temp database for the same
663LSN.&nbsp; Both solutions (adding the timestamp to <span
664 style="font-weight: bold;">REP_CONTROL</span> and adding a <span
665 style="font-weight: bold;">REPCTL_LEASE</span> flag) can yield
666duplicate entries.&nbsp; The flag would cause the same record from the
667master and client to be different as well.<br>
668<h4>Handling Incoming Lease Grants<br>
669</h4>
670The third piece of lease management is handling the incoming <b>REP_LEASE_GRANT</b>
671message on the
672master.&nbsp; When this message is received, the master must do the
673following:<br>
674<pre>REP_SYSTEM_LOCK<br>msg_timestamp = cntrl-&gt;timestamp;<br>client_lease = __rep_lease_entry(dbenv, client eid)<br>if (client_lease == NULL)<br>	initial lease for this site, DB_ASSERT there is space in the table<br>	add this to the table if there is space<br>} else <br>	compare msg_timestamp with client_lease-&gt;start_time<br>	if (msg_timestamp is more recent &amp;&amp; msg_lsn &gt;= lease LSN)<br>		update entry in table<br>REP_SYSTEM_UNLOCK<br></pre>
675<h3>Expiring Leases</h3>
676Leases can expire in two ways.&nbsp; First they can expire naturally
677due to the passage of time.&nbsp; When checking leases, if the current
678time is later than the lease entry's <i>end_time</i>
679then the lease is expired.&nbsp; Second, they can be forced with a
680premature expiration when the application's transport function returns
681an error.&nbsp; In the first case, there is nothing to do, in the
682second case we need to manipulate the <i>end_time</i>
683so that all future lease checks fail.&nbsp; Since the lease <i>start_time</i>
684is guaranteed to not be in the future we will have a function <i>__rep_lease_expire</i>
685that will:<br>
686<pre>REP_SYSTEM_LOCK<br>for each entry in the lease table<br>	entry-&gt;end_time = entry-&gt;start_time;<br>REP_SYSTEM_UNLOCK<br></pre>
687Is there a potential race or problem with prematurely expiring
688leases?&nbsp; Consider an application that enforces an ALL
689acknowledgement policy for PERM records in its transport
690callback.&nbsp; There are four clients and three send the PERM ack to
691the application.&nbsp; The callback returns an error to the master DB
692code.&nbsp; The DB code will now prematurely expire its leases.&nbsp;
693However, at approximately the same time the three clients are also
694sending their <span style="font-weight: bold;">REP_LEASE_GRANT</span>
695messages to the master.&nbsp; There is a race between the master
696processing those messages and the thread handling the callback failure
697expiring the table.&nbsp; This is only an issue if the messages arrive
698after the table has been expired.<br>
699<br>
700Let's assume all three clients send their grants after the master
701expires the table.&nbsp; If we accept those grants and then a read
702occurs the read will succeed since the master has a majority of leases
703even though the callback failed earlier.&nbsp; Is that a problem?&nbsp;
704The lease code is using a majority and the application policy is using
705something other value.&nbsp; It feels like this should be okay since
706the data is held by leases on a majority.&nbsp; Should we consider
707having the lease checking threshold be the same as the permanent ack
708policy?&nbsp; That is difficult because Base API users implement
709whatever they want and DB does not know what it is.<br>
710<h3>Checking Leases</h3>
711When a read operation on the master completes, the last thing we need
712to do is verify the master leases.&nbsp; We've already discussed
713refreshing them when they are expired above.&nbsp; We need two things
714for a lease to be valid.&nbsp; It must be within the timeframe of the
715lease grant and the lease must be valid for the last PERM record
716LSN.&nbsp; Here is the logic
717for checking the validity of leases in <i>__rep_lease_check</i>:<br>
718<pre>#define MAX_REFRESH_TRIES	3<br>DB_LSN lease_lsn;<br>REP_LEASE_ENTRY *entry;<br>u_int32_t min_leases, valid_leases;<br>db_timespec cur_time;<br>int ret, tries;<br><br>	tries = 0;<br>retry:<br>	ret = 0;<br>	LOG_SYSTEM_LOCK<br>	lease_lsn = lp-&gt;lsn<br>	LOG_SYSTEM_UNLOCK<br>	REP_SYSTEM_LOCK<br>	min_leases = rep-&gt;nsites / 2;<br>	__os_gettime(dbenv, &amp;cur_time);<br>	for (entry = head of table, valid_leases = 0; entry != NULL &amp;&amp; valid_leases &lt; min_leases; entry++)<br>		if (timespec_cmp(&amp;entry-&gt;end_time, &amp;cur_time) &gt;= 0 &amp;&amp; log_compare(&amp;entry-&gt;lsn, lease_lsn) == 0)<br>			valid_leases++;<br>	REP_SYSTEM_UNLOCK<br>	if (valid_leases &lt; min_leases) {<br>		ret =__rep_lease_refresh(dbenv, ...);<br>		/*<br>		 * If we are successful, we need to recheck the leases because <br>		 * the lease grant messages may have raced with the PERM<br>		 * acknowledgement.  Give those messages a chance to arrive.<br>		 */<br>		if (ret == 0) {<br>			if (tries &lt;= MAX_REFRESH_TRIES) {<br>				/*<br>				 * If we were successful sending, but not successful in racing the<br>				 * message thread, yield the processor so that message<br>				 * threads may have a chance to run.<br>				 */<br>				if (tries &gt; 0)<br>					/* __os_sleep instead?? */<br>					__os_yield()<br>				tries++;<br>				goto retry;<br>			} else<br>				ret = DB_RET_LEASE_EXPIRED;<br>		}<br>	}<br>	return (ret);</pre>
719If the master has enough valid leases it returns success.&nbsp; If it
720does not have enough, it attempts to refresh them.&nbsp; This attempt
721may fail if sending the PERM record does not receive sufficient
722acks.&nbsp; If we do receive sufficient acknowledgements we may still
723find that scheduling of message threads means the master hasn't yet
724processed the incoming <b>REP_LEASE_GRANT</b>
725messages yet.&nbsp; We will retry a couple times (possibly
726parameterized) if the master discovers that situation.&nbsp; <br>
727<h2>Elections</h2>
728When a client grants a lease to a master, it gives up the right to
729participate in an election until that grant expires.&nbsp; If we are
730the master and <i>dbenv-&gt;rep_elect</i>
731is called, it should return, no matter what, like it does today.&nbsp;
732If we are a client and <i>rep_elect</i>
733is called special processing takes place when leases are in
734effect.&nbsp; First, the easy case is if the lease granted by this
735client has already expired, then the client goes directly into the
736election as normal.&nbsp; If a valid lease grant is outstanding to a
737master, this site cannot participate in an election until that grant
738expires.&nbsp; We have at least two options when a site calls the <i>dbenv-&gt;rep_elect</i>
739API while
740leases are in effect.<br>
741<ol>
742  <li>The simplest coding solution for DB would be simply to refuse to
743participate in the election if this site has a current lease granted to
744a master.&nbsp; We would detect this situation and return EINVAL.&nbsp;
745This is correct behavior and trivial to implement.&nbsp; The
746disadvantage of this solution is that the application would then be
747responsible for repeatedly attempting an election until the lease grant
748expired.<br>
749  </li>
750  <li>The more satisfying solution is for DB to wait the remaining time
751for the grant.&nbsp; If this client hears from the master during that
752time the election does not take place and the call to <i>rep_elect</i>
753returns with the
754information for the current/old master.</li>
755</ol>
756<h3>Election Code Changes</h3>
757The code changes to support leases in the election code are fairly
758isolated.&nbsp; First if leases are configured, we must verify the <i>nsites</i>
759parameter is set to 0.&nbsp;
760Second, in <i>__rep_elect_init</i>
761we must not overwrite the value of <i>rep-&gt;nsites</i>
762for leases because it is controlled by the <i>dbenv-&gt;rep_set_nsites</i>
763API.&nbsp;
764These changes are small and easy to understand.<br>
765<br>
766The more complicated code will be the client code when it has an
767outstanding lease granted.&nbsp; The client will wait for the current
768lease grant to expire before proceeding with the election.&nbsp; The
769client will only do so if it does not hear from the master for the
770remainder of the lease grant time.&nbsp; If the client hears from the
771master, it returns and does not begin participating in the
772election.&nbsp; A new election phase, <b>REP_EPHASE0</b>
773will exist so that the call to <i>__rep_wait</i>
774can detect if a master responds.&nbsp; The client, while waiting for
775the lease grant to expire, will send a <b>REP_MASTER_REQ</b>
776message so that the master will respond with a <b>REP_NEWMASTER</b>
777message and thus,
778allow the client to know the master exists.&nbsp; However, it is also
779desirable that if the master
780replies to the client, the master wants the client to update its lease
781grant.&nbsp; <br>
782<br>
783Recall that the <b>REP_NEWMASTER</b>
784message does not result in a lease grant from the client.&nbsp; The
785client responds when it processes a PERM record that has the <b>REPCTL_LEASE</b>
786flag set in the message
787with its lease grant up to the given LSN.&nbsp; Therefore, we want the
788client's <b>REP_MASTER_REQ</b> to
789yield both the discovery of the existing master and have the master
790refresh its leases.&nbsp; The client will also use the <b>REPCTL_LEASE</b>
791flag in its <b>REP_MASTER_REQ</b> message to the
792master.&nbsp; This flag will serve as the indicator to the master that
793it needs to deal with leases and both send the <b>REP_NEWMASTER</b>
794message and refresh
795the lease.<br>
796The code will work as follows:<br>
797<pre>if (leases_configured &amp;&amp; (my_grant_still_valid || lease_never_granted) {<br>	if (lease_never_granted)<br>		wait_time = lease_timeout<br>	else<br>		wait_time = grant_expiration - current_time<br>	F_SET(REP_F_EPHASE0);<br>	__rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);<br>	ret = __rep_wait(..., REP_F_EPHASE0);<br>	if (we found a master)<br>		return<br>} /* if we don't return, fall out and proceed with election */<br></pre>
798On the master side, the code handling the <b>REP_MASTER_REQ</b> will
799do:<br>
800<pre>if (I am master) {<br>	...<br>	__rep_send_message(REP_NEWMASTER...)<br>	if (F_ISSET(rp, REPCTL_LEASE))<br>		__rep_lease_refresh(...)<br>}<br></pre>
801Other minor implementation details are that<i> __rep_elect_done</i>
802must also clear
803the <b>REP_F_EPHASE0</b> flag.&nbsp;
804We also, obviously, need to define <b>REP_F_EPHASE0</b>
805in the list of replication flags.&nbsp; Note that the client's call to <i>__rep_wait</i>
806will return upon
807receiving the <b>REP_NEWMASTER</b>
808message.&nbsp; The client will independently refresh its lease when it
809receives the log record from the master's call to refresh the lease.<br>
810<br>
811Again, similar to what I suggested above, the code could simply assume
812global leases are configured, and instead of having the <b>REPCTL_LEASE</b>
813flag at all, the master
814assumes that it needs to refresh leases because it has them configured,
815not because it is specified in the <b>REP_MASTER_REQ</b>
816message it is processing. Right now I don't think every possible
817<b>REP_MASTER_REQ</b> message should result in a lease grant request.<br>
818<h4>Elections and Quiescient Systems</h4>
819It is possible that a master is slow or the client is close to its
820expiration time, or that the master is quiescient and all leases are
821currently expired, but nothing much is going on anyway, yet some client
822calls <i>__rep_elect</i> at that
823time.&nbsp; In the code above, we will not send the <b>REP_MASTER_REQ</b>
824because the lease is
825not valid.&nbsp; The client will simply proceed directly to sending the
826<b>REP_VOTE1</b> message, throwing all
827other clients into an election.&nbsp; The master is still master and
828should stay that way.&nbsp; Currently in response to a vote message, a
829master will broadcast out a <b>REP_NEWMASTER</b>
830to assert its mastership.&nbsp; That causes the election to
831complete.&nbsp; However, if desired the master may want to proactively
832refresh its leases.&nbsp; This situation indicates to me that the
833master should choose to refresh leases based on configuration, not a
834flag sent from the client.&nbsp; I believe anytime the master asserts
835its mastership via sending a <b>REP_NEWMASTER</b>
836message that I need to add code to proactively refresh leases at that
837time.<br>
838<h2>Other Implementation Details</h2>
839<h3>Role Changes<br>
840</h3>
841When a site changes its role via a call to <i>rep_start</i> in either
842direction, we
843must take action when leases are configured.&nbsp; There are three
844types of role changes that all need changes to deal with leases:<br>
845<ol>
846  <li><i>A master downgrading to a
847client.</i> When a master downgrades to a client, it can do so
848immediately after it has proactively expired all existing leases it
849holds.&nbsp; This situation is similar to an error from the send
850callback, and it effectively cancels all outstanding leases held on
851this site.&nbsp; Note that if this master expires its leases, it does
852not have any effect on when the clients' lease grants expire on the
853client side.&nbsp; The clients must still wait their full expected
854grant time.<br>
855  </li>
856  <li><i>A client upgrading to master.</i>
857If a client is upgrading to a master but it has an outstanding lease
858granted to another site, the code will return an <b>EINVAL</b>
859error.&nbsp; This situation
860only arises if the application simply declares this site master.&nbsp;
861If a site wins an election then the election itself should have waited
862long enough for the granted lease to expire and this state should not
863arise then.</li>
864  <li><i>A client finding a new master.</i>
865When a client discovers a new and different master, via a <b>REP_NEWMASTER</b>
866message then the
867client cannot accept that new master until its current lease grant
868expires.&nbsp; This situation should only occur when a site declares
869itself master without an election and that site's lease grant expires
870before this client's grant expires.&nbsp; However, it is <b>possible</b>
871for this situation to arise
872with elections also.&nbsp; If we have 5 sites holding an election and 4
873of those sites have leases expire at about the same time T, and this
874site's lease expires at time T+N and the election timeout is &lt; N,
875then those 4 sites may hold an election and elect a master without this
876site's participation.&nbsp; A client in this situation must call <i>__rep_wait</i>
877with the time remaining
878on its lease.&nbsp; If the lease is expired after waiting the remaining
879time, then the client can accept this new master.&nbsp; If the lease
880was refreshed during the waiting period then the client does not accept
881this new master and returns.<br>
882  </li>
883</ol>
884<h3>DUPMASTER</h3>
885A duplicate master situation can occur if an old master becomes
886disconnected from the rest of the group, that group elects a new master
887and then the partition is resolved.&nbsp; The requirement for master
888leases is that this situation will not cause the newly elected,
889rightful master to receive the <b>DB_REP_DUPMASTER</b>
890return.&nbsp; It is okay for the old master to get that return
891value.&nbsp; When a dual master situation exists, the following will
892happen:<br>
893<ul>
894  <li><i>On the current master and all
895current clients</i> - If the current master receives an update
896message or other conflicting message from the old master then that
897message will be ignored because the generation number is out of date.</li>
898  <li><i>On the old master</i> - If
899the old master receives an update message from the current master, or
900any other message with a later generation from any site, the new
901generation number will trigger this site to return <b>DB_REP_DUPMASTER</b>.&nbsp;
902However,
903instead of broadcasting out the <b>REP_DUPMASTER</b>
904message to shoot down others as well, this site, if leases are
905configured, will call <i>__rep_lease_check</i>
906and if they are expired, return the error.&nbsp; It should be
907impossible for us to receive a later generation message and still hold
908a majority of master leases.&nbsp; Something is seriously wrong and we
909will <b>DB_ASSERT</b> this situation
910cannot happen.<br>
911  </li>
912</ul>
913<h3>Client to Client Synchronization</h3>
914One question to ask is how lease grants interact with client-to-client
915synchronization. The only answer is that they do not.&nbsp; A client
916that is sending log records to another client cannot request the
917receiving client refresh its lease with the master.&nbsp; That client
918does not have a timestamp it can use for the master and clock skew
919makes it meaningless between machines.&nbsp; Therefore, sites that use
920client-to-client synchronization will likely see more lease refreshment
921during the read path and leases will be refreshed during live updates
922only.&nbsp; Of course, if a client supplies log records that fill a
923gap, and the later log records stored came from the master in a live
924update then the client will respond as per the discussion on Gap
925Processing above.<br>
926<h2>Interaction Matrix</h2>
927If leases are granted (by a client) or held (by a master) what should
928the following APIs and messages do?<br>
929<br>
930Other:<br>
931log_archive: Leases do not affect log_archive.&nbsp; OK.<br>
932dbenv-&gt;close: OK.<br>
933crash during lease grant and restart: <b>Potential
934problem here.&nbsp; See discussion below</b>.<br>
935<br>
936Rep Base API method:<br>
937rep_elect: Already discussed above.&nbsp; Must wait for lease to expire.<br>
938rep_flush: Master only, OK - this will be the basis for refreshing
939leases.<br>
940rep_get_*: Not affected by leases.<br>
941rep_process_message: Generally OK.&nbsp; We'll discuss each message
942below.<br>
943rep_set_config: OK.<br>
944rep_set_limit: OK<br>
945rep_set_nsites: Must be called before <i>rep_start</i>
946and <i>nsites</i> is immutable until
94714778 is resolved.<br>
948rep_set_priority: OK<br>
949rep_set_timeout: OK.&nbsp; Used to set lease timeout.<br>
950rep_set_transport: OK.<br>
951rep_start(MASTER): Role changes are discussed above.&nbsp; Make sure
952duplicate rep_start calls are no-ops for leases.<br>
953rep_start(CLIENT): Role changes are discussed above.&nbsp; Make sure
954duplicate calls are no-ops for leases.<br>
955rep_stat: OK.<br>
956rep_sync: Should not be able to happen.&nbsp; Client cannot accept new
957master with outstanding lease grant.&nbsp; Add DB_ASSERT here.<br>
958<br>
959REP_ALIVE: OK.<br>
960REP_ALIVE_REQ: OK.<br>
961REP_ALL_REQ: OK.<br>
962REP_BULK_LOG: OK.&nbsp; Clients check to send ACK.<br>
963REP_BULK_PAGE: Should never process one with lease granted.&nbsp; Add
964DB_ASSERT.<br>
965REP_DUPMASTER: Should never happen, this is what leases are supposed to
966prevent.&nbsp; See above.<br>
967REP_LOG: OK.&nbsp; Clients check to send ACK.<br>
968REP_LOG_MORE: OK.&nbsp; Clients check to send ACK.<br>
969REP_LOG_REQ: OK.<br>
970REP_MASTER_REQ: OK.<br>
971REP_NEWCLIENT: OK.<br>
972REP_NEWFILE: OK.&nbsp; Clients check to send ACK.<br>
973REP_NEWMASTER: See above.<br>
974REP_NEWSITE: OK.<br>
975REP_PAGE: OK.&nbsp; Should never process one with lease granted.&nbsp;
976Add DB_ASSERT.<br>
977REP_PAGE_FAIL:&nbsp; OK.&nbsp; Should never process one with lease
978granted.&nbsp; Add DB_ASSERT.<br>
979REP_PAGE_MORE:&nbsp; OK.&nbsp; Should never process one with lease
980granted.&nbsp; Add DB_ASSERT.<br>
981REP_PAGE_REQ: OK.<br>
982REP_REREQUEST: OK.<br>
983REP_UPDATE: OK.&nbsp; Should never process one with lease
984granted.&nbsp; Add DB_ASSERT.<br>
985REP_UPDATE_REQ: OK.&nbsp; This is a master-only message.<br>
986REP_VERIFY: OK.&nbsp; Should never process one with lease
987granted.&nbsp; Add DB_ASSERT.<br>
988REP_VERIFY_FAIL: OK.&nbsp; Should never process one with lease
989granted.&nbsp; Add DB_ASSERT.<br>
990REP_VERIFY_REQ: OK.<br>
991REP_VOTE1: OK.&nbsp; See Election discussion above.&nbsp; It is
992possible to receive one with a lease granted.&nbsp; Client cannot send
993one with an outstanding lease however.<br>
994REP_VOTE2: OK.&nbsp; See Election discussion above.&nbsp; It is
995possible to receive one with a lease granted.<br>
996<br>
997If the following method or message processing is in progress and a
998client wants to grant a lease, what should it do?&nbsp; Let's examine
999what this means.&nbsp; The client wanting to grant a lease simply means
1000it is responding to the receipt of a <b>REP_LOG</b>
1001(or its variants) message and applying a log record.&nbsp; Therefore,
1002we need to consider a thread processing a log message racing with these
1003other actions.<br>
1004<br>
1005Other:<br>
1006log_archive: OK.&nbsp; <br>
1007dbenv-&gt;close: User error.&nbsp; User should not be closing the env
1008while other threads are using that handle.&nbsp; Should have no effect
1009if a 2nd dbenv handle to same env is closed.<br>
1010<br>
1011Rep Base API method:<br>
1012rep_elect: See Election discussion above.&nbsp; <i>rep_elect</i>
1013should wait and may grant
1014lease while election is in progress.<br>
1015rep_flush: Should not be called on client.<br>
1016rep_get_*: OK.<br>
1017rep_process_message: Generally OK.&nbsp; See handling each message
1018below.<br>
1019rep_set_config: OK.<br>
1020rep_set_limit: OK.<br>
1021rep_set_nsites: Must be called before <i>rep_start</i>
1022until 14778 is resolved.<br>
1023rep_set_priority: OK.<br>
1024rep_set_timeout: OK.<br>
1025rep_set_transport: OK.<br>
1026rep_start(MASTER): OK, can't happen - already protect racing <i>rep_start</i>
1027and <i>rep_process_message</i>.<br>
1028rep_start(CLIENT): OK, can't happen - already protect racing <i>rep_start</i>
1029and <i>rep_process_message</i>.<br>
1030rep_stat: OK.<br>
1031rep_sync: Shouldn't happen because client cannot grant leases during
1032sync-up.&nbsp; Incoming log message ignored.<br>
1033<br>
1034REP_ALIVE: OK.<br>
1035REP_ALIVE_REQ: OK.<br>
1036REP_ALL_REQ: OK.<br>
1037REP_BULK_LOG: OK.<br>
1038REP_BULK_PAGE: OK.&nbsp; Incoming log message ignored during internal
1039init.<br>
1040REP_DUPMASTER: Shouldn't happen.&nbsp; See DUPMASTER discussion above.<br>
1041REP_LOG: OK.<br>
1042REP_LOG_MORE: OK.<br>
1043REP_LOG_REQ: OK.<br>
1044REP_MASTER_REQ: OK.<br>
1045REP_NEWCLIENT: OK.<br>
1046REP_NEWFILE: OK.<br>
1047REP_NEWMASTER: See above.&nbsp; If a client accepts a new master
1048because its lease grant expired, then that master sends a message
1049requesting the lease grant, this client will not process the log record
1050if it is in sync-up recovery, or it may after the master switch is
1051complete and the client doesn't need sync-up recovery.&nbsp; Basically,
1052just uses existing log record processing/newmaster infrastructure.<br>
1053REP_NEWSITE: OK.<br>
1054REP_PAGE: OK.&nbsp; Receiving a log record during internal init PAGE
1055phase should ignore log record.<br>
1056REP_PAGE_FAIL: OK.<br>
1057REP_PAGE_MORE: OK.<br>
1058REP_PAGE_REQ: OK.<br>
1059REP_REREQUEST: OK.<br>
1060REP_UPDATE: OK.&nbsp; Receiving a log record during internal init
1061should ignore log record.<br>
1062REP_UPDATE_REQ: OK - master-only message.<br>
1063REP_VERIFY: OK.&nbsp; Receiving a log record during verify phase
1064ignores log record.<br>
1065REP_VERIFY_FAIL: OK.<br>
1066REP_VERIFY_REQ: OK.<br>
1067REP_VOTE1: OK.&nbsp; This client is processing someone else's vote when
1068the lease request comes in.&nbsp; That is fine.&nbsp; We protect our
1069own election and lease interaction in <i>__rep_elect</i>.<br>
1070REP_VOTE2: OK.<br>
1071<h4>Crashing - Potential Problem<br>
1072</h4>
1073It appears there is one area where we could have a problem.&nbsp; I
1074believe that crashes can cause us to break our guarantee on durability,
1075authoritative reads and inability to elect duplicate masters.&nbsp;
1076Consider this scenario:<br>
1077<ol>
1078  <li>A master and 4 clients are all up and running.</li>
1079  <li>The master commits a txn and all 4 clients refresh their lease
1080grants at time T.</li>
1081  <li>All 4 clients have the txn and log records in the cache.&nbsp;
1082None are flushing to disk.</li>
1083  <li>All 4 clients have responded to the PERM messages as well as
1084refreshed their lease with the master.</li>
1085  <li>All 4 clients hit the same application coding error and crash
1086(machine/OS stays up).</li>
1087  <li>Master authoritatively reads data in txn from step 2.</li>
1088  <li>All 4 clients restart the application and run recovery, thus the
1089txn from step 2 is lost on all clients because it isn't any logs.<span
1090 style="font-weight: bold;"></span><br>
1091  </li>
1092  <li>A network partition happens and the master is alone on its side.</li>
1093  <li>All 4 clients are on the other side and elect a new master.</li>
1094  <li>Partition resolves itself and we have duplicate masters, where
1095the former master still holds all valid lease grants.<span
1096 style="font-weight: bold;"></span><br>
1097  </li>
1098</ol>
1099Therefore, we have broken both guarantees.&nbsp; In step 6 the data is
1100really not durable and we've given it to the user.&nbsp; One can argue
1101that if this is an issue the application better be syncing somewhere if
1102they really want durability.&nbsp; However, worse than that is that we
1103have a legitimate DUPMASTER situation in step 10 where both masters
1104hold valid leases.&nbsp; The reason is that all lease knowledge is in
1105the shared memory and that is lost when the app restarts and runs
1106recovery.<br>
1107<br>
1108How can we solve this?&nbsp; The obvious solution is (ugh, yet another)
1109durable BDB-owned file with some information in it, such as the current
1110lease expiration time so that rebooting after a crash leaves the
1111knowledge that the lease was granted.&nbsp; However, writing and
1112syncing every lease grant on every client out to disk is far too
1113expensive.<br>
1114<br>
1115A second possible solution is to have clients wait a full lease timeout
1116before entering an election the first time. This solution solves the
1117DUPMASTER issue, but not the non-authoritative read.&nbsp; This
1118solution naturally falls out of elections and leases really.&nbsp; If a
1119client has never granted a lease, it should be considered as having to
1120wait a full lease timeout before entering an election.&nbsp;
1121Applications already know that leases impact elections and this does
1122not seem so bad as it is only on the first election.<br>
1123<br>
1124Is it sufficient to document that the authoritative read is only as
1125authoritative as the durability guarantees they make on the sites that
1126indicate it is permanent? Yes, I believe this is sufficient.&nbsp; If
1127the application says it is permanent and it really isn't, then the
1128application is at fault.&nbsp; Believing the application when it
1129indicates with the PERM response that it is permanent avoids the
1130authoritative problem.&nbsp; <br>
1131<h2>Upgrade/Mixed Versions</h2>
1132Clearly leases cannot be used with mixed version sites since masters
1133running older releases will not have any knowledge of lease
1134support.&nbsp; What considerations are needed in the lease code for
1135mixed versions?<br>
1136<br>
1137First if the <b>REP_CONTROL</b>
1138structure changes, we need to maintain and use an old version of the
1139structure for talking to older clients and masters.&nbsp; The
1140implementation of this would be similar to the way we manage for old <b>REP_VOTE_INFO</b>
1141structures.&nbsp;
1142Second any new messages need translation table entries added.&nbsp;
1143Third, if we are assuming global leases then clearly any mixed versions
1144cannot have leases configured, and leases cannot be used in mixed
1145version groups.&nbsp; Maintaining two versions of the control structure
1146is not necessary if we choose a different style of implementation and
1147don't change the control structure.<br>
1148<br>
1149However, then how could an old application both run continuously,
1150upgrade to the new release and take advantage of leases without taking
1151down the entire application?&nbsp; I believe it is possible for clients
1152to be configured for leases but be subject to the master regarding
1153leases, yet the master code can assume that if it has leases
1154configured, all client sites do as well.&nbsp; In several places above
1155I suggested that a client could make a choice based on either a new <b>REPCTL_LEASE</b>
1156flag or simply having
1157leases turned on locally.&nbsp; If we choose to use the flag, then we
1158can support leases with mixed versions.&nbsp; The upgraded clients can
1159configure leases and they simply will not be granted until the old
1160master is upgraded and send PERM message with the flag indicating it
1161wants a lease grant.&nbsp; The client will not grant a lease until such
1162time.&nbsp; The clients, while having the leases configured, will not
1163grant a lease until told to do so and will simply have an expired
1164lease.&nbsp; Then, when the old master finally upgrades, it too can
1165configure leases and suddenly all sites are using them.&nbsp; I believe
1166this should work just fine and I will need to make sure a client's
1167granting of leases is only in response to the master asking for a
1168grant.&nbsp; If the master never asks, then the client has them
1169configured, but doesn't grant them.<br>
1170<h2>Testing</h2>
1171Clearly any user-facing API changes will need the equivalent reflection
1172in the Tcl API for testing, under CONFIG_TEST.<br>
1173<br>
1174I am sure the list of tests will grow but off the top of my head:<br>
1175Basic test: have N sites all configure leases, run some,&nbsp; read on
1176master, etc.<br>
1177Refresh test: Perform update on master, sleep until past expiration,
1178read on master and make sure leases are refreshed/read successful<br>
1179Error test: Test error conditions (reading on client with leases but no
1180ignore flag, calling after rep_start, etc)<br>
1181Read test: Test reading on both client and master both with and without
1182the IGNORE flag.&nbsp; Test that data read with the ignore flag can be
1183rolled back.<br>
1184Dupmaster test: Force a DUPMASTER situation and verify that the newer
1185master cannot get DUPMASTER error.<br>
1186Election test: Call election while grant is outstanding and master
1187exists.<br>
1188Call election while grant is outstanding and master does not exist.<br>
1189Call election after expiration on quiescient system with master
1190existing.<br>
1191Run with a group where some members have leases configured and other do
1192not to make sure we get errors instead of dumping core.<br>
1193<br>
1194<small><br>
1195</small>
1196</body>
1197</html>
1198