1<!DOCTYPE doctype PUBLIC "-//w3c//dtd html 4.0 transitional//en"> 2<html> 3<head> 4 <meta http-equiv="Content-Type" 5 content="text/html; charset=iso-8859-1"> 6 <meta name="GENERATOR" 7 content="Mozilla/4.76 [en] (X11; U; FreeBSD 4.3-RELEASE i386) [Netscape]"> 8 <title>Master Lease</title> 9</head> 10<body> 11<center> 12<h1>Master Leases for Berkeley DB</h1> 13</center> 14<center><i>Susan LoVerso</i> <br> 15<i>sue@sleepycat.com</i> <br> 16<i>Rev 1.1</i><br> 17<i>2007 Feb 2</i><br> 18</center> 19<p><br> 20</p> 21<h2>What are Master Leases?</h2> 22A master lease is a mechanism whereby clients grant master-ship rights 23to a site and that master, by holding lease rights can provide a 24guarantee of durability to a replication group for a given period of 25time. By granting a lease to a master, 26a client will not participate in an election to elect a new 27master until that granted master lease has expired. By holding a 28collection of granted leases, a master will be able to supply 29authoritative read requests to applications. By holding leases a 30read operation on a master can guarantee several things to the 31application:<br> 32<ol> 33 <li>Authoritative reads: a guarantee that the data being read by the 34application is durable and can never be rolled back.</li> 35 <li>Freshness: a guarantee that the data being read by the 36application <b>at the master</b> is 37not stale.</li> 38 <li>Master viability: a guarantee that a current master with valid 39leases will not encounter a duplicate master situation.<br> 40 </li> 41</ol> 42<h2>Requirements</h2> 43The requirements of DB to support this include:<br> 44<ul> 45 <li>After turning them on, users can choose to ignore them in reads 46or not.</li> 47 <li>We are providing read authority on the master only. A 48read on a client is equivalent to a read while ignoring leases.</li> 49 <li>We guarantee that data committed on a master <b>that has been 50read by an application on the 51master</b> will not be rolled back. Data read on a client or 52while ignoring leases <i>or data 53successfully updated/committed but not read,</i> 54may be rolled back.<br> 55 </li> 56 <li>A master will not return successfully from a read operation 57unless it holds a 58majority of leases unless leases are ignored.</li> 59 <li>Master leases will remove the possibility of a current/correct 60master being "shot down" by DUPMASTER. <b>NOTE: Old/Expired 61masters may discover a 62later master and return DUPMASTER to the application however.</b><br> 63 </li> 64 <li>Any send callback failure must result in premature lease 65expiration on the master.<br> 66 </li> 67 <li>Users who change the system clock during master leases void the 68guarantee and may get undefined behavior. We assume time always 69runs forward. <br> 70 </li> 71 <li>Clients are forbidden from participating in elections while they 72have an outstanding lease granted to another site.</li> 73 <li>Clients are forbidden from accepting a new master while they have 74an outstanding lease granted to another site.</li> 75 <li>Clients are forbidden from upgrading themselves to master while 76they have an outstanding lease granted to another site.</li> 77 <li>When asked for a lease grant explicitly by the master, the client 78cannot grant the lease to the master unless the LSN in the master's 79request has been processed by this client.<br> 80 </li> 81</ul> 82The requirements of the 83application using leases include:<br> 84<ul> 85 <li>Users must implement (Base API users on their own, RepMgr users 86via configuration) a majority (or larger) ACK policy. <br> 87 </li> 88 <li>The application must use the election mechanism to decide a master. 89It may not simply declare a site master.</li> 90 <li>The send callback must return an error if the majority ACK policy 91is not met for PERM records.</li> 92 <li>Users must set the number of sites in the group.</li> 93 <li>Using leases in a replication group is all-or-none. 94Therefore, if a site knows it is using leases, it can assume other 95sites are also.<br> 96 </li> 97 <li>All applications that care about read guarantees must forward or 98perform all reads on the master. Reading on the client means a 99read ignoring leases. </li> 100</ul> 101<p>There are some open questions 102remaining.</p> 103<ul> 104 <li>There is one major showstopper issue, see Crashing - Potential 105problem near the end of the document. We need a better solution 106than the one shown there (writing to disk every time a lease is 107granted). Perhaps just documenting that durability means it must be 108flushed to disk before success to avoid that situation?<br> 109 </li> 110 <li>What about db->join? Users can call join, but the calls 111on the join cursor to get the data would be subject to leases and 112therefore protected. Ok, this is not an open question.</li> 113 <li>What about other read-like operations? Clearly <i> 114DB->get, DB->pget, DBC->get, 115DBC->pget</i> need lease checks. However, other APIs use 116keys. <i>DB->key_range</i> 117provides an estimate only so it shouldn't need lease checks. <i> 118DB->stat</i> provides exact counts 119to <i>bt_nkeys</i> and <i>bt_ndata</i> fields. Are those 120fields considered authoritative that providing those values implies a 121durability guarantee and therefore <i>DB->stat</i> 122should be subject to lease verification? <i>DBC->count</i> 123provides a count for 124the number of data items associated with a key. Is this 125authoritative information? This is similar to stat - should it be 126subject to lease verification?<br> 127 </li> 128 <li>Do we require master lease checks on write operations? I 129think lease checks are not needed on write operations. It doesn't 130add correctness and adds a lot of complexity (checking leases in put, 131del, and cursors, then what about rename, remove, etc).<br> 132 </li> 133 <li>Do master leases give an iron-clad guarantee of never rolling 134back a transaction? No, but it should mean that a committed transaction 135can never be <b>read</b> on a master 136unless the lease is valid. A committed transaction on a master 137that has never been presented to the application may get rolled back.<br> 138 </li> 139 <li>Do we need to quarantine or prevent reads on an ex-master until 140sync-up is done? No. A master that is simply downgraded to 141client or crashes and reboots is now a client. Reading from that 142client is the same as saying Ignore Leases.</li> 143 <li>What about adding and removing sites while leases are 144active? This is SR 14778. A consistent <i>nsites</i> value 145is required by master 146leases. It isn't 147clear to me what a master is 148supposed to do if the value of nsites gets smaller while leases are 149active. Perhaps it leaves its larger table intact and simply 150checks for a smaller number of granted leases?<br> 151 </li> 152 <li>Can users turn leases off? No. There is no planned <i>turn 153leases off</i> API.</li> 154 <li>Clock skew will be a percentage. However, the smallest, 1%, 155is probably rather large for clock skew. Percentage was chosen 156for simplicity and similarity to other APIs. What granularity is 157appropriate here?</li> 158</ul> 159<h2>API Changes</h2> 160The API changes that are visible 161to the user are fairly minimal. 162There are a few API calls they need to make to configure master leases 163and then there is the API call to turn them on. There is also a 164new flag to existing APIs to allow read operations to ignore leases and 165return data that 166may be non-durable potentially.<br> 167<h3>Lease Timeout<br> 168</h3> 169There is a new timout the user 170must configure for leases called <b>DB_REP_LEASE_TIMEOUT</b>. 171This timeout will be new to 172the <i>dbenv->rep_set_timeout</i> method. The <b>DB_REP_LEASE_TIMEOUT</b> 173has no default and it is required that the user configure a timeout 174before they turn on leases (obviously, this timeout need not be set of 175leases will not be used). That timeout is the amount of time 176the lease is valid on the master and how long it is granted 177on the client. This timeout must be the same 178value on all sites (like log file size). The timeout used when 179refreshing leases is the <b>DB_REP_ACK_TIMEOUT</b> 180for RepMgr application. For Base API applications, lease 181refreshes will use the same mechanism as <b>PERM</b> messages and they 182should 183have no additional burden. This timeout is used for lease 184refreshment and is the amount of time a reader will wait to refresh 185leases before returning failure to the application from a read 186operation.<br> 187<br> 188This timeout will be both stored 189with its original value, and also 190converted to a <i>db_timespec</i> 191using the <b>DB_TIMEOUT_TO_TIMESPEC</b> 192macro and have the clock skew accounted for and stored in the shared 193rep structure:<br> 194<pre>db_timeout_t lease_timeout;<br>db_timespec lease_duration;<br></pre> 195NOTE: By sending the lease refresh during DB operations, we are 196forcing/assuming that the operation's process has a replication 197transport function set. That is obviously the case for write 198operations, but would it be a burden for read processes (on a 199master)? I think mostly not, but if we need leases for <i> 200DB->stat</i> then we need to 201document it as it is certainly possible for an application to have a 202separate or dedicated <i>stat</i> 203application or attempt to use <i>db_stat</i> 204(which will not work if leases must be checked).<br> 205<br> 206Leases should be checked after the local operation so that we don't 207have a window/boundary if we were to check leases first, get 208descheduled, the lose our lease and then perform the operation. 209Do the operation, then check leases before returning to the user.<br> 210<h3>Using Leases</h3> 211There is a new API that the user must call to tell the system to use 212the lease mechanism. The method must be called before the 213application calls <i>dbenv->rep_start</i> 214or <i>dbenv->repmgr_start</i>. 215This new 216method is:<br> 217<br> 218<pre> dbenv->rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)<br> 219</pre> 220The <i>clock_scale_factor</i> 221parameter is interpreted as a percentage, greater than 100 (to transmit 222a floating point number as an integer to the API) that represents the 223maximum shkew between any two sites' clocks. That is, a <span 224 style="font-style: italic;">clock_scale_factor</span> of 150 suggests 225that the greatest discrepancy between clocks is that one runs 50% 226faster than the others. Both the 227master and client sides 228compensate for possible clock skew. The master uses the value to 229compensate in case the replica has a slow clock and replicas compensate 230in case they have a fast clock. This scaling factor will need to 231be divided by 100 on all sites to truly represent the percentage for 232adjustments made to time values.<br> 233<br> 234Assume the slowest replica's clock is a factor of <i>clock_scale_factor</i> 235slower than the 236fastest clock. Using that assumption, if the fastest clock goes 237from time t1 to t2 in X 238seconds, the slowest clock does it in (<i>clock_scale_factor</i> / 100) 239* X seconds.<br> 240<br> 241The <i>flags</i> parameter is not 242currently used.<br> 243<br> 244When the <i>dbenv->rep_set_lease</i> 245method is called, we will set a configuration flag indicating that 246leases are turned on:<br> 247<b>#define REP_C_LEASE <value></b>. 248We will also record the <b>u_int32_t 249clock_skew</b> value passed in. The <i>rep_set_lease</i> method 250will not allow 251calls after <i>rep_start. </i>If 252multiple calls are made prior to calling <i>rep_start</i> then later 253calls will 254overwrite the earlier clock skew value. <br> 255<br> 256We need a new flag to prevent calling <i>rep_set_lease</i> 257after <i>rep_start</i>. The 258simplest solution would be to reject the call to 259<i>rep_set_lease 260</i>if<b> 261REP_F_CLIENT</b> 262or <b>REP_F_MASTER</b> is set. 263However that does not work in the cases where a site cleanly closes its 264environment and then opens without running recovery. The 265replication state will still be set. The prevention will be 266implemented as:<br> 267<pre>#define REP_F_START_CALLED <some bit value><br></pre> 268In __rep_start, at the end:<br> 269<pre>if (ret == 0 ) {<br> REP_SYSTEM_LOCK<br> F_SET(rep, REP_F_START_CALLED)<br> REP_SYSTEM_UNLOCK<br>}</pre> 270In <i>__rep_env_refresh</i>, if we 271are the last reference closing the env (we already check for that):<br> 272<pre>F_CLR(rep, REP_F_START_CALLED);</pre> 273In order to avoid run-time floating point operations 274on <i>db_timespec</i> structures, 275when a site is declared as a client or master in <i>rep_start</i> we 276will pre-compute the 277lease duration based on the integer-based clock skew and the 278integer-based lease timeout. A master should set a replica's 279lease expiration to the <b>start time of 280the sent message + 281(lease_timeout / clock_scale_factor)</b> in case the replica has a 282slow clock. Replicas extend their leases to <b>received message 283time + (lease_timeout * 284clock_scale_factor)</b> in case this replica has a fast clock. 285Therefore, the computation will be as follows if the site is becoming a 286master:<br> 287<pre>db_timeout_t tmp;<br>tmp = (db_timeout_t)((double)rep->lease_timeout / ((double)rep->clock_skew / (double)100));<br>rep->lease_duration = DB_TIMEOUT_TO_TIMESPEC(&tmp);<br></pre> 288Similarly, on a client the computation is:<br> 289<pre>tmp = (db_timeout_t)((double)rep->lease_timeout * ((double)rep->clock_skew / (double)100));<br></pre> 290When a site changes state, its lease duration will change based on 291whether it is becoming a master or client and it will be recomputed 292from the original values. Note that these computations, coupled 293with the fact that the lease on the master is computed based on the 294master's time that it sent the message means that leases on the master 295are more conservatively computed than on the clients.<br> 296<br> 297The <i>dbenv->rep_set_lease</i> 298method must be called after <i>dbenv->open</i>, 299similar to <i>dbenv->rep_set_config</i>. 300The reason is so that we can check that this is a replication 301environment and we have access to the replication shared memory region.<br> 302<h3>Read Operations<br> 303</h3> 304Authoritative read operations on the master with leases enabled will 305abide by leases by default. We will provide a flag that allows an 306operation on a master to ignore leases. <b>All read operations 307on a client imply 308ignoring leases.</b> If an application wants authoritative reads 309they must forward the read requests to the master and it is the 310application's responsibility to provide the forwarding. 311The consensus was that forcing <span style="font-weight: bold;">DB_IGNORE_LEASE</span> 312on client read operations (with leases enabled, obviously) was too 313heavy handed. Read operations on the client will ignore leases, 314but do no special flag checking.<br> 315<br> 316The flag will be called <b>DB_IGNORE_LEASE</b> 317and it will be a flag that can be OR'd into the DB access method and 318cursor operation values. It will be similar to the <b>DB_READ_UNCOMMITTED</b> 319flag. 320<br> 321</b>The methods that will 322adhere to leases are:<br> 323<ul> 324 <li><i>Db->get</i></li> 325 <li><i>Db->pget</i></li> 326 <li><i>Dbc->get</i></li> 327 <li><i>Dbc->pget</i></li> 328</ul> 329The code that will check leases for a client reading would look 330something 331like this, if we decide to become heavy-handed:<br> 332<pre>if (IS_REP_CLIENT(dbenv)) {<br> [get to rep structure]<br> if (FLD_ISSET(rep->config, REP_C_LEASE) && !LF_ISSET(DB_IGNORE_LEASE)) {<br> db_err("Read operations must ignore leases or go to master");<br> ret = EINVAL;<br> goto err;<br> }<br>}<br></pre> 333On the master, the new code to abide by leases is more complex. 334After the call to perform the operation we will check the lease. 335In that checking code, the master will see if it has a valid 336lease. If so, then all is well. If not, it will try to 337refresh the leases. If that refresh attempt results in leases, 338all is well. If the refresh attempt does not get leases, then the 339master cannot respond to the read as an authority and we return an 340error. The new error is called <b>DB_REP_LEASE_EXPIRED</b>. 341The location of the master lease check is down after the internal call 342to read the data is successful:<br> 343<pre>if (IS_REP_MASTER(dbenv) && !LF_ISSET(DB_IGNORE_LEASE)) {<br> [get to rep structure]<br> if (FLD_ISSET(rep->config, REP_C_LEASE) &&<br> (ret = __rep_lease_check(dbenv)) != 0) {<br> /*<br> * We don't hold the lease.<br> */<br> goto err;<br> }<br>}<br></pre> 344See below for the details of <i>__rep_lease_check</i>.<br> 345<br> 346Also note that if leases (or replication) are not configured, then <span 347 style="font-weight: bold;">DB_IGNORE_LEASE</span> is a no-op. It 348is ignored (and won't error) if used when leases are not in 349effect. The reason is so that we can generically set that flag in 350utility programs like <span style="font-style: italic;">db_dump</span> 351that walk the database with a cursor. Note that <span 352 style="font-style: italic;">db_dump</span> is the only utility that 353reads with a cursor.<span style="font-style: italic;"><span 354 style="font-style: italic;"></span></span><br> 355<h3><b>Nsites 356and Elections</b></h3> 357The call to <i>dbenv->rep_set_nsites</i> 358must be performed before the call to <i>dbenv->rep_start</i> 359or <i>dbenv->repmgr_start</i>. 360This document assumes either that <b>SR 36114778</b> gets resolved, or assumes that the value of <i>nsites</i> is 362immutable. The 363master and all clients need to know how many sites and leases are in 364the group. Clients need to know for elections. The master 365needs to know for the size of the lease table and to know what value a 366majority of the group is. <b>[Until 36714778 is resolved, the master lease work must assume <i>nsites</i> is 368immutable and will 369therefore enforce that this is called before <i>rep_start</i> using 370the same mechanism 371as <i>rep_set_lease</i>.]</b><br> 372<br> 373Elections and leases need to agree on the number of sites in the 374group. Therefore, when leases are in effect on clients, all calls 375to <i>dbenv->rep_elect</i> must 376set the <i>nsites</i> parameter to 3770. The <i>rep_elect</i> code 378path will return <b>EINVAL</b> if <b>REP_C_LEASE</b> is set and <i>nsites</i> 379is non-0. 380<h2>Lease Management</h2> 381<h3>Message Changes</h3> 382In order for clients to grant leases to the master a new message type 383must be added for that purpose. This will be the <b>REP_LEASE_GRANT</b> 384message. 385Granting leases will be a result of applying a <b>DB_REP_PERMANENT</b> 386record and therefore we 387do not need any additional message in order for a master to request a 388lease grant. The <b>REP_LEASE_GRANT</b> 389message will pass a structure as its message DBT:<br> 390<pre>struct __rep_lease_grant {<br> db_timespec msg_time;<br>#ifdef DIAGNOSTIC<br> db_timespec expire_time;<br>#endif<br>} REP_GRANT_INFO;<br></pre> 391In the <b>REP_LEASE_GRANT</b> 392message, the client is actually giving the master several pieces of 393information. We only need the echoed <i>msg_time</i> in this 394structure because 395everything else is already sent. The client is really sending the 396master:<br> 397<ul> 398 <li>Its EID (parameter to <span style="font-style: italic;">rep_send_message</span> 399and <span style="font-style: italic;">rep_process_message</span>)<br> 400 </li> 401 <li>The PERM LSN this message acknowledged (sent in the control 402message)</li> 403 <li>Unique identifier echoed back to master (<i>msg_time</i> sent in 404message as above)</li> 405</ul> 406On the client, we always maintain the maximum PERM LSN already in <i>lp->max_perm_lsn</i>. 407<h3>Local State Management</h3> 408Each client must maintain a <i>db_timespec</i> 409timestamp containing the expiration of its granted lease. This 410field will be in the replication shared memory structure:<br> 411<pre>db_timespec grant_expire;<br></pre> 412This timestamp already takes into account the clock skew. All 413new fields must be initialized when the region is created. Whenever we 414grant our master lease and want to send the <b>REP_LEASE_GRANT</b> 415message, this value 416will be updated. It will be used in the following way: 417<pre>db_timespec mytime;<br>DB_LSN perm_lsn;<br>DBT lease_dbt;<br>REP_GRANT_INFO gi;<br><br><br>timespecclear(&mytime);<br>timespecclear(&newgrant);<br>memset(&lease_dbt, 0, sizeof(lease_dbt));<br>memset(&gi, 0, sizeof(gi));<br>__os_gettime(dbenv, &mytime);<br>timespecadd(&mytime, &rep->lease_duration);<br>MUTEX_LOCK(rep->clientdb_mutex);<br>perm_lsn = lp->max_perm_lsn;<br>MUTEX_UNLOCK(rep->clientdb_mutex);<br>REP_SYSTEM_LOCK(dbenv);<br>if (timespeccmp(mytime, rep->grant_expire, >))<br> rep->grant_expire = mytime;<br>gi.msg_time = msg->msg_time;<br>#ifdef DIAGNOSTIC<br>gi.expire_time = rep->grant_expire;<br>#endif<br>lease_dbt.data = &gi;<br>lease_dbt.size = sizeof(gi);<br>REP_SYSTEM_UNLOCK(dbenv);<br>__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &perm_lsn, &lease_dbt, 0, 0);<br></pre> 418This updating of the lease grant will occur in the <b>PERM</b> code 419path when we have 420successfully applied the permanent record.<br> 421<h3>Maintaining Leases on the 422Master/Rep_start</h3> 423The master maintains a lease table that it checks when fulfilling a 424read request that is subject to leases. This table is initialized 425when a site calls<i> 426dbenv->rep_start(DB_MASTER)</i> and the site is undergoing a role 427change (i.e. a master making additional calls to <i>dbenv->rep_start(DB_MASTER)</i> 428does 429not affect an already existing table).<br> 430<br> 431When a non-master site becomes master, it must do two things related to 432leases on a role change. First, a client cannot upgrade to master 433while it has an outstanding lease granted to another site. If a 434client attempts to do so, an error, <b>EINVAL</b>, 435will be returned. The only way this should happen is if the 436application simply declares a site master, instead of using 437elections. Elections will already wait for leases to expire 438before proceeding. (See below.) 439<br> 440<br> 441Second, once we are proceeding with becoming a master, the site must 442allocate the table it will use to maintain lease information. 443This table will be sized based on <i>nsites</i> 444and it will be an array of the following structure:<br> 445<pre>struct {<br> int eid; /* EID of client site. */<br> db_timespec start_time; /* Unique time ID client echoes back on grants. */<br> db_timespec end_time; /* Master's lease expiration time. */<br> DB_LSN lease_lsn; /* Durable LSN this lease applies to. */<br> u_int32_t flags; /* Unused for now?? */<br>} REP_LEASE_ENTRY;<br></pre> 446<h3>Granting Leases</h3> 447It is the burden of the application to make sure that all sites in the 448group 449are using leases, or none are. Therefore, when a client processes 450a <b>PERM</b> 451log record that arrived from the master, it will grant its lease 452automatically if that record is permanent (i.e. <b>DB_REP_ISPERM</b> 453is being returned), 454and leases are configured. A client will not send a 455lease grant when it is processing log records (even <b>PERM</b> 456ones) it receives from other clients that use client-to-client 457synchronization. The reason is that the master requires a unique 458time-of-msg ID (see below) that the client echoes back in its lease 459grant and it will not have such an ID from another client.<br> 460<br> 461The master stores a time-of-msg ID in each message and the client 462simply echoes it back to the master. In its lease table, it does 463keep the base 464time-of-msg for a valid lease. When <b>REP_LEASE_GRANT</b> 465message comes in, 466the master does a number of things:<br> 467<ol> 468 <li>Pulls the echoed timespec from the client message, into <i>msg_time</i>.<br> 469 </li> 470 <li>Finds the entry in its lease table for the client's EID. It 471walks the table searching for the ID. EIDs of <span 472 style="font-weight: bold;">DB_EID_INVALID</span> are 473illegal. Either the master will find the entry, or it will find 474an empty slot in the table (i.e. it is still populating the table with 475leases).</li> 476 <li>If this is a previously unknown site lease, the master 477initializes the entry by copying to the <i>eid</i>, <i>start_time, </i>and 478 <i>lease_lsn</i> fields. The master 479also computes the <i>end_time</i> 480based on the adjusted <i>rep->lease_duration</i>.</li> 481 <li>If this is a lease from a previously known site, the master must 482perform <i>timespeccmp(&msg_time, 483&table[i].start_time, >)</i> and only update the <i>end_time</i> 484of the lease when this is 485a more recent message. If it is a more recent message, then we 486should update 487the <i>lease_lsn</i> to the LSN in 488the message.</li> 489 <li>Since lease durations are computed taking the clock skew into 490account, clients compute them based on the current time and the master 491computes it based on original sending time, for diagnostic purposes 492only, I also plan to send the client's expiration time. The 493client errs on the side of computing a larger lease expiration time and 494the master errs on the side of computing a smaller duration. 495Since both are taking the clock skew 496into account, the client's ending expiration time should never be 497smaller than 498the master's computed expiration time or their value for clock skew may 499not be correct.<br> 500 </li> 501</ol> 502Any log records (new or resent) that originate from the master and 503result in <b>DB_REP_ISPERM</b> get an 504ack.<br> 505<br> 506<h3>Refreshing Leases</h3> 507Leases get refreshed when a master receives a <b>REP_LEASE_GRANT</b> 508message from a client. There are three pieces to lease 509refreshment. <br> 510<h4>Lazy Lease Refreshing on Read<br> 511</h4> 512If the master discovers that leases are 513expired during the read operation, it attempts to refresh its 514collection of lease grants. It does this by calling a new 515function <i>__rep_lease_refresh</i>. 516This function is very similar to the already-existing function <i>__rep_flush</i>. 517Basically, to 518refresh the lease, the master simply needs to resend the last PERM 519record to the clients. The requirements state that when the 520application send function returns successfully from sending a PERM 521record, the majority of clients have that PERM LSN durable. We 522will have a new public DB error return called <b>DB_REP_LEASE_EXPIRED</b> 523that will be 524returned back to the caller if the master cannot assert its 525authority. The code will look something like this:<br> 526<pre>/*<br> * Use lp->max_perm_lsn on the master (currently not used on the master)<br> * to keep track of the last PERM record written through the logging system.<br> * need to initialize lp->max_perm_lsn in rep_start on role_chg.<br> */<br>call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT<br>if failure<br> expire leases<br> return lease expired error to caller<br>else /* success */<br> recheck lease table<br> /*<br> * We need to recheck the lease table because the client<br> * lease grant messages may not be processed yet, or got<br> * lost, or racing with the application's ACK messages or<br> * whatever. <br> */<br> if we have a majority of valid leases<br> return success<br> else<br> return lease expired error to caller <br></pre> 527<h4>Ongoing Update Refreshment<br> 528</h4> 529Second is having the master indicate to 530the client it needs to send a lease grant in response to the current 531PERM log message. The problem is 532that acknowledgements must contain a master-supplied message timestamp 533that the client sends back to the master. We need to modify the 534structure of the log record messages when leases are configured 535so 536that when a PERM message is sent, the master sends, and the client 537expects, the message timestamp. There are three fairly 538straightforward and different implementations to consider.<br> 539<ol> 540 <li>Adding the timestamp to the <b>REP_CONTROL</b> 541structure. If this option is chosen, then the code trivially 542sends back the timestamp in the client's reply. There is no 543special processing done by either side with the message contents. 544So, on a PERM log record, the master will send a non-zero 545timestamp. On a normal log record the timestamp will be zero or 546some known invalid value. If the client sees a non-zero 547timestamp, it sends a <b>REP_LEASE_GRANT</b> 548with the <i>lp->max_perm_lsn</i> 549after applying that log record. If it is zero, then the client 550does nothing different. The advantage is ease of code. The 551disadvantage is that for mixed version systems, the client is now 552dealing with different sized control structures. We would have to 553retain the old control structure so that during a mixed version group 554the (upgraded) clients can use, expect and send old control structures 555to the master. This is unfortunate, so let's consider additional 556implementations that don't require modifying the control structure.<br> 557 </li> 558 <li>Adding a new <b>REPCTL_LEASE</b> 559flag to the list of flags for the control structure, but do not change 560the control structure fields. When a master wants to send a 561message that needs a lease ack, it sets the flag. Additionally, 562instead of simply sending a log record DBT as the <i>rec</i> parameter 563for replication, we 564would send a new structure that had the timestamp first and then the 565record (similar to the bulk transfer buffer). The advantage of 566this is that the control structure does not change. Disadvantages 567include more special-cased code in the normal code path where we have 568to check the flag. If the flag is set we have to extract the 569timestamp value and massage the incoming data to pass on the real log 570record to <i>rep_apply</i>. On 571bulk transfer, we would just add the timestamp into the buffer. 572On normal transfers, it would incur an additional data copy on the 573master side. That is unfortunate. Additionally, if this 574record needs to be stored in the temp db, we need some way to get it 575back again later or <span style="font-style: italic;">rep_apply</span> 576would have to extract the timestamp out when it processed the record 577(either live or from the temp db).<br> 578 </li> 579 <li>Adding a different message type, such as <b>REP_LOG_ACK</b>. 580Similarly to <b>REP_LOG_MORE</b> this message would be a 581special-case version of a log record. We would extract out the 582timestamp and then handle as a normal log record. This 583implementation is rejected because it actually would require three new 584message types: <b>REP_LOG_ACK, 585REP_LOG_ACK_MORE, REP_BULK_LOG_ACK</b>. That is just too ugly 586to contemplate.</li> 587</ol> 588<b>[Slight digression:</b> it occurs 589to me while writing about #2 and #3 above, that our implementation of 590all of the *_MORE messages could really be implemented with a <b>REPCTL_MORE</b> 591flag instead of a 592separate message type. We should clean that up and simplify the 593messages but not part of master leases. Hmm, taking that thought 594process further, we really could get rid of the <b>REP_BULK_*</b> 595messages as well if we 596added a <b>REPCTL_BULK</b> 597flag. I think we should definitely do it for the *_MORE 598messages. I am not sure we should do it for bulk because the 599structure of the incoming data record is vastly different.]<br> 600<br> 601Of these options, I believe that modifying the control structure is the 602best alternative. The handling of the old structure will be very 603isolated to code dealing with old versions and is far less complicated 604than injecting the timestamp into the log record DBT and doing a data 605copy. Actually, I will likely combine #1 and the flag from #2 606above. I will have the <b>REPCTL_LEASE</b> 607flag that indicates a lease grant reply is expected and have the 608timestamp in the control structure. 609Also I will probably add in a spare field or two for future use in the <b>REP_CONTROL</b> 610structure.<br> 611<h4>Gap processing</h4> 612No matter which implementation we choose for ongoing lease refreshment, 613gap processing must be considered. The code above assumes the 614timestamps will be placed on PERM records only. Normal log 615records will not have a timestamp, nor a flag or anything else like 616that. However, any log message can fill a gap on a client and 617result in the processing of that normal log record to return <b>DB_REP_ISPERM</b> 618because later records 619were also processed.<br> 620<br> 621The current implementation should work fine in that case because when 622we store the message in the client temp db we store both the control 623DBT and the record DBT. Therefore, when a normal record fills a 624gap, the later PERM record, when retrieved will look just like it did 625when it arrived. The client will have access to the LSN, and the 626timestamp, etc. However, it does mean that sending the <b>REP_LEASE_GRANT</b> 627message must take 628place down in <i>__rep_apply</i> 629because that is the only place we have access to the contents of those 630stored records with the timestamps.<br> 631<br> 632There are two logical choices to consider for granting the lease when 633processing an update. As we process (either a live record or one 634read from the temp db after filling a gap) a PERM message, we send the <b>REP_LEASE_GRANT</b> 635message for each 636PERM record we successfully apply. Or, second, we keep track of 637the largest timestamp of all PERM records we've processed and at the 638end of the function after we've applied all records, we send back a 639single lease grant with the <i>max_perm_lsn</i> 640and a new <i>max_lease_timestamp</i> 641value to the master. The first is easier to implement, the second 642results in possibly slightly fewer messages at the expense of more 643bookkeeping on the client.<br> 644<br> 645A third, more complicated option would be to have the message timestamp 646on all records, but grants are only sent on the PERM messages. A 647reason to do this is that the later timestamp of a normal log record 648would be used as the timestamp sent in the reply and the master would 649get a more up to date timestamp value and a longer lease. <br> 650<br> 651If we change the <span style="font-weight: bold;">REP_CONTROL</span> 652structure to include the timestamp, we potentially break or at least 653need to revisit the gap processing algorithm. That code assumes 654that the control and record elements for the same LSN look the same 655each and every time. The code stores the <span 656 style="font-style: italic;">control</span> DBT as the key and the <span 657 style="font-style: italic;">rec</span> DBT as the data. We use a 658specialized compare function to sort based on the LSN in the control 659DBT. With master leases, the same record transmitted by a master 660multiple times or client for the same LSN will be different because the 661timestamp field will not be the same. Therefore, the client will 662end up with duplicate entries in the temp database for the same 663LSN. Both solutions (adding the timestamp to <span 664 style="font-weight: bold;">REP_CONTROL</span> and adding a <span 665 style="font-weight: bold;">REPCTL_LEASE</span> flag) can yield 666duplicate entries. The flag would cause the same record from the 667master and client to be different as well.<br> 668<h4>Handling Incoming Lease Grants<br> 669</h4> 670The third piece of lease management is handling the incoming <b>REP_LEASE_GRANT</b> 671message on the 672master. When this message is received, the master must do the 673following:<br> 674<pre>REP_SYSTEM_LOCK<br>msg_timestamp = cntrl->timestamp;<br>client_lease = __rep_lease_entry(dbenv, client eid)<br>if (client_lease == NULL)<br> initial lease for this site, DB_ASSERT there is space in the table<br> add this to the table if there is space<br>} else <br> compare msg_timestamp with client_lease->start_time<br> if (msg_timestamp is more recent && msg_lsn >= lease LSN)<br> update entry in table<br>REP_SYSTEM_UNLOCK<br></pre> 675<h3>Expiring Leases</h3> 676Leases can expire in two ways. First they can expire naturally 677due to the passage of time. When checking leases, if the current 678time is later than the lease entry's <i>end_time</i> 679then the lease is expired. Second, they can be forced with a 680premature expiration when the application's transport function returns 681an error. In the first case, there is nothing to do, in the 682second case we need to manipulate the <i>end_time</i> 683so that all future lease checks fail. Since the lease <i>start_time</i> 684is guaranteed to not be in the future we will have a function <i>__rep_lease_expire</i> 685that will:<br> 686<pre>REP_SYSTEM_LOCK<br>for each entry in the lease table<br> entry->end_time = entry->start_time;<br>REP_SYSTEM_UNLOCK<br></pre> 687Is there a potential race or problem with prematurely expiring 688leases? Consider an application that enforces an ALL 689acknowledgement policy for PERM records in its transport 690callback. There are four clients and three send the PERM ack to 691the application. The callback returns an error to the master DB 692code. The DB code will now prematurely expire its leases. 693However, at approximately the same time the three clients are also 694sending their <span style="font-weight: bold;">REP_LEASE_GRANT</span> 695messages to the master. There is a race between the master 696processing those messages and the thread handling the callback failure 697expiring the table. This is only an issue if the messages arrive 698after the table has been expired.<br> 699<br> 700Let's assume all three clients send their grants after the master 701expires the table. If we accept those grants and then a read 702occurs the read will succeed since the master has a majority of leases 703even though the callback failed earlier. Is that a problem? 704The lease code is using a majority and the application policy is using 705something other value. It feels like this should be okay since 706the data is held by leases on a majority. Should we consider 707having the lease checking threshold be the same as the permanent ack 708policy? That is difficult because Base API users implement 709whatever they want and DB does not know what it is.<br> 710<h3>Checking Leases</h3> 711When a read operation on the master completes, the last thing we need 712to do is verify the master leases. We've already discussed 713refreshing them when they are expired above. We need two things 714for a lease to be valid. It must be within the timeframe of the 715lease grant and the lease must be valid for the last PERM record 716LSN. Here is the logic 717for checking the validity of leases in <i>__rep_lease_check</i>:<br> 718<pre>#define MAX_REFRESH_TRIES 3<br>DB_LSN lease_lsn;<br>REP_LEASE_ENTRY *entry;<br>u_int32_t min_leases, valid_leases;<br>db_timespec cur_time;<br>int ret, tries;<br><br> tries = 0;<br>retry:<br> ret = 0;<br> LOG_SYSTEM_LOCK<br> lease_lsn = lp->lsn<br> LOG_SYSTEM_UNLOCK<br> REP_SYSTEM_LOCK<br> min_leases = rep->nsites / 2;<br> __os_gettime(dbenv, &cur_time);<br> for (entry = head of table, valid_leases = 0; entry != NULL && valid_leases < min_leases; entry++)<br> if (timespec_cmp(&entry->end_time, &cur_time) >= 0 && log_compare(&entry->lsn, lease_lsn) == 0)<br> valid_leases++;<br> REP_SYSTEM_UNLOCK<br> if (valid_leases < min_leases) {<br> ret =__rep_lease_refresh(dbenv, ...);<br> /*<br> * If we are successful, we need to recheck the leases because <br> * the lease grant messages may have raced with the PERM<br> * acknowledgement. Give those messages a chance to arrive.<br> */<br> if (ret == 0) {<br> if (tries <= MAX_REFRESH_TRIES) {<br> /*<br> * If we were successful sending, but not successful in racing the<br> * message thread, yield the processor so that message<br> * threads may have a chance to run.<br> */<br> if (tries > 0)<br> /* __os_sleep instead?? */<br> __os_yield()<br> tries++;<br> goto retry;<br> } else<br> ret = DB_RET_LEASE_EXPIRED;<br> }<br> }<br> return (ret);</pre> 719If the master has enough valid leases it returns success. If it 720does not have enough, it attempts to refresh them. This attempt 721may fail if sending the PERM record does not receive sufficient 722acks. If we do receive sufficient acknowledgements we may still 723find that scheduling of message threads means the master hasn't yet 724processed the incoming <b>REP_LEASE_GRANT</b> 725messages yet. We will retry a couple times (possibly 726parameterized) if the master discovers that situation. <br> 727<h2>Elections</h2> 728When a client grants a lease to a master, it gives up the right to 729participate in an election until that grant expires. If we are 730the master and <i>dbenv->rep_elect</i> 731is called, it should return, no matter what, like it does today. 732If we are a client and <i>rep_elect</i> 733is called special processing takes place when leases are in 734effect. First, the easy case is if the lease granted by this 735client has already expired, then the client goes directly into the 736election as normal. If a valid lease grant is outstanding to a 737master, this site cannot participate in an election until that grant 738expires. We have at least two options when a site calls the <i>dbenv->rep_elect</i> 739API while 740leases are in effect.<br> 741<ol> 742 <li>The simplest coding solution for DB would be simply to refuse to 743participate in the election if this site has a current lease granted to 744a master. We would detect this situation and return EINVAL. 745This is correct behavior and trivial to implement. The 746disadvantage of this solution is that the application would then be 747responsible for repeatedly attempting an election until the lease grant 748expired.<br> 749 </li> 750 <li>The more satisfying solution is for DB to wait the remaining time 751for the grant. If this client hears from the master during that 752time the election does not take place and the call to <i>rep_elect</i> 753returns with the 754information for the current/old master.</li> 755</ol> 756<h3>Election Code Changes</h3> 757The code changes to support leases in the election code are fairly 758isolated. First if leases are configured, we must verify the <i>nsites</i> 759parameter is set to 0. 760Second, in <i>__rep_elect_init</i> 761we must not overwrite the value of <i>rep->nsites</i> 762for leases because it is controlled by the <i>dbenv->rep_set_nsites</i> 763API. 764These changes are small and easy to understand.<br> 765<br> 766The more complicated code will be the client code when it has an 767outstanding lease granted. The client will wait for the current 768lease grant to expire before proceeding with the election. The 769client will only do so if it does not hear from the master for the 770remainder of the lease grant time. If the client hears from the 771master, it returns and does not begin participating in the 772election. A new election phase, <b>REP_EPHASE0</b> 773will exist so that the call to <i>__rep_wait</i> 774can detect if a master responds. The client, while waiting for 775the lease grant to expire, will send a <b>REP_MASTER_REQ</b> 776message so that the master will respond with a <b>REP_NEWMASTER</b> 777message and thus, 778allow the client to know the master exists. However, it is also 779desirable that if the master 780replies to the client, the master wants the client to update its lease 781grant. <br> 782<br> 783Recall that the <b>REP_NEWMASTER</b> 784message does not result in a lease grant from the client. The 785client responds when it processes a PERM record that has the <b>REPCTL_LEASE</b> 786flag set in the message 787with its lease grant up to the given LSN. Therefore, we want the 788client's <b>REP_MASTER_REQ</b> to 789yield both the discovery of the existing master and have the master 790refresh its leases. The client will also use the <b>REPCTL_LEASE</b> 791flag in its <b>REP_MASTER_REQ</b> message to the 792master. This flag will serve as the indicator to the master that 793it needs to deal with leases and both send the <b>REP_NEWMASTER</b> 794message and refresh 795the lease.<br> 796The code will work as follows:<br> 797<pre>if (leases_configured && (my_grant_still_valid || lease_never_granted) {<br> if (lease_never_granted)<br> wait_time = lease_timeout<br> else<br> wait_time = grant_expiration - current_time<br> F_SET(REP_F_EPHASE0);<br> __rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);<br> ret = __rep_wait(..., REP_F_EPHASE0);<br> if (we found a master)<br> return<br>} /* if we don't return, fall out and proceed with election */<br></pre> 798On the master side, the code handling the <b>REP_MASTER_REQ</b> will 799do:<br> 800<pre>if (I am master) {<br> ...<br> __rep_send_message(REP_NEWMASTER...)<br> if (F_ISSET(rp, REPCTL_LEASE))<br> __rep_lease_refresh(...)<br>}<br></pre> 801Other minor implementation details are that<i> __rep_elect_done</i> 802must also clear 803the <b>REP_F_EPHASE0</b> flag. 804We also, obviously, need to define <b>REP_F_EPHASE0</b> 805in the list of replication flags. Note that the client's call to <i>__rep_wait</i> 806will return upon 807receiving the <b>REP_NEWMASTER</b> 808message. The client will independently refresh its lease when it 809receives the log record from the master's call to refresh the lease.<br> 810<br> 811Again, similar to what I suggested above, the code could simply assume 812global leases are configured, and instead of having the <b>REPCTL_LEASE</b> 813flag at all, the master 814assumes that it needs to refresh leases because it has them configured, 815not because it is specified in the <b>REP_MASTER_REQ</b> 816message it is processing. Right now I don't think every possible 817<b>REP_MASTER_REQ</b> message should result in a lease grant request.<br> 818<h4>Elections and Quiescient Systems</h4> 819It is possible that a master is slow or the client is close to its 820expiration time, or that the master is quiescient and all leases are 821currently expired, but nothing much is going on anyway, yet some client 822calls <i>__rep_elect</i> at that 823time. In the code above, we will not send the <b>REP_MASTER_REQ</b> 824because the lease is 825not valid. The client will simply proceed directly to sending the 826<b>REP_VOTE1</b> message, throwing all 827other clients into an election. The master is still master and 828should stay that way. Currently in response to a vote message, a 829master will broadcast out a <b>REP_NEWMASTER</b> 830to assert its mastership. That causes the election to 831complete. However, if desired the master may want to proactively 832refresh its leases. This situation indicates to me that the 833master should choose to refresh leases based on configuration, not a 834flag sent from the client. I believe anytime the master asserts 835its mastership via sending a <b>REP_NEWMASTER</b> 836message that I need to add code to proactively refresh leases at that 837time.<br> 838<h2>Other Implementation Details</h2> 839<h3>Role Changes<br> 840</h3> 841When a site changes its role via a call to <i>rep_start</i> in either 842direction, we 843must take action when leases are configured. There are three 844types of role changes that all need changes to deal with leases:<br> 845<ol> 846 <li><i>A master downgrading to a 847client.</i> When a master downgrades to a client, it can do so 848immediately after it has proactively expired all existing leases it 849holds. This situation is similar to an error from the send 850callback, and it effectively cancels all outstanding leases held on 851this site. Note that if this master expires its leases, it does 852not have any effect on when the clients' lease grants expire on the 853client side. The clients must still wait their full expected 854grant time.<br> 855 </li> 856 <li><i>A client upgrading to master.</i> 857If a client is upgrading to a master but it has an outstanding lease 858granted to another site, the code will return an <b>EINVAL</b> 859error. This situation 860only arises if the application simply declares this site master. 861If a site wins an election then the election itself should have waited 862long enough for the granted lease to expire and this state should not 863arise then.</li> 864 <li><i>A client finding a new master.</i> 865When a client discovers a new and different master, via a <b>REP_NEWMASTER</b> 866message then the 867client cannot accept that new master until its current lease grant 868expires. This situation should only occur when a site declares 869itself master without an election and that site's lease grant expires 870before this client's grant expires. However, it is <b>possible</b> 871for this situation to arise 872with elections also. If we have 5 sites holding an election and 4 873of those sites have leases expire at about the same time T, and this 874site's lease expires at time T+N and the election timeout is < N, 875then those 4 sites may hold an election and elect a master without this 876site's participation. A client in this situation must call <i>__rep_wait</i> 877with the time remaining 878on its lease. If the lease is expired after waiting the remaining 879time, then the client can accept this new master. If the lease 880was refreshed during the waiting period then the client does not accept 881this new master and returns.<br> 882 </li> 883</ol> 884<h3>DUPMASTER</h3> 885A duplicate master situation can occur if an old master becomes 886disconnected from the rest of the group, that group elects a new master 887and then the partition is resolved. The requirement for master 888leases is that this situation will not cause the newly elected, 889rightful master to receive the <b>DB_REP_DUPMASTER</b> 890return. It is okay for the old master to get that return 891value. When a dual master situation exists, the following will 892happen:<br> 893<ul> 894 <li><i>On the current master and all 895current clients</i> - If the current master receives an update 896message or other conflicting message from the old master then that 897message will be ignored because the generation number is out of date.</li> 898 <li><i>On the old master</i> - If 899the old master receives an update message from the current master, or 900any other message with a later generation from any site, the new 901generation number will trigger this site to return <b>DB_REP_DUPMASTER</b>. 902However, 903instead of broadcasting out the <b>REP_DUPMASTER</b> 904message to shoot down others as well, this site, if leases are 905configured, will call <i>__rep_lease_check</i> 906and if they are expired, return the error. It should be 907impossible for us to receive a later generation message and still hold 908a majority of master leases. Something is seriously wrong and we 909will <b>DB_ASSERT</b> this situation 910cannot happen.<br> 911 </li> 912</ul> 913<h3>Client to Client Synchronization</h3> 914One question to ask is how lease grants interact with client-to-client 915synchronization. The only answer is that they do not. A client 916that is sending log records to another client cannot request the 917receiving client refresh its lease with the master. That client 918does not have a timestamp it can use for the master and clock skew 919makes it meaningless between machines. Therefore, sites that use 920client-to-client synchronization will likely see more lease refreshment 921during the read path and leases will be refreshed during live updates 922only. Of course, if a client supplies log records that fill a 923gap, and the later log records stored came from the master in a live 924update then the client will respond as per the discussion on Gap 925Processing above.<br> 926<h2>Interaction Matrix</h2> 927If leases are granted (by a client) or held (by a master) what should 928the following APIs and messages do?<br> 929<br> 930Other:<br> 931log_archive: Leases do not affect log_archive. OK.<br> 932dbenv->close: OK.<br> 933crash during lease grant and restart: <b>Potential 934problem here. See discussion below</b>.<br> 935<br> 936Rep Base API method:<br> 937rep_elect: Already discussed above. Must wait for lease to expire.<br> 938rep_flush: Master only, OK - this will be the basis for refreshing 939leases.<br> 940rep_get_*: Not affected by leases.<br> 941rep_process_message: Generally OK. We'll discuss each message 942below.<br> 943rep_set_config: OK.<br> 944rep_set_limit: OK<br> 945rep_set_nsites: Must be called before <i>rep_start</i> 946and <i>nsites</i> is immutable until 94714778 is resolved.<br> 948rep_set_priority: OK<br> 949rep_set_timeout: OK. Used to set lease timeout.<br> 950rep_set_transport: OK.<br> 951rep_start(MASTER): Role changes are discussed above. Make sure 952duplicate rep_start calls are no-ops for leases.<br> 953rep_start(CLIENT): Role changes are discussed above. Make sure 954duplicate calls are no-ops for leases.<br> 955rep_stat: OK.<br> 956rep_sync: Should not be able to happen. Client cannot accept new 957master with outstanding lease grant. Add DB_ASSERT here.<br> 958<br> 959REP_ALIVE: OK.<br> 960REP_ALIVE_REQ: OK.<br> 961REP_ALL_REQ: OK.<br> 962REP_BULK_LOG: OK. Clients check to send ACK.<br> 963REP_BULK_PAGE: Should never process one with lease granted. Add 964DB_ASSERT.<br> 965REP_DUPMASTER: Should never happen, this is what leases are supposed to 966prevent. See above.<br> 967REP_LOG: OK. Clients check to send ACK.<br> 968REP_LOG_MORE: OK. Clients check to send ACK.<br> 969REP_LOG_REQ: OK.<br> 970REP_MASTER_REQ: OK.<br> 971REP_NEWCLIENT: OK.<br> 972REP_NEWFILE: OK. Clients check to send ACK.<br> 973REP_NEWMASTER: See above.<br> 974REP_NEWSITE: OK.<br> 975REP_PAGE: OK. Should never process one with lease granted. 976Add DB_ASSERT.<br> 977REP_PAGE_FAIL: OK. Should never process one with lease 978granted. Add DB_ASSERT.<br> 979REP_PAGE_MORE: OK. Should never process one with lease 980granted. Add DB_ASSERT.<br> 981REP_PAGE_REQ: OK.<br> 982REP_REREQUEST: OK.<br> 983REP_UPDATE: OK. Should never process one with lease 984granted. Add DB_ASSERT.<br> 985REP_UPDATE_REQ: OK. This is a master-only message.<br> 986REP_VERIFY: OK. Should never process one with lease 987granted. Add DB_ASSERT.<br> 988REP_VERIFY_FAIL: OK. Should never process one with lease 989granted. Add DB_ASSERT.<br> 990REP_VERIFY_REQ: OK.<br> 991REP_VOTE1: OK. See Election discussion above. It is 992possible to receive one with a lease granted. Client cannot send 993one with an outstanding lease however.<br> 994REP_VOTE2: OK. See Election discussion above. It is 995possible to receive one with a lease granted.<br> 996<br> 997If the following method or message processing is in progress and a 998client wants to grant a lease, what should it do? Let's examine 999what this means. The client wanting to grant a lease simply means 1000it is responding to the receipt of a <b>REP_LOG</b> 1001(or its variants) message and applying a log record. Therefore, 1002we need to consider a thread processing a log message racing with these 1003other actions.<br> 1004<br> 1005Other:<br> 1006log_archive: OK. <br> 1007dbenv->close: User error. User should not be closing the env 1008while other threads are using that handle. Should have no effect 1009if a 2nd dbenv handle to same env is closed.<br> 1010<br> 1011Rep Base API method:<br> 1012rep_elect: See Election discussion above. <i>rep_elect</i> 1013should wait and may grant 1014lease while election is in progress.<br> 1015rep_flush: Should not be called on client.<br> 1016rep_get_*: OK.<br> 1017rep_process_message: Generally OK. See handling each message 1018below.<br> 1019rep_set_config: OK.<br> 1020rep_set_limit: OK.<br> 1021rep_set_nsites: Must be called before <i>rep_start</i> 1022until 14778 is resolved.<br> 1023rep_set_priority: OK.<br> 1024rep_set_timeout: OK.<br> 1025rep_set_transport: OK.<br> 1026rep_start(MASTER): OK, can't happen - already protect racing <i>rep_start</i> 1027and <i>rep_process_message</i>.<br> 1028rep_start(CLIENT): OK, can't happen - already protect racing <i>rep_start</i> 1029and <i>rep_process_message</i>.<br> 1030rep_stat: OK.<br> 1031rep_sync: Shouldn't happen because client cannot grant leases during 1032sync-up. Incoming log message ignored.<br> 1033<br> 1034REP_ALIVE: OK.<br> 1035REP_ALIVE_REQ: OK.<br> 1036REP_ALL_REQ: OK.<br> 1037REP_BULK_LOG: OK.<br> 1038REP_BULK_PAGE: OK. Incoming log message ignored during internal 1039init.<br> 1040REP_DUPMASTER: Shouldn't happen. See DUPMASTER discussion above.<br> 1041REP_LOG: OK.<br> 1042REP_LOG_MORE: OK.<br> 1043REP_LOG_REQ: OK.<br> 1044REP_MASTER_REQ: OK.<br> 1045REP_NEWCLIENT: OK.<br> 1046REP_NEWFILE: OK.<br> 1047REP_NEWMASTER: See above. If a client accepts a new master 1048because its lease grant expired, then that master sends a message 1049requesting the lease grant, this client will not process the log record 1050if it is in sync-up recovery, or it may after the master switch is 1051complete and the client doesn't need sync-up recovery. Basically, 1052just uses existing log record processing/newmaster infrastructure.<br> 1053REP_NEWSITE: OK.<br> 1054REP_PAGE: OK. Receiving a log record during internal init PAGE 1055phase should ignore log record.<br> 1056REP_PAGE_FAIL: OK.<br> 1057REP_PAGE_MORE: OK.<br> 1058REP_PAGE_REQ: OK.<br> 1059REP_REREQUEST: OK.<br> 1060REP_UPDATE: OK. Receiving a log record during internal init 1061should ignore log record.<br> 1062REP_UPDATE_REQ: OK - master-only message.<br> 1063REP_VERIFY: OK. Receiving a log record during verify phase 1064ignores log record.<br> 1065REP_VERIFY_FAIL: OK.<br> 1066REP_VERIFY_REQ: OK.<br> 1067REP_VOTE1: OK. This client is processing someone else's vote when 1068the lease request comes in. That is fine. We protect our 1069own election and lease interaction in <i>__rep_elect</i>.<br> 1070REP_VOTE2: OK.<br> 1071<h4>Crashing - Potential Problem<br> 1072</h4> 1073It appears there is one area where we could have a problem. I 1074believe that crashes can cause us to break our guarantee on durability, 1075authoritative reads and inability to elect duplicate masters. 1076Consider this scenario:<br> 1077<ol> 1078 <li>A master and 4 clients are all up and running.</li> 1079 <li>The master commits a txn and all 4 clients refresh their lease 1080grants at time T.</li> 1081 <li>All 4 clients have the txn and log records in the cache. 1082None are flushing to disk.</li> 1083 <li>All 4 clients have responded to the PERM messages as well as 1084refreshed their lease with the master.</li> 1085 <li>All 4 clients hit the same application coding error and crash 1086(machine/OS stays up).</li> 1087 <li>Master authoritatively reads data in txn from step 2.</li> 1088 <li>All 4 clients restart the application and run recovery, thus the 1089txn from step 2 is lost on all clients because it isn't any logs.<span 1090 style="font-weight: bold;"></span><br> 1091 </li> 1092 <li>A network partition happens and the master is alone on its side.</li> 1093 <li>All 4 clients are on the other side and elect a new master.</li> 1094 <li>Partition resolves itself and we have duplicate masters, where 1095the former master still holds all valid lease grants.<span 1096 style="font-weight: bold;"></span><br> 1097 </li> 1098</ol> 1099Therefore, we have broken both guarantees. In step 6 the data is 1100really not durable and we've given it to the user. One can argue 1101that if this is an issue the application better be syncing somewhere if 1102they really want durability. However, worse than that is that we 1103have a legitimate DUPMASTER situation in step 10 where both masters 1104hold valid leases. The reason is that all lease knowledge is in 1105the shared memory and that is lost when the app restarts and runs 1106recovery.<br> 1107<br> 1108How can we solve this? The obvious solution is (ugh, yet another) 1109durable BDB-owned file with some information in it, such as the current 1110lease expiration time so that rebooting after a crash leaves the 1111knowledge that the lease was granted. However, writing and 1112syncing every lease grant on every client out to disk is far too 1113expensive.<br> 1114<br> 1115A second possible solution is to have clients wait a full lease timeout 1116before entering an election the first time. This solution solves the 1117DUPMASTER issue, but not the non-authoritative read. This 1118solution naturally falls out of elections and leases really. If a 1119client has never granted a lease, it should be considered as having to 1120wait a full lease timeout before entering an election. 1121Applications already know that leases impact elections and this does 1122not seem so bad as it is only on the first election.<br> 1123<br> 1124Is it sufficient to document that the authoritative read is only as 1125authoritative as the durability guarantees they make on the sites that 1126indicate it is permanent? Yes, I believe this is sufficient. If 1127the application says it is permanent and it really isn't, then the 1128application is at fault. Believing the application when it 1129indicates with the PERM response that it is permanent avoids the 1130authoritative problem. <br> 1131<h2>Upgrade/Mixed Versions</h2> 1132Clearly leases cannot be used with mixed version sites since masters 1133running older releases will not have any knowledge of lease 1134support. What considerations are needed in the lease code for 1135mixed versions?<br> 1136<br> 1137First if the <b>REP_CONTROL</b> 1138structure changes, we need to maintain and use an old version of the 1139structure for talking to older clients and masters. The 1140implementation of this would be similar to the way we manage for old <b>REP_VOTE_INFO</b> 1141structures. 1142Second any new messages need translation table entries added. 1143Third, if we are assuming global leases then clearly any mixed versions 1144cannot have leases configured, and leases cannot be used in mixed 1145version groups. Maintaining two versions of the control structure 1146is not necessary if we choose a different style of implementation and 1147don't change the control structure.<br> 1148<br> 1149However, then how could an old application both run continuously, 1150upgrade to the new release and take advantage of leases without taking 1151down the entire application? I believe it is possible for clients 1152to be configured for leases but be subject to the master regarding 1153leases, yet the master code can assume that if it has leases 1154configured, all client sites do as well. In several places above 1155I suggested that a client could make a choice based on either a new <b>REPCTL_LEASE</b> 1156flag or simply having 1157leases turned on locally. If we choose to use the flag, then we 1158can support leases with mixed versions. The upgraded clients can 1159configure leases and they simply will not be granted until the old 1160master is upgraded and send PERM message with the flag indicating it 1161wants a lease grant. The client will not grant a lease until such 1162time. The clients, while having the leases configured, will not 1163grant a lease until told to do so and will simply have an expired 1164lease. Then, when the old master finally upgrades, it too can 1165configure leases and suddenly all sites are using them. I believe 1166this should work just fine and I will need to make sure a client's 1167granting of leases is only in response to the master asking for a 1168grant. If the master never asks, then the client has them 1169configured, but doesn't grant them.<br> 1170<h2>Testing</h2> 1171Clearly any user-facing API changes will need the equivalent reflection 1172in the Tcl API for testing, under CONFIG_TEST.<br> 1173<br> 1174I am sure the list of tests will grow but off the top of my head:<br> 1175Basic test: have N sites all configure leases, run some, read on 1176master, etc.<br> 1177Refresh test: Perform update on master, sleep until past expiration, 1178read on master and make sure leases are refreshed/read successful<br> 1179Error test: Test error conditions (reading on client with leases but no 1180ignore flag, calling after rep_start, etc)<br> 1181Read test: Test reading on both client and master both with and without 1182the IGNORE flag. Test that data read with the ignore flag can be 1183rolled back.<br> 1184Dupmaster test: Force a DUPMASTER situation and verify that the newer 1185master cannot get DUPMASTER error.<br> 1186Election test: Call election while grant is outstanding and master 1187exists.<br> 1188Call election while grant is outstanding and master does not exist.<br> 1189Call election after expiration on quiescient system with master 1190existing.<br> 1191Run with a group where some members have leases configured and other do 1192not to make sure we get errors instead of dumping core.<br> 1193<br> 1194<small><br> 1195</small> 1196</body> 1197</html> 1198