1<!DOCTYPE doctype PUBLIC "-//w3c//dtd html 4.0 transitional//en"> 2<html> 3<head> 4 <meta http-equiv="Content-Type" 5 content="text/html; charset=iso-8859-1"> 6 <meta name="GENERATOR" 7 content="Mozilla/4.76 [en] (X11; U; FreeBSD 4.3-RELEASE i386) [Netscape]"> 8 <title>Master Lease</title> 9</head> 10<body> 11<center> 12<h1>Master Leases for Berkeley DB</h1> 13</center> 14<center><i>Susan LoVerso</i> <br> 15<i>sue@sleepycat.com</i> <br> 16<i>Rev 1.1</i><br> 17<i>2007 Feb 2</i><br> 18</center> 19<p><br> 20</p> 21<h2>What are Master Leases?</h2> 22A master lease is a mechanism whereby clients grant master-ship rights 23to a site and that master, by holding lease rights can provide a 24guarantee of durability to a replication group for a given period of 25time. By granting a lease to a master, 26a client will not participate in an election to elect a new 27master until that granted master lease has expired. By holding a 28collection of granted leases, a master will be able to supply 29authoritative read requests to applications. By holding leases a 30read operation on a master can guarantee several things to the 31application:<br> 32<ol> 33 <li>Authoritative reads: a guarantee that the data being read by the 34application is durable and can never be rolled back.</li> 35 <li>Freshness: a guarantee that the data being read by the 36application <b>at the master</b> is 37not stale.</li> 38 <li>Master viability: a guarantee that a current master with valid 39leases will not encounter a duplicate master situation.<br> 40 </li> 41</ol> 42<h2>Requirements</h2> 43The requirements of DB to support this include:<br> 44<ul> 45 <li>After turning them on, users can choose to ignore them in reads 46or not.</li> 47 <li>We are providing read authority on the master only. A 48read on a client is equivalent to a read while ignoring leases.</li> 49 <li>We guarantee that data committed on a master <b>that has been 50read by an application on the 51master</b> will not be rolled back. Data read on a client or 52while ignoring leases <i>or data 53successfully updated/committed but not read,</i> 54may be rolled back.<br> 55 </li> 56 <li>A master will not return successfully from a read operation 57unless it holds a 58majority of leases unless leases are ignored.</li> 59 <li>Master leases will remove the possibility of a current/correct 60master being "shot down" by DUPMASTER. <b>NOTE: Old/Expired 61masters may discover a 62later master and return DUPMASTER to the application however.</b><br> 63 </li> 64 <li>Any send callback failure must result in premature lease 65expiration on the master.<br> 66 </li> 67 <li>Users who change the system clock during master leases void the 68guarantee and may get undefined behavior. We assume time always 69runs forward. <b>[document this.]</b><br> 70 </li> 71 <li>Clients are forbidden from participating in elections while they 72have an outstanding lease granted to another site.</li> 73 <li>Clients are forbidden from accepting a new master while they have 74an outstanding lease granted to another site.</li> 75 <li>Clients are forbidden from upgrading themselves to master while 76they have an outstanding lease granted to another site.</li> 77 <li>When asked for a lease grant explicitly by the master, the client 78cannot grant the lease to the master unless the LSN in the master's 79request has been processed by this client.<br> 80 </li> 81</ul> 82The requirements of the 83application using leases include:<br> 84<ul> 85 <li>Users must implement (Base API users on their own, RepMgr users 86via configuration) a majority (or larger) ACK policy. <br> 87 </li> 88 <li>The application must use the election mechanism to decide a master. 89It may not simply declare a site master.</li> 90 <li>The send callback must return an error if the majority ACK policy 91is not met for PERM records.</li> 92 <li>Users must set the number of sites in the group.</li> 93 <li>Using leases in a replication group is all-or-none. 94Therefore, if a site knows it is using leases, it can assume other 95sites are also.<br> 96 </li> 97 <li>All applications that care about read guarantees must forward or 98perform all reads on the master. Reading on the client means a 99read ignoring leases. </li> 100</ul> 101<p>There are some open questions 102remaining.</p> 103<ul> 104 <li>There is one major showstopper issue, see Crashing - Potential 105problem near the end of the document. We need a better solution 106than the one shown there (writing to disk every time a lease is 107granted). Perhaps just documenting that durability means it must be 108flushed to disk before success to avoid that situation?<br> 109 </li> 110 <li>What about db->join? Users can call join, but the calls 111on the join cursor to get the data would be subject to leases and 112therefore protected. Ok, this is not an open question.</li> 113 <li>What about other read-like operations? Clearly <i> 114DB->get, DB->pget, DBC->get, 115DBC->pget</i> need lease checks. However, other APIs use 116keys. <i>DB->key_range</i> 117provides an estimate only so it shouldn't need lease checks. <i> 118DB->stat</i> provides exact counts 119to <i>bt_nkeys</i> and <i>bt_ndata</i> fields. Are those 120fields considered authoritative that providing those values implies a 121durability guarantee and therefore <i>DB->stat</i> 122should be subject to lease verification? <i>DBC->count</i> 123provides a count for 124the number of data items associated with a key. Is this 125authoritative information? This is similar to stat - should it be 126subject to lease verification?<br> 127 </li> 128 <li>Do we require master lease checks on write operations? I 129think lease checks are not needed on write operations. It doesn't 130add correctness and adds a lot of complexity (checking leases in put, 131del, and cursors, then what about rename, remove, etc).<br> 132 </li> 133 <li>Do master leases give an iron-clad guarantee of never rolling 134back a transaction? No, but it should mean that a committed transaction 135can never be <b>read</b> on a master 136unless the lease is valid. A committed transaction on a master 137that has never been presented to the application may get rolled back.<br> 138 </li> 139 <li>Do we need to quarantine or prevent reads on an ex-master until 140sync-up is done? No. A master that is simply downgraded to 141client or crashes and reboots is now a client. Reading from that 142client is the same as saying Ignore Leases.</li> 143 <li>What about adding and removing sites while leases are 144active? This is SR 14778. A consistent <i>nsites</i> value 145is required by master 146leases. <b>The resolution of 14778 147is a prerequisite - currently owned by Alan</b>. It isn't 148clear to me what a master is 149supposed to do if the value of nsites gets smaller while leases are 150active. Perhaps it leaves its larger table intact and simply 151checks for a smaller number of granted leases?<br> 152 </li> 153 <li>Can users turn leases off? No. There is no planned <i>turn 154leases off</i> API.</li> 155 <li>Clock skew will be a percentage. However, the smallest, 1%, 156is probably rather large for clock skew. Percentage was chosen 157for simplicity and similarity to other APIs. What granularity is 158appropriate here?</li> 159</ul> 160<h2>API Changes</h2> 161The API changes that are visible 162to the user are fairly minimal. 163There are a few API calls they need to make to configure master leases 164and then there is the API call to turn them on. There is also a 165new flag to existing APIs to allow read operations to ignore leases and 166return data that 167may be non-durable potentially.<br> 168<h3>Lease Timeout<br> 169</h3> 170There is a new timout the user 171must configure for leases called <b>DB_REP_LEASE_TIMEOUT</b>. 172This timeout will be new to 173the <i>dbenv->rep_set_timeout</i> method. The <b>DB_REP_LEASE_TIMEOUT</b> 174has no default and it is required that the user configure a timeout 175before they turn on leases (obviously, this timeout need not be set of 176leases will not be used). That timeout is the amount of time 177the lease is valid on the master and how long it is granted 178on the client. This timeout must be the same 179value on all sites (like log file size). <b>[Document this 180requirement. We cannot 181enforce it across the group easily.]</b> The timeout used when 182refreshing leases is the <b>DB_REP_ACK_TIMEOUT</b> 183for RepMgr application. For Base API applications, lease 184refreshes will use the same mechanism as <b>PERM</b> messages and they 185should 186have no additional burden. This timeout is used for lease 187refreshment and is the amount of time a reader will wait to refresh 188leases before returning failure to the application from a read 189operation.<br> 190<br> 191This timeout will be both stored 192with its original value, and also 193converted to a <i>db_timespec</i> 194using the <b>DB_TIMEOUT_TO_TIMESPEC</b> 195macro and have the clock skew accounted for and stored in the shared 196rep structure:<br> 197<pre>db_timeout_t lease_timeout;<br>db_timespec lease_duration;<br></pre> 198NOTE: By sending the lease refresh during DB operations, we are 199forcing/assuming that the operation's process has a replication 200transport function set. That is obviously the case for write 201operations, but would it be a burden for read processes (on a 202master)? I think mostly not, but if we need leases for <i> 203DB->stat</i> then we need to 204document it as it is certainly possible for an application to have a 205separate or dedicated <i>stat</i> 206application or attempt to use <i>db_stat</i> 207(which will not work if leases must be checked).<br> 208<br> 209Leases should be checked after the local operation so that we don't 210have a window/boundary if we were to check leases first, get 211descheduled, the lose our lease and then perform the operation. 212Do the operation, then check leases before returning to the user.<br> 213<h3>Using Leases</h3> 214There is a new API that the user must call to tell the system to use 215the lease mechanism. The method must be called before the 216application calls <i>dbenv->rep_start</i> 217or <i>dbenv->repmgr_start</i>. 218This new 219method is:<br> 220<br> 221<pre> dbenv->rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)<br> 222</pre> 223The <i>clock_scale_factor</i> 224parameter is interpreted as a percentage, greater than 100 (to transmit 225a floating point number as an integer to the API) that represents the 226maximum shkew between any two sites' clocks. That is, a <span 227 style="font-style: italic;">clock_scale_factor</span> of 150 suggests 228that the greatest discrepancy between clocks is that one runs 50% 229faster than the others. Both the 230master and client sides 231compensate for possible clock skew. The master uses the value to 232compensate in case the replica has a slow clock and replicas compensate 233in case they have a fast clock. This scaling factor will need to 234be divided by 100 on all sites to truly represent the percentage for 235adjustments made to time values.<br> 236<br> 237Assume the slowest replica's clock is a factor of <i>clock_scale_factor</i> 238slower than the 239fastest clock. Using that assumption, if the fastest clock goes 240from time t1 to t2 in X 241seconds, the slowest clock does it in (<i>clock_scale_factor</i> / 100) 242* X seconds.<br> 243<br> 244The <i>flags</i> parameter is not 245currently used.<br> 246<br> 247When the <i>dbenv->rep_set_lease</i> 248method is called, we will set a configuration flag indicating that 249leases are turned on:<br> 250<b>#define REP_C_LEASE <value></b>. 251We will also record the <b>u_int32_t 252clock_skew</b> value passed in. The <i>rep_set_lease</i> method 253will not allow 254calls after <i>rep_start. </i>If 255multiple calls are made prior to calling <i>rep_start</i> then later 256calls will 257overwrite the earlier clock skew value. <br> 258<br> 259We need a new flag to prevent calling <i>rep_set_lease</i> 260after <i>rep_start</i>. The 261simplest solution would be to reject the call to 262<i>rep_set_lease 263</i>if<b> 264REP_F_CLIENT</b> 265or <b>REP_F_MASTER</b> is set. 266However that does not work in the cases where a site cleanly closes its 267environment and then opens without running recovery. The 268replication state will still be set. The prevention will be 269implemented as:<br> 270<pre>#define REP_F_START_CALLED <some bit value><br></pre> 271In __rep_start, at the end:<br> 272<pre>if (ret == 0 ) {<br> REP_SYSTEM_LOCK<br> F_SET(rep, REP_F_START_CALLED)<br> REP_SYSTEM_UNLOCK<br>}</pre> 273In <i>__rep_env_refresh</i>, if we 274are the last reference closing the env (we already check for that):<br> 275<pre>F_CLR(rep, REP_F_START_CALLED);</pre> 276<b>[Please review the logic here 277carefully.]</b> In order to avoid run-time floating point operations 278on <i>db_timespec</i> structures, 279when a site is declared as a client or master in <i>rep_start</i> we 280will pre-compute the 281lease duration based on the integer-based clock skew and the 282integer-based lease timeout. A master should set a replica's 283lease expiration to the <b>start time of 284the sent message + 285(lease_timeout / clock_scale_factor)</b> in case the replica has a 286slow clock. Replicas extend their leases to <b>received message 287time + (lease_timeout * 288clock_scale_factor)</b> in case this replica has a fast clock. 289Therefore, the computation will be as follows if the site is becoming a 290master:<br> 291<pre>db_timeout_t tmp;<br>tmp = (db_timeout_t)((double)rep->lease_timeout / ((double)rep->clock_skew / (double)100));<br>rep->lease_duration = DB_TIMEOUT_TO_TIMESPEC(&tmp);<br></pre> 292Similarly, on a client the computation is:<br> 293<pre>tmp = (db_timeout_t)((double)rep->lease_timeout * ((double)rep->clock_skew / (double)100));<br></pre> 294When a site changes state, its lease duration will change based on 295whether it is becoming a master or client and it will be recomputed 296from the original values. Note that these computations, coupled 297with the fact that the lease on the master is computed based on the 298master's time that it sent the message means that leases on the master 299are more conservatively computed than on the clients.<br> 300<br> 301The <i>dbenv->rep_set_lease</i> 302method must be called after <i>dbenv->open</i>, 303similar to <i>dbenv->rep_set_config</i>. 304The reason is so that we can check that this is a replication 305environment and we have access to the replication shared memory region.<br> 306<h3>Read Operations<br> 307</h3> 308Authoritative read operations on the master with leases enabled will 309abide by leases by default. We will provide a flag that allows an 310operation on a master to ignore leases. <b>All read operations 311on a client imply 312ignoring leases.</b> If an application wants authoritative reads 313they must forward the read requests to the master and it is the 314application's responsibility to provide the forwarding. 315The consensus was that forcing <span style="font-weight: bold;">DB_IGNORE_LEASE</span> 316on client read operations (with leases enabled, obviously) was too 317heavy handed. Read operations on the client will ignore leases, 318but do no special flag checking.<br> 319<br> 320The flag will be called <b>DB_IGNORE_LEASE</b> 321and it will be a flag that can be OR'd into the DB access method and 322cursor operation values. It will be similar to the <b>DB_READ_UNCOMMITTED</b> 323flag. <b>[Keith, I will need your help here for 324finding a bit in the DB flags that isn't in use for my new flag. 325That 326looks like a very full and confusing area...]<br> 327<br> 328</b>The methods that will 329adhere to leases are:<br> 330<ul> 331 <li><i>Db->get</i></li> 332 <li><i>Db->pget</i></li> 333 <li><i>Dbc->get</i></li> 334 <li><i>Dbc->pget</i></li> 335 <li><i>Db->stat </i><b>[maybe?]</b></li> 336 <li><i>Dbc->count</i><b>[maybe?]</b></li> 337</ul> 338The code that will check leases for a client reading would look 339something 340like this, if we decide to become heavy-handed:<br> 341<pre>if (IS_REP_CLIENT(dbenv)) {<br> [get to rep structure]<br> if (FLD_ISSET(rep->config, REP_C_LEASE) && !LF_ISSET(DB_IGNORE_LEASE)) {<br> db_err("Read operations must ignore leases or go to master");<br> ret = EINVAL;<br> goto err;<br> }<br>}<br></pre> 342On the master, the new code to abide by leases is more complex. 343After the call to perform the operation we will check the lease. 344In that checking code, the master will see if it has a valid 345lease. If so, then all is well. If not, it will try to 346refresh the leases. If that refresh attempt results in leases, 347all is well. If the refresh attempt does not get leases, then the 348master cannot respond to the read as an authority and we return an 349error. The new error is called <b>DB_REP_LEASE_EXPIRED</b>. 350The location of the master lease check is down after the internal call 351to read the data is successful:<br> 352<pre>if (IS_REP_MASTER(dbenv) && !LF_ISSET(DB_IGNORE_LEASE)) {<br> [get to rep structure]<br> if (FLD_ISSET(rep->config, REP_C_LEASE) &&<br> (ret = __rep_lease_check(dbenv)) != 0) {<br> /*<br> * We don't hold the lease.<br> */<br> goto err;<br> }<br>}<br></pre> 353See below for the details of <i>__rep_lease_check</i>.<br> 354<br> 355Also note that if leases (or replication) are not configured, then <span 356 style="font-weight: bold;">DB_IGNORE_LEASE</span> is a no-op. It 357is ignored (and won't error) if used when leases are not in 358effect. The reason is so that we can generically set that flag in 359utility programs like <span style="font-style: italic;">db_dump</span> 360that walk the database with a cursor. Note that <span 361 style="font-style: italic;">db_dump</span> is the only utility that 362reads with a cursor.<span style="font-style: italic;"><span 363 style="font-style: italic;"></span></span><br> 364<h3><b>Nsites 365and Elections</b></h3> 366The call to <i>dbenv->rep_set_nsites</i> 367must be performed before the call to <i>dbenv->rep_start</i> 368or <i>dbenv->repmgr_start</i>. 369This document assumes either that <b>SR 37014778</b> gets resolved, or assumes that the value of <i>nsites</i> is 371immutable. The 372master and all clients need to know how many sites and leases are in 373the group. Clients need to know for elections. The master 374needs to know for the size of the lease table and to know what value a 375majority of the group is. <b>[Until 37614778 is resolved, the master lease work must assume <i>nsites</i> is 377immutable and will 378therefore enforce that this is called before <i>rep_start</i> using 379the same mechanism 380as <i>rep_set_lease</i>.]</b><br> 381<br> 382Elections and leases need to agree on the number of sites in the 383group. Therefore, when leases are in effect on clients, all calls 384to <i>dbenv->rep_elect</i> must 385set the <i>nsites</i> parameter to 3860. The <i>rep_elect</i> code 387path will return <b>EINVAL</b> if <b>REP_C_LEASE</b> is set and <i>nsites</i> 388is non-0. 389<h2>Lease Management</h2> 390<h3>Message Changes</h3> 391In order for clients to grant leases to the master a new message type 392must be added for that purpose. This will be the <b>REP_LEASE_GRANT</b> 393message. 394Granting leases will be a result of applying a <b>DB_REP_PERMANENT</b> 395record and therefore we 396do not need any additional message in order for a master to request a 397lease grant. The <b>REP_LEASE_GRANT</b> 398message will pass a structure as its message DBT:<br> 399<pre>struct __rep_lease_grant {<br> db_timespec msg_time;<br>#ifdef DIAGNOSTIC<br> db_timespec expire_time;<br>#endif<br>} REP_GRANT_INFO;<br></pre> 400In the <b>REP_LEASE_GRANT</b> 401message, the client is actually giving the master several pieces of 402information. We only need the echoed <i>msg_time</i> in this 403structure because 404everything else is already sent. The client is really sending the 405master:<br> 406<ul> 407 <li>Its EID (parameter to <span style="font-style: italic;">rep_send_message</span> 408and <span style="font-style: italic;">rep_process_message</span>)<br> 409 </li> 410 <li>The PERM LSN this message acknowledged (sent in the control 411message)</li> 412 <li>Unique identifier echoed back to master (<i>msg_time</i> sent in 413message as above)</li> 414</ul> 415On the client, we always maintain the maximum PERM LSN already in <i>lp->max_perm_lsn</i>. 416<h3>Local State Management</h3> 417Each client must maintain a <i>db_timespec</i> 418timestamp containing the expiration of its granted lease. This 419field will be in the replication shared memory structure:<br> 420<pre>db_timespec grant_expire;<br></pre> 421This timestamp already takes into account the clock skew. All 422new fields must be initialized when the region is created. Whenever we 423grant our master lease and want to send the <b>REP_LEASE_GRANT</b> 424message, this value 425will be updated. It will be used in the following way: 426<pre>db_timespec mytime;<br>DB_LSN perm_lsn;<br>DBT lease_dbt;<br>REP_GRANT_INFO gi;<br><br><br>timespecclear(&mytime);<br>timespecclear(&newgrant);<br>memset(&lease_dbt, 0, sizeof(lease_dbt));<br>memset(&gi, 0, sizeof(gi));<br>__os_gettime(dbenv, &mytime);<br>timespecadd(&mytime, &rep->lease_duration);<br>MUTEX_LOCK(rep->clientdb_mutex);<br>perm_lsn = lp->max_perm_lsn;<br>MUTEX_UNLOCK(rep->clientdb_mutex);<br>REP_SYSTEM_LOCK(dbenv);<br>if (timespeccmp(mytime, rep->grant_expire, >))<br> rep->grant_expire = mytime;<br>gi.msg_time = msg->msg_time;<br>#ifdef DIAGNOSTIC<br>gi.expire_time = rep->grant_expire;<br>#endif<br>lease_dbt.data = &gi;<br>lease_dbt.size = sizeof(gi);<br>REP_SYSTEM_UNLOCK(dbenv);<br>__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &perm_lsn, &lease_dbt, 0, 0);<br></pre> 427This updating of the lease grant will occur in the <b>PERM</b> code 428path when we have 429successfully applied the permanent record.<br> 430<h3>Maintaining Leases on the 431Master/Rep_start</h3> 432The master maintains a lease table that it checks when fulfilling a 433read request that is subject to leases. This table is initialized 434when a site calls<i> 435dbenv->rep_start(DB_MASTER)</i> and the site is undergoing a role 436change (i.e. a master making additional calls to <i>dbenv->rep_start(DB_MASTER)</i> 437does 438not affect an already existing table).<br> 439<br> 440When a non-master site becomes master, it must do two things related to 441leases on a role change. First, a client cannot upgrade to master 442while it has an outstanding lease granted to another site. If a 443client attempts to do so, an error, <b>EINVAL</b>, 444will be returned. The only way this should happen is if the 445application simply declares a site master, instead of using 446elections. Elections will already wait for leases to expire 447before proceeding. (See below.) <b>[I 448believe an error is sufficient and we do not need, for version 1 at 449least, any other complex waiting mechanism. Applications that 450don't use elections and declare masters are quite rare.]</b><br> 451<br> 452Second, once we are proceeding with becoming a master, the site must 453allocate the table it will use to maintain lease information. 454This table will be sized based on <i>nsites</i> 455and it will be an array of the following structure:<br> 456<pre>struct {<br> int eid; /* EID of client site. */<br> db_timespec start_time; /* Unique time ID client echoes back on grants. */<br> db_timespec end_time; /* Master's lease expiration time. */<br> DB_LSN lease_lsn; /* Durable LSN this lease applies to. */<br> u_int32_t flags; /* Unused for now?? */<br>} REP_LEASE_ENTRY;<br></pre> 457<h3>Granting Leases</h3> 458It is the burden of the application to make sure that all sites in the 459group 460are using leases, or none are. Therefore, when a client processes 461a <b>PERM</b> 462log record that arrived from the master, it will grant its lease 463automatically if that record is permanent (i.e. <b>DB_REP_ISPERM</b> 464is being returned), 465and leases are configured. A client will not send a 466lease grant when it is processing log records (even <b>PERM</b> 467ones) it receives from other clients that use client-to-client 468synchronization. The reason is that the master requires a unique 469time-of-msg ID (see below) that the client echoes back in its lease 470grant and it will not have such an ID from another client.<br> 471<br> 472The master stores a time-of-msg ID in each message and the client 473simply echoes it back to the master. In its lease table, it does 474keep the base 475time-of-msg for a valid lease. When <b>REP_LEASE_GRANT</b> 476message comes in, 477the master does a number of things:<br> 478<ol> 479 <li>Pulls the echoed timespec from the client message, into <i>msg_time</i>.<br> 480 </li> 481 <li>Finds the entry in its lease table for the client's EID. It 482walks the table searching for the ID. EIDs of <span 483 style="font-weight: bold;">DB_EID_INVALID</span> are 484illegal. Either the master will find the entry, or it will find 485an empty slot in the table (i.e. it is still populating the table with 486leases).</li> 487 <li>If this is a previously unknown site lease, the master 488initializes the entry by copying to the <i>eid</i>, <i>start_time, </i>and 489 <i>lease_lsn</i> fields. The master 490also computes the <i>end_time</i> 491based on the adjusted <i>rep->lease_duration</i>.</li> 492 <li>If this is a lease from a previously known site, the master must 493perform <i>timespeccmp(&msg_time, 494&table[i].start_time, >)</i> and only update the <i>end_time</i> 495of the lease when this is 496a more recent message. If it is a more recent message, then we 497should update 498the <i>lease_lsn</i> to the LSN in 499the message.</li> 500 <li>Since lease durations are computed taking the clock skew into 501account, clients compute them based on the current time and the master 502computes it based on original sending time, for diagnostic purposes 503only, I also plan to send the client's expiration time. The 504client errs on the side of computing a larger lease expiration time and 505the master errs on the side of computing a smaller duration. 506Since both are taking the clock skew 507into account, the client's ending expiration time should never be 508smaller than 509the master's computed expiration time or their value for clock skew may 510not be correct.<br> 511 </li> 512</ol> 513Any log records (new or resent) that originate from the master and 514result in <b>DB_REP_ISPERM</b> get an 515ack.<br> 516<br> 517<h3>Refreshing Leases</h3> 518Leases get refreshed when a master receives a <b>REP_LEASE_GRANT</b> 519message from a client. There are three pieces to lease 520refreshment. <br> 521<h4>Lazy Lease Refreshing on Read<br> 522</h4> 523If the master discovers that leases are 524expired during the read operation, it attempts to refresh its 525collection of lease grants. It does this by calling a new 526function <i>__rep_lease_refresh</i>. 527This function is very similar to the already-existing function <i>__rep_flush</i>. 528Basically, to 529refresh the lease, the master simply needs to resend the last PERM 530record to the clients. The requirements state that when the 531application send function returns successfully from sending a PERM 532record, the majority of clients have that PERM LSN durable. We 533will have a new public DB error return called <b>DB_REP_LEASE_EXPIRED</b> 534that will be 535returned back to the caller if the master cannot assert its 536authority. The code will look something like this:<br> 537<pre>/*<br> * Use lp->max_perm_lsn on the master (currently not used on the master)<br> * to keep track of the last PERM record written through the logging system.<br> * need to initialize lp->max_perm_lsn in rep_start on role_chg.<br> */<br>call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT<br>if failure<br> expire leases<br> return lease expired error to caller<br>else /* success */<br> recheck lease table<br> /*<br> * We need to recheck the lease table because the client<br> * lease grant messages may not be processed yet, or got<br> * lost, or racing with the application's ACK messages or<br> * whatever. <br> */<br> if we have a majority of valid leases<br> return success<br> else<br> return lease expired error to caller <br></pre> 538<h4>Ongoing Update Refreshment<br> 539</h4> 540Second is having the master indicate to 541the client it needs to send a lease grant in response to the current 542PERM log message. The problem is 543that acknowledgements must contain a master-supplied message timestamp 544that the client sends back to the master. We need to modify the 545structure of the log record messages when leases are configured 546so 547that when a PERM message is sent, the master sends, and the client 548expects, the message timestamp. There are three fairly 549straightforward and different implementations to consider.<br> 550<ol> 551 <li>Adding the timestamp to the <b>REP_CONTROL</b> 552structure. If this option is chosen, then the code trivially 553sends back the timestamp in the client's reply. There is no 554special processing done by either side with the message contents. 555So, on a PERM log record, the master will send a non-zero 556timestamp. On a normal log record the timestamp will be zero or 557some known invalid value. If the client sees a non-zero 558timestamp, it sends a <b>REP_LEASE_GRANT</b> 559with the <i>lp->max_perm_lsn</i> 560after applying that log record. If it is zero, then the client 561does nothing different. The advantage is ease of code. The 562disadvantage is that for mixed version systems, the client is now 563dealing with different sized control structures. We would have to 564retain the old control structure so that during a mixed version group 565the (upgraded) clients can use, expect and send old control structures 566to the master. This is unfortunate, so let's consider additional 567implementations that don't require modifying the control structure.<br> 568 </li> 569 <li>Adding a new <b>REPCTL_LEASE</b> 570flag to the list of flags for the control structure, but do not change 571the control structure fields. When a master wants to send a 572message that needs a lease ack, it sets the flag. Additionally, 573instead of simply sending a log record DBT as the <i>rec</i> parameter 574for replication, we 575would send a new structure that had the timestamp first and then the 576record (similar to the bulk transfer buffer). The advantage of 577this is that the control structure does not change. Disadvantages 578include more special-cased code in the normal code path where we have 579to check the flag. If the flag is set we have to extract the 580timestamp value and massage the incoming data to pass on the real log 581record to <i>rep_apply</i>. On 582bulk transfer, we would just add the timestamp into the buffer. 583On normal transfers, it would incur an additional data copy on the 584master side. That is unfortunate. Additionally, if this 585record needs to be stored in the temp db, we need some way to get it 586back again later or <span style="font-style: italic;">rep_apply</span> 587would have to extract the timestamp out when it processed the record 588(either live or from the temp db).<br> 589 </li> 590 <li>Adding a different message type, such as <b>REP_LOG_ACK</b>. 591Similarly to <b>REP_LOG_MORE</b> this message would be a 592special-case version of a log record. We would extract out the 593timestamp and then handle as a normal log record. This 594implementation is rejected because it actually would require three new 595message types: <b>REP_LOG_ACK, 596REP_LOG_ACK_MORE, REP_BULK_LOG_ACK</b>. That is just too ugly 597to contemplate.</li> 598</ol> 599<b>[Slight digression:</b> it occurs 600to me while writing about #2 and #3 above, that our implementation of 601all of the *_MORE messages could really be implemented with a <b>REPCTL_MORE</b> 602flag instead of a 603separate message type. We should clean that up and simplify the 604messages but not part of master leases. Hmm, taking that thought 605process further, we really could get rid of the <b>REP_BULK_*</b> 606messages as well if we 607added a <b>REPCTL_BULK</b> 608flag. I think we should definitely do it for the *_MORE 609messages. I am not sure we should do it for bulk because the 610structure of the incoming data record is vastly different.]<br> 611<br> 612Of these options, I believe that modifying the control structure is the 613best alternative. The handling of the old structure will be very 614isolated to code dealing with old versions and is far less complicated 615than injecting the timestamp into the log record DBT and doing a data 616copy. Actually, I will likely combine #1 and the flag from #2 617above. I will have the <b>REPCTL_LEASE</b> 618flag that indicates a lease grant reply is expected and have the 619timestamp in the control structure. <b>[Is that necessary - it 620feels cleaner, but 621also we could just have a non-zero timestamp = send a 622reply without have it directed by a flag from the master. That 623means we would not need the flag, but builds in an assumption into the 624code instead of having the client simply send a grant when the flag 625says to do so. See Upgrades/Mixed versions below too.]</b> 626Also I will probably add in a spare field or two for future use in the <b>REP_CONTROL</b> 627structure.<br> 628<h4>Gap processing</h4> 629No matter which implementation we choose for ongoing lease refreshment, 630gap processing must be considered. The code above assumes the 631timestamps will be placed on PERM records only. Normal log 632records will not have a timestamp, nor a flag or anything else like 633that. However, any log message can fill a gap on a client and 634result in the processing of that normal log record to return <b>DB_REP_ISPERM</b> 635because later records 636were also processed.<br> 637<br> 638The current implementation should work fine in that case because when 639we store the message in the client temp db we store both the control 640DBT and the record DBT. Therefore, when a normal record fills a 641gap, the later PERM record, when retrieved will look just like it did 642when it arrived. The client will have access to the LSN, and the 643timestamp, etc. However, it does mean that sending the <b>REP_LEASE_GRANT</b> 644message must take 645place down in <i>__rep_apply</i> 646because that is the only place we have access to the contents of those 647stored records with the timestamps.<br> 648<br> 649There are two logical choices to consider for granting the lease when 650processing an update. As we process (either a live record or one 651read from the temp db after filling a gap) a PERM message, we send the <b>REP_LEASE_GRANT</b> 652message for each 653PERM record we successfully apply. Or, second, we keep track of 654the largest timestamp of all PERM records we've processed and at the 655end of the function after we've applied all records, we send back a 656single lease grant with the <i>max_perm_lsn</i> 657and a new <i>max_lease_timestamp</i> 658value to the master. The first is easier to implement, the second 659results in possibly slightly fewer messages at the expense of more 660bookkeeping on the client.<br> 661<br> 662A third, more complicated option would be to have the message timestamp 663on all records, but grants are only sent on the PERM messages. A 664reason to do this is that the later timestamp of a normal log record 665would be used as the timestamp sent in the reply and the master would 666get a more up to date timestamp value and a longer lease. <br> 667<br> 668<span style="font-weight: bold;">[Concern about gap processing here.]</span> 669If we change the <span style="font-weight: bold;">REP_CONTROL</span> 670structure to include the timestamp, we potentially break or at least 671need to revisit the gap processing algorithm. That code assumes 672that the control and record elements for the same LSN look the same 673each and every time. The code stores the <span 674 style="font-style: italic;">control</span> DBT as the key and the <span 675 style="font-style: italic;">rec</span> DBT as the data. We use a 676specialized compare function to sort based on the LSN in the control 677DBT. With master leases, the same record transmitted by a master 678multiple times or client for the same LSN will be different because the 679timestamp field will not be the same. Therefore, the client will 680end up with duplicate entries in the temp database for the same 681LSN. Both solutions (adding the timestamp to <span 682 style="font-weight: bold;">REP_CONTROL</span> and adding a <span 683 style="font-weight: bold;">REPCTL_LEASE</span> flag) can yield 684duplicate entries. The flag would cause the same record from the 685master and client to be different as well.<br> 686<h4>Handling Incoming Lease Grants<br> 687</h4> 688The third piece of lease management is handling the incoming <b>REP_LEASE_GRANT</b> 689message on the 690master. When this message is received, the master must do the 691following:<br> 692<pre>REP_SYSTEM_LOCK<br>msg_timestamp = cntrl->timestamp;<br>client_lease = __rep_lease_entry(dbenv, client eid)<br>if (client_lease == NULL)<br> initial lease for this site, DB_ASSERT there is space in the table<br> add this to the table if there is space<br>} else <br> compare msg_timestamp with client_lease->start_time<br> if (msg_timestamp is more recent && msg_lsn >= lease LSN)<br> update entry in table<br>REP_SYSTEM_UNLOCK<br></pre> 693<h3>Expiring Leases</h3> 694Leases can expire in two ways. First they can expire naturally 695due to the passage of time. When checking leases, if the current 696time is later than the lease entry's <i>end_time</i> 697then the lease is expired. Second, they can be forced with a 698premature expiration when the application's transport function returns 699an error. In the first case, there is nothing to do, in the 700second case we need to manipulate the <i>end_time</i> 701so that all future lease checks fail. Since the lease <i>start_time</i> 702is guaranteed to not be in the future we will have a function <i>__rep_lease_expire</i> 703that will:<br> 704<pre>REP_SYSTEM_LOCK<br>for each entry in the lease table<br> entry->end_time = entry->start_time;<br>REP_SYSTEM_UNLOCK<br></pre> 705Is there a potential race or problem with prematurely expiring 706leases? Consider an application that enforces an ALL 707acknowledgement policy for PERM records in its transport 708callback. There are four clients and three send the PERM ack to 709the application. The callback returns an error to the master DB 710code. The DB code will now prematurely expire its leases. 711However, at approximately the same time the three clients are also 712sending their <span style="font-weight: bold;">REP_LEASE_GRANT</span> 713messages to the master. There is a race between the master 714processing those messages and the thread handling the callback failure 715expiring the table. This is only an issue if the messages arrive 716after the table has been expired.<br> 717<br> 718Let's assume all three clients send their grants after the master 719expires the table. If we accept those grants and then a read 720occurs the read will succeed since the master has a majority of leases 721even though the callback failed earlier. Is that a problem? 722The lease code is using a majority and the application policy is using 723something other value. It feels like this should be okay since 724the data is held by leases on a majority. Should we consider 725having the lease checking threshold be the same as the permanent ack 726policy? That is difficult because Base API users implement 727whatever they want and DB does not know what it is.<br> 728<h3>Checking Leases</h3> 729When a read operation on the master completes, the last thing we need 730to do is verify the master leases. We've already discussed 731refreshing them when they are expired above. We need two things 732for a lease to be valid. It must be within the timeframe of the 733lease grant and the lease must be valid for the last PERM record 734LSN. Here is the logic 735for checking the validity of leases in <i>__rep_lease_check</i>:<br> 736<pre>#define MAX_REFRESH_TRIES 3<br>DB_LSN lease_lsn;<br>REP_LEASE_ENTRY *entry;<br>u_int32_t min_leases, valid_leases;<br>db_timespec cur_time;<br>int ret, tries;<br><br> tries = 0;<br>retry:<br> ret = 0;<br> LOG_SYSTEM_LOCK<br> lease_lsn = lp->lsn<br> LOG_SYSTEM_UNLOCK<br> REP_SYSTEM_LOCK<br> min_leases = rep->nsites / 2;<br> __os_gettime(dbenv, &cur_time);<br> for (entry = head of table, valid_leases = 0; entry != NULL && valid_leases < min_leases; entry++)<br> if (timespec_cmp(&entry->end_time, &cur_time) >= 0 && log_compare(&entry->lsn, lease_lsn) == 0)<br> valid_leases++;<br> REP_SYSTEM_UNLOCK<br> if (valid_leases < min_leases) {<br> ret =__rep_lease_refresh(dbenv, ...);<br> /*<br> * If we are successful, we need to recheck the leases because <br> * the lease grant messages may have raced with the PERM<br> * acknowledgement. Give those messages a chance to arrive.<br> */<br> if (ret == 0) {<br> if (tries <= MAX_REFRESH_TRIES) {<br> /*<br> * If we were successful sending, but not successful in racing the<br> * message thread, yield the processor so that message<br> * threads may have a chance to run.<br> */<br> if (tries > 0)<br> /* __os_sleep instead?? */<br> __os_yield()<br> tries++;<br> goto retry;<br> } else<br> ret = DB_RET_LEASE_EXPIRED;<br> }<br> }<br> return (ret);</pre> 737If the master has enough valid leases it returns success. If it 738does not have enough, it attempts to refresh them. This attempt 739may fail if sending the PERM record does not receive sufficient 740acks. If we do receive sufficient acknowledgements we may still 741find that scheduling of message threads means the master hasn't yet 742processed the incoming <b>REP_LEASE_GRANT</b> 743messages yet. We will retry a couple times (possibly 744parameterized) if the master discovers that situation. <br> 745<h2>Elections</h2> 746When a client grants a lease to a master, it gives up the right to 747participate in an election until that grant expires. If we are 748the master and <i>dbenv->rep_elect</i> 749is called, it should return, no matter what, like it does today. 750If we are a client and <i>rep_elect</i> 751is called special processing takes place when leases are in 752effect. First, the easy case is if the lease granted by this 753client has already expired, then the client goes directly into the 754election as normal. If a valid lease grant is outstanding to a 755master, this site cannot participate in an election until that grant 756expires. We have at least two options when a site calls the <i>dbenv->rep_elect</i> 757API while 758leases are in effect.<br> 759<ol> 760 <li>The simplest coding solution for DB would be simply to refuse to 761participate in the election if this site has a current lease granted to 762a master. We would detect this situation and return EINVAL. 763This is correct behavior and trivial to implement. The 764disadvantage of this solution is that the application would then be 765responsible for repeatedly attempting an election until the lease grant 766expired.<br> 767 </li> 768 <li>The more satisfying solution is for DB to wait the remaining time 769for the grant. If this client hears from the master during that 770time the election does not take place and the call to <i>rep_elect</i> 771returns with the 772information for the current/old master.</li> 773</ol> 774<h3>Election Code Changes</h3> 775The code changes to support leases in the election code are fairly 776isolated. First if leases are configured, we must verify the <i>nsites</i> 777parameter is set to 0. 778Second, in <i>__rep_elect_init</i> 779we must not overwrite the value of <i>rep->nsites</i> 780for leases because it is controlled by the <i>dbenv->rep_set_nsites</i> 781API. 782These changes are small and easy to understand.<br> 783<br> 784The more complicated code will be the client code when it has an 785outstanding lease granted. The client will wait for the current 786lease grant to expire before proceeding with the election. The 787client will only do so if it does not hear from the master for the 788remainder of the lease grant time. If the client hears from the 789master, it returns and does not begin participating in the 790election. A new election phase, <b>REP_EPHASE0</b> 791will exist so that the call to <i>__rep_wait</i> 792can detect if a master responds. The client, while waiting for 793the lease grant to expire, will send a <b>REP_MASTER_REQ</b> 794message so that the master will respond with a <b>REP_NEWMASTER</b> 795message and thus, 796allow the client to know the master exists. However, it is also 797desirable that if the master 798replies to the client, the master wants the client to update its lease 799grant. <br> 800<br> 801Recall that the <b>REP_NEWMASTER</b> 802message does not result in a lease grant from the client. The 803client responds when it processes a PERM record that has the <b>REPCTL_LEASE</b> 804flag set in the message 805with its lease grant up to the given LSN. Therefore, we want the 806client's <b>REP_MASTER_REQ</b> to 807yield both the discovery of the existing master and have the master 808refresh its leases. The client will also use the <b>REPCTL_LEASE</b> 809flag in its <b>REP_MASTER_REQ</b> message to the 810master. This flag will serve as the indicator to the master that 811it needs to deal with leases and both send the <b>REP_NEWMASTER</b> 812message and refresh 813the lease.<br> 814The code will work as follows:<br> 815<pre>if (leases_configured && (my_grant_still_valid || lease_never_granted) {<br> if (lease_never_granted)<br> wait_time = lease_timeout<br> else<br> wait_time = grant_expiration - current_time<br> F_SET(REP_F_EPHASE0);<br> __rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);<br> ret = __rep_wait(..., REP_F_EPHASE0);<br> if (we found a master)<br> return<br>} /* if we don't return, fall out and proceed with election */<br></pre> 816On the master side, the code handling the <b>REP_MASTER_REQ</b> will 817do:<br> 818<pre>if (I am master) {<br> ...<br> __rep_send_message(REP_NEWMASTER...)<br> if (F_ISSET(rp, REPCTL_LEASE))<br> __rep_lease_refresh(...)<br>}<br></pre> 819Other minor implementation details are that<i> __rep_elect_done</i> 820must also clear 821the <b>REP_F_EPHASE0</b> flag. 822We also, obviously, need to define <b>REP_F_EPHASE0</b> 823in the list of replication flags. Note that the client's call to <i>__rep_wait</i> 824will return upon 825receiving the <b>REP_NEWMASTER</b> 826message. The client will independently refresh its lease when it 827receives the log record from the master's call to refresh the lease.<br> 828<br> 829Again, similar to what I suggested above, the code could simply assume 830global leases are configured, and instead of having the <b>REPCTL_LEASE</b> 831flag at all, the master 832assumes that it needs to refresh leases because it has them configured, 833not because it is specified in the <b>REP_MASTER_REQ</b> 834message it is processing. Right now I don't think every possible 835<b>REP_MASTER_REQ</b> message should result in a lease grant request.<br> 836<h4>Elections and Quiescient Systems</h4> 837It is possible that a master is slow or the client is close to its 838expiration time, or that the master is quiescient and all leases are 839currently expired, but nothing much is going on anyway, yet some client 840calls <i>__rep_elect</i> at that 841time. In the code above, we will not send the <b>REP_MASTER_REQ</b> 842because the lease is 843not valid. The client will simply proceed directly to sending the 844<b>REP_VOTE1</b> message, throwing all 845other clients into an election. The master is still master and 846should stay that way. Currently in response to a vote message, a 847master will broadcast out a <b>REP_NEWMASTER</b> 848to assert its mastership. That causes the election to 849complete. However, if desired the master may want to proactively 850refresh its leases. This situation indicates to me that the 851master should choose to refresh leases based on configuration, not a 852flag sent from the client. I believe anytime the master asserts 853its mastership via sending a <b>REP_NEWMASTER</b> 854message that I need to add code to proactively refresh leases at that 855time.<br> 856<h2>Other Implementation Details</h2> 857<h3>Role Changes<br> 858</h3> 859When a site changes its role via a call to <i>rep_start</i> in either 860direction, we 861must take action when leases are configured. There are three 862types of role changes that all need changes to deal with leases:<br> 863<ol> 864 <li><i>A master downgrading to a 865client.</i> When a master downgrades to a client, it can do so 866immediately after it has proactively expired all existing leases it 867holds. This situation is similar to an error from the send 868callback, and it effectively cancels all outstanding leases held on 869this site. Note that if this master expires its leases, it does 870not have any effect on when the clients' lease grants expire on the 871client side. The clients must still wait their full expected 872grant time.<br> 873 </li> 874 <li><i>A client upgrading to master.</i> 875If a client is upgrading to a master but it has an outstanding lease 876granted to another site, the code will return an <b>EINVAL</b> 877error. This situation 878only arises if the application simply declares this site master. 879If a site wins an election then the election itself should have waited 880long enough for the granted lease to expire and this state should not 881arise then.</li> 882 <li><i>A client finding a new master.</i> 883When a client discovers a new and different master, via a <b>REP_NEWMASTER</b> 884message then the 885client cannot accept that new master until its current lease grant 886expires. This situation should only occur when a site declares 887itself master without an election and that site's lease grant expires 888before this client's grant expires. However, it is <b>possible</b> 889for this situation to arise 890with elections also. If we have 5 sites holding an election and 4 891of those sites have leases expire at about the same time T, and this 892site's lease expires at time T+N and the election timeout is < N, 893then those 4 sites may hold an election and elect a master without this 894site's participation. A client in this situation must call <i>__rep_wait</i> 895with the time remaining 896on its lease. If the lease is expired after waiting the remaining 897time, then the client can accept this new master. If the lease 898was refreshed during the waiting period then the client does not accept 899this new master and returns.<br> 900 </li> 901</ol> 902<h3>DUPMASTER</h3> 903A duplicate master situation can occur if an old master becomes 904disconnected from the rest of the group, that group elects a new master 905and then the partition is resolved. The requirement for master 906leases is that this situation will not cause the newly elected, 907rightful master to receive the <b>DB_REP_DUPMASTER</b> 908return. It is okay for the old master to get that return 909value. When a dual master situation exists, the following will 910happen:<br> 911<ul> 912 <li><i>On the current master and all 913current clients</i> - If the current master receives an update 914message or other conflicting message from the old master then that 915message will be ignored because the generation number is out of date.</li> 916 <li><i>On the old master</i> - If 917the old master receives an update message from the current master, or 918any other message with a later generation from any site, the new 919generation number will trigger this site to return <b>DB_REP_DUPMASTER</b>. 920However, 921instead of broadcasting out the <b>REP_DUPMASTER</b> 922message to shoot down others as well, this site, if leases are 923configured, will call <i>__rep_lease_check</i> 924and if they are expired, return the error. It should be 925impossible for us to receive a later generation message and still hold 926a majority of master leases. Something is seriously wrong and we 927will <b>DB_ASSERT</b> this situation 928cannot happen.<br> 929 </li> 930</ul> 931<h3>Client to Client Synchronization</h3> 932One question to ask is how lease grants interact with client-to-client 933synchronization. The only answer is that they do not. A client 934that is sending log records to another client cannot request the 935receiving client refresh its lease with the master. That client 936does not have a timestamp it can use for the master and clock skew 937makes it meaningless between machines. Therefore, sites that use 938client-to-client synchronization will likely see more lease refreshment 939during the read path and leases will be refreshed during live updates 940only. Of course, if a client supplies log records that fill a 941gap, and the later log records stored came from the master in a live 942update then the client will respond as per the discussion on Gap 943Processing above.<br> 944<h2>Interaction Matrix</h2> 945If leases are granted (by a client) or held (by a master) what should 946the following APIs and messages do?<br> 947<br> 948Other:<br> 949log_archive: Leases do not affect log_archive. OK.<br> 950dbenv->close: OK.<br> 951crash during lease grant and restart: <b>Potential 952problem here. See discussion below</b>.<br> 953<br> 954Rep Base API method:<br> 955rep_elect: Already discussed above. Must wait for lease to expire.<br> 956rep_flush: Master only, OK - this will be the basis for refreshing 957leases.<br> 958rep_get_*: Not affected by leases.<br> 959rep_process_message: Generally OK. We'll discuss each message 960below.<br> 961rep_set_config: OK.<br> 962rep_set_limit: OK<br> 963rep_set_nsites: Must be called before <i>rep_start</i> 964and <i>nsites</i> is immutable until 96514778 is resolved.<br> 966rep_set_priority: OK<br> 967rep_set_timeout: OK. Used to set lease timeout.<br> 968rep_set_transport: OK.<br> 969rep_start(MASTER): Role changes are discussed above. Make sure 970duplicate rep_start calls are no-ops for leases.<br> 971rep_start(CLIENT): Role changes are discussed above. Make sure 972duplicate calls are no-ops for leases.<br> 973rep_stat: OK. <b>[Do we have any stats 974we want to add? Currently none are planned, but may come up 975during implementation and testing as useful to have. Suggestions?]</b><br> 976rep_sync: Should not be able to happen. Client cannot accept new 977master with outstanding lease grant. Add DB_ASSERT here.<br> 978<br> 979REP_ALIVE: OK.<br> 980REP_ALIVE_REQ: OK.<br> 981REP_ALL_REQ: OK.<br> 982REP_BULK_LOG: OK. Clients check to send ACK.<br> 983REP_BULK_PAGE: Should never process one with lease granted. Add 984DB_ASSERT.<br> 985REP_DUPMASTER: Should never happen, this is what leases are supposed to 986prevent. See above.<br> 987REP_LOG: OK. Clients check to send ACK.<br> 988REP_LOG_MORE: OK <b>[maybe remove and 989use flag]</b> Clients check to send ACK.<br> 990REP_LOG_REQ: OK.<br> 991REP_MASTER_REQ: OK.<br> 992REP_NEWCLIENT: OK.<br> 993REP_NEWFILE: OK. Clients check to send ACK.<br> 994REP_NEWMASTER: See above.<br> 995REP_NEWSITE: OK.<br> 996REP_PAGE: OK. Should never process one with lease granted. 997Add DB_ASSERT.<br> 998REP_PAGE_FAIL: OK. Should never process one with lease 999granted. Add DB_ASSERT.<br> 1000REP_PAGE_MORE: OK. Should never process one with lease 1001granted. Add DB_ASSERT.<br> 1002REP_PAGE_REQ: OK.<br> 1003REP_REREQUEST: OK.<br> 1004REP_UPDATE: OK. Should never process one with lease 1005granted. Add DB_ASSERT.<br> 1006REP_UPDATE_REQ: OK. This is a master-only message.<br> 1007REP_VERIFY: OK. Should never process one with lease 1008granted. Add DB_ASSERT.<br> 1009REP_VERIFY_FAIL: OK. Should never process one with lease 1010granted. Add DB_ASSERT.<br> 1011REP_VERIFY_REQ: OK.<br> 1012REP_VOTE1: OK. See Election discussion above. It is 1013possible to receive one with a lease granted. Client cannot send 1014one with an outstanding lease however.<br> 1015REP_VOTE2: OK. See Election discussion above. It is 1016possible to receive one with a lease granted.<br> 1017<br> 1018If the following method or message processing is in progress and a 1019client wants to grant a lease, what should it do? Let's examine 1020what this means. The client wanting to grant a lease simply means 1021it is responding to the receipt of a <b>REP_LOG</b> 1022(or its variants) message and applying a log record. Therefore, 1023we need to consider a thread processing a log message racing with these 1024other actions.<br> 1025<br> 1026Other:<br> 1027log_archive: OK. <br> 1028dbenv->close: User error. User should not be closing the env 1029while other threads are using that handle. Should have no effect 1030if a 2nd dbenv handle to same env is closed.<br> 1031<br> 1032Rep Base API method:<br> 1033rep_elect: See Election discussion above. <i>rep_elect</i> 1034should wait and may grant 1035lease while election is in progress.<br> 1036rep_flush: Should not be called on client.<br> 1037rep_get_*: OK.<br> 1038rep_process_message: Generally OK. See handling each message 1039below.<br> 1040rep_set_config: OK.<br> 1041rep_set_limit: OK.<br> 1042rep_set_nsites: Must be called before <i>rep_start</i> 1043until 14778 is resolved.<br> 1044rep_set_priority: OK.<br> 1045rep_set_timeout: OK.<br> 1046rep_set_transport: OK.<br> 1047rep_start(MASTER): OK, can't happen - already protect racing <i>rep_start</i> 1048and <i>rep_process_message</i>.<br> 1049rep_start(CLIENT): OK, can't happen - already protect racing <i>rep_start</i> 1050and <i>rep_process_message</i>.<br> 1051rep_stat: OK.<br> 1052rep_sync: Shouldn't happen because client cannot grant leases during 1053sync-up. Incoming log message ignored.<br> 1054<br> 1055REP_ALIVE: OK.<br> 1056REP_ALIVE_REQ: OK.<br> 1057REP_ALL_REQ: OK.<br> 1058REP_BULK_LOG: OK.<br> 1059REP_BULK_PAGE: OK. Incoming log message ignored during internal 1060init.<br> 1061REP_DUPMASTER: Shouldn't happen. See DUPMASTER discussion above.<br> 1062REP_LOG: OK.<br> 1063REP_LOG_MORE: OK.<br> 1064REP_LOG_REQ: OK.<br> 1065REP_MASTER_REQ: OK.<br> 1066REP_NEWCLIENT: OK.<br> 1067REP_NEWFILE: OK.<br> 1068REP_NEWMASTER: See above. If a client accepts a new master 1069because its lease grant expired, then that master sends a message 1070requesting the lease grant, this client will not process the log record 1071if it is in sync-up recovery, or it may after the master switch is 1072complete and the client doesn't need sync-up recovery. Basically, 1073just uses existing log record processing/newmaster infrastructure.<br> 1074REP_NEWSITE: OK.<br> 1075REP_PAGE: OK. Receiving a log record during internal init PAGE 1076phase should ignore log record.<br> 1077REP_PAGE_FAIL: OK.<br> 1078REP_PAGE_MORE: OK.<br> 1079REP_PAGE_REQ: OK.<br> 1080REP_REREQUEST: OK.<br> 1081REP_UPDATE: OK. Receiving a log record during internal init 1082should ignore log record.<br> 1083REP_UPDATE_REQ: OK - master-only message.<br> 1084REP_VERIFY: OK. Receiving a log record during verify phase 1085ignores log record.<br> 1086REP_VERIFY_FAIL: OK.<br> 1087REP_VERIFY_REQ: OK.<br> 1088REP_VOTE1: OK. This client is processing someone else's vote when 1089the lease request comes in. That is fine. We protect our 1090own election and lease interaction in <i>__rep_elect</i>.<br> 1091REP_VOTE2: OK.<br> 1092<h4>Crashing - Potential Problem<br> 1093</h4> 1094It appears there is one area where we could have a problem. I 1095believe that crashes can cause us to break our guarantee on durability, 1096authoritative reads and inability to elect duplicate masters. 1097Consider this scenario:<br> 1098<ol> 1099 <li>A master and 4 clients are all up and running.</li> 1100 <li>The master commits a txn and all 4 clients refresh their lease 1101grants at time T.</li> 1102 <li>All 4 clients have the txn and log records in the cache. 1103None are flushing to disk.</li> 1104 <li>All 4 clients have responded to the PERM messages as well as 1105refreshed their lease with the master.</li> 1106 <li>All 4 clients hit the same application coding error and crash 1107(machine/OS stays up).</li> 1108 <li>Master authoritatively reads data in txn from step 2.</li> 1109 <li>All 4 clients restart the application and run recovery, thus the 1110txn from step 2 is lost on all clients because it isn't any logs.<span 1111 style="font-weight: bold;"></span><br> 1112 </li> 1113 <li>A network partition happens and the master is alone on its side.</li> 1114 <li>All 4 clients are on the other side and elect a new master.</li> 1115 <li>Partition resolves itself and we have duplicate masters, where 1116the former master still holds all valid lease grants.<span 1117 style="font-weight: bold;"></span><br> 1118 </li> 1119</ol> 1120Therefore, we have broken both guarantees. In step 6 the data is 1121really not durable and we've given it to the user. One can argue 1122that if this is an issue the application better be syncing somewhere if 1123they really want durability. However, worse than that is that we 1124have a legitimate DUPMASTER situation in step 10 where both masters 1125hold valid leases. The reason is that all lease knowledge is in 1126the shared memory and that is lost when the app restarts and runs 1127recovery.<br> 1128<br> 1129How can we solve this? The obvious solution is (ugh, yet another) 1130durable BDB-owned file with some information in it, such as the current 1131lease expiration time so that rebooting after a crash leaves the 1132knowledge that the lease was granted. However, writing and 1133syncing every lease grant on every client out to disk is far too 1134expensive.<br> 1135<br> 1136A second possible solution is to have clients wait a full lease timeout 1137before entering an election the first time. This solution solves the 1138DUPMASTER issue, but not the non-authoritative read. This 1139solution naturally falls out of elections and leases really. If a 1140client has never granted a lease, it should be considered as having to 1141wait a full lease timeout before entering an election. 1142Applications already know that leases impact elections and this does 1143not seem so bad as it is only on the first election.<br> 1144<br> 1145Is it sufficient to document that the authoritative read is only as 1146authoritative as the durability guarantees they make on the sites that 1147indicate it is permanent? Yes, I believe this is sufficient. If 1148the application says it is permanent and it really isn't, then the 1149application is at fault. Believing the application when it 1150indicates with the PERM response that it is permanent avoids the 1151authoritative problem <span style="font-weight: bold;">[document this 1152application requirement]</span>. <br> 1153<h2>Upgrade/Mixed Versions</h2> 1154Clearly leases cannot be used with mixed version sites since masters 1155running older releases will not have any knowledge of lease 1156support. What considerations are needed in the lease code for 1157mixed versions?<br> 1158<br> 1159First if the <b>REP_CONTROL</b> 1160structure changes, we need to maintain and use an old version of the 1161structure for talking to older clients and masters. The 1162implementation of this would be similar to the way we manage for old <b>REP_VOTE_INFO</b> 1163structures. 1164Second any new messages need translation table entries added. 1165Third, if we are assuming global leases then clearly any mixed versions 1166cannot have leases configured, and leases cannot be used in mixed 1167version groups. Maintaining two versions of the control structure 1168is not necessary if we choose a different style of implementation and 1169don't change the control structure.<br> 1170<br> 1171However, then how could an old application both run continuously, 1172upgrade to the new release and take advantage of leases without taking 1173down the entire application? I believe it is possible for clients 1174to be configured for leases but be subject to the master regarding 1175leases, yet the master code can assume that if it has leases 1176configured, all client sites do as well. In several places above 1177I suggested that a client could make a choice based on either a new <b>REPCTL_LEASE</b> 1178flag or simply having 1179leases turned on locally. If we choose to use the flag, then we 1180can support leases with mixed versions. The upgraded clients can 1181configure leases and they simply will not be granted until the old 1182master is upgraded and send PERM message with the flag indicating it 1183wants a lease grant. The client will not grant a lease until such 1184time. The clients, while having the leases configured, will not 1185grant a lease until told to do so and will simply have an expired 1186lease. Then, when the old master finally upgrades, it too can 1187configure leases and suddenly all sites are using them. I believe 1188this should work just fine and I will need to make sure a client's 1189granting of leases is only in response to the master asking for a 1190grant. If the master never asks, then the client has them 1191configured, but doesn't grant them.<br> 1192<h2>Testing</h2> 1193Clearly any user-facing API changes will need the equivalent reflection 1194in the Tcl API for testing, under CONFIG_TEST.<br> 1195<br> 1196I am sure the list of tests will grow but off the top of my head:<br> 1197Basic test: have N sites all configure leases, run some, read on 1198master, etc.<br> 1199Refresh test: Perform update on master, sleep until past expiration, 1200read on master and make sure leases are refreshed/read successful<br> 1201Error test: Test error conditions (reading on client with leases but no 1202ignore flag, calling after rep_start, etc)<br> 1203Read test: Test reading on both client and master both with and without 1204the IGNORE flag. Test that data read with the ignore flag can be 1205rolled back.<br> 1206Dupmaster test: Force a DUPMASTER situation and verify that the newer 1207master cannot get DUPMASTER error.<br> 1208Election test: Call election while grant is outstanding and master 1209exists.<br> 1210Call election while grant is outstanding and master does not exist.<br> 1211Call election after expiration on quiescient system with master 1212existing.<br> 1213Run with a group where some members have leases configured and other do 1214not to make sure we get errors instead of dumping core.<br> 1215<br> 1216<small><br> 1217</small> 1218</body> 1219</html> 1220