1<!--$Id: elect.so,v 1.30 2008/01/19 14:12:58 bostic Exp $--> 2<!--Copyright (c) 1997,2008 Oracle. All rights reserved.--> 3<!--See the file LICENSE for redistribution information.--> 4<html> 5<head> 6<title>Berkeley DB Reference Guide: Elections</title> 7<meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> 8<meta name="keywords" content="embedded,database,programmatic,toolkit,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> 9</head> 10<body bgcolor=white> 11<table width="100%"><tr valign=top> 12<td><b><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></b></td> 13<td align=right><a href="../rep/newsite.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/mastersync.html"><img src="../../images/next.gif" alt="Next"></a> 14</td></tr></table> 15<p align=center><b>Elections</b></p> 16<p>When using the Base replication API, it is the responsibility of the 17application to initiate elections if desired. It is never dangerous 18to hold an election, as the Berkeley DB election process ensures there is 19never more than a single master database environment. Clients should 20initiate an election whenever they lose contact with the master 21environment, whenever they see a return of <a href="../../api_c/rep_message.html#DB_REP_HOLDELECTION">DB_REP_HOLDELECTION</a> 22from the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method, or when, for whatever reason, they do 23not know who the master is. It is not necessary for applications to 24immediately hold elections when they start, as any existing master 25will be discovered after calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>. If no master has 26been found after a short wait period, then the application should call 27for an election.</p> 28<p>For a client to win an election, the replication group must currently 29have no master, and the client must have the most recent log records. 30In the case of clients having equivalent log records, the priority of 31the database environments participating in the election will determine 32the winner. The application specifies the minimum number of replication 33group members that must participate in an election for a winner to be 34declared. We recommend at least ((N/2) + 1) members. If fewer than the 35simple majority are specified, a warning will be given.</p> 36<p>If an application's policy for what site should win an election can be 37parameterized in terms of the database environment's information (that 38is, the number of sites, available log records and a relative priority 39are all that matter), then Berkeley DB can handle all elections transparently. 40However, there are cases where the application has more complete 41knowledge and needs to affect the outcome of elections. For example, 42applications may choose to handle master selection, explicitly 43designating master and client sites. Applications in these cases may 44never need to call for an election. Alternatively, applications may 45choose to use <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>'s arguments to force the correct outcome 46to an election. That is, if an application has three sites, A, B, and 47C, and after a failure of C determines that A must become the winner, 48the application can guarantee an election's outcome by specifying 49priorities appropriately after an election:</p> 50<blockquote><pre>on A: priority 100, nsites 2 51on B: priority 0, nsites 2</pre></blockquote> 52<p>It is dangerous to configure more than one master environment using the 53<a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method, and applications should be careful not to do so. 54Applications should only configure themselves as the master environment 55if they are the only possible master, or if they have won an election. 56An application knows it has won an election when it receives the 57<a href="../../api_c/env_event_notify.html#DB_EVENT_REP_ELECTED">DB_EVENT_REP_ELECTED</a> event.</p> 58<p>Normally, when a master failure is detected it is desired that an 59election finish quickly so the application can continue to service 60updates. Also, participating sites are already up and can participate. 61However, in the case of restarting a whole group after an administrative 62shut down, it is possible that a slower booting site had later logs than 63any other site. To cover that case, an application would like to give 64the election more time to ensure all sites have a chance to participate. 65Since it is intractable for a starting site to determine which case 66the whole group is in, the use of a long timeout gives all sites a 67reasonable chance to participate. If an application wanting full 68participation sets the <b>nvotes</b> arg to the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method to 69the number of sites in the group and one site does not reboot, a master 70can never be elected without manual intervention.</p> 71<p> 72In those cases, the desired action at a group level is to hold 73a full election if all sites crashed and a majority election if 74a subset of sites crashed or rebooted. Since an individual site cannot know 75which number of votes to require, a mechanism is available to 76accomplish this using timeouts. By setting a long timeout (perhaps 77on the order of minutes) using the <b>DB_REP_FULL_ELECTION_TIMEOUT</b> 78flag to the <a href="../../api_c/rep_timeout.html">DB_ENV->rep_set_timeout</a> method, an application can 79allow Berkeley DB to elect a master even without full participation. 80Sites may also want to set a normal election timeout for majority 81based elections using the <b>DB_REP_ELECTION_TIMEOUT</b> flag 82to the <a href="../../api_c/rep_timeout.html">DB_ENV->rep_set_timeout</a> method.</p> 83<p> 84Consider 3 sites, A, B, and C where A is the master. In the 85case where all three sites crash and all reboot, all sites 86will set a timeout for a full election, say 10 minutes, but only 87require a majority for <b>nvotes</b> to the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method. 88Once all three sites are booted the election will complete 89immediately if they reboot within 10 minutes of each other. Consider 90if all three sites crash and only two reboot. The two sites will 91enter the election, but after the 10 minute timeout they will 92elect with the majority of two sites. Using the full election 93timeout sets a threshold for allowing a site to reboot and rejoin 94the group.</p> 95<p>To add a database environment to the replication group with the intent 96of it becoming the master, first add it as a client. Since it may be 97out-of-date with respect to the current master, allow it to update 98itself from the current master. Then, shut the current master down. 99Presumably, the added client will win the subsequent election. If the 100client does not win the election, it is likely that it was not given 101sufficient time to update itself with respect to the current master.</p> 102<p>If a client is unable to find a master or win an election, it means that 103the network has been partitioned and there are not enough environments 104participating in the election for one of the participants to win. 105In this case, the application should repeatedly call <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> 106and <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a>, alternating between attempting to discover an 107existing master, and holding an election to declare a new one. In 108desperate circumstances, an application could simply declare itself the 109master by calling <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a>, or by reducing the number of 110participants required to win an election until the election is won. 111Neither of these solutions is recommended: in the case of a network 112partition, either of these choices can result in there being two masters 113in one replication group, and the databases in the environment might 114irretrievably diverge as they are modified in different ways by the 115masters.</p> 116<p>Note that this presents a special problem for a replication group 117consisting of only two environments. If a master site fails, the 118remaining client can never comprise a majority of sites in the group. 119If the client application can reach a remote network site, or some other 120external tie-breaker, it may be able to determine whether it is safe 121to declare itself master. Otherwise it must choose between providing 122availability of a writable master (at the risk of duplicate masters), 123or strict protection against duplicate masters (but no master when a 124failure occurs). Users of the Base replication API can accomplish 125this by judicious setting of the nvotes and nsites parameters to the 126<a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method. Replication Manager offers this choice via the 127<a href="../../api_c/rep_config.html">DB_ENV->rep_set_config</a> method.</p> 128<p>It is possible for a less-preferred database environment to win an 129election if a number of systems crash at the same time. Because an 130election winner is declared as soon as enough environments participate 131in the election, the environment on a slow booting but well-connected 132machine might lose to an environment on a badly connected but faster 133booting machine. In the case of a number of environments crashing at 134the same time (for example, a set of replicated servers in a single 135machine room), applications should bring the database environments on 136line as clients initially (which will allow them to process read queries 137immediately), and then hold an election after sufficient time has passed 138for the slower booting machines to catch up.</p> 139<p>If, for any reason, a less-preferred database environment becomes the 140master, it is possible to switch masters in a replicated environment. 141For example, the preferred master crashes, and one of the replication 142group clients becomes the group master. In order to restore the 143preferred master to master status, take the following steps:</p> 144<ol> 145<p><li>The preferred master should reboot and re-join the replication group 146as a client. 147<li>Once the preferred master has caught up with the replication group, the 148application on the current master should complete all active transactions 149and reconfigure itself as a client using the <a href="../../api_c/rep_start.html">DB_ENV->rep_start</a> method. 150<li>Then, the current or preferred master should call for an election using 151the <a href="../../api_c/rep_elect.html">DB_ENV->rep_elect</a> method. 152</ol> 153<p>Replication Manager automatically conducts elections when necessary, 154based on configuration information supplied to the 155<a href="../../api_c/rep_priority.html">DB_ENV->rep_set_priority</a> method and the <a href="../../api_c/rep_nsites.html">DB_ENV->rep_set_nsites</a> method.</p> 156<table width="100%"><tr><td><br></td><td align=right><a href="../rep/newsite.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/mastersync.html"><img src="../../images/next.gif" alt="Next"></a> 157</td></tr></table> 158<p><font size=1>Copyright (c) 1996,2008 Oracle. All rights reserved.</font> 159</body> 160</html> 161