1<!--$Id: partition.so,v 1.8 2007/10/03 17:48:44 sue Exp $-->
2<!--Copyright (c) 1997,2008 Oracle.  All rights reserved.-->
3<!--See the file LICENSE for redistribution information.-->
4<html>
5<head>
6<title>Berkeley DB Reference Guide: Network partitions</title>
7<meta name="description" content="Berkeley DB: An embedded database programmatic toolkit.">
8<meta name="keywords" content="embedded,database,programmatic,toolkit,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++">
9</head>
10<body bgcolor=white>
11<table width="100%"><tr valign=top>
12<td><b><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></b></td>
13<td align=right><a href="/rep/clock_skew.html"><img src="/images/prev.gif" alt="Prev"></a><a href="/toc.html"><img src="/images/ref.gif" alt="Ref"></a><a href="/rep/faq.html"><img src="/images/next.gif" alt="Next"></a>
14</td></tr></table>
15<p align=center><b>Network partitions</b></p>
16<p>The Berkeley DB replication implementation can be affected by network
17partitioning problems.</p>
18<p>For example, consider a replication group with N members.  The network
19partitions with the master on one side and more than N/2 of the sites
20on the other side.  The sites on the side with the master will continue
21forward, and the master will continue to accept write queries for the
22databases.  Unfortunately, the sites on the other side of the partition,
23realizing they no longer have a master, will hold an election.  The
24election will succeed as there are more than N/2 of the total sites
25participating, and there will then be two masters for the replication
26group.  Since both masters are potentially accepting write queries, the
27databases could diverge in incompatible ways.</p>
28<p>If multiple masters are ever found to exist in a replication group, a
29master detecting the problem will return <a href="/api_c/rep_message.html#DB_REP_DUPMASTER">DB_REP_DUPMASTER</a>.  If
30the application sees this return, it should reconfigure itself as a
31client (by calling <a href="/api_c/rep_start.html">DB_ENV-&gt;rep_start</a>), and then call for an election
32(by calling <a href="/api_c/rep_elect.html">DB_ENV-&gt;rep_elect</a>).  The site that wins the election may be
33one of the two previous masters, or it may be another site entirely.
34Regardless, the winning system will bring all of the other systems into
35conformance.</p>
36<p>As another example, consider a replication group with a master
37environment and two clients A and B, where client A may upgrade to
38master status and client B cannot.  Then, assume client A is partitioned
39from the other two database environments, and it becomes out-of-date
40with respect to the master.  Then, assume the master crashes and does
41not come back on-line.  Subsequently, the network partition is restored,
42and clients A and B hold an election.  As client B cannot win the
43election, client A will win by default, and in order to get back into
44sync with client B, possibly committed transactions on client B will be
45unrolled until the two sites can once again move forward together.</p>
46<p>In both of these examples, there is a phase where a newly elected master
47brings the members of a replication group into conformance with itself
48so that it can start sending new information to them.  This can result
49in the loss of information as previously committed transactions are
50unrolled.</p>
51<p>In architectures where network partitions are an issue, applications
52may want to implement a heart-beat protocol to minimize the consequences
53of a bad network partition.  As long as a master is able to contact at
54least half of the sites in the replication group, it is impossible for
55there to be two masters.  If the master can no longer contact a
56sufficient number of systems, it should reconfigure itself as a client,
57and hold an election.  Replication Manager does not currently
58implement such a feature, so this technique is only available to
59applications which use the Base replication API.</p>
60<p>There is another tool applications can use to minimize the damage in
61the case of a network partition.  By specifying an <b>nsites</b>
62argument to <a href="/api_c/rep_elect.html">DB_ENV-&gt;rep_elect</a> that is larger than the actual number of
63database environments in the replication group, applications can keep
64systems from declaring themselves the master unless they can talk to
65a large percentage of the sites in the system.  For example, if there
66are 20 database environments in the replication group, and an argument
67of 30 is specified to the <a href="/api_c/rep_elect.html">DB_ENV-&gt;rep_elect</a> method, then a system will have
68to be able to talk to at least 16 of the sites to declare itself the
69master.</p>
70<p>Replication Manager uses the value of <b>nsites</b> (configured by
71the <a href="/api_c/rep_nsites.html">DB_ENV-&gt;rep_set_nsites</a> method) for elections as well as in calculating how
72many acknowledgements to wait for when sending a
73<a href="/api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> message.  So this technique may be useful here
74as well, unless the application uses the <a href="/api_c/repmgr_ack_policy.html#DB_REPMGR_ACKS_ALL">DB_REPMGR_ACKS_ALL</a> or
75<a href="/api_c/repmgr_ack_policy.html#DB_REPMGR_ACKS_ALL_PEERS">DB_REPMGR_ACKS_ALL_PEERS</a> acknowledgement policies.</p>
76<p>Specifying a <b>nsites</b> argument to <a href="/api_c/rep_elect.html">DB_ENV-&gt;rep_elect</a> that is
77smaller than the actual number of database environments in the
78replication group has its uses as well.  For example, consider a
79replication group with 2 environments.  If they are partitioned from
80each other, neither of the sites could ever get enough votes to become
81the master.  A reasonable alternative would be to specify a
82<b>nsites</b> argument of 2 to one of the systems and a <b>nsites</b>
83argument of 1 to the other.  That way, one of the systems could win
84elections even when partitioned, while the other one could not.  This
85would allow one of the systems to continue accepting write
86queries after the partition.</p>
87<p>In a 2-site group, Replication Manager reacts to the loss of
88communication with the master by assuming the master has crashed: the
89surviving client simply declares itself to be master.  Thus it avoids
90the problem of the survivor never being able to get enough votes to
91prevail.  But it does leave the group vulnerable to the risk of
92multiple masters, if both sites are running but cannot communicate.</p>
93<p>These scenarios stress the importance of good network infrastructure in
94Berkeley DB replicated environments.  When replicating database environments
95over sufficiently lossy networking, the best solution may well be to
96pick a single master, and only hold elections when human intervention
97has determined the selected master is unable to recover at all.</p>
98<table width="100%"><tr><td><br></td><td align=right><a href="/rep/clock_skew.html"><img src="/images/prev.gif" alt="Prev"></a><a href="/toc.html"><img src="/images/ref.gif" alt="Ref"></a><a href="/rep/faq.html"><img src="/images/next.gif" alt="Next"></a>
99</td></tr></table>
100<p><font size=1>Copyright (c) 1996,2008 Oracle.  All rights reserved.</font>
101</body>
102</html>
103