1<!--$Id: trans.so,v 1.19 2007/09/21 15:41:26 sue Exp $--> 2<!--Copyright (c) 1997,2008 Oracle. All rights reserved.--> 3<!--See the file LICENSE for redistribution information.--> 4<html> 5<head> 6<title>Berkeley DB Reference Guide: Transactional guarantees</title> 7<meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> 8<meta name="keywords" content="embedded,database,programmatic,toolkit,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> 9</head> 10<body bgcolor=white> 11<table width="100%"><tr valign=top> 12<td><b><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Replication</dl></b></td> 13<td align=right><a href="../rep/bulk.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/lease.html"><img src="../../images/next.gif" alt="Next"></a> 14</td></tr></table> 15<p align=center><b>Transactional guarantees</b></p> 16<p>It is important to consider replication in the context of the overall 17database environment's transactional guarantees. To briefly review, 18transactional guarantees in a non-replicated application are based on 19the writing of log file records to "stable storage", usually a disk 20drive. If the application or system then fails, the Berkeley DB logging 21information is reviewed during recovery, and the databases are updated 22so that all changes made as part of committed transactions appear, and 23all changes made as part of uncommitted transactions do not appear. In 24this case, no information will have been lost.</p> 25<p>If a database environment does not require the log be flushed to 26stable storage on transaction commit (using the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> 27flag to increase performance at the cost of sacrificing transactional 28durability), Berkeley DB recovery will only be able to restore the system to 29the state of the last commit found on stable storage. In this case, 30information may have been lost (for example, the changes made by some 31committed transactions may not appear in the databases after recovery).</p> 32<p>Further, if there is database or log file loss or corruption (for 33example, if a disk drive fails), then catastrophic recovery is 34necessary, and Berkeley DB recovery will only be able to restore the system 35to the state of the last archived log file. In this case, information 36may also have been lost.</p> 37<p>Replicating the database environment extends this model, by adding a 38new component to "stable storage": the client's replicated information. 39If a database environment is replicated, there is no lost information 40in the case of database or log file loss, because the replicated system 41can be configured to contain a complete set of databases and log records 42up to the point of failure. A database environment that loses a disk 43drive can have the drive replaced, and it can then rejoin the 44replication group.</p> 45<p>Because of this new component of stable storage, specifying 46<a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> in a replicated environment no longer sacrifices 47durability, as long as one or more clients have acknowledged receipt of 48the messages sent by the master. Since network connections are often 49faster than local synchronous disk writes, replication becomes a way 50for applications to significantly improve their performance as well as 51their reliability.</p> 52<p>The return status from the application's <b>send</b> function must be 53set by the application to ensure the transactional guarantees the 54application wants to provide. Whenever the <b>send</b> function 55returns failure, the local database environment's log is flushed as 56necessary to ensure that any information critical to database integrity 57is not lost. Because this flush is an expensive operation in terms of 58database performance, applications should avoid returning an error from 59the <b>send</b> function, if at all possible.</p> 60<p>The only interesting message type for replication transactional 61guarantees is when the application's <b>send</b> function was called 62with the <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag specified. There is no reason 63for the <b>send</b> function to ever return failure unless the 64<a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag was specified -- messages without the 65<a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag do not make visible changes to databases, 66and the <b>send</b> function can return success to Berkeley DB as soon as 67the message has been sent to the client(s) or even just copied to local 68application memory in preparation for being sent.</p> 69<p>When a client receives a <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> message, the client 70will flush its log to stable storage before returning (unless the client 71environment has been configured with the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> option). 72If the client is unable to flush a complete transactional record to disk 73for any reason (for example, there is a missing log record before the 74flagged message), the call to the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method on the client 75will return <a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a> and return the LSN of this record 76to the application in the <b>ret_lsnp</b> parameter. 77The application's client or master 78message handling loops should take proper action to ensure the correct 79transactional guarantees in this case. When missing records arrive 80and allow subsequent processing of previously stored permanent 81records, the call to the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method on the client will 82return <a href="../../api_c/rep_message.html#DB_REP_ISPERM">DB_REP_ISPERM</a> and return the largest LSN of the 83permanent records that were flushed to disk. Client applications 84can use these LSNs to know definitively if any particular LSN is 85permanently stored or not.</p> 86<p>An application relying on a client's ability to become a master and 87guarantee that no data has been lost will need to write the <b>send</b> 88function to return an error whenever it cannot guarantee the site that 89will win the next election has the record. Applications not requiring 90this level of transactional guarantees need not have the <b>send</b> 91function return failure (unless the master's database environment has 92been configured with <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a>), as any information critical 93to database integrity has already been flushed to the local log before 94<b>send</b> was called.</p> 95<p>To sum up, the only reason for the <b>send</b> function to return 96failure is when the master database environment has been configured to 97not synchronously flush the log on transaction commit (that is, 98<a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> was configured on the master), the 99<a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> flag is specified for the message, and the 100<b>send</b> function was unable to determine that some number of 101clients have received the current message (and all messages preceding 102the current message). How many clients need to receive the message 103before the <b>send</b> function can return success is an application 104choice (and may not depend as much on a specific number of clients 105reporting success as one or more geographically distributed clients).</p> 106<p>If, however, the application does require on-disk durability on the master, 107the master should be configured to synchronously flush the log on commit. 108If clients are not configured to synchronously flush the log, 109that is, if a client is running with <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> configured, 110then it is up to the application to reconfigure that client 111appropriately when it becomes a master. That is, the 112application must explicitly call <a href="../../api_c/env_set_flags.html">DB_ENV->set_flags</a> to 113disable asynchronous log flushing as part of re-configuring 114the client as the new master.</p> 115<p>Of course, it is important to ensure that the replicated master and 116client environments are truly independent of each other. For example, 117it does not help matters that a client has acknowledged receipt of a 118message if both master and clients are on the same power supply, as the 119failure of the power supply will still potentially lose information.</p> 120<p>Configuring your replication-based application to achieve the proper 121mix of performance and transactional guarantees can be complex. In 122brief, there are a few controls an application can set to configure the 123guarantees it makes: specification of <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> for the 124master environment, specification of <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> for the 125client environment, the priorities of different sites participating in 126an election, and the behavior of the application's <b>send</b> 127function.</p> 128<p>Applications using Replication Manager are free to use 129<a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> at the master and/or clients as they see fit. The 130behavior of the <b>send</b> function that Replication Manager provides 131on the application's behalf is determined by an "acknowledgement 132policy", which is configured by the <a href="../../api_c/repmgr_ack_policy.html">DB_ENV->repmgr_set_ack_policy</a> method. 133Clients always send acknowledgements for <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> 134messages (unless the acknowledgement policy in effect indicates that the 135master doesn't care about them). For a <a href="../../api_c/rep_transport.html#DB_REP_PERMANENT">DB_REP_PERMANENT</a> 136message, the master blocks the sending thread until either it receives 137the proper number of acknowledgements, or the <a href="../../api_c/rep_timeout.html#DB_REP_ACK_TIMEOUT">DB_REP_ACK_TIMEOUT</a> 138expires. In the case of timeout, Replication Manager returns an error 139code from the <b>send</b> function, causing Berkeley DB to flush the 140transaction log before returning to the application, as previously 141described. The default acknowledgement policy is 142<a href="../../api_c/repmgr_ack_policy.html#DB_REPMGR_ACKS_QUORUM">DB_REPMGR_ACKS_QUORUM</a>, which ensures that the effect of a 143permanent record remains durable following an election.</p> 144<p>First, it is rarely useful to write and synchronously flush the log when 145a transaction commits on a replication client. It may be useful where 146systems share resources and multiple systems commonly fail at the same 147time. By default, all Berkeley DB database environments, whether master or 148client, synchronously flush the log on transaction commit or prepare. 149Generally, replication masters and clients turn log flush off for 150transaction commit using the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> flag.</p> 151<p>Consider two systems connected by a network interface. One acts as the 152master, the other as a read-only client. The client takes over as 153master if the master crashes and the master rejoins the replication 154group after such a failure. Both master and client are configured to 155not synchronously flush the log on transaction commit (that is, 156<a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> was configured on both systems). The 157application's <b>send</b> function never returns failure to the Berkeley DB 158library, simply forwarding messages to the client (perhaps over a 159broadcast mechanism), and always returning success. On the client, any 160<a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a> returns from the client's <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method 161are ignored, as well. This system configuration has excellent 162performance, but may lose data in some failure modes.</p> 163<p>If both the master and the client crash at once, it is possible to lose 164committed transactions, that is, transactional durability is not being 165maintained. Reliability can be increased by providing separate power 166supplies for the systems and placing them in separate physical locations.</p> 167<p>If the connection between the two machines fails (or just some number 168of messages are lost), and subsequently the master crashes, it is 169possible to lose committed transactions. Again, because transactional 170durability is not being maintained. Reliability can be improved in a 171couple of ways:</p> 172<ol> 173<p><li>Use a reliable network protocol (for example, TCP/IP instead of UDP). 174<p><li>Increase the number of clients and network paths to make it less likely 175that a message will be lost. In this case, it is important to also make 176sure a client that did receive the message wins any subsequent election. 177If a client that did not receive the message wins a subsequent election, 178data can still be lost. 179</ol> 180<p>Further, systems may want to guarantee message delivery to the client(s) 181(for example, to prevent a network connection from simply discarding 182messages). Some systems may want to ensure clients never return 183out-of-date information, that is, once a transaction commit returns 184success on the master, no client will return old information to a 185read-only query. Some of the following changes may be used to address 186these issues:</p> 187<ol> 188<p><li>Write the application's <b>send</b> function to not return to Berkeley DB 189until one or more clients have acknowledged receipt of the message. 190The number of clients chosen will be dependent on the application: you 191will want to consider likely network partitions (ensure that a client 192at each physical site receives the message) and geographical diversity 193(ensure that a client on each coast receives the message). 194<p><li>Write the client's message processing loop to not acknowledge receipt 195of the message until a call to the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method has returned 196success. Messages resulting in a return of <a href="../../api_c/rep_message.html#DB_REP_NOTPERM">DB_REP_NOTPERM</a> from 197the <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method mean the message could not be flushed to the 198client's disk. If the client does not acknowledge receipt of such 199messages to the master until a subsequent call to the 200<a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method returns <a href="../../api_c/rep_message.html#DB_REP_ISPERM">DB_REP_ISPERM</a> and the LSN 201returned is at least as large as this message's LSN, then the master's 202<b>send</b> function will not return success to the Berkeley DB library. 203This means the thread committing the transaction on the master will not 204be allowed to proceed based on the transaction having committed until 205the selected set of clients have received the message and consider it 206complete. 207<p>Alternatively, the client's message processing loop could acknowledge 208the message to the master, but with an error code indicating that the 209application's <b>send</b> function should not return to the Berkeley DB 210library until a subsequent acknowledgement from the same client 211indicates success.</p> 212<p>The application send callback function invoked by Berkeley DB contains 213an LSN of the record being sent (if appropriate for that record). 214When <a href="../../api_c/rep_message.html">DB_ENV->rep_process_message</a> method returns indicators that a permanent 215record has been written then it also returns the maximum LSN of the 216permanent record written.</p> 217</ol> 218<p>There is one final pair of failure scenarios to consider. First, it is 219not possible to abort transactions after the application's <b>send</b> 220function has been called, as the master may have already written the 221commit log records to disk, and so abort is no longer an option. 222Second, a related problem is that even though the master will attempt 223to flush the local log if the <b>send</b> function returns failure, 224that flush may fail (for example, when the local disk is full). Again, 225the transaction cannot be aborted as one or more clients may have 226committed the transaction even if <b>send</b> returns failure. Rare 227applications may not be able to tolerate these unlikely failure modes. 228In that case the application may want to:</p> 229<ol> 230<p><li>Configure the master to do always local synchronous commits (turning 231off the <a href="../../api_c/env_set_flags.html#DB_TXN_NOSYNC">DB_TXN_NOSYNC</a> configuration). This will decrease 232performance significantly, of course (one of the reasons to use 233replication is to avoid local disk writes.) In this configuration, 234failure to write the local log will cause the transaction to abort in 235all cases. 236<p><li>Do not return from the application's <b>send</b> function under any 237conditions, until the selected set of clients has acknowledged the 238message. Until the <b>send</b> function returns to the Berkeley DB library, 239the thread committing the transaction on the master will wait, and so 240no application will be able to act on the knowledge that the transaction 241has committed. 242</ol> 243<p>The final alternative for applications concerned about these types of 244failure is to use distributed transactions as an alternative means of 245replication, guaranteeing full consistency at the cost of implementing 246a Global Transaction Manager and performing two-phase commit across 247multiple Berkeley DB database environments. More information on this topic 248can be found in the <a href="../../ref/xa/intro.html">Distributed 249Transactions</a> chapter.</p> 250<table width="100%"><tr><td><br></td><td align=right><a href="../rep/bulk.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../rep/lease.html"><img src="../../images/next.gif" alt="Next"></a> 251</td></tr></table> 252<p><font size=1>Copyright (c) 1996,2008 Oracle. All rights reserved.</font> 253</body> 254</html> 255