1<!--$Id: reclimit.so,v 11.32 2005/06/16 17:13:55 bostic Exp $--> 2<!--Copyright (c) 1997,2008 Oracle. All rights reserved.--> 3<!--See the file LICENSE for redistribution information.--> 4<html> 5<head> 6<title>Berkeley DB Reference Guide: Berkeley DB recoverability</title> 7<meta name="description" content="Berkeley DB: An embedded database programmatic toolkit."> 8<meta name="keywords" content="embedded,database,programmatic,toolkit,btree,hash,hashing,transaction,transactions,locking,logging,access method,access methods,Java,C,C++"> 9</head> 10<body bgcolor=white> 11<a name="2"><!--meow--></a> 12<table width="100%"><tr valign=top> 13<td><b><dl><dt>Berkeley DB Reference Guide:<dd>Berkeley DB Transactional Data Store Applications</dl></b></td> 14<td align=right><a href="../transapp/filesys.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../transapp/tune.html"><img src="../../images/next.gif" alt="Next"></a> 15</td></tr></table> 16<p align=center><b>Berkeley DB recoverability</b></p> 17<p>Berkeley DB recovery is based on write-ahead logging. This means that 18when a change is made to a database page, a description of the change is 19written into a log file. This description in the log file is guaranteed 20to be written to stable storage before the database pages that were 21changed are written to stable storage. This is the fundamental feature 22of the logging system that makes durability and rollback work.</p> 23<p>If the application or system crashes, the log is reviewed during 24recovery. Any database changes described in the log that were part of 25committed transactions and that were never written to the actual 26database itself are written to the database as part of recovery. Any 27database changes described in the log that were never committed and that 28were written to the actual database itself are backed-out of the 29database as part of recovery. This design allows the database to be 30written lazily, and only blocks from the log file have to be forced to 31disk as part of transaction commit.</p> 32<p>There are two interfaces that are a concern when considering Berkeley DB 33recoverability:</p> 34<ol> 35<p><li>The interface between Berkeley DB and the operating system/filesystem. 36<li>The interface between the operating system/filesystem and the 37underlying stable storage hardware. 38</ol> 39<p>Berkeley DB uses the operating system interfaces and its underlying filesystem 40when writing its files. This means that Berkeley DB can fail if the underlying 41filesystem fails in some unrecoverable way. Otherwise, the interface 42requirements here are simple: The system call that Berkeley DB uses to flush 43data to disk (normally fsync or fdatasync), must guarantee that all the 44information necessary for a file's recoverability has been written to 45stable storage before it returns to Berkeley DB, and that no possible 46application or system crash can cause that file to be unrecoverable.</p> 47<p>In addition, Berkeley DB implicitly uses the interface between the operating 48system and the underlying hardware. The interface requirements here are 49not as simple.</p> 50<p>First, it is necessary to consider the underlying page size of the Berkeley DB 51databases. The Berkeley DB library performs all database writes using the 52page size specified by the application, and Berkeley DB assumes pages are 53written atomically. This means that if the operating system performs 54filesystem I/O in blocks of different sizes than the database page size, 55it may increase the possibility for database corruption. For example, 56assume that Berkeley DB is writing 32KB pages for a database, and the 57operating system does filesystem I/O in 16KB blocks. If the operating 58system writes the first 16KB of the database page successfully, but 59crashes before being able to write the second 16KB of the database, the 60database has been corrupted and this corruption may or may not be 61detected during recovery. For this reason, it may be important to 62select database page sizes that will be written as single block 63transfers by the underlying operating system. If you do not select a 64page size that the underlying operating system will write as a single 65block, you may want to configure the database to use checksums (see the 66<a href="../../api_c/db_set_flags.html#DB_CHKSUM">DB_CHKSUM</a> flag for more information). By configuring checksums, 67you guarantee this kind of corruption will be detected at the expense 68of the CPU required to generate the checksums. When such an error is 69detected, the only course of recovery is to perform catastrophic 70recovery to restore the database.</p> 71<p>Second, if you are copying database files (either as part of doing a 72hot backup or creation of a hot failover area), there is an additional 73question related to the page size of the Berkeley DB databases. You must copy 74databases atomically, in units of the database page size. In other 75words, the reads made by the copy program must not be interleaved with 76writes by other threads of control, and the copy program must read the 77databases in multiples of the underlying database page size. Generally, 78this is not a problem, as operating systems already make this guarantee 79and system utilities normally read in power-of-2 sized chunks, which 80are larger than the largest possible Berkeley DB database page size.</p> 81<p>One problem we have seen in this area was in some releases of Solaris 82where the cp utility was implemented using the mmap system call rather 83than the read system call. Because the Solaris' mmap system call did 84not make the same guarantee of read atomicity as the read system call, 85using the cp utility could create corrupted copies of the databases. 86Another problem we have seen is implementations of the tar utility doing 8710KB block reads by default, and even when an output block size was 88specified to that utility, not reading from the underlying databases in 89multiples of the block size. Using the dd utility instead of the cp or 90tar utilities (and specifying an appropriate block size), fixes these 91problems. If you plan to use a system utility to copy database files, 92you may want to use a system call trace utility (for example, ktrace or 93truss) to check for an I/O size smaller than or not a multiple of the 94database page size and system calls other than read.</p> 95<p>Third, it is necessary to consider the behavior of the system's 96underlying stable storage hardware. For example, consider a SCSI 97controller that has been configured to cache data and return to the 98operating system that the data has been written to stable storage, when, 99in fact, it has only been written into the controller RAM cache. If 100power is lost before the controller is able to flush its cache to disk, 101and the controller cache is not stable (that is, the writes will not be 102flushed to disk when power returns), the writes will be lost. If the 103writes include database blocks, there is no loss because recovery will 104correctly update the database. If the writes include log file blocks, 105it is possible that transactions that were already committed may not 106appear in the recovered database, although the recovered database will 107be coherent after a crash.</p> 108<p>If the underlying hardware can fail in any way so that only part of the 109block was written, the failure conditions are the same as those 110described previously for an operating system failure that writes only 111part of a logical database block. In such cases, configuring the 112database for checksums will ensure the corruption is detected.</p> 113<p>For these reasons, it may be important to select hardware that does not 114do partial writes and does not cache data writes (or does not return 115that the data has been written to stable storage until it has either 116been written to stable storage or the actual writing of all of the data 117is guaranteed, barring catastrophic hardware failure -- that is, your 118disk drive exploding).</p> 119<p>If the disk drive on which you are storing your databases explodes, you 120can perform normal Berkeley DB catastrophic recovery, because it requires only 121a snapshot of your databases plus the log files you have archived since 122those snapshots were taken. In this case, you should lose no database 123changes at all.</p> 124<p>If the disk drive on which you are storing your log files explodes, you 125can also perform catastrophic recovery, but you will lose any database 126changes made as part of transactions committed since your last archival 127of the log files. Alternatively, if your database environment and 128databases are still available after you lose the log file disk, you 129should be able to dump your databases. However, you may see an 130inconsistent snapshot of your data after doing the dump, because 131changes that were part of transactions that were not yet committed 132may appear in the database dump. Depending on the value of the data, 133a reasonable alternative may be to perform both the database dump and 134the catastrophic recovery and then compare the databases created by 135the two methods.</p> 136<p>Regardless, for these reasons, storing your databases and log files on 137different disks should be considered a safety measure as well as a 138performance enhancement.</p> 139<p>Finally, you should be aware that Berkeley DB does not protect against all 140cases of stable storage hardware failure, nor does it protect against 141simple hardware misbehavior (for example, a disk controller writing 142incorrect data to the disk). However, configuring the database for 143checksums will ensure that any such corruption is detected.</p> 144<table width="100%"><tr><td><br></td><td align=right><a href="../transapp/filesys.html"><img src="../../images/prev.gif" alt="Prev"></a><a href="../toc.html"><img src="../../images/ref.gif" alt="Ref"></a><a href="../transapp/tune.html"><img src="../../images/next.gif" alt="Next"></a> 145</td></tr></table> 146<p><font size=1>Copyright (c) 1996,2008 Oracle. All rights reserved.</font> 147</body> 148</html> 149