1# $Id: Design,v 12.0 2004/11/17 03:44:06 bostic Exp $ 2 3Synchronization in the Locking Subsystem 4 5This is a document that describes how we implemented fine-grain locking 6in the lock manager (that is, locking on a hash bucket level instead of 7locking the entire region). We found that the increase in concurrency 8was not sufficient to warrant the increase in complexity or the additional 9cost of performing each lock operation. Therefore, we don't use this 10any more. Should we have to do fine-grain locking in a future release, 11this would be a reasonable starting point. 12 13=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 141. Data structures 15=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 16 17The lock manager maintains 3 different structures: 18 19Objects (__db_lockobj): 20 Describes an object that is locked. When used with DB, this consists 21 of a __db_ilock (a file identifier and a page number). 22 23Lockers (__db_locker): 24 Identifies a specific locker ID and maintains the head of a list of 25 locks held by a locker (for using during transaction commit/abort). 26 27Locks (__db_lock): 28 Describes a particular object lock held on behalf of a particular 29 locker id. 30 31Objects and Lockers reference Locks. 32 33These structures are organized via two synchronized hash tables. Each 34hash table consists of two physical arrays: the array of actual hash 35buckets and an array of mutexes so we can lock individual buckets, rather 36than the whole table. 37 38One hash table contains Objects and the other hash table contains Lockers. 39Objects contain two lists of locks, waiters and holders: holders currently 40hold a lock on the Object, waiters are lock waiting to be granted. 41Lockers are a single linked list that connects the Locks held on behalf 42of the specific locker ID. 43 44In the diagram below: 45 46Locker ID #1 holds a lock on Object #1 (L1) and Object #2 (L5), and is 47waiting on a lock on Object #1 (L3). 48 49Locker ID #2 holds a lock on Object #1 (L2) and is waiting on a lock for 50Object #2 (L7). 51 52Locker ID #3 is waiting for a lock on Object #2 (L6). 53 54 OBJECT ----------------------- 55 HASH | | 56 ----|------------- | 57 ________ _______ | | ________ | | 58 | |-->| O1 |--|---|-->| O2 | | | 59 |_______| |_____| | | |______| V | 60 | | W H--->L1->L2 W H--->L5 | holders 61 |_______| | | | | V 62 | | ------->L3 \ ------->L6------>L7 waiters 63 |_______| / \ \ 64 . . / \ \ 65 . . | \ \ 66 . . | \ ----------- 67 |_______| | -------------- | 68 | | ____|____ ___|_____ _|______ 69 |_______| | | | | | | 70 | | | LID1 | | LID2 | | LID3 | 71 |_______| |_______| |_______| |______| 72 ^ ^ ^ 73 | | | 74 ___|________________________|________|___ 75 LOCKER | | | | | | | | | 76 HASH | | | | | | | | | 77 | | | | | | | | | 78 |____|____|____|____|____|____|____|____| 79 80=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 812. Synchronization 82=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 83 84There are four types of mutexes in the subsystem. 85 86Object mutexes; 87 These map one-to-one to each bucket in the Object hash table. 88 Holding a mutex on an Object bucket secures all the Objects in 89 that bucket as well as the Lock structures linked from those 90 Objects. All fields in the Locks EXCEPT the Locker links (the 91 links that attach Locks by Locker ID) are protected by these 92 mutexes. 93 94Locker mutexes: 95 These map one-to-one to each bucket in the Locker hash table. 96 Holding a mutex on a Locker bucket secures the Locker structures 97 and the Locker links in the Locks. 98 99Memory mutex: 100 This mutex allows calls to allocate/free memory, i.e. calls to 101 __db_shalloc and __db_shalloc_free, as well as manipulation of 102 the Object, Locker and Lock free lists. 103 104Region mutex: 105 This mutex is currently only used to protect the locker ids. 106 It may also be needed later to provide exclusive access to 107 the region for deadlock detection. 108 109Creating or removing a Lock requires locking both the Object lock and the 110Locker lock (and eventually the shalloc lock to return the item to the 111free list). 112 113The locking hierarchy is as follows: 114 115 The Region mutex may never be acquired after any other mutex. 116 117 The Object mutex may be acquired after the Region mutex. 118 119 The Locker mutex may be acquired after the Region and Object 120 mutexes. 121 122 The Memory mutex may be acquired after any mutex. 123 124So, if both and Object mutex and a Locker mutex are going to be acquired, 125the Object mutex must be acquired first. 126 127The Memory mutex may be acquired after any other mutex, but no other mutexes 128can be acquired once the Memory mutex is held. 129 130=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 1313. The algorithms: 132=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 133The locking subsystem supports four basic operations: 134 Get a Lock (lock_get) 135 136 Release a Lock (lock_put) 137 138 Release all the Locks on a specific Object (lock_vec) 139 140 Release all the Locks for a specific Locker (lock_vec) 141 142Get a lock: 143 Acquire Object bucket mutex. 144 Acquire Locker bucket mutex. 145 146 Acquire Memory mutex. 147 If the Object does not exist 148 Take an Object off the freelist. 149 If the Locker doesn't exist 150 Take a Locker off the freelist. 151 Take a Lock off the free list. 152 Release Memory mutex. 153 154 Add Lock to the Object list. 155 Add Lock to the Locker list. 156 Release Locker bucket mutex 157 158 If the lock cannot be granted 159 Release Object bucket mutex 160 Acquire lock mutex (blocks) 161 162 Acquire Object bucket mutex 163 If lock acquisition did not succeed (e.g, deadlock) 164 Acquire Locker bucket mutex 165 If locker should be destroyed 166 Remove locker from hash table 167 Acquire Memory mutex 168 Return locker to free list 169 Release Memory mutex 170 Release Locker bucket mutex 171 172 If object should be released 173 Acquire Memory mutex 174 Return object to free list 175 Release Memory mutex 176 177 Release Object bucket mutex 178 179Release a lock: 180 Acquire Object bucket mutex. 181 (Requires that we be able to find the Object hash bucket 182 without looking inside the Lock itself.) 183 184 If releasing a single lock and the user provided generation number 185 doesn't match the Lock's generation number, the Lock has been reused 186 and we return failure. 187 188 Enter lock_put_internal: 189 if the Lock is still on the Object's lists: 190 Increment Lock's generation number. 191 Remove Lock from the Object's list (NULL link fields). 192 Promote locks for the Object. 193 194 Enter locker_list_removal 195 Acquire Locker bucket mutex. 196 If Locker doesn't exist: 197 Release Locker bucket mutex 198 Release Object bucket mutex 199 Return error. 200 Else if Locker marked as deleted: 201 dont_release = TRUE 202 Else 203 Remove Lock from Locker list. 204 If Locker has no more locks 205 Remove Locker from table. 206 Acquire Memory mutex. 207 Return Locker to free list 208 Release Memory mutex 209 Release Locker bucket mutex. 210 Exit locker_list_removal 211 212 If (!dont_release) 213 Acquire Memory mutex 214 Return Lock to free list 215 Release Memory mutex 216 217 Exit lock_put_internal 218 219 Release Object bucket mutex 220 221Release all the Locks on a specific Object (lock_vec, DB_PUT_ALL_OBJ): 222 223 Acquire Object bucket mutex. 224 225 For each lock on the waiter list: 226 lock_put_internal 227 For each lock on the holder list: 228 lock_put_internal 229 230 Release Object bucket mutex. 231 232Release all the Locks for a specific Locker (lock_vec, DB_PUT_ALL): 233 234 Acquire Locker bucket mutex. 235 Mark Locker deleted. 236 Release Locker mutex. 237 238 For each lock on the Locker's list: 239 Remove from locker's list 240 (The lock could get put back on the free list in 241 lock_put and then could get reallocated and the 242 act of setting its locker links could clobber us.) 243 Perform "Release a Lock" above: skip locker_list_removal. 244 245 Acquire Locker bucket mutex. 246 Remove Locker 247 Release Locker mutex. 248 249 Acquire Memory mutex 250 Return Locker to free list 251 Release Memory mutex 252 253Deadlock detection (lock_detect): 254 255 For each bucket in Object table 256 Acquire the Object bucket mutex. 257 create waitsfor 258 259 For each bucket in Object table 260 Release the Object mutex. 261 262=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 263FAQ: 264=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 265Q: Why do you need generation numbers? 266A: If a lock has been released due to a transaction abort (potentially in a 267 different process), and then lock is released by a thread of control 268 unaware of the abort, the lock might have potentially been re-allocated 269 to a different object. The generation numbers detect this problem. 270 271 Note, we assume that reads/writes of lock generation numbers are atomic, 272 if they are not, it is theoretically possible that a re-allocated lock 273 could be mistaken for another lock. 274 275Q: Why is is safe to walk the Locker list without holding any mutexes at 276 all? 277A: Locks are created with both the Object and Locker bucket mutexes held. 278 Once created, they removed in two ways: 279 280 a) when a specific Lock is released, in which case, the Object and 281 Locker bucket mutexes are again held, and 282 283 b) when all Locks for a specific Locker Id is released. 284 285 In case b), the Locker bucket mutex is held while the Locker chain is 286 marked as "destroyed", which blocks any further access to the Locker 287 chain. Then, each individual Object bucket mutex is acquired when each 288 individual Lock is removed. 289 290Q: What are the implications of doing fine grain locking? 291 292A: Since we no longer globally lock the entire region, lock_vec will no 293 longer be atomic. We still execute the items in a lock_vec in order, 294 so things like lock-coupling still work, but you can't make any 295 guarantees about atomicity. 296 297Q: How do I configure for FINE_GRAIN locking? 298 299A: We currently do not support any automatic configuration for FINE_GRAIN 300 locking. When we do, will need to document that atomicity discussion 301 listed above (it is bug-report #553). 302