1# $Id: Design,v 12.0 2004/11/17 03:44:06 bostic Exp $
2
3Synchronization in the Locking Subsystem
4
5This is a document that describes how we implemented fine-grain locking
6in the lock manager (that is, locking on a hash bucket level instead of
7locking the entire region).  We found that the increase in concurrency
8was not sufficient to warrant the increase in complexity or the additional
9cost of performing each lock operation.  Therefore, we don't use this
10any more.  Should we have to do fine-grain locking in a future release,
11this would be a reasonable starting point.
12
13=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
141. Data structures
15=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
16
17The lock manager maintains 3 different structures:
18
19Objects (__db_lockobj):
20	Describes an object that is locked.  When used with DB, this consists
21	of a __db_ilock (a file identifier and a page number).
22
23Lockers (__db_locker):
24	Identifies a specific locker ID and maintains the head of a list of
25	locks held by a locker (for using during transaction commit/abort).
26
27Locks (__db_lock):
28	Describes a particular object lock held on behalf of a particular
29	locker id.
30
31Objects and Lockers reference Locks.
32
33These structures are organized via two synchronized hash tables.  Each
34hash table consists of two physical arrays: the array of actual hash
35buckets and an array of mutexes so we can lock individual buckets, rather
36than the whole table.
37
38One hash table contains Objects and the other hash table contains Lockers.
39Objects contain two lists of locks, waiters and holders: holders currently
40hold a lock on the Object, waiters are lock waiting to be granted.
41Lockers are a single linked list that connects the Locks held on behalf
42of the specific locker ID.
43
44In the diagram below:
45
46Locker ID #1 holds a lock on Object #1 (L1) and Object #2 (L5), and is
47waiting on a lock on Object #1 (L3).
48
49Locker ID #2 holds a lock on Object #1 (L2) and is waiting on a lock for
50Object #2 (L7).
51
52Locker ID #3 is waiting for a lock on Object #2 (L6).
53
54	OBJECT                   -----------------------
55	HASH                     |                     |
56                             ----|-------------        |
57	________    _______  |   |   ________ |        |
58	|	|-->| O1  |--|---|-->|  O2  | |        |
59	|_______|   |_____|  |   |   |______| V        |
60	|	|    W  H--->L1->L2   W  H--->L5       |	holders
61	|_______|    |       |   |    |                V
62	|	|    ------->L3  \    ------->L6------>L7	waiters
63	|_______|           /     \            \
64	.	.          /       \            \
65	.	.          |        \            \
66	.	.          |         \            -----------
67	|_______|          |          --------------        |
68	|	|      ____|____                ___|_____  _|______
69	|_______|      |       |                |       |  |      |
70	|	|      | LID1  |                |  LID2 |  | LID3 |
71	|_______|      |_______|                |_______|  |______|
72			   ^                        ^        ^
73			   |                        |        |
74			___|________________________|________|___
75	       LOCKER	|    |    |    |    |    |    |    |    |
76	       HASH	|    |    |    |    |    |    |    |    |
77			|    |    |    |    |    |    |    |    |
78			|____|____|____|____|____|____|____|____|
79
80=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
812. Synchronization
82=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
83
84There are four types of mutexes in the subsystem.
85
86Object mutexes;
87	These map one-to-one to each bucket in the Object hash table.
88	Holding a mutex on an Object bucket secures all the Objects in
89	that bucket as well as the Lock structures linked from those
90	Objects.  All fields in the Locks EXCEPT the Locker links (the
91	links that attach Locks by Locker ID) are protected by these
92	mutexes.
93
94Locker mutexes:
95	These map one-to-one to each bucket in the Locker hash table.
96	Holding a mutex on a Locker bucket secures the Locker structures
97	and the Locker links in the Locks.
98
99Memory mutex:
100	This mutex allows calls to allocate/free memory, i.e. calls to
101	__db_shalloc and __db_shalloc_free, as well as manipulation of
102	the Object, Locker and Lock free lists.
103
104Region mutex:
105	This mutex is currently only used to protect the locker ids.
106	It may also be needed later to provide exclusive access to
107	the region for deadlock detection.
108
109Creating or removing a Lock requires locking both the Object lock and the
110Locker lock (and eventually the shalloc lock to return the item to the
111free list).
112
113The locking hierarchy is as follows:
114
115	The Region mutex may never be acquired after any other mutex.
116
117	The Object mutex may be acquired after the Region mutex.
118
119	The Locker mutex may be acquired after the Region and Object
120	mutexes.
121
122	The Memory mutex may be acquired after any mutex.
123
124So, if both and Object mutex and a Locker mutex are going to be acquired,
125the Object mutex must be acquired first.
126
127The Memory mutex may be acquired after any other mutex, but no other mutexes
128can be acquired once the Memory mutex is held.
129
130=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
1313. The algorithms:
132=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
133The locking subsystem supports four basic operations:
134	Get a Lock (lock_get)
135
136	Release a Lock (lock_put)
137
138	Release all the Locks on a specific Object (lock_vec)
139
140	Release all the Locks for a specific Locker (lock_vec)
141
142Get a lock:
143	Acquire Object bucket mutex.
144	Acquire Locker bucket mutex.
145
146	Acquire Memory mutex.
147	If the Object does not exist
148		Take an Object off the freelist.
149	If the Locker doesn't exist
150		Take a Locker off the freelist.
151	Take a Lock off the free list.
152	Release Memory mutex.
153
154	Add Lock to the Object list.
155	Add Lock to the Locker list.
156	Release Locker bucket mutex
157
158	If the lock cannot be granted
159		Release Object bucket mutex
160		Acquire lock mutex (blocks)
161
162		Acquire Object bucket mutex
163		If lock acquisition did not succeed (e.g, deadlock)
164			Acquire Locker bucket mutex
165			If locker should be destroyed
166				Remove locker from hash table
167				Acquire Memory mutex
168				Return locker to free list
169				Release Memory mutex
170			Release Locker bucket mutex
171
172			If object should be released
173				Acquire Memory mutex
174				Return object to free list
175				Release Memory mutex
176
177	Release Object bucket mutex
178
179Release a lock:
180	Acquire Object bucket mutex.
181		(Requires that we be able to find the Object hash bucket
182		without looking inside the Lock itself.)
183
184	If releasing a single lock and the user provided generation number
185	doesn't match the Lock's generation number, the Lock has been reused
186	and we return failure.
187
188	Enter lock_put_internal:
189		if the Lock is still on the Object's lists:
190			Increment Lock's generation number.
191			Remove Lock from the Object's list (NULL link fields).
192			Promote locks for the Object.
193
194		Enter locker_list_removal
195			Acquire Locker bucket mutex.
196			If Locker doesn't exist:
197				Release Locker bucket mutex
198				Release Object bucket mutex
199				Return error.
200			Else if Locker marked as deleted:
201				dont_release = TRUE
202			Else
203				Remove Lock from Locker list.
204				If Locker has no more locks
205					Remove Locker from table.
206					Acquire Memory mutex.
207					Return Locker to free list
208					Release Memory mutex
209			Release Locker bucket mutex.
210		Exit locker_list_removal
211
212		If (!dont_release)
213			Acquire Memory mutex
214			Return Lock to free list
215			Release Memory mutex
216
217	Exit lock_put_internal
218
219	Release Object bucket mutex
220
221Release all the Locks on a specific Object (lock_vec, DB_PUT_ALL_OBJ):
222
223	Acquire Object bucket mutex.
224
225	For each lock on the waiter list:
226		lock_put_internal
227	For each lock on the holder list:
228		lock_put_internal
229
230	Release Object bucket mutex.
231
232Release all the Locks for a specific Locker (lock_vec, DB_PUT_ALL):
233
234	Acquire Locker bucket mutex.
235	Mark Locker deleted.
236	Release Locker mutex.
237
238	For each lock on the Locker's list:
239		Remove from locker's list
240			(The lock could get put back on the free list in
241			lock_put and then could get reallocated and the
242			act of setting its locker links could clobber us.)
243		Perform "Release a Lock" above: skip locker_list_removal.
244
245	Acquire Locker bucket mutex.
246	Remove Locker
247	Release Locker mutex.
248
249	Acquire Memory mutex
250	Return Locker to free list
251	Release Memory mutex
252
253Deadlock detection (lock_detect):
254
255	For each bucket in Object table
256		Acquire the Object bucket mutex.
257		create waitsfor
258
259	For each bucket in Object table
260		Release the Object mutex.
261
262=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
263FAQ:
264=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
265Q: Why do you need generation numbers?
266A: If a lock has been released due to a transaction abort (potentially in a
267   different process), and then lock is released by a thread of control
268   unaware of the abort, the lock might have potentially been re-allocated
269   to a different object.  The generation numbers detect this problem.
270
271   Note, we assume that reads/writes of lock generation numbers are atomic,
272   if they are not, it is theoretically possible that a re-allocated lock
273   could be mistaken for another lock.
274
275Q: Why is is safe to walk the Locker list without holding any mutexes at
276   all?
277A: Locks are created with both the Object and Locker bucket mutexes held.
278   Once created, they removed in two ways:
279
280   a) when a specific Lock is released, in which case, the Object and
281   Locker bucket mutexes are again held, and
282
283   b) when all Locks for a specific Locker Id is released.
284
285   In case b), the Locker bucket mutex is held while the Locker chain is
286   marked as "destroyed", which blocks any further access to the Locker
287   chain.  Then, each individual Object bucket mutex is acquired when each
288   individual Lock is removed.
289
290Q: What are the implications of doing fine grain locking?
291
292A: Since we no longer globally lock the entire region, lock_vec will no
293   longer be atomic.  We still execute the items in a lock_vec in order,
294   so things like lock-coupling still work, but you can't make any
295   guarantees about atomicity.
296
297Q: How do I configure for FINE_GRAIN locking?
298
299A: We currently do not support any automatic configuration for FINE_GRAIN
300   locking.  When we do, will need to document that atomicity discussion
301   listed above (it is bug-report #553).
302