1.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
2
3===============
4VM_BIND locking
5===============
6
7This document attempts to describe what's needed to get VM_BIND locking right,
8including the userptr mmu_notifier locking. It also discusses some
9optimizations to get rid of the looping through of all userptr mappings and
10external / shared object mappings that is needed in the simplest
11implementation. In addition, there is a section describing the VM_BIND locking
12required for implementing recoverable pagefaults.
13
14The DRM GPUVM set of helpers
15============================
16
17There is a set of helpers for drivers implementing VM_BIND, and this
18set of helpers implements much, but not all of the locking described
19in this document. In particular, it is currently lacking a userptr
20implementation. This document does not intend to describe the DRM GPUVM
21implementation in detail, but it is covered in :ref:`its own
22documentation <drm_gpuvm>`. It is highly recommended for any driver
23implementing VM_BIND to use the DRM GPUVM helpers and to extend it if
24common functionality is missing.
25
26Nomenclature
27============
28
29* ``gpu_vm``: Abstraction of a virtual GPU address space with
30  meta-data. Typically one per client (DRM file-private), or one per
31  execution context.
32* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with
33  associated meta-data. The backing storage of a gpu_vma can either be
34  a GEM object or anonymous or page-cache pages mapped also into the CPU
35  address space for the process.
36* ``gpu_vm_bo``: Abstracts the association of a GEM object and
37  a VM. The GEM object maintains a list of gpu_vm_bos, where each gpu_vm_bo
38  maintains a list of gpu_vmas.
39* ``userptr gpu_vma or just userptr``: A gpu_vma, whose backing store
40  is anonymous or page-cache pages as described above.
41* ``revalidating``: Revalidating a gpu_vma means making the latest version
42  of the backing store resident and making sure the gpu_vma's
43  page-table entries point to that backing store.
44* ``dma_fence``: A struct dma_fence that is similar to a struct completion
45  and which tracks GPU activity. When the GPU activity is finished,
46  the dma_fence signals. Please refer to the ``DMA Fences`` section of
47  the :doc:`dma-buf doc </driver-api/dma-buf>`.
48* ``dma_resv``: A struct dma_resv (a.k.a reservation object) that is used
49  to track GPU activity in the form of multiple dma_fences on a
50  gpu_vm or a GEM object. The dma_resv contains an array / list
51  of dma_fences and a lock that needs to be held when adding
52  additional dma_fences to the dma_resv. The lock is of a type that
53  allows deadlock-safe locking of multiple dma_resvs in arbitrary
54  order. Please refer to the ``Reservation Objects`` section of the
55  :doc:`dma-buf doc </driver-api/dma-buf>`.
56* ``exec function``: An exec function is a function that revalidates all
57  affected gpu_vmas, submits a GPU command batch and registers the
58  dma_fence representing the GPU command's activity with all affected
59  dma_resvs. For completeness, although not covered by this document,
60  it's worth mentioning that an exec function may also be the
61  revalidation worker that is used by some drivers in compute /
62  long-running mode.
63* ``local object``: A GEM object which is only mapped within a
64  single VM. Local GEM objects share the gpu_vm's dma_resv.
65* ``external object``: a.k.a shared object: A GEM object which may be shared
66  by multiple gpu_vms and whose backing storage may be shared with
67  other drivers.
68
69Locks and locking order
70=======================
71
72One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's
73dma_resv object and hence the dma_resv lock. So, even with a huge
74number of local GEM objects, only one lock is needed to make the exec
75sequence atomic.
76
77The following locks and locking orders are used:
78
79* The ``gpu_vm->lock`` (optionally an rwsem). Protects the gpu_vm's
80  data structure keeping track of gpu_vmas. It can also protect the
81  gpu_vm's list of userptr gpu_vmas. With a CPU mm analogy this would
82  correspond to the mmap_lock. An rwsem allows several readers to walk
83  the VM tree concurrently, but the benefit of that concurrency most
84  likely varies from driver to driver.
85* The ``userptr_seqlock``. This lock is taken in read mode for each
86  userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu
87  notifier invalidation. This is not a real seqlock but described in
88  ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side
89  'lock' a lot like a seqcount. However this allows multiple
90  write-sides to hold it at once...". The read side critical section
91  is enclosed by ``mmu_interval_read_begin() /
92  mmu_interval_read_retry()`` with ``mmu_interval_read_begin()``
93  sleeping if the write side is held.
94  The write side is held by the core mm while calling mmu interval
95  invalidation notifiers.
96* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing
97  rebinding, as well as the residency state of all the gpu_vm's local
98  GEM objects.
99  Furthermore, it typically protects the gpu_vm's list of evicted and
100  external GEM objects.
101* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is
102  taken in read mode during exec and write mode during a mmu notifier
103  invalidation. The userptr notifier lock is per gpu_vm.
104* The ``gem_object->gpuva_lock`` This lock protects the GEM object's
105  list of gpu_vm_bos. This is usually the same lock as the GEM
106  object's dma_resv, but some drivers protects this list differently,
107  see below.
108* The ``gpu_vm list spinlocks``. With some implementations they are needed
109  to be able to update the gpu_vm evicted- and external object
110  list. For those implementations, the spinlocks are grabbed when the
111  lists are manipulated. However, to avoid locking order violations
112  with the dma_resv locks, a special scheme is needed when iterating
113  over the lists.
114
115.. _gpu_vma lifetime:
116
117Protection and lifetime of gpu_vm_bos and gpu_vmas
118==================================================
119
120The GEM object's list of gpu_vm_bos, and the gpu_vm_bo's list of gpu_vmas
121is protected by the ``gem_object->gpuva_lock``, which is typically the
122same as the GEM object's dma_resv, but if the driver
123needs to access these lists from within a dma_fence signalling
124critical section, it can instead choose to protect it with a
125separate lock, which can be locked from within the dma_fence signalling
126critical section. Such drivers then need to pay additional attention
127to what locks need to be taken from within the loop when iterating
128over the gpu_vm_bo and gpu_vma lists to avoid locking-order violations.
129
130The DRM GPUVM set of helpers provide lockdep asserts that this lock is
131held in relevant situations and also provides a means of making itself
132aware of which lock is actually used: :c:func:`drm_gem_gpuva_set_lock`.
133
134Each gpu_vm_bo holds a reference counted pointer to the underlying GEM
135object, and each gpu_vma holds a reference counted pointer to the
136gpu_vm_bo. When iterating over the GEM object's list of gpu_vm_bos and
137over the gpu_vm_bo's list of gpu_vmas, the ``gem_object->gpuva_lock`` must
138not be dropped, otherwise, gpu_vmas attached to a gpu_vm_bo may
139disappear without notice since those are not reference-counted. A
140driver may implement its own scheme to allow this at the expense of
141additional complexity, but this is outside the scope of this document.
142
143In the DRM GPUVM implementation, each gpu_vm_bo and each gpu_vma
144holds a reference count on the gpu_vm itself. Due to this, and to avoid circular
145reference counting, cleanup of the gpu_vm's gpu_vmas must not be done from the
146gpu_vm's destructor. Drivers typically implements a gpu_vm close
147function for this cleanup. The gpu_vm close function will abort gpu
148execution using this VM, unmap all gpu_vmas and release page-table memory.
149
150Revalidation and eviction of local objects
151==========================================
152
153Note that in all the code examples given below we use simplified
154pseudo-code. In particular, the dma_resv deadlock avoidance algorithm
155as well as reserving memory for dma_resv fences is left out.
156
157Revalidation
158____________
159With VM_BIND, all local objects need to be resident when the gpu is
160executing using the gpu_vm, and the objects need to have valid
161gpu_vmas set up pointing to them. Typically, each gpu command buffer
162submission is therefore preceded with a re-validation section:
163
164.. code-block:: C
165
166   dma_resv_lock(gpu_vm->resv);
167
168   // Validation section starts here.
169   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
170           validate_gem_bo(&gpu_vm_bo->gem_bo);
171
172           // The following list iteration needs the Gem object's
173           // dma_resv to be held (it protects the gpu_vm_bo's list of
174           // gpu_vmas, but since local gem objects share the gpu_vm's
175           // dma_resv, it is already held at this point.
176           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
177                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
178   }
179
180   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
181           rebind_gpu_vma(&gpu_vma);
182           remove_gpu_vma_from_rebind_list(&gpu_vma);
183   }
184   // Validation section ends here, and job submission starts.
185
186   add_dependencies(&gpu_job, &gpu_vm->resv);
187   job_dma_fence = gpu_submit(&gpu_job));
188
189   add_dma_fence(job_dma_fence, &gpu_vm->resv);
190   dma_resv_unlock(gpu_vm->resv);
191
192The reason for having a separate gpu_vm rebind list is that there
193might be userptr gpu_vmas that are not mapping a buffer object that
194also need rebinding.
195
196Eviction
197________
198
199Eviction of one of these local objects will then look similar to the
200following:
201
202.. code-block:: C
203
204   obj = get_object_from_lru();
205
206   dma_resv_lock(obj->resv);
207   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo);
208           add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
209
210   add_dependencies(&eviction_job, &obj->resv);
211   job_dma_fence = gpu_submit(&eviction_job);
212   add_dma_fence(&obj->resv, job_dma_fence);
213
214   dma_resv_unlock(&obj->resv);
215   put_object(obj);
216
217Note that since the object is local to the gpu_vm, it will share the gpu_vm's
218dma_resv lock such that ``obj->resv == gpu_vm->resv``.
219The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list,
220which is protected by ``gpu_vm->resv``. During eviction all local
221objects have their dma_resv locked and, due to the above equality, also
222the gpu_vm's dma_resv protecting the gpu_vm's evict list is locked.
223
224With VM_BIND, gpu_vmas don't need to be unbound before eviction,
225since the driver must ensure that the eviction blit or copy will wait
226for GPU idle or depend on all previous GPU activity. Furthermore, any
227subsequent attempt by the GPU to access freed memory through the
228gpu_vma will be preceded by a new exec function, with a revalidation
229section which will make sure all gpu_vmas are rebound. The eviction
230code holding the object's dma_resv while revalidating will ensure a
231new exec function may not race with the eviction.
232
233A driver can be implemented in such a way that, on each exec function,
234only a subset of vmas are selected for rebind.  In this case, all vmas that are
235*not* selected for rebind must be unbound before the exec
236function workload is submitted.
237
238Locking with external buffer objects
239====================================
240
241Since external buffer objects may be shared by multiple gpu_vm's they
242can't share their reservation object with a single gpu_vm. Instead
243they need to have a reservation object of their own. The external
244objects bound to a gpu_vm using one or many gpu_vmas are therefore put on a
245per-gpu_vm list which is protected by the gpu_vm's dma_resv lock or
246one of the :ref:`gpu_vm list spinlocks <Spinlock iteration>`. Once
247the gpu_vm's reservation object is locked, it is safe to traverse the
248external object list and lock the dma_resvs of all external
249objects. However, if instead a list spinlock is used, a more elaborate
250iteration scheme needs to be used.
251
252At eviction time, the gpu_vm_bos of *all* the gpu_vms an external
253object is bound to need to be put on their gpu_vm's evict list.
254However, when evicting an external object, the dma_resvs of the
255gpu_vms the object is bound to are typically not held. Only
256the object's private dma_resv can be guaranteed to be held. If there
257is a ww_acquire context at hand at eviction time we could grab those
258dma_resvs but that could cause expensive ww_mutex rollbacks. A simple
259option is to just mark the gpu_vm_bos of the evicted gem object with
260an ``evicted`` bool that is inspected before the next time the
261corresponding gpu_vm evicted list needs to be traversed. For example, when
262traversing the list of external objects and locking them. At that time,
263both the gpu_vm's dma_resv and the object's dma_resv is held, and the
264gpu_vm_bo marked evicted, can then be added to the gpu_vm's list of
265evicted gpu_vm_bos. The ``evicted`` bool is formally protected by the
266object's dma_resv.
267
268The exec function becomes
269
270.. code-block:: C
271
272   dma_resv_lock(gpu_vm->resv);
273
274   // External object list is protected by the gpu_vm->resv lock.
275   for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) {
276           dma_resv_lock(gpu_vm_bo.gem_obj->resv);
277           if (gpu_vm_bo_marked_evicted(&gpu_vm_bo))
278                   add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
279   }
280
281   for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) {
282           validate_gem_bo(&gpu_vm_bo->gem_bo);
283
284           for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma)
285                  move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list);
286   }
287
288   for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) {
289           rebind_gpu_vma(&gpu_vma);
290           remove_gpu_vma_from_rebind_list(&gpu_vma);
291   }
292
293   add_dependencies(&gpu_job, &gpu_vm->resv);
294   job_dma_fence = gpu_submit(&gpu_job));
295
296   add_dma_fence(job_dma_fence, &gpu_vm->resv);
297   for_each_external_obj(gpu_vm, &obj)
298          add_dma_fence(job_dma_fence, &obj->resv);
299   dma_resv_unlock_all_resv_locks();
300
301And the corresponding shared-object aware eviction would look like:
302
303.. code-block:: C
304
305   obj = get_object_from_lru();
306
307   dma_resv_lock(obj->resv);
308   for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo)
309           if (object_is_vm_local(obj))
310                add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list);
311           else
312                mark_gpu_vm_bo_evicted(&gpu_vm_bo);
313
314   add_dependencies(&eviction_job, &obj->resv);
315   job_dma_fence = gpu_submit(&eviction_job);
316   add_dma_fence(&obj->resv, job_dma_fence);
317
318   dma_resv_unlock(&obj->resv);
319   put_object(obj);
320
321.. _Spinlock iteration:
322
323Accessing the gpu_vm's lists without the dma_resv lock held
324===========================================================
325
326Some drivers will hold the gpu_vm's dma_resv lock when accessing the
327gpu_vm's evict list and external objects lists. However, there are
328drivers that need to access these lists without the dma_resv lock
329held, for example due to asynchronous state updates from within the
330dma_fence signalling critical path. In such cases, a spinlock can be
331used to protect manipulation of the lists. However, since higher level
332sleeping locks need to be taken for each list item while iterating
333over the lists, the items already iterated over need to be
334temporarily moved to a private list and the spinlock released
335while processing each item:
336
337.. code block:: C
338
339    struct list_head still_in_list;
340
341    INIT_LIST_HEAD(&still_in_list);
342
343    spin_lock(&gpu_vm->list_lock);
344    do {
345            struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head);
346
347            if (!entry)
348                    break;
349
350            list_move_tail(&entry->head, &still_in_list);
351            list_entry_get_unless_zero(entry);
352            spin_unlock(&gpu_vm->list_lock);
353
354            process(entry);
355
356            spin_lock(&gpu_vm->list_lock);
357            list_entry_put(entry);
358    } while (true);
359
360    list_splice_tail(&still_in_list, &gpu_vm->list);
361    spin_unlock(&gpu_vm->list_lock);
362
363Due to the additional locking and atomic operations, drivers that *can*
364avoid accessing the gpu_vm's list outside of the dma_resv lock
365might want to avoid also this iteration scheme. Particularly, if the
366driver anticipates a large number of list items. For lists where the
367anticipated number of list items is small, where list iteration doesn't
368happen very often or if there is a significant additional cost
369associated with each iteration, the atomic operation overhead
370associated with this type of iteration is, most likely, negligible. Note that
371if this scheme is used, it is necessary to make sure this list
372iteration is protected by an outer level lock or semaphore, since list
373items are temporarily pulled off the list while iterating, and it is
374also worth mentioning that the local list ``still_in_list`` should
375also be considered protected by the ``gpu_vm->list_lock``, and it is
376thus possible that items can be removed also from the local list
377concurrently with list iteration.
378
379Please refer to the :ref:`DRM GPUVM locking section
380<drm_gpuvm_locking>` and its internal
381:c:func:`get_next_vm_bo_from_list` function.
382
383
384userptr gpu_vmas
385================
386
387A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a
388GPU virtual address range, directly maps a CPU mm range of anonymous-
389or file page-cache pages.
390A very simple approach would be to just pin the pages using
391pin_user_pages() at bind time and unpin them at unbind time, but this
392creates a Denial-Of-Service vector since a single user-space process
393would be able to pin down all of system memory, which is not
394desirable. (For special use-cases and assuming proper accounting pinning might
395still be a desirable feature, though). What we need to do in the
396general case is to obtain a reference to the desired pages, make sure
397we are notified using a MMU notifier just before the CPU mm unmaps the
398pages, dirty them if they are not mapped read-only to the GPU, and
399then drop the reference.
400When we are notified by the MMU notifier that CPU mm is about to drop the
401pages, we need to stop GPU access to the pages by waiting for VM idle
402in the MMU notifier and make sure that before the next time the GPU
403tries to access whatever is now present in the CPU mm range, we unmap
404the old pages from the GPU page tables and repeat the process of
405obtaining new page references. (See the :ref:`notifier example
406<Invalidation example>` below). Note that when the core mm decides to
407laundry pages, we get such an unmap MMU notification and can mark the
408pages dirty again before the next GPU access. We also get similar MMU
409notifications for NUMA accounting which the GPU driver doesn't really
410need to care about, but so far it has proven difficult to exclude
411certain notifications.
412
413Using a MMU notifier for device DMA (and other methods) is described in
414:ref:`the pin_user_pages() documentation <mmu-notifier-registration-case>`.
415
416Now, the method of obtaining struct page references using
417get_user_pages() unfortunately can't be used under a dma_resv lock
418since that would violate the locking order of the dma_resv lock vs the
419mmap_lock that is grabbed when resolving a CPU pagefault. This means
420the gpu_vm's list of userptr gpu_vmas needs to be protected by an
421outer lock, which in our example below is the ``gpu_vm->lock``.
422
423The MMU interval seqlock for a userptr gpu_vma is used in the following
424way:
425
426.. code-block:: C
427
428   // Exclusive locking mode here is strictly needed only if there are
429   // invalidated userptr gpu_vmas present, to avoid concurrent userptr
430   // revalidations of the same userptr gpu_vma.
431   down_write(&gpu_vm->lock);
432   retry:
433
434   // Note: mmu_interval_read_begin() blocks until there is no
435   // invalidation notifier running anymore.
436   seq = mmu_interval_read_begin(&gpu_vma->userptr_interval);
437   if (seq != gpu_vma->saved_seq) {
438           obtain_new_page_pointers(&gpu_vma);
439           dma_resv_lock(&gpu_vm->resv);
440           add_gpu_vma_to_revalidate_list(&gpu_vma, &gpu_vm);
441           dma_resv_unlock(&gpu_vm->resv);
442           gpu_vma->saved_seq = seq;
443   }
444
445   // The usual revalidation goes here.
446
447   // Final userptr sequence validation may not happen before the
448   // submission dma_fence is added to the gpu_vm's resv, from the POW
449   // of the MMU invalidation notifier. Hence the
450   // userptr_notifier_lock that will make them appear atomic.
451
452   add_dependencies(&gpu_job, &gpu_vm->resv);
453   down_read(&gpu_vm->userptr_notifier_lock);
454   if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) {
455          up_read(&gpu_vm->userptr_notifier_lock);
456          goto retry;
457   }
458
459   job_dma_fence = gpu_submit(&gpu_job));
460
461   add_dma_fence(job_dma_fence, &gpu_vm->resv);
462
463   for_each_external_obj(gpu_vm, &obj)
464          add_dma_fence(job_dma_fence, &obj->resv);
465
466   dma_resv_unlock_all_resv_locks();
467   up_read(&gpu_vm->userptr_notifier_lock);
468   up_write(&gpu_vm->lock);
469
470The code between ``mmu_interval_read_begin()`` and the
471``mmu_interval_read_retry()`` marks the read side critical section of
472what we call the ``userptr_seqlock``. In reality, the gpu_vm's userptr
473gpu_vma list is looped through, and the check is done for *all* of its
474userptr gpu_vmas, although we only show a single one here.
475
476The userptr gpu_vma MMU invalidation notifier might be called from
477reclaim context and, again, to avoid locking order violations, we can't
478take any dma_resv lock nor the gpu_vm->lock from within it.
479
480.. _Invalidation example:
481.. code-block:: C
482
483  bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq)
484  {
485          // Make sure the exec function either sees the new sequence
486          // and backs off or we wait for the dma-fence:
487
488          down_write(&gpu_vm->userptr_notifier_lock);
489          mmu_interval_set_seq(userptr_interval, cur_seq);
490          up_write(&gpu_vm->userptr_notifier_lock);
491
492          // At this point, the exec function can't succeed in
493          // submitting a new job, because cur_seq is an invalid
494          // sequence number and will always cause a retry. When all
495          // invalidation callbacks, the mmu notifier core will flip
496          // the sequence number to a valid one. However we need to
497          // stop gpu access to the old pages here.
498
499          dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP,
500                                false, MAX_SCHEDULE_TIMEOUT);
501          return true;
502  }
503
504When this invalidation notifier returns, the GPU can no longer be
505accessing the old pages of the userptr gpu_vma and needs to redo the
506page-binding before a new GPU submission can succeed.
507
508Efficient userptr gpu_vma exec_function iteration
509_________________________________________________
510
511If the gpu_vm's list of userptr gpu_vmas becomes large, it's
512inefficient to iterate through the complete lists of userptrs on each
513exec function to check whether each userptr gpu_vma's saved
514sequence number is stale. A solution to this is to put all
515*invalidated* userptr gpu_vmas on a separate gpu_vm list and
516only check the gpu_vmas present on this list on each exec
517function. This list will then lend itself very-well to the spinlock
518locking scheme that is
519:ref:`described in the spinlock iteration section <Spinlock iteration>`, since
520in the mmu notifier, where we add the invalidated gpu_vmas to the
521list, it's not possible to take any outer locks like the
522``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the
523``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is
524complete, as also mentioned in that section.
525
526If using an invalidated userptr list like this, the retry check in the
527exec function trivially becomes a check for invalidated list empty.
528
529Locking at bind and unbind time
530===============================
531
532At bind time, assuming a GEM object backed gpu_vma, each
533gpu_vma needs to be associated with a gpu_vm_bo and that
534gpu_vm_bo in turn needs to be added to the GEM object's
535gpu_vm_bo list, and possibly to the gpu_vm's external object
536list. This is referred to as *linking* the gpu_vma, and typically
537requires that the ``gpu_vm->lock`` and the ``gem_object->gpuva_lock``
538are held. When unlinking a gpu_vma the same locks should be held,
539and that ensures that when iterating over ``gpu_vmas`, either under
540the ``gpu_vm->resv`` or the GEM object's dma_resv, that the gpu_vmas
541stay alive as long as the lock under which we iterate is not released. For
542userptr gpu_vmas it's similarly required that during vma destroy, the
543outer ``gpu_vm->lock`` is held, since otherwise when iterating over
544the invalidated userptr list as described in the previous section,
545there is nothing keeping those userptr gpu_vmas alive.
546
547Locking for recoverable page-fault page-table updates
548=====================================================
549
550There are two important things we need to ensure with locking for
551recoverable page-faults:
552
553* At the time we return pages back to the system / allocator for
554  reuse, there should be no remaining GPU mappings and any GPU TLB
555  must have been flushed.
556* The unmapping and mapping of a gpu_vma must not race.
557
558Since the unmapping (or zapping) of GPU ptes is typically taking place
559where it is hard or even impossible to take any outer level locks we
560must either introduce a new lock that is held at both mapping and
561unmapping time, or look at the locks we do hold at unmapping time and
562make sure that they are held also at mapping time. For userptr
563gpu_vmas, the ``userptr_seqlock`` is held in write mode in the mmu
564invalidation notifier where zapping happens. Hence, if the
565``userptr_seqlock`` as well as the ``gpu_vm->userptr_notifier_lock``
566is held in read mode during mapping, it will not race with the
567zapping. For GEM object backed gpu_vmas, zapping will take place under
568the GEM object's dma_resv and ensuring that the dma_resv is held also
569when populating the page-tables for any gpu_vma pointing to the GEM
570object, will similarly ensure we are race-free.
571
572If any part of the mapping is performed asynchronously
573under a dma-fence with these locks released, the zapping will need to
574wait for that dma-fence to signal under the relevant lock before
575starting to modify the page-table.
576
577Since modifying the
578page-table structure in a way that frees up page-table memory
579might also require outer level locks, the zapping of GPU ptes
580typically focuses only on zeroing page-table or page-directory entries
581and flushing TLB, whereas freeing of page-table memory is deferred to
582unbind or rebind time.
583