1.. SPDX-License-Identifier: GPL-2.0
2
3Idmappings
4==========
5
6Most filesystem developers will have encountered idmappings. They are used when
7reading from or writing ownership to disk, reporting ownership to userspace, or
8for permission checking. This document is aimed at filesystem developers that
9want to know how idmappings work.
10
11Formal notes
12------------
13
14An idmapping is essentially a translation of a range of ids into another or the
15same range of ids. The notational convention for idmappings that is widely used
16in userspace is::
17
18 u:k:r
19
20``u`` indicates the first element in the upper idmapset ``U`` and ``k``
21indicates the first element in the lower idmapset ``K``. The ``r`` parameter
22indicates the range of the idmapping, i.e. how many ids are mapped. From now
23on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
24we're talking about an id in the upper or lower idmapset.
25
26To see what this looks like in practice, let's take the following idmapping::
27
28 u22:k10000:r3
29
30and write down the mappings it will generate::
31
32 u22 -> k10000
33 u23 -> k10001
34 u24 -> k10002
35
36From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
37idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
38order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
39the set of all possible ids usable on a given system.
40
41Looking at this mathematically briefly will help us highlight some properties
42that make it easier to understand how we can translate between idmappings. For
43example, we know that the inverse idmapping is an order isomorphism as well::
44
45 k10000 -> u22
46 k10001 -> u23
47 k10002 -> u24
48
49Given that we are dealing with order isomorphisms plus the fact that we're
50dealing with subsets we can embed idmappings into each other, i.e. we can
51sensibly translate between different idmappings. For example, assume we've been
52given the three idmappings::
53
54 1. u0:k10000:r10000
55 2. u0:k20000:r10000
56 3. u0:k30000:r10000
57
58and id ``k11000`` which has been generated by the first idmapping by mapping
59``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
60
61Because we're dealing with order isomorphic subsets it is meaningful to ask
62what id ``k11000`` corresponds to in the second or third idmapping. The
63straightforward algorithm to use is to apply the inverse of the first idmapping,
64mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
65either the second idmapping mapping or third idmapping mapping. The second
66idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
67``u1000`` down to ``u31000``.
68
69If we were given the same task for the following three idmappings::
70
71 1. u0:k10000:r10000
72 2. u0:k20000:r200
73 3. u0:k30000:r300
74
75we would fail to translate as the sets aren't order isomorphic over the full
76range of the first idmapping anymore (However they are order isomorphic over
77the full range of the second idmapping.). Neither the second or third idmapping
78contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
79an id mapped. We can simply say that ``u1000`` is unmapped in the second and
80third idmapping. The kernel will report unmapped ids as the overflowuid
81``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
82
83The algorithm to calculate what a given id maps to is pretty simple. First, we
84need to verify that the range can contain our target id. We will skip this step
85for simplicity. After that if we want to know what ``id`` maps to we can do
86simple calculations:
87
88- If we want to map from left to right::
89
90   u:k:r
91   id - u + k = n
92
93- If we want to map from right to left::
94
95   u:k:r
96   id - k + u = n
97
98Instead of "left to right" we can also say "down" and instead of "right to
99left" we can also say "up". Obviously mapping down and up invert each other.
100
101To see whether the simple formulas above work, consider the following two
102idmappings::
103
104 1. u0:k20000:r10000
105 2. u500:k30000:r10000
106
107Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
108want to know what id this was mapped from in the upper idmapset of the first
109idmapping. So we're mapping up in the first idmapping::
110
111 id     - k      + u  = n
112 k21000 - k20000 + u0 = u1000
113
114Now assume we are given the id ``u1100`` in the upper idmapset of the second
115idmapping and we want to know what this id maps down to in the lower idmapset
116of the second idmapping. This means we're mapping down in the second
117idmapping::
118
119 id    - u    + k      = n
120 u1100 - u500 + k30000 = k30600
121
122General notes
123-------------
124
125In the context of the kernel an idmapping can be interpreted as mapping a range
126of userspace ids into a range of kernel ids::
127
128 userspace-id:kernel-id:range
129
130A userspace id is always an element in the upper idmapset of an idmapping of
131type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
132idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
133"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
134types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
135
136The kernel is mostly concerned with kernel ids. They are used when performing
137permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
138A userspace id on the other hand is an id that is reported to userspace by the
139kernel, or is passed by userspace to the kernel, or a raw device id that is
140written or read from disk.
141
142Note that we are only concerned with idmappings as the kernel stores them not
143how userspace would specify them.
144
145For the rest of this document we will prefix all userspace ids with ``u`` and
146all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
147an idmapping will be written as ``u0:k10000:r10000``.
148
149For example, within this idmapping, the id ``u1000`` is an id in the upper
150idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to
151``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset"
152starting with ``k10000``.
153
154A kernel id is always created by an idmapping. Such idmappings are associated
155with user namespaces. Since we mainly care about how idmappings work we're not
156going to be concerned with how idmappings are created nor how they are used
157outside of the filesystem context. This is best left to an explanation of user
158namespaces.
159
160The initial user namespace is special. It always has an idmapping of the
161following form::
162
163 u0:k0:r4294967295
164
165which is an identity idmapping over the full range of ids available on this
166system.
167
168Other user namespaces usually have non-identity idmappings such as::
169
170 u0:k10000:r10000
171
172When a process creates or wants to change ownership of a file, or when the
173ownership of a file is read from disk by a filesystem, the userspace id is
174immediately translated into a kernel id according to the idmapping associated
175with the relevant user namespace.
176
177For instance, consider a file that is stored on disk by a filesystem as being
178owned by ``u1000``:
179
180- If a filesystem were to be mounted in the initial user namespaces (as most
181  filesystems are) then the initial idmapping will be used. As we saw this is
182  simply the identity idmapping. This would mean id ``u1000`` read from disk
183  would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
184  would contain ``k1000``.
185
186- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
187  then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
188  ``i_uid`` and ``i_gid`` would contain ``k11000``.
189
190Translation algorithms
191----------------------
192
193We've already seen briefly that it is possible to translate between different
194idmappings. We'll now take a closer look how that works.
195
196Crossmapping
197~~~~~~~~~~~~
198
199This translation algorithm is used by the kernel in quite a few places. For
200example, it is used when reporting back the ownership of a file to userspace
201via the ``stat()`` system call family.
202
203If we've been given ``k11000`` from one idmapping we can map that id up in
204another idmapping. In order for this to work both idmappings need to contain
205the same kernel id in their kernel idmapsets. For example, consider the
206following idmappings::
207
208 1. u0:k10000:r10000
209 2. u20000:k10000:r10000
210
211and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
212then translate ``k11000`` into a userspace id in the second idmapping using the
213kernel idmapset of the second idmapping::
214
215 /* Map the kernel id up into a userspace id in the second idmapping. */
216 from_kuid(u20000:k10000:r10000, k11000) = u21000
217
218Note, how we can get back to the kernel id in the first idmapping by inverting
219the algorithm::
220
221 /* Map the userspace id down into a kernel id in the second idmapping. */
222 make_kuid(u20000:k10000:r10000, u21000) = k11000
223
224 /* Map the kernel id up into a userspace id in the first idmapping. */
225 from_kuid(u0:k10000:r10000, k11000) = u1000
226
227This algorithm allows us to answer the question what userspace id a given
228kernel id corresponds to in a given idmapping. In order to be able to answer
229this question both idmappings need to contain the same kernel id in their
230respective kernel idmapsets.
231
232For example, when the kernel reads a raw userspace id from disk it maps it down
233into a kernel id according to the idmapping associated with the filesystem.
234Let's assume the filesystem was mounted with an idmapping of
235``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
236means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
237the inode's ``i_uid`` and ``i_gid`` field.
238
239When someone in userspace calls ``stat()`` or a related function to get
240ownership information about the file the kernel can't simply map the id back up
241according to the filesystem's idmapping as this would give the wrong owner if
242the caller is using an idmapping.
243
244So the kernel will map the id back up in the idmapping of the caller. Let's
245assume the caller has the somewhat unconventional idmapping
246``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
247Consequently the user would see that this file is owned by ``u4000``.
248
249Remapping
250~~~~~~~~~
251
252It is possible to translate a kernel id from one idmapping to another one via
253the userspace idmapset of the two idmappings. This is equivalent to remapping
254a kernel id.
255
256Let's look at an example. We are given the following two idmappings::
257
258 1. u0:k10000:r10000
259 2. u0:k20000:r10000
260
261and we are given ``k11000`` in the first idmapping. In order to translate this
262kernel id in the first idmapping into a kernel id in the second idmapping we
263need to perform two steps:
264
2651. Map the kernel id up into a userspace id in the first idmapping::
266
267    /* Map the kernel id up into a userspace id in the first idmapping. */
268    from_kuid(u0:k10000:r10000, k11000) = u1000
269
2702. Map the userspace id down into a kernel id in the second idmapping::
271
272    /* Map the userspace id down into a kernel id in the second idmapping. */
273    make_kuid(u0:k20000:r10000, u1000) = k21000
274
275As you can see we used the userspace idmapset in both idmappings to translate
276the kernel id in one idmapping to a kernel id in another idmapping.
277
278This allows us to answer the question what kernel id we would need to use to
279get the same userspace id in another idmapping. In order to be able to answer
280this question both idmappings need to contain the same userspace id in their
281respective userspace idmapsets.
282
283Note, how we can easily get back to the kernel id in the first idmapping by
284inverting the algorithm:
285
2861. Map the kernel id up into a userspace id in the second idmapping::
287
288    /* Map the kernel id up into a userspace id in the second idmapping. */
289    from_kuid(u0:k20000:r10000, k21000) = u1000
290
2912. Map the userspace id down into a kernel id in the first idmapping::
292
293    /* Map the userspace id down into a kernel id in the first idmapping. */
294    make_kuid(u0:k10000:r10000, u1000) = k11000
295
296Another way to look at this translation is to treat it as inverting one
297idmapping and applying another idmapping if both idmappings have the relevant
298userspace id mapped. This will come in handy when working with idmapped mounts.
299
300Invalid translations
301~~~~~~~~~~~~~~~~~~~~
302
303It is never valid to use an id in the kernel idmapset of one idmapping as the
304id in the userspace idmapset of another or the same idmapping. While the kernel
305idmapset always indicates an idmapset in the kernel id space the userspace
306idmapset indicates a userspace id. So the following translations are forbidden::
307
308 /* Map the userspace id down into a kernel id in the first idmapping. */
309 make_kuid(u0:k10000:r10000, u1000) = k11000
310
311 /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
312 make_kuid(u10000:k20000:r10000, k110000) = k21000
313                                 ~~~~~~~
314
315and equally wrong::
316
317 /* Map the kernel id up into a userspace id in the first idmapping. */
318 from_kuid(u0:k10000:r10000, k11000) = u1000
319
320 /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
321 from_kuid(u20000:k0:r10000, u1000) = k21000
322                             ~~~~~
323
324Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type
325``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are
326conflated. So the two examples above would cause a compilation failure.
327
328Idmappings when creating filesystem objects
329-------------------------------------------
330
331The concepts of mapping an id down or mapping an id up are expressed in the two
332kernel functions filesystem developers are rather familiar with and which we've
333already used in this document::
334
335 /* Map the userspace id down into a kernel id. */
336 make_kuid(idmapping, uid)
337
338 /* Map the kernel id up into a userspace id. */
339 from_kuid(idmapping, kuid)
340
341We will take an abbreviated look into how idmappings figure into creating
342filesystem objects. For simplicity we will only look at what happens when the
343VFS has already completed path lookup right before it calls into the filesystem
344itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
345called. We will also assume that the directory we're creating filesystem
346objects in is readable and writable for everyone.
347
348When creating a filesystem object the caller will look at the caller's
349filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
350but they are exclusively used when determining file ownership which is why they
351are called "filesystem ids". They are usually identical to the uid and gid of
352the caller but can differ. We will just assume they are always identical to not
353get lost in too many details.
354
355When the caller enters the kernel two things happen:
356
3571. Map the caller's userspace ids down into kernel ids in the caller's
358   idmapping.
359   (To be precise, the kernel will simply look at the kernel ids stashed in the
360   credentials of the current task but for our education we'll pretend this
361   translation happens just in time.)
3622. Verify that the caller's kernel ids can be mapped up to userspace ids in the
363   filesystem's idmapping.
364
365The second step is important as regular filesystem will ultimately need to map
366the kernel id back up into a userspace id when writing to disk.
367So with the second step the kernel guarantees that a valid userspace id can be
368written to disk. If it can't the kernel will refuse the creation request to not
369even remotely risk filesystem corruption.
370
371The astute reader will have realized that this is simply a variation of the
372crossmapping algorithm we mentioned above in a previous section. First, the
373kernel maps the caller's userspace id down into a kernel id according to the
374caller's idmapping and then maps that kernel id up according to the
375filesystem's idmapping.
376
377From the implementation point it's worth mentioning how idmappings are represented.
378All idmappings are taken from the corresponding user namespace.
379
380    - caller's idmapping (usually taken from ``current_user_ns()``)
381    - filesystem's idmapping (``sb->s_user_ns``)
382    - mount's idmapping (``mnt_idmap(vfsmnt)``)
383
384Let's see some examples with caller/filesystem idmapping but without mount
385idmappings. This will exhibit some problems we can hit. After that we will
386revisit/reconsider these examples, this time using mount idmappings, to see how
387they can solve the problems we observed before.
388
389Example 1
390~~~~~~~~~
391
392::
393
394 caller id:            u1000
395 caller idmapping:     u0:k0:r4294967295
396 filesystem idmapping: u0:k0:r4294967295
397
398Both the caller and the filesystem use the identity idmapping:
399
4001. Map the caller's userspace ids into kernel ids in the caller's idmapping::
401
402    make_kuid(u0:k0:r4294967295, u1000) = k1000
403
4042. Verify that the caller's kernel ids can be mapped to userspace ids in the
405   filesystem's idmapping.
406
407   For this second step the kernel will call the function
408   ``fsuidgid_has_mapping()`` which ultimately boils down to calling
409   ``from_kuid()``::
410
411    from_kuid(u0:k0:r4294967295, k1000) = u1000
412
413In this example both idmappings are the same so there's nothing exciting going
414on. Ultimately the userspace id that lands on disk will be ``u1000``.
415
416Example 2
417~~~~~~~~~
418
419::
420
421 caller id:            u1000
422 caller idmapping:     u0:k10000:r10000
423 filesystem idmapping: u0:k20000:r10000
424
4251. Map the caller's userspace ids down into kernel ids in the caller's
426   idmapping::
427
428    make_kuid(u0:k10000:r10000, u1000) = k11000
429
4302. Verify that the caller's kernel ids can be mapped up to userspace ids in the
431   filesystem's idmapping::
432
433    from_kuid(u0:k20000:r10000, k11000) = u-1
434
435It's immediately clear that while the caller's userspace id could be
436successfully mapped down into kernel ids in the caller's idmapping the kernel
437ids could not be mapped up according to the filesystem's idmapping. So the
438kernel will deny this creation request.
439
440Note that while this example is less common, because most filesystem can't be
441mounted with non-initial idmappings this is a general problem as we can see in
442the next examples.
443
444Example 3
445~~~~~~~~~
446
447::
448
449 caller id:            u1000
450 caller idmapping:     u0:k10000:r10000
451 filesystem idmapping: u0:k0:r4294967295
452
4531. Map the caller's userspace ids down into kernel ids in the caller's
454   idmapping::
455
456    make_kuid(u0:k10000:r10000, u1000) = k11000
457
4582. Verify that the caller's kernel ids can be mapped up to userspace ids in the
459   filesystem's idmapping::
460
461    from_kuid(u0:k0:r4294967295, k11000) = u11000
462
463We can see that the translation always succeeds. The userspace id that the
464filesystem will ultimately put to disk will always be identical to the value of
465the kernel id that was created in the caller's idmapping. This has mainly two
466consequences.
467
468First, that we can't allow a caller to ultimately write to disk with another
469userspace id. We could only do this if we were to mount the whole filesystem
470with the caller's or another idmapping. But that solution is limited to a few
471filesystems and not very flexible. But this is a use-case that is pretty
472important in containerized workloads.
473
474Second, the caller will usually not be able to create any files or access
475directories that have stricter permissions because none of the filesystem's
476kernel ids map up into valid userspace ids in the caller's idmapping
477
4781. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
479
480    make_kuid(u0:k0:r4294967295, u1000) = k1000
481
4822. Map kernel ids up to userspace ids in the caller's idmapping::
483
484    from_kuid(u0:k10000:r10000, k1000) = u-1
485
486Example 4
487~~~~~~~~~
488
489::
490
491 file id:              u1000
492 caller idmapping:     u0:k10000:r10000
493 filesystem idmapping: u0:k0:r4294967295
494
495In order to report ownership to userspace the kernel uses the crossmapping
496algorithm introduced in a previous section:
497
4981. Map the userspace id on disk down into a kernel id in the filesystem's
499   idmapping::
500
501    make_kuid(u0:k0:r4294967295, u1000) = k1000
502
5032. Map the kernel id up into a userspace id in the caller's idmapping::
504
505    from_kuid(u0:k10000:r10000, k1000) = u-1
506
507The crossmapping algorithm fails in this case because the kernel id in the
508filesystem idmapping cannot be mapped up to a userspace id in the caller's
509idmapping. Thus, the kernel will report the ownership of this file as the
510overflowid.
511
512Example 5
513~~~~~~~~~
514
515::
516
517 file id:              u1000
518 caller idmapping:     u0:k10000:r10000
519 filesystem idmapping: u0:k20000:r10000
520
521In order to report ownership to userspace the kernel uses the crossmapping
522algorithm introduced in a previous section:
523
5241. Map the userspace id on disk down into a kernel id in the filesystem's
525   idmapping::
526
527    make_kuid(u0:k20000:r10000, u1000) = k21000
528
5292. Map the kernel id up into a userspace id in the caller's idmapping::
530
531    from_kuid(u0:k10000:r10000, k21000) = u-1
532
533Again, the crossmapping algorithm fails in this case because the kernel id in
534the filesystem idmapping cannot be mapped to a userspace id in the caller's
535idmapping. Thus, the kernel will report the ownership of this file as the
536overflowid.
537
538Note how in the last two examples things would be simple if the caller would be
539using the initial idmapping. For a filesystem mounted with the initial
540idmapping it would be trivial. So we only consider a filesystem with an
541idmapping of ``u0:k20000:r10000``:
542
5431. Map the userspace id on disk down into a kernel id in the filesystem's
544   idmapping::
545
546    make_kuid(u0:k20000:r10000, u1000) = k21000
547
5482. Map the kernel id up into a userspace id in the caller's idmapping::
549
550    from_kuid(u0:k0:r4294967295, k21000) = u21000
551
552Idmappings on idmapped mounts
553-----------------------------
554
555The examples we've seen in the previous section where the caller's idmapping
556and the filesystem's idmapping are incompatible causes various issues for
557workloads. For a more complex but common example, consider two containers
558started on the host. To completely prevent the two containers from affecting
559each other, an administrator may often use different non-overlapping idmappings
560for the two containers::
561
562 container1 idmapping:  u0:k10000:r10000
563 container2 idmapping:  u0:k20000:r10000
564 filesystem idmapping:  u0:k30000:r10000
565
566An administrator wanting to provide easy read-write access to the following set
567of files::
568
569 dir id:       u0
570 dir/file1 id: u1000
571 dir/file2 id: u2000
572
573to both containers currently can't.
574
575Of course the administrator has the option to recursively change ownership via
576``chown()``. For example, they could change ownership so that ``dir`` and all
577files below it can be crossmapped from the filesystem's into the container's
578idmapping. Let's assume they change ownership so it is compatible with the
579first container's idmapping::
580
581 dir id:       u10000
582 dir/file1 id: u11000
583 dir/file2 id: u12000
584
585This would still leave ``dir`` rather useless to the second container. In fact,
586``dir`` and all files below it would continue to appear owned by the overflowid
587for the second container.
588
589Or consider another increasingly popular example. Some service managers such as
590systemd implement a concept called "portable home directories". A user may want
591to use their home directories on different machines where they are assigned
592different login userspace ids. Most users will have ``u1000`` as the login id
593on their machine at home and all files in their home directory will usually be
594owned by ``u1000``. At uni or at work they may have another login id such as
595``u1125``. This makes it rather difficult to interact with their home directory
596on their work machine.
597
598In both cases changing ownership recursively has grave implications. The most
599obvious one is that ownership is changed globally and permanently. In the home
600directory case this change in ownership would even need to happen every time the
601user switches from their home to their work machine. For really large sets of
602files this becomes increasingly costly.
603
604If the user is lucky, they are dealing with a filesystem that is mountable
605inside user namespaces. But this would also change ownership globally and the
606change in ownership is tied to the lifetime of the filesystem mount, i.e. the
607superblock. The only way to change ownership is to completely unmount the
608filesystem and mount it again in another user namespace. This is usually
609impossible because it would mean that all users currently accessing the
610filesystem can't anymore. And it means that ``dir`` still can't be shared
611between two containers with different idmappings.
612But usually the user doesn't even have this option since most filesystems
613aren't mountable inside containers. And not having them mountable might be
614desirable as it doesn't require the filesystem to deal with malicious
615filesystem images.
616
617But the usecases mentioned above and more can be handled by idmapped mounts.
618They allow to expose the same set of dentries with different ownership at
619different mounts. This is achieved by marking the mounts with a user namespace
620through the ``mount_setattr()`` system call. The idmapping associated with it
621is then used to translate from the caller's idmapping to the filesystem's
622idmapping and vica versa using the remapping algorithm we introduced above.
623
624Idmapped mounts make it possible to change ownership in a temporary and
625localized way. The ownership changes are restricted to a specific mount and the
626ownership changes are tied to the lifetime of the mount. All other users and
627locations where the filesystem is exposed are unaffected.
628
629Filesystems that support idmapped mounts don't have any real reason to support
630being mountable inside user namespaces. A filesystem could be exposed
631completely under an idmapped mount to get the same effect. This has the
632advantage that filesystems can leave the creation of the superblock to
633privileged users in the initial user namespace.
634
635However, it is perfectly possible to combine idmapped mounts with filesystems
636mountable inside user namespaces. We will touch on this further below.
637
638Filesystem types vs idmapped mount types
639~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
640
641With the introduction of idmapped mounts we need to distinguish between
642filesystem ownership and mount ownership of a VFS object such as an inode. The
643owner of a inode might be different when looked at from a filesystem
644perspective than when looked at from an idmapped mount. Such fundamental
645conceptual distinctions should almost always be clearly expressed in the code.
646So, to distinguish idmapped mount ownership from filesystem ownership separate
647types have been introduced.
648
649If a uid or gid has been generated using the filesystem or caller's idmapping
650then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid
651has been generated using a mount idmapping then we will be using the dedicated
652``vfsuid_t`` and ``vfsgid_t`` types.
653
654All VFS helpers that generate or take uids and gids as arguments use the
655``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler
656to catch errors that originate from conflating filesystem and VFS uids and gids.
657
658The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t``
659and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped
660from and to ``uid_t`` and ``gid_t`` types::
661
662 uid_t <--> kuid_t <--> vfsuid_t
663 gid_t <--> kgid_t <--> vfsgid_t
664
665Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type,
666e.g., during ``stat()``, or store ownership information in a shared VFS object
667based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can
668use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.
669
670To illustrate why this helper currently exists, consider what happens when we
671change ownership of an inode from an idmapped mount. After we generated
672a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
673this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership.
674Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
675or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
676``vfsgid_into_kgid()``.
677
678Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached
679``struct posix_acl``, stores ownership information a filesystem or "global"
680``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t``
681and ``vfsgid_t`` is specific to an idmapped mount.
682
683We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based
684on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based
685on filesystem idmappings. To prevent abusing filesystem idmappings to generate
686``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t``
687or ``kgid_t`` types filesystem idmappings and mount idmappings are different
688types as well.
689
690All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require
691a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing
692a filesystem or caller idmapping will cause a compilation error.
693
694Similar to how we prefix all userspace ids in this document with ``u`` and all
695kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount
696idmapping will be written as: ``u0:v10000:r10000``.
697
698Remapping helpers
699~~~~~~~~~~~~~~~~~
700
701Idmapping functions were added that translate between idmappings. They make use
702of the remapping algorithm we've introduced earlier. We're going to look at:
703
704- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()``
705
706  The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into
707  VFS ids in the mount's idmapping::
708
709   /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
710   from_kuid(filesystem, kid) = uid
711
712   /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */
713   make_kuid(mount, uid) = kuid
714
715- ``mapped_fsuid()`` and ``mapped_fsgid()``
716
717  The ``mapped_fs*id()`` functions translate the caller's kernel ids into
718  kernel ids in the filesystem's idmapping. This translation is achieved by
719  remapping the caller's VFS ids using the mount's idmapping::
720
721   /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */
722   from_kuid(mount, kid) = uid
723
724   /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
725   make_kuid(filesystem, uid) = kuid
726
727- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()``
728
729   Whenever
730
731Note that these two functions invert each other. Consider the following
732idmappings::
733
734 caller idmapping:     u0:k10000:r10000
735 filesystem idmapping: u0:k20000:r10000
736 mount idmapping:      u0:v10000:r10000
737
738Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
739to ``k21000`` according to its idmapping. This is what is stored in the
740inode's ``i_uid`` and ``i_gid`` fields.
741
742When the caller queries the ownership of this file via ``stat()`` the kernel
743would usually simply use the crossmapping algorithm and map the filesystem's
744kernel id up to a userspace id in the caller's idmapping.
745
746But when the caller is accessing the file on an idmapped mount the kernel will
747first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel
748id into a VFS id in the mount's idmapping::
749
750 i_uid_into_vfsuid(k21000):
751   /* Map the filesystem's kernel id up into a userspace id. */
752   from_kuid(u0:k20000:r10000, k21000) = u1000
753
754   /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */
755   make_kuid(u0:v10000:r10000, u1000) = v11000
756
757Finally, when the kernel reports the owner to the caller it will turn the
758VFS id in the mount's idmapping into a userspace id in the caller's
759idmapping::
760
761  k11000 = vfsuid_into_kuid(v11000)
762  from_kuid(u0:k10000:r10000, k11000) = u1000
763
764We can test whether this algorithm really works by verifying what happens when
765we create a new file. Let's say the user is creating a file with ``u1000``.
766
767The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
768kernel would now apply the crossmapping, verifying that ``k11000`` can be
769mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
770be mapped up in the filesystem's idmapping directly this creation request
771fails.
772
773But when the caller is accessing the file on an idmapped mount the kernel will
774first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
775a VFS id according to the mount's idmapping::
776
777 mapped_fsuid(k11000):
778    /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
779    from_kuid(u0:k10000:r10000, k11000) = u1000
780
781    /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
782    make_kuid(u0:v20000:r10000, u1000) = v21000
783
784When finally writing to disk the kernel will then map ``v21000`` up into a
785userspace id in the filesystem's idmapping::
786
787   k21000 = vfsuid_into_kuid(v21000)
788   from_kuid(u0:k20000:r10000, k21000) = u1000
789
790As we can see, we end up with an invertible and therefore information
791preserving algorithm. A file created from ``u1000`` on an idmapped mount will
792also be reported as being owned by ``u1000`` and vica versa.
793
794Let's now briefly reconsider the failing examples from earlier in the context
795of idmapped mounts.
796
797Example 2 reconsidered
798~~~~~~~~~~~~~~~~~~~~~~
799
800::
801
802 caller id:            u1000
803 caller idmapping:     u0:k10000:r10000
804 filesystem idmapping: u0:k20000:r10000
805 mount idmapping:      u0:v10000:r10000
806
807When the caller is using a non-initial idmapping the common case is to attach
808the same idmapping to the mount. We now perform three steps:
809
8101. Map the caller's userspace ids into kernel ids in the caller's idmapping::
811
812    make_kuid(u0:k10000:r10000, u1000) = k11000
813
8142. Translate the caller's VFS id into a kernel id in the filesystem's
815   idmapping::
816
817    mapped_fsuid(v11000):
818      /* Map the VFS id up into a userspace id in the mount's idmapping. */
819      from_kuid(u0:v10000:r10000, v11000) = u1000
820
821      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
822      make_kuid(u0:k20000:r10000, u1000) = k21000
823
8242. Verify that the caller's kernel ids can be mapped to userspace ids in the
825   filesystem's idmapping::
826
827    from_kuid(u0:k20000:r10000, k21000) = u1000
828
829So the ownership that lands on disk will be ``u1000``.
830
831Example 3 reconsidered
832~~~~~~~~~~~~~~~~~~~~~~
833
834::
835
836 caller id:            u1000
837 caller idmapping:     u0:k10000:r10000
838 filesystem idmapping: u0:k0:r4294967295
839 mount idmapping:      u0:v10000:r10000
840
841The same translation algorithm works with the third example.
842
8431. Map the caller's userspace ids into kernel ids in the caller's idmapping::
844
845    make_kuid(u0:k10000:r10000, u1000) = k11000
846
8472. Translate the caller's VFS id into a kernel id in the filesystem's
848   idmapping::
849
850    mapped_fsuid(v11000):
851       /* Map the VFS id up into a userspace id in the mount's idmapping. */
852       from_kuid(u0:v10000:r10000, v11000) = u1000
853
854       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
855       make_kuid(u0:k0:r4294967295, u1000) = k1000
856
8572. Verify that the caller's kernel ids can be mapped to userspace ids in the
858   filesystem's idmapping::
859
860    from_kuid(u0:k0:r4294967295, k21000) = u1000
861
862So the ownership that lands on disk will be ``u1000``.
863
864Example 4 reconsidered
865~~~~~~~~~~~~~~~~~~~~~~
866
867::
868
869 file id:              u1000
870 caller idmapping:     u0:k10000:r10000
871 filesystem idmapping: u0:k0:r4294967295
872 mount idmapping:      u0:v10000:r10000
873
874In order to report ownership to userspace the kernel now does three steps using
875the translation algorithm we introduced earlier:
876
8771. Map the userspace id on disk down into a kernel id in the filesystem's
878   idmapping::
879
880    make_kuid(u0:k0:r4294967295, u1000) = k1000
881
8822. Translate the kernel id into a VFS id in the mount's idmapping::
883
884    i_uid_into_vfsuid(k1000):
885      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
886      from_kuid(u0:k0:r4294967295, k1000) = u1000
887
888      /* Map the userspace id down into a VFS id in the mounts's idmapping. */
889      make_kuid(u0:v10000:r10000, u1000) = v11000
890
8913. Map the VFS id up into a userspace id in the caller's idmapping::
892
893    k11000 = vfsuid_into_kuid(v11000)
894    from_kuid(u0:k10000:r10000, k11000) = u1000
895
896Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
897idmapping. With the idmapped mount in place it now can be crossmapped into the
898filesystem's idmapping via the mount's idmapping. The file will now be created
899with ``u1000`` according to the mount's idmapping.
900
901Example 5 reconsidered
902~~~~~~~~~~~~~~~~~~~~~~
903
904::
905
906 file id:              u1000
907 caller idmapping:     u0:k10000:r10000
908 filesystem idmapping: u0:k20000:r10000
909 mount idmapping:      u0:v10000:r10000
910
911Again, in order to report ownership to userspace the kernel now does three
912steps using the translation algorithm we introduced earlier:
913
9141. Map the userspace id on disk down into a kernel id in the filesystem's
915   idmapping::
916
917    make_kuid(u0:k20000:r10000, u1000) = k21000
918
9192. Translate the kernel id into a VFS id in the mount's idmapping::
920
921    i_uid_into_vfsuid(k21000):
922      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
923      from_kuid(u0:k20000:r10000, k21000) = u1000
924
925      /* Map the userspace id down into a VFS id in the mounts's idmapping. */
926      make_kuid(u0:v10000:r10000, u1000) = v11000
927
9283. Map the VFS id up into a userspace id in the caller's idmapping::
929
930    k11000 = vfsuid_into_kuid(v11000)
931    from_kuid(u0:k10000:r10000, k11000) = u1000
932
933Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
934idmapping. With the idmapped mount in place it now can be crossmapped into the
935filesystem's idmapping via the mount's idmapping. The file is now owned by
936``u1000`` according to the mount's idmapping.
937
938Changing ownership on a home directory
939~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
940
941We've seen above how idmapped mounts can be used to translate between
942idmappings when either the caller, the filesystem or both uses a non-initial
943idmapping. A wide range of usecases exist when the caller is using
944a non-initial idmapping. This mostly happens in the context of containerized
945workloads. The consequence is as we have seen that for both, filesystem's
946mounted with the initial idmapping and filesystems mounted with non-initial
947idmappings, access to the filesystem isn't working because the kernel ids can't
948be crossmapped between the caller's and the filesystem's idmapping.
949
950As we've seen above idmapped mounts provide a solution to this by remapping the
951caller's or filesystem's idmapping according to the mount's idmapping.
952
953Aside from containerized workloads, idmapped mounts have the advantage that
954they also work when both the caller and the filesystem use the initial
955idmapping which means users on the host can change the ownership of directories
956and files on a per-mount basis.
957
958Consider our previous example where a user has their home directory on portable
959storage. At home they have id ``u1000`` and all files in their home directory
960are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
961
962Taking their home directory with them becomes problematic. They can't easily
963access their files, they might not be able to write to disk without applying
964lax permissions or ACLs and even if they can, they will end up with an annoying
965mix of files and directories owned by ``u1000`` and ``u1125``.
966
967Idmapped mounts allow to solve this problem. A user can create an idmapped
968mount for their home directory on their work computer or their computer at home
969depending on what ownership they would prefer to end up on the portable storage
970itself.
971
972Let's assume they want all files on disk to belong to ``u1000``. When the user
973plugs in their portable storage at their work station they can setup a job that
974creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
975when they create a file the kernel performs the following steps we already know
976from above:::
977
978 caller id:            u1125
979 caller idmapping:     u0:k0:r4294967295
980 filesystem idmapping: u0:k0:r4294967295
981 mount idmapping:      u1000:v1125:r1
982
9831. Map the caller's userspace ids into kernel ids in the caller's idmapping::
984
985    make_kuid(u0:k0:r4294967295, u1125) = k1125
986
9872. Translate the caller's VFS id into a kernel id in the filesystem's
988   idmapping::
989
990    mapped_fsuid(v1125):
991      /* Map the VFS id up into a userspace id in the mount's idmapping. */
992      from_kuid(u1000:v1125:r1, v1125) = u1000
993
994      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
995      make_kuid(u0:k0:r4294967295, u1000) = k1000
996
9972. Verify that the caller's filesystem ids can be mapped to userspace ids in the
998   filesystem's idmapping::
999
1000    from_kuid(u0:k0:r4294967295, k1000) = u1000
1001
1002So ultimately the file will be created with ``u1000`` on disk.
1003
1004Now let's briefly look at what ownership the caller with id ``u1125`` will see
1005on their work computer:
1006
1007::
1008
1009 file id:              u1000
1010 caller idmapping:     u0:k0:r4294967295
1011 filesystem idmapping: u0:k0:r4294967295
1012 mount idmapping:      u1000:v1125:r1
1013
10141. Map the userspace id on disk down into a kernel id in the filesystem's
1015   idmapping::
1016
1017    make_kuid(u0:k0:r4294967295, u1000) = k1000
1018
10192. Translate the kernel id into a VFS id in the mount's idmapping::
1020
1021    i_uid_into_vfsuid(k1000):
1022      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
1023      from_kuid(u0:k0:r4294967295, k1000) = u1000
1024
1025      /* Map the userspace id down into a VFS id in the mounts's idmapping. */
1026      make_kuid(u1000:v1125:r1, u1000) = v1125
1027
10283. Map the VFS id up into a userspace id in the caller's idmapping::
1029
1030    k1125 = vfsuid_into_kuid(v1125)
1031    from_kuid(u0:k0:r4294967295, k1125) = u1125
1032
1033So ultimately the caller will be reported that the file belongs to ``u1125``
1034which is the caller's userspace id on their workstation in our example.
1035
1036The raw userspace id that is put on disk is ``u1000`` so when the user takes
1037their home directory back to their home computer where they are assigned
1038``u1000`` using the initial idmapping and mount the filesystem with the initial
1039idmapping they will see all those files owned by ``u1000``.
1040