Searched +hist:4 +hist:be929be (Results 1 - 19 of 19) sorted by relevance

/linux-master/fs/ocfs2/
H A Dblockcheck.cdiff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/drivers/dma/
H A Dtimb_dma.cdiff 4f339d6e Tue Sep 19 07:32:01 MDT 2023 Uwe Kleine-König <u.kleine-koenig@pengutronix.de> dmaengine: timb_dma: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new() which already returns void. Eventually after all drivers
are converted, .remove_new() is renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Link: https://lore.kernel.org/r/20230919133207.1400430-54-u.kleine-koenig@pengutronix.de
Signed-off-by: Vinod Koul <vkoul@kernel.org>
diff 4bf27b8b Fri Dec 21 16:09:59 MST 2012 Greg Kroah-Hartman <gregkh@linuxfoundation.org> Drivers: dma: remove __dev* attributes.

CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit, __devexit_p, __devinitconst,
and __devexit from these drivers.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Viresh Kumar <viresh.linux@gmail.com>
Cc: Dan Williams <djbw@fb.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Barry Song <baohua.song@csr.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Jassi Brar <jassisinghbrar@gmail.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 4d4e58de Tue Mar 06 15:34:06 MST 2012 Russell King - ARM Linux <linux@arm.linux.org.uk> dmaengine: move last completed cookie into generic dma_chan structure

Every DMA engine implementation declares a last completed dma cookie
in their private dma channel structures. This is pointless, and
forces driver specific code. Move this out into the common dma_chan
structure.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tested-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jassi Brar <jassisinghbrar@gmail.com>
[imx-sdma.c & mxs-dma.c]
Tested-by: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: Vinod Koul <vinod.koul@linux.intel.com>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/net/9p/
H A Dprotocol.cdiff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size()

This new function calculates a buffer size suitable for holding the
intended 9p request or response. For rather small message types (which
applies to almost all 9p message types actually) simply use hard coded
values. For some variable-length and potentially large message types
calculate a more precise value according to what data is actually
transmitted to avoid unnecessarily huge buffers.

So p9_msg_buf_size() divides the individual 9p message types into 3
message size categories:

- dynamically calculated message size (i.e. potentially large)
- 8k hard coded message size
- 4k hard coded message size

As for the latter two hard coded message types: for most 9p message
types it is pretty obvious whether they would always fit into 4k or
8k. But for some of them it depends on the maximum directory entry
name length allowed by OS and filesystem for determining into which
of the two size categories they would fit into. Currently Linux
supports directory entry names up to NAME_MAX (255), however when
comparing the limitation of individual filesystems, ReiserFS
theoretically supports up to slightly below 4k long names. So in
order to make this code more future proof, and as revisiting it
later on is a bit tedious and has the potential to miss out details,
the decision [1] was made to take 4k as basis as for max. name length.

Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com
Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1]
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
diff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size()

This new function calculates a buffer size suitable for holding the
intended 9p request or response. For rather small message types (which
applies to almost all 9p message types actually) simply use hard coded
values. For some variable-length and potentially large message types
calculate a more precise value according to what data is actually
transmitted to avoid unnecessarily huge buffers.

So p9_msg_buf_size() divides the individual 9p message types into 3
message size categories:

- dynamically calculated message size (i.e. potentially large)
- 8k hard coded message size
- 4k hard coded message size

As for the latter two hard coded message types: for most 9p message
types it is pretty obvious whether they would always fit into 4k or
8k. But for some of them it depends on the maximum directory entry
name length allowed by OS and filesystem for determining into which
of the two size categories they would fit into. Currently Linux
supports directory entry names up to NAME_MAX (255), however when
comparing the limitation of individual filesystems, ReiserFS
theoretically supports up to slightly below 4k long names. So in
order to make this code more future proof, and as revisiting it
later on is a bit tedious and has the potential to miss out details,
the decision [1] was made to take 4k as basis as for max. name length.

Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com
Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1]
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
diff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size()

This new function calculates a buffer size suitable for holding the
intended 9p request or response. For rather small message types (which
applies to almost all 9p message types actually) simply use hard coded
values. For some variable-length and potentially large message types
calculate a more precise value according to what data is actually
transmitted to avoid unnecessarily huge buffers.

So p9_msg_buf_size() divides the individual 9p message types into 3
message size categories:

- dynamically calculated message size (i.e. potentially large)
- 8k hard coded message size
- 4k hard coded message size

As for the latter two hard coded message types: for most 9p message
types it is pretty obvious whether they would always fit into 4k or
8k. But for some of them it depends on the maximum directory entry
name length allowed by OS and filesystem for determining into which
of the two size categories they would fit into. Currently Linux
supports directory entry names up to NAME_MAX (255), however when
comparing the limitation of individual filesystems, ReiserFS
theoretically supports up to slightly below 4k long names. So in
order to make this code more future proof, and as revisiting it
later on is a bit tedious and has the potential to miss out details,
the decision [1] was made to take 4k as basis as for max. name length.

Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com
Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1]
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
diff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size()

This new function calculates a buffer size suitable for holding the
intended 9p request or response. For rather small message types (which
applies to almost all 9p message types actually) simply use hard coded
values. For some variable-length and potentially large message types
calculate a more precise value according to what data is actually
transmitted to avoid unnecessarily huge buffers.

So p9_msg_buf_size() divides the individual 9p message types into 3
message size categories:

- dynamically calculated message size (i.e. potentially large)
- 8k hard coded message size
- 4k hard coded message size

As for the latter two hard coded message types: for most 9p message
types it is pretty obvious whether they would always fit into 4k or
8k. But for some of them it depends on the maximum directory entry
name length allowed by OS and filesystem for determining into which
of the two size categories they would fit into. Currently Linux
supports directory entry names up to NAME_MAX (255), however when
comparing the limitation of individual filesystems, ReiserFS
theoretically supports up to slightly below 4k long names. So in
order to make this code more future proof, and as revisiting it
later on is a bit tedious and has the potential to miss out details,
the decision [1] was made to take 4k as basis as for max. name length.

Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com
Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1]
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
diff ef5305f1 Fri Sep 07 09:36:08 MDT 2018 Dominique Martinet <dominique.martinet@cea.fr> 9p: p9dirent_read: check network-provided name length

strcpy to dirent->d_name could overflow the buffer, use strscpy to check
the provided string length and error out if the size was too big.

While we are here, make the function return an error when the pdu
parsing failed, instead of returning the pdu offset as if it had been a
success...

Link: http://lkml.kernel.org/r/1536339057-21974-4-git-send-email-asmadeus@codewreck.org
Addresses-Coverity-ID: 139133 ("Copy into fixed size buffer")
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
diff 6da2ec56 Tue Jun 12 14:55:00 MDT 2018 Kees Cook <keescook@chromium.org> treewide: kmalloc() -> kmalloc_array()

The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

kmalloc(a * b, gfp)

with:
kmalloc_array(a * b, gfp)

as well as handling cases of:

kmalloc(a * b * c, gfp)

with:

kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
diff 4f3b35c1 Wed Apr 01 17:57:53 MDT 2015 Al Viro <viro@zeniv.linux.org.uk> net/9p: switch the guts of p9_client_{read,write}() to iov_iter

... and have get_user_pages_fast() mapping fewer pages than requested
to generate a short read/write.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 348b5901 Sat Aug 06 13:16:59 MDT 2011 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> net/9p: Convert net/9p protocol dumps to tracepoints

This helps in more control over debugging.
root@qemu-img-64:~# ls /pass/123
ls: cannot access /pass/123: No such file or directory
root@qemu-img-64:~# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
ls-1536 [001] 70.928584: 9p_protocol_dump: clnt 18446612132784021504 P9_TWALK(tag = 1)
000: 16 00 00 00 6e 01 00 01 00 00 00 02 00 00 00 01
010: 00 03 00 31 32 33 00 00 00 ff ff ff ff 00 00 00

ls-1536 [001] 70.928587: <stack trace>
=> trace_9p_protocol_dump
=> p9pdu_finalize
=> p9_client_rpc
=> p9_client_walk
=> v9fs_vfs_lookup
=> d_alloc_and_lookup
=> walk_component
=> path_lookupat
ls-1536 [000] 70.929696: 9p_protocol_dump: clnt 18446612132784021504 P9_RLERROR(tag = 1)
000: 0b 00 00 00 07 01 00 02 00 00 00 4e 03 00 02 00
010: 00 00 00 00 03 00 02 00 00 00 00 00 ff 43 00 00

ls-1536 [000] 70.929697: <stack trace>
=> trace_9p_protocol_dump
=> p9_client_rpc
=> p9_client_walk
=> v9fs_vfs_lookup
=> d_alloc_and_lookup
=> walk_component
=> path_lookupat
=> do_path_lookup

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol.

SYNOPSIS

size[4] Tsetattr tag[2] attr[n]

size[4] Rsetattr tag[2]

DESCRIPTION

The setattr command changes some of the file status information.
attr resembles the iattr structure used in Linux kernel. It
specifies which status parameter is to be changed and to what
value. It is laid out as follows:

valid[4]
specifies which status information is to be changed. Possible
values are:
ATTR_MODE (1 << 0)
ATTR_UID (1 << 1)
ATTR_GID (1 << 2)
ATTR_SIZE (1 << 3)
ATTR_ATIME (1 << 4)
ATTR_MTIME (1 << 5)
ATTR_ATIME_SET (1 << 7)
ATTR_MTIME_SET (1 << 8)

The last two bits represent whether the time information
is being sent by the client's user space. In the absense
of these bits the server always uses server's time.

mode[4]
File permission bits

uid[4]
Owner id of file

gid[4]
Group id of the file

size[8]
File size

atime_sec[8]
Time of last file access, seconds

atime_nsec[8]
Time of last file access, nanoseconds

mtime_sec[8]
Time of last file modification, seconds

mtime_nsec[8]
Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
p9_iattr_dotl structure and passes it down to the client. The
only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
parameters because I don't think these are needed in our case.
The client user space can request updating just ctime by calling
chown(fd, -1, -1). This is handled on server side without a need
for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
none of them does: if the request succeeds, all changes were made;
if it fails, none were."
I have not done anything to implement this specifically because I
don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol.

SYNOPSIS

size[4] Tsetattr tag[2] attr[n]

size[4] Rsetattr tag[2]

DESCRIPTION

The setattr command changes some of the file status information.
attr resembles the iattr structure used in Linux kernel. It
specifies which status parameter is to be changed and to what
value. It is laid out as follows:

valid[4]
specifies which status information is to be changed. Possible
values are:
ATTR_MODE (1 << 0)
ATTR_UID (1 << 1)
ATTR_GID (1 << 2)
ATTR_SIZE (1 << 3)
ATTR_ATIME (1 << 4)
ATTR_MTIME (1 << 5)
ATTR_ATIME_SET (1 << 7)
ATTR_MTIME_SET (1 << 8)

The last two bits represent whether the time information
is being sent by the client's user space. In the absense
of these bits the server always uses server's time.

mode[4]
File permission bits

uid[4]
Owner id of file

gid[4]
Group id of the file

size[8]
File size

atime_sec[8]
Time of last file access, seconds

atime_nsec[8]
Time of last file access, nanoseconds

mtime_sec[8]
Time of last file modification, seconds

mtime_nsec[8]
Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
p9_iattr_dotl structure and passes it down to the client. The
only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
parameters because I don't think these are needed in our case.
The client user space can request updating just ctime by calling
chown(fd, -1, -1). This is handled on server side without a need
for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
none of them does: if the request succeeds, all changes were made;
if it fails, none were."
I have not done anything to implement this specifically because I
don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol.

SYNOPSIS

size[4] Tsetattr tag[2] attr[n]

size[4] Rsetattr tag[2]

DESCRIPTION

The setattr command changes some of the file status information.
attr resembles the iattr structure used in Linux kernel. It
specifies which status parameter is to be changed and to what
value. It is laid out as follows:

valid[4]
specifies which status information is to be changed. Possible
values are:
ATTR_MODE (1 << 0)
ATTR_UID (1 << 1)
ATTR_GID (1 << 2)
ATTR_SIZE (1 << 3)
ATTR_ATIME (1 << 4)
ATTR_MTIME (1 << 5)
ATTR_ATIME_SET (1 << 7)
ATTR_MTIME_SET (1 << 8)

The last two bits represent whether the time information
is being sent by the client's user space. In the absense
of these bits the server always uses server's time.

mode[4]
File permission bits

uid[4]
Owner id of file

gid[4]
Group id of the file

size[8]
File size

atime_sec[8]
Time of last file access, seconds

atime_nsec[8]
Time of last file access, nanoseconds

mtime_sec[8]
Time of last file modification, seconds

mtime_nsec[8]
Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
p9_iattr_dotl structure and passes it down to the client. The
only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
parameters because I don't think these are needed in our case.
The client user space can request updating just ctime by calling
chown(fd, -1, -1). This is handled on server side without a need
for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
none of them does: if the request succeeds, all changes were made;
if it fails, none were."
I have not done anything to implement this specifically because I
don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol.

SYNOPSIS

size[4] Tsetattr tag[2] attr[n]

size[4] Rsetattr tag[2]

DESCRIPTION

The setattr command changes some of the file status information.
attr resembles the iattr structure used in Linux kernel. It
specifies which status parameter is to be changed and to what
value. It is laid out as follows:

valid[4]
specifies which status information is to be changed. Possible
values are:
ATTR_MODE (1 << 0)
ATTR_UID (1 << 1)
ATTR_GID (1 << 2)
ATTR_SIZE (1 << 3)
ATTR_ATIME (1 << 4)
ATTR_MTIME (1 << 5)
ATTR_ATIME_SET (1 << 7)
ATTR_MTIME_SET (1 << 8)

The last two bits represent whether the time information
is being sent by the client's user space. In the absense
of these bits the server always uses server's time.

mode[4]
File permission bits

uid[4]
Owner id of file

gid[4]
Group id of the file

size[8]
File size

atime_sec[8]
Time of last file access, seconds

atime_nsec[8]
Time of last file access, nanoseconds

mtime_sec[8]
Time of last file modification, seconds

mtime_nsec[8]
Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
p9_iattr_dotl structure and passes it down to the client. The
only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
parameters because I don't think these are needed in our case.
The client user space can request updating just ctime by calling
chown(fd, -1, -1). This is handled on server side without a need
for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
none of them does: if the request succeeds, all changes were made;
if it fails, none were."
I have not done anything to implement this specifically because I
don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol.

SYNOPSIS

size[4] Tsetattr tag[2] attr[n]

size[4] Rsetattr tag[2]

DESCRIPTION

The setattr command changes some of the file status information.
attr resembles the iattr structure used in Linux kernel. It
specifies which status parameter is to be changed and to what
value. It is laid out as follows:

valid[4]
specifies which status information is to be changed. Possible
values are:
ATTR_MODE (1 << 0)
ATTR_UID (1 << 1)
ATTR_GID (1 << 2)
ATTR_SIZE (1 << 3)
ATTR_ATIME (1 << 4)
ATTR_MTIME (1 << 5)
ATTR_ATIME_SET (1 << 7)
ATTR_MTIME_SET (1 << 8)

The last two bits represent whether the time information
is being sent by the client's user space. In the absense
of these bits the server always uses server's time.

mode[4]
File permission bits

uid[4]
Owner id of file

gid[4]
Group id of the file

size[8]
File size

atime_sec[8]
Time of last file access, seconds

atime_nsec[8]
Time of last file access, nanoseconds

mtime_sec[8]
Time of last file modification, seconds

mtime_nsec[8]
Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
p9_iattr_dotl structure and passes it down to the client. The
only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
parameters because I don't think these are needed in our case.
The client user space can request updating just ctime by calling
chown(fd, -1, -1). This is handled on server side without a need
for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
none of them does: if the request succeeds, all changes were made;
if it fails, none were."
I have not done anything to implement this specifically because I
don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol.

SYNOPSIS

size[4] Tsetattr tag[2] attr[n]

size[4] Rsetattr tag[2]

DESCRIPTION

The setattr command changes some of the file status information.
attr resembles the iattr structure used in Linux kernel. It
specifies which status parameter is to be changed and to what
value. It is laid out as follows:

valid[4]
specifies which status information is to be changed. Possible
values are:
ATTR_MODE (1 << 0)
ATTR_UID (1 << 1)
ATTR_GID (1 << 2)
ATTR_SIZE (1 << 3)
ATTR_ATIME (1 << 4)
ATTR_MTIME (1 << 5)
ATTR_ATIME_SET (1 << 7)
ATTR_MTIME_SET (1 << 8)

The last two bits represent whether the time information
is being sent by the client's user space. In the absense
of these bits the server always uses server's time.

mode[4]
File permission bits

uid[4]
Owner id of file

gid[4]
Group id of the file

size[8]
File size

atime_sec[8]
Time of last file access, seconds

atime_nsec[8]
Time of last file access, nanoseconds

mtime_sec[8]
Time of last file modification, seconds

mtime_nsec[8]
Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
p9_iattr_dotl structure and passes it down to the client. The
only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
parameters because I don't think these are needed in our case.
The client user space can request updating just ctime by calling
chown(fd, -1, -1). This is handled on server side without a need
for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
none of them does: if the request succeeds, all changes were made;
if it fails, none were."
I have not done anything to implement this specifically because I
don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol.

SYNOPSIS

size[4] Tsetattr tag[2] attr[n]

size[4] Rsetattr tag[2]

DESCRIPTION

The setattr command changes some of the file status information.
attr resembles the iattr structure used in Linux kernel. It
specifies which status parameter is to be changed and to what
value. It is laid out as follows:

valid[4]
specifies which status information is to be changed. Possible
values are:
ATTR_MODE (1 << 0)
ATTR_UID (1 << 1)
ATTR_GID (1 << 2)
ATTR_SIZE (1 << 3)
ATTR_ATIME (1 << 4)
ATTR_MTIME (1 << 5)
ATTR_ATIME_SET (1 << 7)
ATTR_MTIME_SET (1 << 8)

The last two bits represent whether the time information
is being sent by the client's user space. In the absense
of these bits the server always uses server's time.

mode[4]
File permission bits

uid[4]
Owner id of file

gid[4]
Group id of the file

size[8]
File size

atime_sec[8]
Time of last file access, seconds

atime_nsec[8]
Time of last file access, nanoseconds

mtime_sec[8]
Time of last file modification, seconds

mtime_nsec[8]
Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
p9_iattr_dotl structure and passes it down to the client. The
only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
parameters because I don't think these are needed in our case.
The client user space can request updating just ctime by calling
chown(fd, -1, -1). This is handled on server side without a need
for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
none of them does: if the request succeeds, all changes were made;
if it fails, none were."
I have not done anything to implement this specifically because I
don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
/linux-master/net/sunrpc/
H A Drpcb_clnt.cdiff 4f3ed837 Wed Oct 11 02:00:22 MDT 2023 Dan Carpenter <dan.carpenter@linaro.org> SUNRPC: Add an IS_ERR() check back to where it was

This IS_ERR() check was deleted during in a cleanup because, at the time,
the rpcb_call_async() function could not return an error pointer. That
changed in commit 25cf32ad5dba ("SUNRPC: Handle allocation failure in
rpc_new_task()") and now it can return an error pointer. Put the check
back.

A related revert was done in commit 13bd90141804 ("Revert "SUNRPC:
Remove unreachable error condition"").

Fixes: 037e910b52b0 ("SUNRPC: Remove unreachable error condition in rpcb_getport_async()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
diff 81c88b18 Thu Dec 20 08:35:11 MST 2018 J. Bruce Fields <bfields@redhat.com> sunrpc: handle ENOMEM in rpcb_getport_async

If we ignore the error we'll hit a null dereference a little later.

Reported-by: syzbot+4b98281f2401ab849f4b@syzkaller.appspotmail.com
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
diff 4e0038b6 Thu Mar 01 15:01:05 MST 2012 Trond Myklebust <Trond.Myklebust@netapp.com> SUNRPC: Move clnt->cl_server into struct rpc_xprt

When the cl_xprt field is updated, the cl_server field will also have
to change. Since the contents of cl_server follow the remote endpoint
of cl_xprt, just move that field to the rpc_xprt.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ cel: simplify check_gss_callback_principal(), whitespace changes ]
[ cel: forward ported to 3.4 ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
diff c526611d Thu Dec 03 13:58:56 MST 2009 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Use a cached RPC client and transport for rpcbind upcalls

The kernel's rpcbind client creates and deletes an rpc_clnt and its
underlying transport socket for every upcall to the local rpcbind
daemon.

When starting a typical NFS server on IPv4 and IPv6, the NFS service
itself does three upcalls (one per version) times two upcalls (one
per transport) times two upcalls (one per address family), making 12,
plus another one for the initial call to unregister previous NFS
services. Starting the NLM service adds an additional 13 upcalls,
for similar reasons.

(Currently the NFS service doesn't start IPv6 listeners, but it will
soon enough).

Instead, let's create an rpc_clnt for rpcbind upcalls during the
first local rpcbind query, and cache it. This saves the overhead of
creating and destroying an rpc_clnt and a socket for every upcall.

The new logic also prevents the kernel from attempting an RPCB_SET or
RPCB_UNSET if it knows from the start that the local portmapper does
not support rpcbind protocol version 4. This will cut down on the
number of rpcbind upcalls in legacy environments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon

Introduce a new API to register RPC services on IPv6 interfaces to allow
the NFS server and lockd to advertise on IPv6 networks.

Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind
protocol version 4 to contact the local rpcbind daemon. The version 4
SET/UNSET procedures allow services to register address families besides
AF_INET, register at specific network interfaces, and register transport
protocols besides UDP and TCP. All of this functionality is exposed via
the new rpcb_v4_register() kernel API.

A user-space rpcbind daemon implementation that supports version 4 of the
rpcbind protocol is required in order to make use of this new API.

Note that rpcbind version 3 is sufficient to support the new rpcbind
facilities listed above, but most extant implementations use version 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon

Introduce a new API to register RPC services on IPv6 interfaces to allow
the NFS server and lockd to advertise on IPv6 networks.

Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind
protocol version 4 to contact the local rpcbind daemon. The version 4
SET/UNSET procedures allow services to register address families besides
AF_INET, register at specific network interfaces, and register transport
protocols besides UDP and TCP. All of this functionality is exposed via
the new rpcb_v4_register() kernel API.

A user-space rpcbind daemon implementation that supports version 4 of the
rpcbind protocol is required in order to make use of this new API.

Note that rpcbind version 3 is sufficient to support the new rpcbind
facilities listed above, but most extant implementations use version 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon

Introduce a new API to register RPC services on IPv6 interfaces to allow
the NFS server and lockd to advertise on IPv6 networks.

Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind
protocol version 4 to contact the local rpcbind daemon. The version 4
SET/UNSET procedures allow services to register address families besides
AF_INET, register at specific network interfaces, and register transport
protocols besides UDP and TCP. All of this functionality is exposed via
the new rpcb_v4_register() kernel API.

A user-space rpcbind daemon implementation that supports version 4 of the
rpcbind protocol is required in order to make use of this new API.

Note that rpcbind version 3 is sufficient to support the new rpcbind
facilities listed above, but most extant implementations use version 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon

Introduce a new API to register RPC services on IPv6 interfaces to allow
the NFS server and lockd to advertise on IPv6 networks.

Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind
protocol version 4 to contact the local rpcbind daemon. The version 4
SET/UNSET procedures allow services to register address families besides
AF_INET, register at specific network interfaces, and register transport
protocols besides UDP and TCP. All of this functionality is exposed via
the new rpcb_v4_register() kernel API.

A user-space rpcbind daemon implementation that supports version 4 of the
rpcbind protocol is required in order to make use of this new API.

Note that rpcbind version 3 is sufficient to support the new rpcbind
facilities listed above, but most extant implementations use version 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
/linux-master/net/dccp/
H A Doptions.cdiff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
diff f17a37c9 Wed Nov 10 13:20:07 MST 2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Ack Vector interface clean-up

This patch brings the Ack Vector interface up to date. Its main purpose is
to lay the basis for the subsequent patches of this set, which will use the
new data structure fields and routines.

There are no real algorithmic changes, rather an adaptation:

(1) Replaced the static Ack Vector size (2) with a #define so that it can
be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems
to be sufficient for the moment) and added a solution so that computing
the ECN nonce will continue to work - even with larger Ack Vectors.

(2) Replaced the #defines for Ack Vector states with a complete enum.

(3) Replaced #defines to compute Ack Vector length and state with general
purpose routines (inlines), and updated code to use these.

(4) Added a `tail' field (conversion to circular buffer in subsequent patch).

(5) Updated the (outdated) documentation for Ack Vector struct.

(6) All sequence number containers now trimmed to 48 bits.

(7) Removal of unused bits:
* removed dccpav_ack_nonce from struct dccp_ackvec, since this is already
redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record);
* removed Elapsed Time for Ack Vectors (it was nowhere used);
* replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since
the code needs to be able to remember the old run length;
* reduced the de-/allocation routines (redundant / duplicate tests).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
diff 59b80802 Mon Jun 21 19:14:35 MDT 2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: make implementation of Syn-RTT symmetric

This patch is thanks to Andre Noll who reported the issue and helped testing.

The Syn-RTT sampled during the initial handshake currently only works for
the client sending the DCCP-Request. TFRC penalizes the absence of an RTT
sample with a very slow initial speed (1 packet per second), which delays
slow-start significantly, resulting in sluggish performance.

This patch mirrors the "Syn RTT" principle by adding a timestamp also onto
the DCCP-Response, producing an RTT sample when the (Data)Ack completing
the handshake arrives.

Also changed the documentation to 'TFRC' since Syn RTTs are also used by CCID-4.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 883ca833 Fri Jan 16 16:36:32 MST 2009 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Initialisation and type-checking of feature sysctls

This patch takes care of initialising and type-checking sysctls
related to feature negotiation. Type checking is important since some
of the sysctls now directly impact the feature-negotiation process.

The sysctls are initialised with the known default values for each
feature. For the type-checking the value constraints from RFC 4340
are used:

* Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes),
tested and confirmed that it works up to 4294967295 - for Gbps speed;
* Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer);
* CCIDs are between 0 .. 255;
* request_retries, retries1, retries2 also between 0..255 for good measure;
* tx_qlen is checked to be non-negative;
* sync_ratelimit remains as before.

Notes:
------
1. Die s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c.
2. As pointed out by Arnaldo, the pattern of type-checking repeats itself in
other places, sometimes with exactly the same kind of definitions (e.g.
"static int zero;"). It may be a good idea (kernel janitors?) to consolidate
type checking. For the sake of keeping the changeset small and in order not
to affect other subsystems, I have not strived to generalise here.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 6fdd34d4 Mon Dec 08 02:19:06 MST 2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Phase out the use of boolean Ack Vector sysctl

This removes the use of the sysctl and the minisock variable for the Send Ack
Vector feature, as it now is handled fully dynamically via feature negotiation
(i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
RFC 4341, 4.).

Using a sysctl in parallel to this implementation would open the door to
crashes, since much of the code relies on tests of the boolean minisock /
sysctl variable. Thus, this patch replaces all tests of type

if (dccp_msk(sk)->dccpms_send_ack_vector)
/* ... */
with
if (dp->dccps_hc_rx_ackvec != NULL)
/* ... */

The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
negotiation concluded that Ack Vectors are to be used on the half-connection.
Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
so that the test is a valid one.

The activation handler for Ack Vectors is called as soon as the feature
negotiation has concluded at the
* server when the Ack marking the transition RESPOND => OPEN arrives;
* client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

Adding the sequence number of the Response packet to the Ack Vector has been
removed, since
(a) connection establishment implies that the Response has been received;
(b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
this entry will always be ignored;
(c) it can not be used for anything useful - to detect loss for instance, only
packets received after the loss can serve as pseudo-dupacks.

There was a FIXME to change the error code when dccp_ackvec_add() fails.
I removed this after finding out that:
* the check whether ackno < ISN is already made earlier,
* this Response is likely the 1st packet with an Ackno that the client gets,
* so when dccp_ackvec_add() fails, the reason is likely not a packet error.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff dd9c0e36 Sun Nov 16 23:55:08 MST 2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Deprecate Ack Ratio sysctl

This patch deprecates the Ack Ratio sysctl, since
* Ack Ratio is entirely ignored by CCID-3 and CCID-4,
* Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
* even if it would work in CCID-2, there is no point for a user to change it:
- Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
- if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
(since waiting for Acks which will never arrive in this window),
- cwnd is not a user-configurable value.

The only reasonable place for Ack Ratio is to print it for debugging. It is
planned to do this later on, as part of e.g. dccp_probe.

With this patch Ack Ratio is now under full control of feature negotiation:
* Ack Ratio is resolved as a dependency of the selected CCID;
* if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
Ratio 2 for both endpoints";
* what happens then is part of another patch set, since it concerns the
dynamic update of Ack Ratio while the connection is in full flight.

Thanks to Tomasz Grobelny for discussion leading up to this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 68b1de15 Wed Sep 03 23:30:19 MDT 2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Algorithm to update buffer state

This provides a routine to consistently update the buffer state when the
peer acknowledges receipt of Ack Vectors; updating state in the list of Ack
Vectors as well as in the circular buffer.

While based on RFC 4340, several additional (and necessary) precautions were
added to protect the consistency of the buffer state. These additions are
essential, since analysis and experience showed that the basic algorithm was
insufficient for this task (which lead to problems that were hard to debug).

The algorithm now
* deals with HC-sender acknowledging to HC-receiver and vice versa,
* keeps track of the last unacknowledged but received seqno in tail_ackno,
* has special cases to reset the overflow condition when appropriate,
* is protected against receiving older information (would mess up buffer state).

Note: The older code performed an unnecessary step, where the sender cleared
Ack Vector state by parsing the Ack Vector received by the HC-receiver. Doing
this was entirely redundant, since
* the receiver always puts the full acknowledgment window (groups 2,3 in 11.4.2)
into the Ack Vectors it sends; hence the HC-receiver is only interested in the
highest state that the HC-sender received;
* this means that the acknowledgment number on the (Data)Ack from the HC-sender
is sufficient; and work done in parsing earlier state is not necessary, since
the later state subsumes the earlier one (see also RFC 4340, A.4).
This older interface (dccp_ackvec_parse()) is therefore removed.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
/linux-master/ipc/
H A Dmsg.cdiff 18319498 Thu Sep 02 15:55:31 MDT 2021 Vasily Averin <vvs@virtuozzo.com> memcg: enable accounting of ipc resources

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry

do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call pipelined_send.

This leads to a very hard to trigger race where a do_mq_timedreceive
call might return and leave do_mq_timedsend to rely on an invalid
address, causing the following crash:

RIP: 0010:wake_q_add_safe+0x13/0x60
Call Trace:
__x64_sys_mq_timedsend+0x2a9/0x490
do_syscall_64+0x80/0x680
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5928e40343

The race occurs as:

1. do_mq_timedreceive calls wq_sleep with the address of `struct
ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
holds a valid `struct ext_wait_queue *` as long as the stack has not
been overwritten.

2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.

3. Sender calls __pipelined_op::smp_store_release(&this->state,
STATE_READY). Here is where the race window begins. (`this` is
`ewq_addr`.)

4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break.

5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)

6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass to
the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.

do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add_safe on the receiver's
task_struct returned by get_task_struct, instead of dereferencing `this`
which sits on the receiver's stack.

As Manfred pointed out, the race potentially also exists in
ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix
those in the same way.

Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers")
Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam@suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4b78e201 Sun Jun 07 22:40:07 MDT 2020 Jules Irenge <jbi.octave@gmail.com> ipc/msg: add missing annotation for freeque()

Sparse reports a warning at freeque()

warning: context imbalance in freeque() - unexpected unlock

The root cause is the missing annotation at freeque()

Add the missing __releases(RCU) annotation
Add the missing __releases(&msq->q_perm) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lu Shuaibing <shuaibinglu@126.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: http://lkml.kernel.org/r/20200403160505.2832-2-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions

A use of uninitialized memory in msgctl_down() because msqid64 in
ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
is never initialized before msgctl_down() checks msqid64->msg_qbytes.

KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports:

==================================================================
BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x75/0xae
__kumsan_report+0x17c/0x3e6
kumsan_report+0xe/0x20
msgctl_down+0x94/0x300
ksys_msgctl.constprop.14+0xef/0x260
do_syscall_64+0x7e/0x1f0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4400e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================

Syzkaller reproducer:
msgctl$IPC_RMID(0x0, 0x0)

C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
syscall(__NR_msgctl, 0, 0, 0);
return 0;
}

[natechancellor@gmail.com: adjust indentation in ksys_msgctl]
Link: https://github.com/ClangBuiltLinux/linux/issues/829
Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing <shuaibinglu@126.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix layout

Each line here overflows 80 cols by exactly one character. Delete one tab
per line to fix.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions

A use of uninitialized memory in msgctl_down() because msqid64 in
ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
is never initialized before msgctl_down() checks msqid64->msg_qbytes.

KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports:

==================================================================
BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x75/0xae
__kumsan_report+0x17c/0x3e6
kumsan_report+0xe/0x20
msgctl_down+0x94/0x300
ksys_msgctl.constprop.14+0xef/0x260
do_syscall_64+0x7e/0x1f0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4400e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================

Syzkaller reproducer:
msgctl$IPC_RMID(0x0, 0x0)

C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
syscall(__NR_msgctl, 0, 0, 0);
return 0;
}

[natechancellor@gmail.com: adjust indentation in ksys_msgctl]
Link: https://github.com/ClangBuiltLinux/linux/issues/829
Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing <shuaibinglu@126.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix layout

Each line here overflows 80 cols by exactly one character. Delete one tab
per line to fix.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions

A use of uninitialized memory in msgctl_down() because msqid64 in
ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
is never initialized before msgctl_down() checks msqid64->msg_qbytes.

KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports:

==================================================================
BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x75/0xae
__kumsan_report+0x17c/0x3e6
kumsan_report+0xe/0x20
msgctl_down+0x94/0x300
ksys_msgctl.constprop.14+0xef/0x260
do_syscall_64+0x7e/0x1f0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4400e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================

Syzkaller reproducer:
msgctl$IPC_RMID(0x0, 0x0)

C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
syscall(__NR_msgctl, 0, 0, 0);
return 0;
}

[natechancellor@gmail.com: adjust indentation in ksys_msgctl]
Link: https://github.com/ClangBuiltLinux/linux/issues/829
Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing <shuaibinglu@126.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix layout

Each line here overflows 80 cols by exactly one character. Delete one tab
per line to fix.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions

A use of uninitialized memory in msgctl_down() because msqid64 in
ksys_msgctl hasn't been initialized. The local | msqid64 | is created in
ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64
is never initialized before msgctl_down() checks msqid64->msg_qbytes.

KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports:

==================================================================
BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x75/0xae
__kumsan_report+0x17c/0x3e6
kumsan_report+0xe/0x20
msgctl_down+0x94/0x300
ksys_msgctl.constprop.14+0xef/0x260
do_syscall_64+0x7e/0x1f0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4400e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================

Syzkaller reproducer:
msgctl$IPC_RMID(0x0, 0x0)

C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
syscall(__NR_msgctl, 0, 0, 0);
return 0;
}

[natechancellor@gmail.com: adjust indentation in ksys_msgctl]
Link: https://github.com/ClangBuiltLinux/linux/issues/829
Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing <shuaibinglu@126.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix layout

Each line here overflows 80 cols by exactly one character. Delete one tab
per line to fix.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 39cfffd7 Tue Aug 21 23:01:29 MDT 2018 Manfred Spraul <manfred@colorfullife.com> ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early, and by
modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
issue with reading kern_ipc_perm.seq, here both read and write to already
released memory could happen.

Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 23c8cec8 Tue Apr 10 17:35:30 MDT 2018 Davidlohr Bueso <dave@stgolabs.net> ipc/msg: introduce msgctl(MSG_STAT_ANY)

There is a permission discrepancy when consulting msq ipc object
metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
command. The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the msq metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it. Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs). Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new MSG_STAT_ANY command such that the msq ipc
object permissions are ignored, and only audited instead. In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
H A Dutil.cdiff 20401d10 Tue Sep 07 21:00:53 MDT 2021 Rafael Aquini <aquini@redhat.com> ipc: replace costly bailout check in sysvipc_find_ipc()

sysvipc_find_ipc() was left with a costly way to check if the offset
position fed to it is bigger than the total number of IPC IDs in use. So
much so that the time it takes to iterate over /proc/sysvipc/* files grows
exponentially for a custom benchmark that creates "N" SYSV shm segments
and then times the read of /proc/sysvipc/shm (milliseconds):

12 msecs to read 1024 segs from /proc/sysvipc/shm
18 msecs to read 2048 segs from /proc/sysvipc/shm
65 msecs to read 4096 segs from /proc/sysvipc/shm
325 msecs to read 8192 segs from /proc/sysvipc/shm
1303 msecs to read 16384 segs from /proc/sysvipc/shm
5182 msecs to read 32768 segs from /proc/sysvipc/shm

The root problem lies with the loop that computes the total amount of ids
in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
"ids->in_use". That is a quite inneficient way to get to the maximum
index in the id lookup table, specially when that value is already
provided by struct ipc_ids.max_idx.

This patch follows up on the optimization introduced via commit
15df03c879836 ("sysvipc: make get_maxid O(1) again") and gets rid of the
aforementioned costly loop replacing it by a simpler checkpoint based on
ipc_get_maxidx() returned value, which allows for a smooth linear increase
in time complexity for the same custom benchmark:

2 msecs to read 1024 segs from /proc/sysvipc/shm
2 msecs to read 2048 segs from /proc/sysvipc/shm
4 msecs to read 4096 segs from /proc/sysvipc/shm
9 msecs to read 8192 segs from /proc/sysvipc/shm
19 msecs to read 16384 segs from /proc/sysvipc/shm
39 msecs to read 32768 segs from /proc/sysvipc/shm

Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <llong@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff d919b33d Mon Apr 06 21:09:01 MDT 2020 Alexey Dobriyan <adobriyan@gmail.com> proc: faster open/read/close with "permanent" files

Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...

Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.

Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.

Enable "permanent" flag for

/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps

More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.

This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".

N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%

Benchmark source:

#include <chrono>
#include <iostream>
#include <thread>
#include <vector>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;

int xxx = 0;

int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}

void f(int n)
{
glue(n % NR_CPUS);

while (*(volatile int *)&xxx == 0) {
}

for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}

int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}

N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);

for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}

std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}

auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';

return 0;
}

P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.

[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 39cfffd7 Tue Aug 21 23:01:29 MDT 2018 Manfred Spraul <manfred@colorfullife.com> ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()

ipc_addid() is impossible to use:
- for certain failures, the caller must not use ipc_rcu_putref(),
because the reference counter is not yet initialized.
- for other failures, the caller must use ipc_rcu_putref(),
because parallel operations could be ongoing already.

The patch cleans that up, by initializing the refcount early, and by
modifying all callers.

The issues is related to the finding of
syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an
issue with reading kern_ipc_perm.seq, here both read and write to already
released memory could happen.

Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff ebf66799 Fri Nov 17 16:31:15 MST 2017 Davidlohr Bueso <dave@stgolabs.net> sysvipc: properly name ipc_addid() limit parameter

This is better understood as a limit, instead of size; exactly like the
function comment indicates. Rename it.

Link: http://lkml.kernel.org/r/20170831172049.14576-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff b8fd9983 Fri Nov 17 16:31:08 MST 2017 Davidlohr Bueso <dave@stgolabs.net> sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE

Patch series "sysvipc: ipc-key management improvements".

Here are a few improvements I spotted while eyeballing Guillaume's
rhashtable implementation for ipc keys. The first and fourth patches
are the interesting ones, the middle two are trivial.

This patch (of 4):

The next_id object-allocation functionality was introduced in commit
03f595668017 ("ipc: add sysctl to specify desired next object id").

Given that these new entries are _only_ exported under the
CONFIG_CHECKPOINT_RESTORE option, there is no point for the common case
to even know about ->next_id. As such rewrite ipc_buildid() such that
it can do away with the field as well as unnecessary branches when
adding a new identifier. The end result also better differentiates both
cases, so the code ends up being cleaner; albeit the small duplications
regarding the default case.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170831172049.14576-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.

For non */uapi/* files that summary was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139

and resulted in the first patch in this series.

If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:

SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930

and resulted in the second patch in this series.

- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:

SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1

and that resulted in the third patch in this series.

- when the two scanners agreed on the detected license(s), that became
the concluded license(s).

- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.

- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).

- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.

- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct

This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
diff 9405c03e Fri Sep 08 17:17:45 MDT 2017 Elena Reshetova <elena.reshetova@intel.com> ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t

refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This allows to avoid
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/1499417992-3238-4-git-send-email-elena.reshetova@intel.com
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: <arozansk@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4d2bff5e Tue Apr 30 20:15:19 MDT 2013 Davidlohr Bueso <davidlohr.bueso@hp.com> ipc: introduce obtaining a lockless ipc object

Through ipc_lock() and therefore ipc_lock_check() we currently return the
locked ipc object. This is not necessary for all situations and can,
therefore, cause unnecessary ipc lock contention.

Introduce analogous ipc_obtain_object() and ipc_obtain_object_check()
functions that only lookup and return the ipc object.

Both these functions must be called within the RCU read critical section.

[akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()]
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Emmanuel Benisty <benisty.e@gmail.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/security/keys/
H A Dkeyring.cdiff 2e12256b Thu Jun 27 16:03:07 MDT 2019 David Howells <dhowells@redhat.com> keys: Replace uid/gid/perm permissions checking with an ACL

Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split. This will also allow a
greater range of subjects to represented.

============
WHY DO THIS?
============

The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.

For SETATTR, this includes actions that are about controlling access to a
key:

(1) Changing a key's ownership.

(2) Changing a key's security information.

(3) Setting a keyring's restriction.

And actions that are about managing a key's lifetime:

(4) Setting an expiry time.

(5) Revoking a key.

and (proposed) managing a key as part of a cache:

(6) Invalidating a key.

Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.

Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission. It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.

As for SEARCH permission, that currently covers:

(1) Finding keys in a keyring tree during a search.

(2) Permitting keyrings to be joined.

(3) Invalidation.

But these don't really belong together either, since these actions really
need to be controlled separately.

Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.


===============
WHAT IS CHANGED
===============

The SETATTR permission is split to create two new permissions:

(1) SET_SECURITY - which allows the key's owner, group and ACL to be
changed and a restriction to be placed on a keyring.

(2) REVOKE - which allows a key to be revoked.

The SEARCH permission is split to create:

(1) SEARCH - which allows a keyring to be search and a key to be found.

(2) JOIN - which allows a keyring to be joined as a session keyring.

(3) INVAL - which allows a key to be invalidated.

The WRITE permission is also split to create:

(1) WRITE - which allows a key's content to be altered and links to be
added, removed and replaced in a keyring.

(2) CLEAR - which allows a keyring to be cleared completely. This is
split out to make it possible to give just this to an administrator.

(3) REVOKE - see above.


Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together. An ACE specifies a subject, such as:

(*) Possessor - permitted to anyone who 'possesses' a key
(*) Owner - permitted to the key owner
(*) Group - permitted to the key group
(*) Everyone - permitted to everyone

Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.

Further subjects may be made available by later patches.

The ACE also specifies a permissions mask. The set of permissions is now:

VIEW Can view the key metadata
READ Can read the key content
WRITE Can update/modify the key content
SEARCH Can find the key by searching/requesting
LINK Can make a link to the key
SET_SECURITY Can change owner, ACL, expiry
INVAL Can invalidate
REVOKE Can revoke
JOIN Can join this keyring
CLEAR Can clear this keyring


The KEYCTL_SETPERM function is then deprecated.

The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.

The KEYCTL_INVALIDATE function then requires INVAL.

The KEYCTL_REVOKE function then requires REVOKE.

The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.

The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.


======================
BACKWARD COMPATIBILITY
======================

To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.

It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.

SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned
on if a keyring is being altered.

The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.

It will make the following mappings:

(1) INVAL, JOIN -> SEARCH

(2) SET_SECURITY -> SETATTR

(3) REVOKE -> WRITE if SETATTR isn't already set

(4) CLEAR -> WRITE

Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.


=======
TESTING
=======

This passes the keyutils testsuite for all but a couple of tests:

(1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
if the type doesn't have ->read(). You still can't actually read the
key.

(2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
work as Other has been replaced with Everyone in the ACL.

Signed-off-by: David Howells <dhowells@redhat.com>
diff 2e12256b Thu Jun 27 16:03:07 MDT 2019 David Howells <dhowells@redhat.com> keys: Replace uid/gid/perm permissions checking with an ACL

Replace the uid/gid/perm permissions checking on a key with an ACL to allow
the SETATTR and SEARCH permissions to be split. This will also allow a
greater range of subjects to represented.

============
WHY DO THIS?
============

The problem is that SETATTR and SEARCH cover a slew of actions, not all of
which should be grouped together.

For SETATTR, this includes actions that are about controlling access to a
key:

(1) Changing a key's ownership.

(2) Changing a key's security information.

(3) Setting a keyring's restriction.

And actions that are about managing a key's lifetime:

(4) Setting an expiry time.

(5) Revoking a key.

and (proposed) managing a key as part of a cache:

(6) Invalidating a key.

Managing a key's lifetime doesn't really have anything to do with
controlling access to that key.

Expiry time is awkward since it's more about the lifetime of the content
and so, in some ways goes better with WRITE permission. It can, however,
be set unconditionally by a process with an appropriate authorisation token
for instantiating a key, and can also be set by the key type driver when a
key is instantiated, so lumping it with the access-controlling actions is
probably okay.

As for SEARCH permission, that currently covers:

(1) Finding keys in a keyring tree during a search.

(2) Permitting keyrings to be joined.

(3) Invalidation.

But these don't really belong together either, since these actions really
need to be controlled separately.

Finally, there are number of special cases to do with granting the
administrator special rights to invalidate or clear keys that I would like
to handle with the ACL rather than key flags and special checks.


===============
WHAT IS CHANGED
===============

The SETATTR permission is split to create two new permissions:

(1) SET_SECURITY - which allows the key's owner, group and ACL to be
changed and a restriction to be placed on a keyring.

(2) REVOKE - which allows a key to be revoked.

The SEARCH permission is split to create:

(1) SEARCH - which allows a keyring to be search and a key to be found.

(2) JOIN - which allows a keyring to be joined as a session keyring.

(3) INVAL - which allows a key to be invalidated.

The WRITE permission is also split to create:

(1) WRITE - which allows a key's content to be altered and links to be
added, removed and replaced in a keyring.

(2) CLEAR - which allows a keyring to be cleared completely. This is
split out to make it possible to give just this to an administrator.

(3) REVOKE - see above.


Keys acquire ACLs which consist of a series of ACEs, and all that apply are
unioned together. An ACE specifies a subject, such as:

(*) Possessor - permitted to anyone who 'possesses' a key
(*) Owner - permitted to the key owner
(*) Group - permitted to the key group
(*) Everyone - permitted to everyone

Note that 'Other' has been replaced with 'Everyone' on the assumption that
you wouldn't grant a permit to 'Other' that you wouldn't also grant to
everyone else.

Further subjects may be made available by later patches.

The ACE also specifies a permissions mask. The set of permissions is now:

VIEW Can view the key metadata
READ Can read the key content
WRITE Can update/modify the key content
SEARCH Can find the key by searching/requesting
LINK Can make a link to the key
SET_SECURITY Can change owner, ACL, expiry
INVAL Can invalidate
REVOKE Can revoke
JOIN Can join this keyring
CLEAR Can clear this keyring


The KEYCTL_SETPERM function is then deprecated.

The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
or if the caller has a valid instantiation auth token.

The KEYCTL_INVALIDATE function then requires INVAL.

The KEYCTL_REVOKE function then requires REVOKE.

The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
existing keyring.

The JOIN permission is enabled by default for session keyrings and manually
created keyrings only.


======================
BACKWARD COMPATIBILITY
======================

To maintain backward compatibility, KEYCTL_SETPERM will translate the
permissions mask it is given into a new ACL for a key - unless
KEYCTL_SET_ACL has been called on that key, in which case an error will be
returned.

It will convert possessor, owner, group and other permissions into separate
ACEs, if each portion of the mask is non-zero.

SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE
permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned
on if a keyring is being altered.

The KEYCTL_DESCRIBE function translates the ACL back into a permissions
mask to return depending on possessor, owner, group and everyone ACEs.

It will make the following mappings:

(1) INVAL, JOIN -> SEARCH

(2) SET_SECURITY -> SETATTR

(3) REVOKE -> WRITE if SETATTR isn't already set

(4) CLEAR -> WRITE

Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
the value set with KEYCTL_SETATTR.


=======
TESTING
=======

This passes the keyutils testsuite for all but a couple of tests:

(1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
if the type doesn't have ->read(). You still can't actually read the
key.

(2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
work as Other has been replaced with Everyone in the ACL.

Signed-off-by: David Howells <dhowells@redhat.com>
diff b206f281 Wed Jun 26 14:02:32 MDT 2019 David Howells <dhowells@redhat.com> keys: Namespace keyring names

Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission). This isn't very container friendly, however.

Make the following changes:

(1) Make default session, process and thread keyring names begin with a
'.' instead of '_'.

(2) Keyrings whose names begin with a '.' aren't added to the list. Such
keyrings are system specials.

(3) Replace the global list with per-user_namespace lists. A keyring adds
its name to the list for the user_namespace that it is currently in.

(4) When a user_namespace is deleted, it just removes itself from the
keyring name list.

The global keyring_name_lock is retained for accessing the name lists.
This allows (4) to work.

This can be tested by:

# keyctl newring foo @s
995906392
# unshare -U
$ keyctl show
...
995906392 --alswrv 65534 65534 \_ keyring: foo
...
$ keyctl session foo
Joined session keyring: 935622349

As can be seen, a new session keyring was created.

The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
employing this feature.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
diff b206f281 Wed Jun 26 14:02:32 MDT 2019 David Howells <dhowells@redhat.com> keys: Namespace keyring names

Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission). This isn't very container friendly, however.

Make the following changes:

(1) Make default session, process and thread keyring names begin with a
'.' instead of '_'.

(2) Keyrings whose names begin with a '.' aren't added to the list. Such
keyrings are system specials.

(3) Replace the global list with per-user_namespace lists. A keyring adds
its name to the list for the user_namespace that it is currently in.

(4) When a user_namespace is deleted, it just removes itself from the
keyring name list.

The global keyring_name_lock is retained for accessing the name lists.
This allows (4) to work.

This can be tested by:

# keyctl newring foo @s
995906392
# unshare -U
$ keyctl show
...
995906392 --alswrv 65534 65534 \_ keyring: foo
...
$ keyctl session foo
Joined session keyring: 935622349

As can be seen, a new session keyring was created.

The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
employing this feature.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
diff 363b02da Wed Oct 04 09:43:25 MDT 2017 David Howells <dhowells@redhat.com> KEYS: Fix race between updating and finding a negative key

Consolidate KEY_FLAG_INSTANTIATED, KEY_FLAG_NEGATIVE and the rejection
error into one field such that:

(1) The instantiation state can be modified/read atomically.

(2) The error can be accessed atomically with the state.

(3) The error isn't stored unioned with the payload pointers.

This deals with the problem that the state is spread over three different
objects (two bits and a separate variable) and reading or updating them
atomically isn't practical, given that not only can uninstantiated keys
change into instantiated or rejected keys, but rejected keys can also turn
into instantiated keys - and someone accessing the key might not be using
any locking.

The main side effect of this problem is that what was held in the payload
may change, depending on the state. For instance, you might observe the
key to be in the rejected state. You then read the cached error, but if
the key semaphore wasn't locked, the key might've become instantiated
between the two reads - and you might now have something in hand that isn't
actually an error code.

The state is now KEY_IS_UNINSTANTIATED, KEY_IS_POSITIVE or a negative error
code if the key is negatively instantiated. The key_is_instantiated()
function is replaced with key_is_positive() to avoid confusion as negative
keys are also 'instantiated'.

Additionally, barriering is included:

(1) Order payload-set before state-set during instantiation.

(2) Order state-read before payload-read when using the key.

Further separate barriering is necessary if RCU is being used to access the
payload content after reading the payload pointers.

Fixes: 146aa8b1453b ("KEYS: Merge the type-specific data with the payload data")
Cc: stable@vger.kernel.org # v4.4+
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
diff 237bbd29 Mon Sep 18 12:37:03 MDT 2017 Eric Biggers <ebiggers@google.com> KEYS: prevent creating a different user's keyrings

It was possible for an unprivileged user to create the user and user
session keyrings for another user. For example:

sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u
keyctl add keyring _uid_ses.4000 "" @u
sleep 15' &
sleep 1
sudo -u '#4000' keyctl describe @u
sudo -u '#4000' keyctl describe @us

This is problematic because these "fake" keyrings won't have the right
permissions. In particular, the user who created them first will own
them and will have full access to them via the possessor permissions,
which can be used to compromise the security of a user's keys:

-4: alswrv-----v------------ 3000 0 keyring: _uid.4000
-5: alswrv-----v------------ 3000 0 keyring: _uid_ses.4000

Fix it by marking user and user session keyrings with a flag
KEY_FLAG_UID_KEYRING. Then, when searching for a user or user session
keyring by name, skip all keyrings that don't have the flag set.

Fixes: 69664cf16af4 ("keys: don't generate user and user session keyrings unless they're accessed")
Cc: <stable@vger.kernel.org> [v2.6.26+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
diff 5ac7eace Wed Apr 06 09:14:24 MDT 2016 David Howells <dhowells@redhat.com> KEYS: Add a facility to restrict new links into a keyring

Add a facility whereby proposed new links to be added to a keyring can be
vetted, permitting them to be rejected if necessary. This can be used to
block public keys from which the signature cannot be verified or for which
the signature verification fails. It could also be used to provide
blacklisting.

This affects operations like add_key(), KEYCTL_LINK and KEYCTL_INSTANTIATE.

To this end:

(1) A function pointer is added to the key struct that, if set, points to
the vetting function. This is called as:

int (*restrict_link)(struct key *keyring,
const struct key_type *key_type,
unsigned long key_flags,
const union key_payload *key_payload),

where 'keyring' will be the keyring being added to, key_type and
key_payload will describe the key being added and key_flags[*] can be
AND'ed with KEY_FLAG_TRUSTED.

[*] This parameter will be removed in a later patch when
KEY_FLAG_TRUSTED is removed.

The function should return 0 to allow the link to take place or an
error (typically -ENOKEY, -ENOPKG or -EKEYREJECTED) to reject the
link.

The pointer should not be set directly, but rather should be set
through keyring_alloc().

Note that if called during add_key(), preparse is called before this
method, but a key isn't actually allocated until after this function
is called.

(2) KEY_ALLOC_BYPASS_RESTRICTION is added. This can be passed to
key_create_or_update() or key_instantiate_and_link() to bypass the
restriction check.

(3) KEY_FLAG_TRUSTED_ONLY is removed. The entire contents of a keyring
with this restriction emplaced can be considered 'trustworthy' by
virtue of being in the keyring when that keyring is consulted.

(4) key_alloc() and keyring_alloc() take an extra argument that will be
used to set restrict_link in the new key. This ensures that the
pointer is set before the key is published, thus preventing a window
of unrestrictedness. Normally this argument will be NULL.

(5) As a temporary affair, keyring_restrict_trusted_only() is added. It
should be passed to keyring_alloc() as the extra argument instead of
setting KEY_FLAG_TRUSTED_ONLY on a keyring. This will be replaced in
a later patch with functions that look in the appropriate places for
authoritative keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
diff 0b0a8415 Mon Dec 01 15:52:53 MST 2014 David Howells <dhowells@redhat.com> KEYS: request_key() should reget expired keys rather than give EKEYEXPIRED

Since the keyring facility can be viewed as a cache (at least in some
applications), the local expiration time on the key should probably be viewed
as a 'needs updating after this time' property rather than an absolute 'anyone
now wanting to use this object is out of luck' property.

Since request_key() is the main interface for the usage of keys, this should
update or replace an expired key rather than issuing EKEYEXPIRED if the local
expiration has been reached (ie. it should refresh the cache).

For absolute conditions where refreshing the cache probably doesn't help, the
key can be negatively instantiated using KEYCTL_REJECT_KEY with EKEYEXPIRED
given as the error to issue. This will still cause request_key() to return
EKEYEXPIRED as that was explicitly set.

In the future, if the key type has an update op available, we might want to
upcall with the expired key and allow the upcall to update it. We would pass
a different operation name (the first column in /etc/request-key.conf) to the
request-key program.

request_key() returning EKEYEXPIRED is causing an NFS problem which Chuck
Lever describes thusly:

After about 10 minutes, my NFSv4 functional tests fail because the
ownership of the test files goes to "-2". Looking at /proc/keys
shows that the id_resolv keys that map to my test user ID have
expired. The ownership problem persists until the expired keys are
purged from the keyring, and fresh keys are obtained.

I bisected the problem to 3.13 commit b2a4df200d57 ("KEYS: Expand
the capacity of a keyring"). This commit inadvertantly changes the
API contract of the internal function keyring_search_aux().

The root cause appears to be that b2a4df200d57 made "no state check"
the default behavior. "No state check" means the keyring search
iterator function skips checking the key's expiry timeout, and
returns expired keys. request_key_and_link() depends on getting
an -EAGAIN result code to know when to perform an upcall to refresh
an expired key.

This patch can be tested directly by:

keyctl request2 user debug:fred a @s
keyctl timeout %user:debug:fred 3
sleep 4
keyctl request2 user debug:fred a @s

Without the patch, the last command gives error EKEYEXPIRED, but with the
command it gives a new key.

Reported-by: Carl Hetherington <cth@carlh.net>
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
diff 46291959 Tue Sep 16 10:36:02 MDT 2014 David Howells <dhowells@redhat.com> KEYS: Preparse match data

Preparse the match data. This provides several advantages:

(1) The preparser can reject invalid criteria up front.

(2) The preparser can convert the criteria to binary data if necessary (the
asymmetric key type really wants to do binary comparison of the key IDs).

(3) The preparser can set the type of search to be performed. This means
that it's not then a one-off setting in the key type.

(4) The preparser can set an appropriate comparator function.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
diff 4bdf0bc3 Tue Sep 24 03:35:15 MDT 2013 David Howells <dhowells@redhat.com> KEYS: Introduce a search context structure

Search functions pass around a bunch of arguments, each of which gets copied
with each call. Introduce a search context structure to hold these.

Whilst we're at it, create a search flag that indicates whether the search
should be directly to the description or whether it should iterate through all
keys looking for a non-description match.

This will be useful when keyrings use a generic data struct with generic
routines to manage their content as the search terms can just be passed
through to the iterator callback function.

Also, for future use, the data to be supplied to the match function is
separated from the description pointer in the search context. This makes it
clear which is being supplied.

Signed-off-by: David Howells <dhowells@redhat.com>
/linux-master/fs/nfsd/
H A Dnfsctl.cdiff 4b148854 Fri Jan 26 08:39:47 MST 2024 Josef Bacik <josef@toxicpanda.com> nfsd: make all of the nfsd stats per-network namespace

We have a global set of counters that we modify for all of the nfsd
operations, but now that we're exposing these stats across all network
namespaces we need to make the stats also be per-network namespace. We
already have some caching stats that are per-network namespace, so move
these definitions into the same counter and then adjust all the helpers
and users of these stats to provide the appropriate nfsd_net struct so
that the stats are maintained for the per-network namespace objects.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 4b471a8b Tue Feb 14 08:19:30 MST 2023 Chuck Lever <chuck.lever@oracle.com> NFSD: Clean up nfsd_symlink()

The pointer dentry is assigned a value that is never read, the
assignment is redundant and can be removed.

Cleans up clang-scan warning:
fs/nfsd/nfsctl.c:1231:2: warning: Value stored to 'dentry' is
never read [deadcode.DeadStores]
dentry = ERR_PTR(ret);

No need to initialize "int ret = -ENOMEM;" either.

These are vestiges of nfsd_mkdir(), from whence I copied
nfsd_symlink().

Reported-by: Colin Ian King <colin.i.king@gmail.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 4df750c9 Sat Jan 14 22:21:39 MST 2023 Chuck Lever <chuck.lever@oracle.com> NFSD: Replace /proc/fs/nfsd/supported_krb5_enctypes with a symlink

Now that I've added a file under /proc/net/rpc that is managed by
the SunRPC's Kerberos mechanism, replace NFSD's
supported_krb5_enctypes file with a symlink to the new SunRPC proc
file, which contains exactly the same content.

Remarkably, commit b0b0c0a26e84 ("nfsd: add proc file listing
kernel's gss_krb5 enctypes") added the nfsd_supported_krb5_enctypes
file in 2011, but this file has never been documented in nfsd(7).

Tested-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Simo Sorce <simo@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
diff 4df493a2 Mon Apr 08 22:13:37 MDT 2019 Trond Myklebust <trondmy@gmail.com> SUNRPC: Cache the process user cred in the RPC server listener

In order to be able to interpret uids and gids correctly in knfsd, we
should cache the user namespace of the process that created the RPC
server's listener. To do so, we refcount the credential of that process.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions

Prior to
e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.")

v4.0 could not be disabled without disabling all NFSv4 protocols.
So the 'versions' file contained ±4 ±4.1 ±4.2.
Writing "-4" would disable all v4 completely. Writing +4 would enabled those
minor versions that are currently enabled, either by default or otherwise.

After that commit, it was possible to disable v4.0 independently. To
maximize backward compatibility with use cases which never disabled
v4.0, the "versions" file would never contain "+4.0" - that was implied
by "+4", unless explicitly negated by "-4.0".

This introduced an inconsistency in that it was possible to disable all
minor versions, but still have the major version advertised.
e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised,
but all attempts to use it rejected.

Commit
d3635ff07e8c ("nfsd: fix configuration of supported minor versions")

and following removed this inconsistency. If all minor version were disabled,
the major would be disabled too. If any minor was enabled, the major would be
disabled.
This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0".
A consequence of this is that writing "-4" would only disable 4.0.
This is a regression against the earlier behaviour, in a use case that rpc.nfsd
actually uses.
The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files.
Previously, that would disable v4 completely. Now it will only disable v4.0.

Also "4.0" never appears in the "versions" file when read.
So if only v4.1 is available, the previous kernel would have reported
"+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which
could easily confuse.

This patch restores the implication that "+4" and "-4" apply more
globals and do not imply "4.0".
Specifically:
writing "-4" will disable all 4.x minor versions.
writing "+4" will enable all 4.1 minor version if none are currently enabled.
rpc.nfsd will list minor versions before major versions, so
rpc.nfsd -V 4.2 -N 4.1
will write "-4.1 +4.2 +2 +3 +4"
so it would be a regression for "+4" to enable always all versions.
reading "-4" implies that no v4.x are enabled
reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless
"-4.0" is also present. All other minor versions will explicitly be listed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
/linux-master/drivers/vhost/
H A Dvhost.cdiff c1ecd8e9 Mon Jun 26 17:23:05 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: allow userspace to create workers

For vhost-scsi with 3 vqs or more and a workload that tries to use
them in parallel like:

fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128 --numjobs=3

the single vhost worker thread will become a bottlneck and we are stuck
at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
used.

To better utilize virtqueues and available CPUs, this patch allows
userspace to create workers and bind them to vqs. You can have N workers
per dev and also share N workers with M vqs on that dev.

This patch adds the interface related code and the next patch will hook
vhost-scsi into it. The patches do not try to hook net and vsock into
the interface because:

1. multiple workers don't seem to help vsock. The problem is that with
only 2 virtqueues we never fully use the existing worker when doing
bidirectional tests. This seems to match vhost-scsi where we don't see
the worker as a bottleneck until 3 virtqueues are used.

2. net already has a way to use multiple workers.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-16-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff 737bdb64 Mon Jun 26 17:22:53 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: add vhost_worker pointer to vhost_virtqueue

This patchset allows userspace to map vqs to different workers. This
patch adds a worker pointer to the vq so in later patches in this set
we can queue/flush specific vqs and their workers.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230626232307.97930-4-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff 55d8122f Mon Apr 24 16:50:30 MDT 2023 Shannon Nelson <shannon.nelson@amd.com> vhost: support PACKED when setting-getting vring_base

Use the right structs for PACKED or split vqs when setting and
getting the vring base.

Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Message-Id: <20230424225031.18947-3-shannon.nelson@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
diff 4b13cbef Wed Jun 07 13:23:38 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: Fix worker hangs due to missed wake up calls

We can race where we have added work to the work_list, but
vhost_task_fn has passed that check but not yet set us into
TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and
just return.

This bug was intoduced in commit f9010dbdce91 ("fork, vhost: Use
CLONE_THREAD to fix freezer/ps regression") when I moved the setting
of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from
logging warnings about being in the wrong state. This moves the setting
of TASK_INTERRUPTIBLE back to before we test if we need to stop the
task to avoid a possible race there as well. We then have vhost_worker
set TASK_RUNNING if it finds work similar to before.

Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230607192338.6041-3-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff a284f09e Wed Jun 07 13:23:37 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: Fix crash during early vhost_transport_send_pkt calls

If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we
can race where:
1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue
2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create.
3. vhost_worker_create will set the dev->worker pointer before setting
the worker->vtsk pointer.
4. thread0's vhost_work_queue will see the dev->worker pointer is
set and try to call vhost_task_wake using not yet set worker->vtsk
pointer.
5. We then crash since vtsk is NULL.

Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker
threads"), we only had the worker pointer so we could just check it to
see if VHOST_SET_OWNER has been done. After that commit we have the
vhost_worker and vhost_task pointer, so we can now hit the bug above.

This patch embeds the vhost_worker in the vhost_dev and moves the work
list initialization back to vhost_dev_init, so we can just check the
worker.vtsk pointer to check if VHOST_SET_OWNER has been done like
before.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20230607192338.6041-2-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: syzbot+d0d442c22fa8db45ff0e@syzkaller.appspotmail.com
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
diff 4d8df0f5 Mon May 22 02:50:19 MDT 2023 Prathu Baronia <prathubaronia2011@gmail.com> vhost: use kzalloc() instead of kmalloc() followed by memset()

Use kzalloc() to allocate new zeroed out msg node instead of
memsetting a node allocated with kmalloc().

Signed-off-by: Prathu Baronia <prathubaronia2011@gmail.com>
Message-Id: <20230522085019.42914-1-prathubaronia2011@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
diff 4c809363 Thu Jan 13 07:11:34 MST 2022 Stefano Garzarella <sgarzare@redhat.com> vhost: remove avail_event arg from vhost_update_avail_event()

In vhost_update_avail_event() we never used the `avail_event` argument,
since its introduction in commit 2723feaa8ec6 ("vhost: set log when
updating used flags or avail event").

Let's remove it to clean up the code.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20220113141134.186773-1-sgarzare@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff 71c0b9a6 Wed Oct 30 10:22:17 MDT 2019 Will Deacon <will@kernel.org> vhost: Remove redundant use of read_barrier_depends() barrier

Since commit 76ebbe78f739 ("locking/barriers: Add implicit
smp_read_barrier_depends() to READ_ONCE()"), there is no need to use
smp_read_barrier_depends() outside of the Alpha architecture code.

Unfortunately, there is precisely _one_ user in the vhost code, and
there isn't an obvious READ_ONCE() access making the barrier
redundant. However, on closer inspection (thanks, Jason), it appears
that vring synchronisation between the producer and consumer occurs via
the 'avail_idx' field, which is followed up by an rmb() in
vhost_get_vq_desc(), making the read_barrier_depends() redundant on
Alpha.

Jason says:

| I'm also confused about the barrier here, basically in driver side
| we did:
|
| 1) allocate pages
| 2) store pages in indirect->addr
| 3) smp_wmb()
| 4) increase the avail idx (somehow a tail pointer of vring)
|
| in vhost we did:
|
| 1) read avail idx
| 2) smp_rmb()
| 3) read indirect->addr
| 4) read from indirect->addr
|
| It looks to me even the data dependency barrier is not necessary
| since we have rmb() which is sufficient for us to the correct
| indirect->addr and driver are not expected to do any writing to
| indirect->addr after avail idx is increased

Remove the redundant barrier invocation.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Jason Wang <jasowang@redhat.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
diff 71c0b9a6 Wed Oct 30 10:22:17 MDT 2019 Will Deacon <will@kernel.org> vhost: Remove redundant use of read_barrier_depends() barrier

Since commit 76ebbe78f739 ("locking/barriers: Add implicit
smp_read_barrier_depends() to READ_ONCE()"), there is no need to use
smp_read_barrier_depends() outside of the Alpha architecture code.

Unfortunately, there is precisely _one_ user in the vhost code, and
there isn't an obvious READ_ONCE() access making the barrier
redundant. However, on closer inspection (thanks, Jason), it appears
that vring synchronisation between the producer and consumer occurs via
the 'avail_idx' field, which is followed up by an rmb() in
vhost_get_vq_desc(), making the read_barrier_depends() redundant on
Alpha.

Jason says:

| I'm also confused about the barrier here, basically in driver side
| we did:
|
| 1) allocate pages
| 2) store pages in indirect->addr
| 3) smp_wmb()
| 4) increase the avail idx (somehow a tail pointer of vring)
|
| in vhost we did:
|
| 1) read avail idx
| 2) smp_rmb()
| 3) read indirect->addr
| 4) read from indirect->addr
|
| It looks to me even the data dependency barrier is not necessary
| since we have rmb() which is sufficient for us to the correct
| indirect->addr and driver are not expected to do any writing to
| indirect->addr after avail idx is increased

Remove the redundant barrier invocation.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Jason Wang <jasowang@redhat.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
diff 0bbe3066 Thu Mar 26 08:01:19 MDT 2020 Jason Wang <jasowang@redhat.com> vhost: factor out IOTLB

This patch factors out IOTLB into a dedicated module in order to be
reused by other modules like vringh. User may choose to enable the
automatic retiring by specifying VHOST_IOTLB_FLAG_RETIRE flag to fit
for the case of vhost device IOTLB implementation.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20200326140125.19794-4-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
/linux-master/include/net/
H A Dip.hdiff d9f28735 Wed Dec 06 06:44:20 MST 2023 David Laight <David.Laight@ACULAB.COM> Use READ/WRITE_ONCE() for IP local_port_range.

Commit 227b60f5102cd added a seqlock to ensure that the low and high
port numbers were always updated together.
This is overkill because the two 16bit port numbers can be held in
a u32 and read/written in a single instruction.

More recently 91d0b78c5177f added support for finer per-socket limits.
The user-supplied value is 'high << 16 | low' but they are held
separately and the socket options protected by the socket lock.

Use a u32 containing 'high << 16 | low' for both the 'net' and 'sk'
fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are
always updated together.

Change (the now trival) inet_get_local_port_range() to a static inline
to optimise the calling code.
(In particular avoiding returning integers by reference.)

Signed-off-by: David Laight <david.laight@aculab.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Acked-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/4e505d4198e946a8be03fb1b4c3072b0@AcuMS.aculab.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos()

Christoph Paasch reported a panic in TCP stack [1]

Indeed, we should not call sk_dst_reset() without holding
the socket lock, as __sk_dst_get() callers do not all rely
on bare RCU.

[1]
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321
Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69
RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293
RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01
RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002
RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0
R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580
R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000
FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0
Call Trace:
<TASK>
tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567
tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419
tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839
tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323
__inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676
tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021
tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336
__sock_sendmsg+0x83/0xd0 net/socket.c:730
__sys_sendto+0x20a/0x2a0 net/socket.c:2194
__do_sys_sendto net/socket.c:2206 [inline]

Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos()

Christoph Paasch reported a panic in TCP stack [1]

Indeed, we should not call sk_dst_reset() without holding
the socket lock, as __sk_dst_get() callers do not all rely
on bare RCU.

[1]
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321
Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69
RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293
RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01
RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002
RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0
R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580
R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000
FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0
Call Trace:
<TASK>
tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567
tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419
tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839
tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323
__inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676
tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021
tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336
__sock_sendmsg+0x83/0xd0 net/socket.c:730
__sys_sendto+0x20a/0x2a0 net/socket.c:2194
__do_sys_sendto net/socket.c:2206 [inline]

Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos()

Christoph Paasch reported a panic in TCP stack [1]

Indeed, we should not call sk_dst_reset() without holding
the socket lock, as __sk_dst_get() callers do not all rely
on bare RCU.

[1]
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321
Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69
RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293
RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01
RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002
RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0
R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580
R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000
FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0
Call Trace:
<TASK>
tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567
tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419
tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839
tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323
__inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676
tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021
tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336
__sock_sendmsg+0x83/0xd0 net/socket.c:730
__sys_sendto+0x20a/0x2a0 net/socket.c:2194
__do_sys_sendto net/socket.c:2206 [inline]

Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos()

Christoph Paasch reported a panic in TCP stack [1]

Indeed, we should not call sk_dst_reset() without holding
the socket lock, as __sk_dst_get() callers do not all rely
on bare RCU.

[1]
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321
Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69
RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293
RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01
RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002
RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0
R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580
R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000
FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0
Call Trace:
<TASK>
tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567
tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419
tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839
tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323
__inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676
tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021
tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336
__sock_sendmsg+0x83/0xd0 net/socket.c:730
__sys_sendto+0x20a/0x2a0 net/socket.c:2194
__do_sys_sendto net/socket.c:2206 [inline]

Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
diff 3c5b4d69 Fri Jul 28 09:03:15 MDT 2023 Eric Dumazet <edumazet@google.com> net: annotate data-races around sk->sk_mark

sk->sk_mark is often read while another thread could change the value.

Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 3632679d Mon May 22 06:08:20 MDT 2023 Nicolas Dichtel <nicolas.dichtel@6wind.com> ipv{4,6}/raw: fix output xfrm lookup wrt protocol

With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
protocol field of the flow structure, build by raw_sendmsg() /
rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy
lookup when some policies are defined with a protocol in the selector.

For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
specify the protocol. Just accept all values for IPPROTO_RAW socket.

For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
without breaking backward compatibility (the value of this field was never
checked). Let's add a new kind of control message, so that the userland
could specify which protocol is used.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
CC: stable@vger.kernel.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.

This approach has a couple of downsides:

a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.

Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.

# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port

In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].

b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.

IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.

The per-netns setting affects all sockets, so this approach can be used
only if:

- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.

For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:

system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.

Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

PORT_LO = 40_000
PORT_HI = 40_511

s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.

This approach has a couple of downsides:

a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.

Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.

# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port

In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].

b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.

IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.

The per-netns setting affects all sockets, so this approach can be used
only if:

- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.

For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:

system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.

Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

PORT_LO = 40_000
PORT_HI = 40_511

s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.

This approach has a couple of downsides:

a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.

Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.

# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port

In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].

b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.

IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.

The per-netns setting affects all sockets, so this approach can be used
only if:

- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.

For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:

system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.

Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

PORT_LO = 40_000
PORT_HI = 40_511

s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.

This approach has a couple of downsides:

a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.

Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.

# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port

In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].

b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.

IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.

The per-netns setting affects all sockets, so this approach can be used
only if:

- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.

For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:

system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.

Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

PORT_LO = 40_000
PORT_HI = 40_511

s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.

This approach has a couple of downsides:

a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.

Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.

# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port

In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].

b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.

IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.

The per-netns setting affects all sockets, so this approach can be used
only if:

- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.

For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:

system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.

Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

PORT_LO = 40_000
PORT_HI = 40_511

s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
H A Dipv6.hdiff 4bd0623f Wed Aug 16 02:15:41 MDT 2023 Eric Dumazet <edumazet@google.com> inet: move inet->transparent to inet->inet_flags

IP_TRANSPARENT socket option can now be set/read
without locking the socket.

v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent()
v4: rebased after commit 3f326a821b99 ("mptcp: change the mpc check helper to return a sk")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 21985f43 Thu Oct 06 12:53:46 MDT 2022 Kuniyuki Iwashima <kuniyu@amazon.com> udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).

Commit 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") forgot
to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6
socket into IPv4 with IPV6_ADDRFORM. After conversion, sk_prot is
changed to udp_prot and ->destroy() never cleans it up, resulting in
a memory leak.

This is due to the discrepancy between inet6_destroy_sock() and
IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM
to remove the difference.

However, this is not enough for now because rxpmtu can be changed
without lock_sock() after commit 03485f2adcde ("udpv6: Add lockless
sendmsg() support"). We will fix this case in the following patch.

Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and
remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy()
in the future.

Fixes: 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 21985f43 Thu Oct 06 12:53:46 MDT 2022 Kuniyuki Iwashima <kuniyu@amazon.com> udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).

Commit 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") forgot
to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6
socket into IPv4 with IPV6_ADDRFORM. After conversion, sk_prot is
changed to udp_prot and ->destroy() never cleans it up, resulting in
a memory leak.

This is due to the discrepancy between inet6_destroy_sock() and
IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM
to remove the difference.

However, this is not enough for now because rxpmtu can be changed
without lock_sock() after commit 03485f2adcde ("udpv6: Add lockless
sendmsg() support"). We will fix this case in the following patch.

Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and
remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy()
in the future.

Fixes: 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff f93431c8 Tue Jun 07 06:00:27 MDT 2022 Wang Yufen <wangyufen@huawei.com> ipv6: Fix signed integer overflow in __ip6_append_data

Resurrect ubsan overflow checks and ubsan report this warning,
fix it by change the variable [length] type to size_t.

UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19
2147479552 + 8567 cannot be represented in type 'int'
CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x214/0x230
show_stack+0x30/0x78
dump_stack_lvl+0xf8/0x118
dump_stack+0x18/0x30
ubsan_epilogue+0x18/0x60
handle_overflow+0xd0/0xf0
__ubsan_handle_add_overflow+0x34/0x44
__ip6_append_data.isra.48+0x1598/0x1688
ip6_append_data+0x128/0x260
udpv6_sendmsg+0x680/0xdd0
inet6_sendmsg+0x54/0x90
sock_sendmsg+0x70/0x88
____sys_sendmsg+0xe8/0x368
___sys_sendmsg+0x98/0xe0
__sys_sendmmsg+0xf4/0x3b8
__arm64_sys_sendmmsg+0x34/0x48
invoke_syscall+0x64/0x160
el0_svc_common.constprop.4+0x124/0x300
do_el0_svc+0x44/0xc8
el0_svc+0x3c/0x1e8
el0t_64_sync_handler+0x88/0xb0
el0t_64_sync+0x16c/0x170

Changes since v1:
-Change the variable [length] type to unsigned, as Eric Dumazet suggested.
Changes since v2:
-Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested.
Changes since v3:
-Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as
Jakub Kicinski suggested.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 4f6570d7 Tue Sep 24 09:01:14 MDT 2019 Eric Dumazet <edumazet@google.com> ipv6: add priority parameter to ip6_xmit()

Currently, ip6_xmit() sets skb->priority based on sk->sk_priority

This is not desirable for TCP since TCP shares the same ctl socket
for a given netns. We want to be able to send RST or ACK packets
with a non zero skb->priority.

This patch has no functional change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list

syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc_node mm/slab.c:3269 [inline]
kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
__alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
alloc_skb include/linux/skbuff.h:1058 [inline]
__ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3698
kfree_skbmem net/core/skbuff.c:625 [inline]
kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
__kfree_skb net/core/skbuff.c:682 [inline]
kfree_skb net/core/skbuff.c:699 [inline]
kfree_skb+0xf0/0x390 net/core/skbuff.c:693
kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
__dev_xmit_skb net/core/dev.c:3551 [inline]
__dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:511 [inline]
ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list

syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc_node mm/slab.c:3269 [inline]
kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
__alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
alloc_skb include/linux/skbuff.h:1058 [inline]
__ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3698
kfree_skbmem net/core/skbuff.c:625 [inline]
kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
__kfree_skb net/core/skbuff.c:682 [inline]
kfree_skb net/core/skbuff.c:699 [inline]
kfree_skb+0xf0/0x390 net/core/skbuff.c:693
kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
__dev_xmit_skb net/core/dev.c:3551 [inline]
__dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:511 [inline]
ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list

syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc_node mm/slab.c:3269 [inline]
kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
__alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
alloc_skb include/linux/skbuff.h:1058 [inline]
__ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3698
kfree_skbmem net/core/skbuff.c:625 [inline]
kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
__kfree_skb net/core/skbuff.c:682 [inline]
kfree_skb net/core/skbuff.c:699 [inline]
kfree_skb+0xf0/0x390 net/core/skbuff.c:693
kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
__dev_xmit_skb net/core/dev.c:3551 [inline]
__dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:511 [inline]
ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list

syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
slab_post_alloc_hook mm/slab.h:437 [inline]
slab_alloc_node mm/slab.c:3269 [inline]
kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
__alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
alloc_skb include/linux/skbuff.h:1058 [inline]
__ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kmem_cache_free+0x86/0x260 mm/slab.c:3698
kfree_skbmem net/core/skbuff.c:625 [inline]
kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
__kfree_skb net/core/skbuff.c:682 [inline]
kfree_skb net/core/skbuff.c:699 [inline]
kfree_skb+0xf0/0x390 net/core/skbuff.c:693
kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
__dev_xmit_skb net/core/dev.c:3551 [inline]
__dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
neigh_output include/net/neighbour.h:511 [inline]
ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
__ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
NF_HOOK_COND include/linux/netfilter.h:294 [inline]
ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
dst_output include/net/dst.h:433 [inline]
ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
sock_sendmsg_nosec net/socket.c:652 [inline]
sock_sendmsg+0xd7/0x130 net/socket.c:671
___sys_sendmsg+0x803/0x920 net/socket.c:2292
__sys_sendmsg+0x105/0x1d0 net/socket.c:2330
__do_sys_sendmsg net/socket.c:2339 [inline]
__se_sys_sendmsg net/socket.c:2337 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff bec1f6f6 Thu Apr 26 11:42:17 MDT 2018 Willem de Bruijn <willemb@google.com> udp: generate gso with UDP_SEGMENT

Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles

tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles

tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles

udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles

udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles

[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff bec1f6f6 Thu Apr 26 11:42:17 MDT 2018 Willem de Bruijn <willemb@google.com> udp: generate gso with UDP_SEGMENT

Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles

tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles

tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles

udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles

udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles

[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/lib/
H A Dvsprintf.cdiff 72fcce70 Fri Oct 27 08:13:58 MDT 2023 Alexey Dobriyan <adobriyan@gmail.com> vsprintf: uninline simple_strntoull(), reorder arguments

* uninline simple_strntoull(),
gcc overinlines and this function is not performance critical

* reorder arguments, so that appending INT_MAX as 4th argument
generates very efficient tail call

Space savings:

add/remove: 1/0 grow/shrink: 0/3 up/down: 27/-179 (-152)
Function old new delta
simple_strntoll - 27 +27
simple_strtoull 15 10 -5
simple_strtoll 41 7 -34
vsscanf 1930 1790 -140

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/all/82a2af6e-9b6c-4a09-89d7-ca90cc1cdad1@p183/
diff 72fcce70 Fri Oct 27 08:13:58 MDT 2023 Alexey Dobriyan <adobriyan@gmail.com> vsprintf: uninline simple_strntoull(), reorder arguments

* uninline simple_strntoull(),
gcc overinlines and this function is not performance critical

* reorder arguments, so that appending INT_MAX as 4th argument
generates very efficient tail call

Space savings:

add/remove: 1/0 grow/shrink: 0/3 up/down: 27/-179 (-152)
Function old new delta
simple_strntoll - 27 +27
simple_strtoull 15 10 -5
simple_strtoll 41 7 -34
vsscanf 1930 1790 -140

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/all/82a2af6e-9b6c-4a09-89d7-ca90cc1cdad1@p183/
diff 4c85c0be Sun Jan 29 21:25:13 MST 2023 Hyeonggon Yoo <42.hyeyoo@gmail.com> mm, printk: introduce new format %pGt for page_type

%pGp format is used to display 'flags' field of a struct page. However,
some page flags (i.e. PG_buddy, see page-flags.h for more details) are
stored in page_type field. To display human-readable output of page_type,
introduce %pGt format.

It is important to note the meaning of bits are different in page_type.
if page_type is 0xffffffff, no flags are set. Setting PG_buddy
(0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit
actually means setting a flag. Bits in page_type are inverted when
displaying type names.

Only values for which page_type_has_type() returns true are considered as
page_type, to avoid confusion with mapcount values. if it returns false,
only raw values are displayed and not page type names.

Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part]
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff ef62c8ff Thu Mar 24 19:09:02 MDT 2022 Waiman Long <longman@redhat.com> lib/vsprintf: avoid redundant work with 0 size

Patch series "mm/page_owner: Extend page_owner to show memcg information", v4.

While debugging the constant increase in percpu memory consumption on a
system that spawned large number of containers, it was found that a lot
of offline mem_cgroup structures remained in place without being freed.
Further investigation indicated that those mem_cgroup structures were
pinned by some pages.

In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are offline or not. With the enhanced page_owner tool, the
following is a typical page that pinned the mem_cgroup structure in my
test case:

Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns
PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
prep_new_page+0xac/0xe0
get_page_from_freelist+0x1327/0x14d0
__alloc_pages+0x191/0x340
alloc_pages_vma+0x84/0x250
shmem_alloc_page+0x3f/0x90
shmem_alloc_and_acct_page+0x76/0x1c0
shmem_getpage_gfp+0x281/0x940
shmem_write_begin+0x36/0xe0
generic_perform_write+0xed/0x1d0
__generic_file_write_iter+0xdc/0x1b0
generic_file_write_iter+0x5d/0xb0
new_sync_write+0x11f/0x1b0
vfs_write+0x1ba/0x2a0
ksys_write+0x59/0xd0
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to offline memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8.

So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.

With cgroup v1, /proc/cgroups can be read to find out the total number
of memory cgroups (online + offline). With cgroup v2, the cgroup.stat
of the root cgroup can be read to find the number of dying cgroups (most
likely pinned by dying memcgs).

The page_owner feature is not supposed to be enabled for production
system due to its memory overhead. However, if it is suspected that
dying memcgs are increasing over time, a test environment with
page_owner enabled can then be set up with appropriate workload for
further analysis on what may be causing the increasing number of dying
memcgs.

This patch (of 4):

For *scnprintf(), vsnprintf() is always called even if the input size is
0. That is a waste of time, so just return 0 in this case.

Note that vsnprintf() will never return -1 to indicate an error. So
skipping the call to vsnprintf() when size is 0 will have no functional
impact at all.

Link: https://lkml.kernel.org/r/20220202203036.744010-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20220202203036.744010-2-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 839b395e Mon Nov 08 19:33:22 MST 2021 Alexey Dobriyan <adobriyan@gmail.com> lib: uninline simple_strntoull() as well

Codegen become bloated again after simple_strntoull() introduction

add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-224 (-224)
Function old new delta
simple_strtoul 5 2 -3
simple_strtol 23 20 -3
simple_strtoull 119 15 -104
simple_strtoll 155 41 -114

Link: https://lkml.kernel.org/r/YVmlB9yY4lvbNKYt@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Richard Fitzgerald <rf@opensource.cirrus.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c0891ac1 Mon Aug 02 14:40:32 MDT 2021 Alexey Dobriyan <adobriyan@gmail.com> isystem: ship and use stdarg.h

Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>.
stdarg.h is the only userspace header commonly used in the kernel.

GPL 2 version of <stdarg.h> can be extracted from
http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
diff ad65dcef Wed Jun 30 19:56:04 MDT 2021 Alexey Dobriyan <adobriyan@gmail.com> lib: uninline simple_strtoull()

Gcc inlines simple_strtoull() too agressively.

Given that all 4 signatures match, everything very efficiently calls or
tailcalls into simple_strtoull():

ffffffff81da0240 <simple_strtoll>:
ffffffff81da0240: 80 3f 2d cmp BYTE PTR [rdi],0x2d
ffffffff81da0243: 74 05 je ffffffff81da024a <simple_strtoll+0xa>
ffffffff81da0245: e9 76 ff ff ff jmp simple_strtoull
ffffffff81da024a: 48 83 c7 01 add rdi,0x1
ffffffff81da024e: e8 6d ff ff ff call simple_strtoull
ffffffff81da0253: 48 f7 d8 neg rax
ffffffff81da0256: c3 ret

Space savings (on F34-ish .config)

add/remove: 0/0 grow/shrink: 1/3 up/down: 52/-313 (-261)
Function old new delta
vsscanf 2167 2219 +52
simple_strtoul 72 2 -70
simple_strtoll 143 23 -120
simple_strtol 143 20 -123

Link: https://lkml.kernel.org/r/YMO2zoOQk2eF34tn@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 900fdc45 Fri May 14 10:12:04 MDT 2021 Richard Fitzgerald <rf@opensource.cirrus.com> lib: vsprintf: Fix handling of number field widths in vsscanf

The existing code attempted to handle numbers by doing a strto[u]l(),
ignoring the field width, and then repeatedly dividing to extract the
field out of the full converted value. If the string contains a run of
valid digits longer than will fit in a long or long long, this would
overflow and no amount of dividing can recover the correct value.

This patch fixes vsscanf() to obey number field widths when parsing
the number.

A new _parse_integer_limit() is added that takes a limit for the number
of characters to parse. The number field conversion in vsscanf is changed
to use this new function.

If a number starts with a radix prefix, the field width must be long
enough for at last one digit after the prefix. If not, it will be handled
like this:

sscanf("0x4", "%1i", &i): i=0, scanning continues with the 'x'
sscanf("0x4", "%2i", &i): i=0, scanning continues with the '4'

This is consistent with the observed behaviour of userland sscanf.

Note that this patch does NOT fix the problem of a single field value
overflowing the target type. So for example:

sscanf("123456789abcdef", "%x", &i);

Will not produce the correct result because the value obviously overflows
INT_MAX. But sscanf will report a successful conversion.

Note that where a very large number is used to mean "unlimited", the value
INT_MAX is used for consistency with the behaviour of vsnprintf().

Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210514161206.30821-2-rf@opensource.cirrus.com
diff c244297a Fri Mar 19 04:12:46 MDT 2021 Yafang Shao <laoar.shao@gmail.com> vsprintf: dump full information of page flags in pGp

Currently the pGp only shows the names of page flags, rather than
the full information including section, node, zone, last cpupid and
kasan tag. While it is not easy to parse these information manually
because there're so many flavors. Let's interpret them in pGp as well.

To be compitable with the existed format of pGp, the new introduced ones
also use '|' as the separator, then the user tools parsing pGp won't
need to make change, suggested by Matthew. The new information is
tracked onto the end of the existed one.

On example of the output in mm/slub.c as follows,
- Before the patch,
[ 6343.396602] Slab 0x000000004382e02b objects=33 used=3 fp=0x000000009ae06ffc flags=0x17ffffc0010200(slab|head)

- After the patch,
[ 8448.272530] Slab 0x0000000090797883 objects=33 used=3 fp=0x00000000790f1c26 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)

The documentation and test cases are also updated. The output of the
test cases as follows,
[68599.816764] test_printf: loaded.
[68599.819068] test_printf: all 388 tests passed
[68599.830367] test_printf: unloaded.

[lkp@intel.com: reported issues in the prev version in test_printf.c]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: kernel test robot <lkp@intel.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210319101246.73513-4-laoar.shao@gmail.com
diff 5ead723a Sun Feb 14 09:13:48 MST 2021 Timur Tabi <timur@kernel.org> lib/vsprintf: no_hash_pointers prints all addresses as unhashed

If the no_hash_pointers command line parameter is set, then
printk("%p") will print pointers as unhashed, which is useful for
debugging purposes. This change applies to any function that uses
vsprintf, such as print_hex_dump() and seq_buf_printf().

A large warning message is displayed if this option is enabled.
Unhashed pointers expose kernel addresses, which can be a security
risk.

Also update test_printf to skip the hashed pointer tests if the
command-line option is set.

Signed-off-by: Timur Tabi <timur@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210214161348.369023-4-timur@kernel.org
/linux-master/net/mac80211/
H A Dsta_info.cdiff 4d3acf43 Mon Aug 28 06:00:01 MDT 2023 Johannes Berg <johannes.berg@intel.com> wifi: mac80211: remove sta_mtx

We now hold the wiphy mutex everywhere that we use or
needed the sta_mtx, so we don't need this mutex any
more. Remove it.

Most of this change was done automatically with spatch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
diff 4c51541d Fri Sep 02 08:12:54 MDT 2022 Benjamin Berg <benjamin.berg@intel.com> wifi: mac80211: keep A-MSDU data in sta and per-link

The A-MSDU data needs to be stored per-link and aggregated into a single
value for the station. Add a new struct ieee_80211_sta_aggregates in
order to store this data and a new function
ieee80211_sta_recalc_aggregates to update the current data for the STA.

Note that in the non MLO case the pointer in ieee80211_sta will directly
reference the data in deflink.agg, which means that recalculation may be
skipped in that case.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
diff 4dde3c36 Mon Nov 29 06:32:42 MST 2021 Mordechay Goodstein <mordechay.goodstein@intel.com> mac80211: update channel context before station state

Currently channel context is updated only after station got an update about
new assoc state, this results in station using the old channel context.

Fix this by moving the update channel context before updating station,
enabling the driver to immediately use the updated channel context in
the new assoc state.

Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.1c80c17ffd8a.I94ae31378b363c1182cfdca46c4b7e7165cff984@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
diff 7bc40aed Thu Nov 12 03:22:04 MST 2020 Johannes Berg <johannes.berg@intel.com> mac80211: free sta in sta_info_insert_finish() on errors

If sta_info_insert_finish() fails, we currently keep the station
around and free it only in the caller, but there's only one such
caller and it always frees it immediately.

As syzbot found, another consequence of this split is that we can
put things that sleep only into __cleanup_single_sta() and not in
sta_info_free(), but this is the only place that requires such of
sta_info_free() now.

Change this to free the station in sta_info_insert_finish(), in
which case we can still sleep. This will also let us unify the
cleanup code later.

Cc: stable@vger.kernel.org
Fixes: dcd479e10a05 ("mac80211: always wind down STA state")
Reported-by: syzbot+32c6c38c4812d22f2f0b@syzkaller.appspotmail.com
Reported-by: syzbot+4c81fe92e372d26c4246@syzkaller.appspotmail.com
Reported-by: syzbot+6a7fe9faf0d1d61bc24a@syzkaller.appspotmail.com
Reported-by: syzbot+abed06851c5ffe010921@syzkaller.appspotmail.com
Reported-by: syzbot+b7aeb9318541a1c709f1@syzkaller.appspotmail.com
Reported-by: syzbot+d5a9416c6cafe53b5dd0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20201112112201.ee6b397b9453.I9c31d667a0ea2151441cc64ed6613d36c18a48e0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
diff 3ace10f5 Mon Nov 18 23:06:09 MST 2019 Kan Yan <kyan@google.com> mac80211: Implement Airtime-based Queue Limit (AQL)

In order for the Fq_CoDel algorithm integrated in mac80211 layer to operate
effectively to control excessive queueing latency, the CoDel algorithm
requires an accurate measure of how long packets stays in the queue, AKA
sojourn time. The sojourn time measured at the mac80211 layer doesn't
include queueing latency in the lower layer (firmware/hardware) and CoDel
expects lower layer to have a short queue. However, most 802.11ac chipsets
offload tasks such TX aggregation to firmware or hardware, thus have a deep
lower layer queue.

Without a mechanism to control the lower layer queue size, packets only
stay in mac80211 layer transiently before being sent to firmware queue.
As a result, the sojourn time measured by CoDel in the mac80211 layer is
almost always lower than the CoDel latency target, hence CoDel does little
to control the latency, even when the lower layer queue causes excessive
latency.

The Byte Queue Limits (BQL) mechanism is commonly used to address the
similar issue with wired network interface. However, this method cannot be
applied directly to the wireless network interface. "Bytes" is not a
suitable measure of queue depth in the wireless network, as the data rate
can vary dramatically from station to station in the same network, from a
few Mbps to over Gbps.

This patch implements an Airtime-based Queue Limit (AQL) to make CoDel work
effectively with wireless drivers that utilized firmware/hardware
offloading. AQL allows each txq to release just enough packets to the lower
layer to form 1-2 large aggregations to keep hardware fully utilized and
retains the rest of the frames in mac80211 layer to be controlled by the
CoDel algorithm.

Signed-off-by: Kan Yan <kyan@google.com>
[ Toke: Keep API to set pending airtime internal, fix nits in commit msg ]
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-4-kyan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
diff 3e493173 Wed Sep 11 07:03:05 MDT 2019 Jouni Malinen <jouni@codeaurora.org> mac80211: Do not send Layer 2 Update frame before authorization

The Layer 2 Update frame is used to update bridges when a station roams
to another AP even if that STA does not transmit any frames after the
reassociation. This behavior was described in IEEE Std 802.11F-2003 as
something that would happen based on MLME-ASSOCIATE.indication, i.e.,
before completing 4-way handshake. However, this IEEE trial-use
recommended practice document was published before RSN (IEEE Std
802.11i-2004) and as such, did not consider RSN use cases. Furthermore,
IEEE Std 802.11F-2003 was withdrawn in 2006 and as such, has not been
maintained amd should not be used anymore.

Sending out the Layer 2 Update frame immediately after association is
fine for open networks (and also when using SAE, FT protocol, or FILS
authentication when the station is actually authenticated by the time
association completes). However, it is not appropriate for cases where
RSN is used with PSK or EAP authentication since the station is actually
fully authenticated only once the 4-way handshake completes after
authentication and attackers might be able to use the unauthenticated
triggering of Layer 2 Update frame transmission to disrupt bridge
behavior.

Fix this by postponing transmission of the Layer 2 Update frame from
station entry addition to the point when the station entry is marked
authorized. Similarly, send out the VLAN binding update only if the STA
entry has already been authorized.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 3e493173 Wed Sep 11 07:03:05 MDT 2019 Jouni Malinen <jouni@codeaurora.org> mac80211: Do not send Layer 2 Update frame before authorization

The Layer 2 Update frame is used to update bridges when a station roams
to another AP even if that STA does not transmit any frames after the
reassociation. This behavior was described in IEEE Std 802.11F-2003 as
something that would happen based on MLME-ASSOCIATE.indication, i.e.,
before completing 4-way handshake. However, this IEEE trial-use
recommended practice document was published before RSN (IEEE Std
802.11i-2004) and as such, did not consider RSN use cases. Furthermore,
IEEE Std 802.11F-2003 was withdrawn in 2006 and as such, has not been
maintained amd should not be used anymore.

Sending out the Layer 2 Update frame immediately after association is
fine for open networks (and also when using SAE, FT protocol, or FILS
authentication when the station is actually authenticated by the time
association completes). However, it is not appropriate for cases where
RSN is used with PSK or EAP authentication since the station is actually
fully authenticated only once the 4-way handshake completes after
authentication and attackers might be able to use the unauthenticated
triggering of Layer 2 Update frame transmission to disrupt bridge
behavior.

Fix this by postponing transmission of the Layer 2 Update frame from
station entry addition to the point when the station entry is marked
authorized. Similarly, send out the VLAN binding update only if the STA
entry has already been authorized.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 41cbb0f5 Sat Jun 09 00:14:44 MDT 2018 Luca Coelho <luciano.coelho@intel.com> mac80211: add support for HE

Add support for HE in mac80211 conforming with P802.11ax_D1.4.

Johannes: Fix another bug with the buf_size comparison in agg-rx.c.

Signed-off-by: Liad Kaufman <liad.kaufman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Ido Yariv <idox.yariv@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
diff 4c02d62f Thu Oct 05 11:39:10 MDT 2017 Kees Cook <keescook@chromium.org> net/mac80211/mesh_plink: Convert timers to use timer_setup()

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly. This requires adding a pointer back
to the sta_info since container_of() can't resolve the sta_info.

Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
diff 4df864c1 Fri Jun 16 06:29:21 MDT 2017 Johannes Berg <johannes.berg@intel.com> networking: make skb_put & friends return void pointers

It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_put, __skb_put };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)

@@
expression E, SKB, LEN;
identifier fn = { skb_put, __skb_put };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/drivers/scsi/fcoe/
H A Dfcoe.cdiff 276f9aa2 Wed Mar 03 07:46:04 MST 2021 Lee Jones <lee.jones@linaro.org> scsi: fcoe: Fix function name fcoe_set_vport_symbolic_name() in description

Fixes the following W=1 kernel build warning(s):

drivers/scsi/fcoe/fcoe.c:2782: warning: expecting prototype for fcoe_vport_set_symbolic_name(). Prototype was for fcoe_set_vport_symbolic_name() instead

Link: https://lore.kernel.org/r/20210303144631.3175331-4-lee.jones@linaro.org
Cc: Hannes Reinecke <hare@suse.de>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
diff b7a9d0c6 Mon Apr 20 21:40:08 MDT 2020 Jason Yan <yanaijie@huawei.com> scsi: fcoe: remove unneeded semicolon in fcoe.c

Fix the following coccicheck warning:

drivers/scsi/fcoe/fcoe.c:1918:3-4: Unneeded semicolon
drivers/scsi/fcoe/fcoe.c:1930:3-4: Unneeded semicolon

Link: https://lore.kernel.org/r/20200421034008.27865-1-yanaijie@huawei.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
diff b7a9d0c6 Mon Apr 20 21:40:08 MDT 2020 Jason Yan <yanaijie@huawei.com> scsi: fcoe: remove unneeded semicolon in fcoe.c

Fix the following coccicheck warning:

drivers/scsi/fcoe/fcoe.c:1918:3-4: Unneeded semicolon
drivers/scsi/fcoe/fcoe.c:1930:3-4: Unneeded semicolon

Link: https://lore.kernel.org/r/20200421034008.27865-1-yanaijie@huawei.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
diff 4df864c1 Fri Jun 16 06:29:21 MDT 2017 Johannes Berg <johannes.berg@intel.com> networking: make skb_put & friends return void pointers

It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

@@
expression SKB, LEN;
typedef u8;
identifier fn = { skb_put, __skb_put };
@@
- *(fn(SKB, LEN))
+ *(u8 *)fn(SKB, LEN)

@@
expression E, SKB, LEN;
identifier fn = { skb_put, __skb_put };
type T;
@@
- E = ((T *)(fn(SKB, LEN)))
+ E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 4b9bc86d Tue Apr 12 09:16:54 MDT 2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de> fcoe: convert to kworker

The driver creates its own per-CPU threads which are updated based on
CPU hotplug events. It is also possible to use kworkers and remove some
of the kthread infrastrucure.

The code checked ->thread to decide if there is an active per-CPU
thread. By using the kworker infrastructure this is no longer
possible (or required). The thread pointer is saved in `kthread' instead
of `thread' so anything trying to use thread is caught by the
compiler. Currently only the bnx2fc driver is using struct fcoe_percpu_s
and the kthread member.

After a CPU went offline, we may still enqueue items on the "offline"
CPU. This isn't much of a problem. The work will be done on a random
CPU. The allocated crc_eof_page page won't be cleaned up. It is probably
expected that the CPU comes up at some point so it should not be a
problem. The crc_eof_page memory is released of course once the module
is removed.

This patch was only compile-tested due to -ENODEV.

Cc: Vasu Dev <vasu.dev@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: fcoe-devel@open-fcoe.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
diff 860eca2b Tue Sep 27 22:38:18 MDT 2011 Vasu Dev <vasu.dev@intel.com> [SCSI] fcoe: setup default initial value for DDP threshold

Currently fcoe_ddp_min doesn't have default value
so by default not used, so setting up default value
as 4k as this works better by avoiding overhead
of programing DDP for small IOs.

Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
diff 4bc71cb9 Fri Sep 02 21:34:30 MDT 2011 Jiri Pirko <jpirko@redhat.com> net: consolidate and fix ethtool_ops->get_settings calling

This patch does several things:
- introduces __ethtool_get_settings which is called from ethtool code and
from drivers as well. Put ASSERT_RTNL there.
- dev_ethtool_get_settings() is replaced by __ethtool_get_settings()
- changes calling in drivers so rtnl locking is respected. In
iboe_get_rate was previously ->get_settings() called unlocked. This
fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same
problem. Also fixed by calling __dev_get_by_index() instead of
dev_get_by_index() and holding rtnl_lock for both calls.
- introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create()
so bnx2fc_if_create() and fcoe_if_create() are called locked as they
are from other places.
- use __ethtool_get_settings() in bonding code

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v2->v3:
-removed dev_ethtool_get_settings()
-added ASSERT_RTNL into __ethtool_get_settings()
-prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock
around it and __ethtool_get_settings() call
v1->v2:
add missing export_symbol
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits]
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 848e7d5b Thu Aug 25 01:40:47 MDT 2011 Robert Love <robert.w.love@intel.com> [SCSI] fcoe: Fix deadlock between fip's recv_work and rtnl

The rtnl cannot be held durrng the fcoe_interface_put.
If it is the last reference on the fcoe_interface the
fcoe_ctlr_destroy will be called as a part of the
cleanup, ultimately calling cancel_work_sync(&fip->recv_work);

If we are processing a flogi response we will be in
the recv_work context and we will lock the rtnl to
add a new unicast MAC address. This is how the deadlock
can occur.

The fix is simply to move the rtnl_lock/unlock into
fcoe_interface_cleanup so that it can be unlocked before
fcoe_interface_put is called.

Here is the lockdep report:

Jul 21 11:26:35 bubba [ 223.870702]
ul 21 11:26:35 bubba [ 223.870704] =======================================================
Jul 21 11:26:35 bubba [ 223.871255] [ INFO: possible circular locking dependency detected ]
Jul 21 11:26:35 bubba [ 223.871530] 3.0.0-rc7+ #1
Jul 21 11:26:35 bubba [ 223.871797] -------------------------------------------------------
Jul 21 11:26:35 bubba [ 223.872072] lockdeptest.sh/3464 is trying to acquire lock:
Jul 21 11:26:35 bubba [ 223.872345] ((&fip->recv_work)
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffff810531f1>] wait_on_work+0x0/0xbd
Jul 21 11:26:35 bubba [ 223.873022]
Jul 21 11:26:35 bubba [ 223.873023] but task is already holding lock:
Jul 21 11:26:35 bubba [ 223.873555] (rtnl_mutex
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffff813e8233>] rtnl_lock+0x12/0x14
Jul 21 11:26:35 bubba [ 223.874229]
Jul 21 11:26:35 bubba [ 223.874230] which lock already depends on the new lock.
Jul 21 11:26:35 bubba [ 223.874231]
Jul 21 11:26:35 bubba [ 223.875032]
Jul 21 11:26:35 bubba [ 223.875033] the existing dependency chain (in reverse order) is:
Jul 21 11:26:35 bubba [ 223.875573]
Jul 21 11:26:35 bubba [ 223.875573] -> #1
Jul 21 11:26:35 bubba (rtnl_mutex
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba :
Jul 21 11:26:35 bubba [ 223.876301]
Jul 21 11:26:35 bubba [<ffffffff8106c14a>] lock_acquire+0xd2/0xf7
Jul 21 11:26:35 bubba [ 223.876645]
Jul 21 11:26:35 bubba [<ffffffff8151d975>] __mutex_lock_common+0x47/0x30d
Jul 21 11:26:35 bubba [ 223.876991]
Jul 21 11:26:35 bubba [<ffffffff8151dd36>] mutex_lock_nested+0x3b/0x40
Jul 21 11:26:35 bubba [ 223.877334]
Jul 21 11:26:35 bubba [<ffffffff813e8233>] rtnl_lock+0x12/0x14
Jul 21 11:26:35 bubba [ 223.877675]
Jul 21 11:26:35 bubba [<ffffffffa003d5a0>] fcoe_update_src_mac+0x2b/0x80 [fcoe]
Jul 21 11:26:35 bubba [ 223.878022]
Jul 21 11:26:35 bubba [<ffffffffa003d698>] fcoe_flogi_resp+0x5e/0x79 [fcoe]
Jul 21 11:26:35 bubba [ 223.878366]
Jul 21 11:26:35 bubba [<ffffffffa001566f>] fc_exch_recv+0x7f5/0x9da [libfc]
Jul 21 11:26:35 bubba [ 223.878713]
Jul 21 11:26:35 bubba [<ffffffffa00327d8>] fcoe_ctlr_recv_work+0x71f/0x10dc [libfcoe]
Jul 21 11:26:35 bubba [ 223.879258]
Jul 21 11:26:35 bubba [<ffffffff81053761>] process_one_work+0x1d7/0x347
Jul 21 11:26:35 bubba [ 223.879601]
Jul 21 11:26:35 bubba [<ffffffff81054ade>] worker_thread+0xf8/0x17c
Jul 21 11:26:35 bubba [ 223.879944]
Jul 21 11:26:35 bubba [<ffffffff81058184>] kthread+0x7d/0x85
Jul 21 11:26:35 bubba [ 223.880287]
Jul 21 11:26:35 bubba [<ffffffff81526414>] kernel_thread_helper+0x4/0x10
Jul 21 11:26:35 bubba [ 223.880634]
Jul 21 11:26:35 bubba [ 223.880635] -> #0
Jul 21 11:26:35 bubba ((&fip->recv_work)
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba :
Jul 21 11:26:35 bubba [ 223.881357]
Jul 21 11:26:35 bubba [<ffffffff8106b93e>] __lock_acquire+0xb1d/0xe2c
Jul 21 11:26:35 bubba [ 223.881695]
Jul 21 11:26:35 bubba [<ffffffff8106c14a>] lock_acquire+0xd2/0xf7
Jul 21 11:26:35 bubba [ 223.882033]
Jul 21 11:26:35 bubba [<ffffffff81053241>] wait_on_work+0x50/0xbd
Jul 21 11:26:35 bubba [ 223.882378]
Jul 21 11:26:35 bubba [<ffffffff81053b32>] __cancel_work_timer+0xb6/0xf4
Jul 21 11:26:35 bubba [ 223.882718]
Jul 21 11:26:35 bubba [<ffffffff81053b8a>] cancel_work_sync+0xb/0xd
Jul 21 11:26:35 bubba [ 223.883057]
Jul 21 11:26:35 bubba [<ffffffffa00317e6>] fcoe_ctlr_destroy+0x1d/0x67 [libfcoe]
Jul 21 11:26:35 bubba [ 223.883399]
Jul 21 11:26:35 bubba [<ffffffffa003e51e>] fcoe_interface_release+0x21/0x45 [fcoe]
Jul 21 11:26:35 bubba [ 223.883940]
Jul 21 11:26:35 bubba [<ffffffff811fbbe6>] kref_put+0x43/0x4d
Jul 21 11:26:35 bubba [ 223.884280]
Jul 21 11:26:35 bubba [<ffffffffa003ebba>] fcoe_interface_put+0x17/0x19 [fcoe]
Jul 21 11:26:35 bubba [ 223.884624]
Jul 21 11:26:35 bubba [<ffffffffa003f2a6>] fcoe_interface_cleanup+0x188/0x193 [fcoe]
Jul 21 11:26:35 bubba [ 223.885163]
Jul 21 11:26:35 bubba [<ffffffffa003f303>] fcoe_destroy+0x52/0x72 [fcoe]
Jul 21 11:26:35 bubba [ 223.885502]
Jul 21 11:26:35 bubba [<ffffffffa00340a4>] fcoe_transport_destroy+0xab/0x110 [libfcoe]
Jul 21 11:26:35 bubba [ 223.886045]
Jul 21 11:26:35 bubba [<ffffffff81056153>] param_attr_store+0x43/0x62
Jul 21 11:26:35 bubba [ 223.886385]
Jul 21 11:26:35 bubba [<ffffffff8105602d>] module_attr_store+0x21/0x25
Jul 21 11:26:35 bubba [ 223.886728]
Jul 21 11:26:35 bubba [<ffffffff8114c23d>] sysfs_write_file+0x103/0x13f
Jul 21 11:26:35 bubba [ 223.887068]
Jul 21 11:26:35 bubba [<ffffffff810f3e7b>] vfs_write+0xa7/0xfa
Jul 21 11:26:35 bubba [ 223.887406]
Jul 21 11:26:35 bubba [<ffffffff810f4073>] sys_write+0x45/0x69
Jul 21 11:26:35 bubba [ 223.887742]
Jul 21 11:26:35 bubba [<ffffffff815252bb>] system_call_fastpath+0x16/0x1b
Jul 21 11:26:35 bubba [ 223.888083]
Jul 21 11:26:35 bubba [ 223.888084] other info that might help us debug this:
Jul 21 11:26:35 bubba [ 223.888085]
Jul 21 11:26:35 bubba [ 223.888879] Possible unsafe locking scenario:
Jul 21 11:26:35 bubba [ 223.888881]
Jul 21 11:26:35 bubba [ 223.889411] CPU0 CPU1
Jul 21 11:26:35 bubba [ 223.889683] ---- ----
Jul 21 11:26:35 bubba [ 223.889955] lock(
Jul 21 11:26:35 bubba rtnl_mutex
Jul 21 11:26:35 bubba );
Jul 21 11:26:35 bubba [ 223.890349] lock(
Jul 21 11:26:35 bubba (&fip->recv_work)
Jul 21 11:26:35 bubba );
Jul 21 11:26:35 bubba [ 223.890751] lock(
Jul 21 11:26:35 bubba rtnl_mutex
Jul 21 11:26:35 bubba );
Jul 21 11:26:35 bubba [ 223.891154] lock(
Jul 21 11:26:35 bubba (&fip->recv_work)
Jul 21 11:26:35 bubba );
Jul 21 11:26:35 bubba [ 223.891549]
Jul 21 11:26:35 bubba [ 223.891550] *** DEADLOCK ***
Jul 21 11:26:35 bubba [ 223.891551]
Jul 21 11:26:35 bubba [ 223.892347] 6 locks held by lockdeptest.sh/3464:
Jul 21 11:26:35 bubba [ 223.892621] #0:
Jul 21 11:26:35 bubba (&buffer->mutex
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffff8114c171>] sysfs_write_file+0x37/0x13f
Jul 21 11:26:35 bubba [ 223.893359] #1:
Jul 21 11:26:35 bubba (s_active
Jul 21 11:26:35 bubba ){++++.+}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffff8114c21c>] sysfs_write_file+0xe2/0x13f
Jul 21 11:26:35 bubba [ 223.894094] #2:
Jul 21 11:26:35 bubba (param_lock
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffff81056146>] param_attr_store+0x36/0x62
Jul 21 11:26:35 bubba [ 223.894835] #3:
Jul 21 11:26:35 bubba (ft_mutex
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffffa0034017>] fcoe_transport_destroy+0x1e/0x110 [libfcoe]
Jul 21 11:26:35 bubba [ 223.895574] #4:
Jul 21 11:26:35 bubba (fcoe_config_mutex
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffffa003f2c9>] fcoe_destroy+0x18/0x72 [fcoe]
Jul 21 11:26:35 bubba [ 223.896314] #5:
Jul 21 11:26:35 bubba (rtnl_mutex
Jul 21 11:26:35 bubba ){+.+.+.}
Jul 21 11:26:35 bubba , at:
Jul 21 11:26:35 bubba [<ffffffff813e8233>] rtnl_lock+0x12/0x14
Jul 21 11:26:35 bubba [ 223.897047]
Jul 21 11:26:35 bubba [ 223.897048] stack backtrace:
Jul 21 11:26:35 bubba [ 223.897578] Pid: 3464, comm: lockdeptest.sh Not tainted 3.0.0-rc7+ #1
Jul 21 11:26:35 bubba [ 223.897853] Call Trace:
Jul 21 11:26:35 bubba [ 223.898128] [<ffffffff81068e16>] print_circular_bug+0x1f8/0x209
Jul 21 11:26:35 bubba [ 223.898416] [<ffffffff8106b93e>] __lock_acquire+0xb1d/0xe2c
Jul 21 11:26:35 bubba [ 223.898699] [<ffffffff810531f1>] ? wait_on_cpu_work+0xe6/0xe6
Jul 21 11:26:35 bubba [ 223.898982] [<ffffffff8106c14a>] lock_acquire+0xd2/0xf7
Jul 21 11:26:35 bubba [ 223.899263] [<ffffffff810531f1>] ? wait_on_cpu_work+0xe6/0xe6
Jul 21 11:26:35 bubba [ 223.899547] [<ffffffff8104a097>] ? mod_timer+0x8f/0x98
Jul 21 11:26:35 bubba [ 223.899827] [<ffffffff81053241>] wait_on_work+0x50/0xbd
Jul 21 11:26:35 bubba [ 223.900108] [<ffffffff810531f1>] ? wait_on_cpu_work+0xe6/0xe6
Jul 21 11:26:35 bubba [ 223.900390] [<ffffffff81053b32>] __cancel_work_timer+0xb6/0xf4
Jul 21 11:26:35 bubba [ 223.900671] [<ffffffff81053b8a>] cancel_work_sync+0xb/0xd
Jul 21 11:26:35 bubba [ 223.900953] [<ffffffffa00317e6>] fcoe_ctlr_destroy+0x1d/0x67 [libfcoe]
Jul 21 11:26:35 bubba [ 223.901237] [<ffffffffa003e51e>] fcoe_interface_release+0x21/0x45 [fcoe]
Jul 21 11:26:35 bubba [ 223.901522] [<ffffffffa003e4fd>] ? fcoe_enable+0x6b/0x6b [fcoe]
Jul 21 11:26:35 bubba [ 223.901803] [<ffffffff811fbbe6>] kref_put+0x43/0x4d
Jul 21 11:26:35 bubba [ 223.902083] [<ffffffffa003ebba>] fcoe_interface_put+0x17/0x19 [fcoe]
Jul 21 11:26:35 bubba [ 223.902367] [<ffffffffa003f2a6>] fcoe_interface_cleanup+0x188/0x193 [fcoe]
Jul 21 11:26:35 bubba [ 223.902653] [<ffffffff8151dd36>] ? mutex_lock_nested+0x3b/0x40
Jul 21 11:26:35 bubba [ 223.902939] [<ffffffffa003f303>] fcoe_destroy+0x52/0x72 [fcoe]
Jul 21 11:26:35 bubba [ 223.903223] [<ffffffffa00340a4>] fcoe_transport_destroy+0xab/0x110 [libfcoe]
Jul 21 11:26:35 bubba [ 223.903508] [<ffffffff81056153>] param_attr_store+0x43/0x62
Jul 21 11:26:35 bubba [ 223.903792] [<ffffffff8105602d>] module_attr_store+0x21/0x25
Jul 21 11:26:35 bubba [ 223.904075] [<ffffffff8114c23d>] sysfs_write_file+0x103/0x13f
Jul 21 11:26:35 bubba [ 223.904357] [<ffffffff810f3e7b>] vfs_write+0xa7/0xfa
Jul 21 11:26:35 bubba [ 223.904642] [<ffffffff810f51d6>] ? fget_light+0x35/0x96
Jul 21 11:26:35 bubba [ 223.904923] [<ffffffff810f4073>] sys_write+0x45/0x69
Jul 21 11:26:35 bubba [ 223.905204] [<ffffffff815252bb>] system_call_fastpath+0x16/0x1b
Jul 21 11:26:36 bubba [ 223.964438] ixgbe 0000:05:00.0: eth3: detected SFP+: 5
Jul 21 11:26:37 bubba [ 225.196702] ixgbe 0000:05:00.0: eth3: NIC Link is Up 10 Gbps, Flow Control: None

Signed-off-by: Robert Love <robert.w.love@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Reviewed-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
diff e10f8c66 Tue Jul 20 16:20:30 MDT 2010 Joe Eykholt <jeykholt@cisco.com> [SCSI] libfcoe: fcoe: fnic: add FIP VN2VN point-to-multipoint support

The FC-BB-6 committee is proposing a new FIP usage model called
VN_port to VN_port mode. It allows VN_ports to discover each other
over a loss-free L2 Ethernet without any FCF or Fibre-channel fabric
services. This is point-to-multipoint. There is also a variant
of this called point-to-point which provides for making sure there
is just one pair of ports operating over the Ethernet fabric.

We add these new states: VNMP_START, _PROBE1, _PROBE2, _CLAIM, and _UP.
These usually go quickly in that sequence. After waiting a random
amount of time up to 100 ms in START, we select a pseudo-random
proposed locally-unique port ID and send out probes in states PROBE1
and PROBE2, 100 ms apart. If no probe responses are heard, we
proceed to CLAIM state 400 ms later and send a claim notification.
We wait another 400 ms to receive claim responses, which give us
a list of the other nodes on the network, including their FC-4
capabilities. After another 400 ms we go to VNMP_UP state and
should start interoperating with any of the nodes for whic we
receivec claim responses. More details are in the spec.j

Add the new mode as FIP_MODE_VN2VN. The driver must specify
explicitly that it wants to operate in this mode. There is
no automatic detection between point-to-multipoint and fabric
mode, and the local port initialization is affected, so it isn't
anticipated that there will ever be any such automatic switchover.

It may eventually be possible to have both fabric and VN2VN
modes on the same L2 network, which may be done by two separate
local VN_ports (lports).

When in VN2VN mode, FIP replaces libfc's fabric-oriented discovery
module with its own simple code that adds remote ports as they
are discovered from incoming claim notifications and responses.
These hooks are placed by fcoe_disc_init().

A linear list of discovered vn_ports is maintained under the
fcoe_ctlr struct. It is expected to be short for now, and
accessed infrequently. It is kept under RCU for lock-ordering
reasons. The lport and/or rport mutexes may be held when we
need to lookup a fcoe_vnport during an ELS send.

Change fcoe_ctlr_encaps() to lookup the destination vn_port in
the list of peers for the destination MAC address of the
FIP-encapsulated frame.

Add a new function fcoe_disc_init() to initialize just the
discovery portion of libfcoe for VN2VN mode.

Signed-off-by: Joe Eykholt <jeykholt@cisco.com>
Signed-off-by: Robert Love <robert.w.love@intel.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/include/linux/
H A Dkernel.hdiff b229baa3 Tue Jul 18 15:11:44 MDT 2023 Andy Shevchenko <andriy.shevchenko@linux.intel.com> kernel.h: split out COUNT_ARGS() and CONCATENATE() to args.h

Patch series "kernel.h: Split out a couple of macros to args.h", v4.

There are macros in kernel.h that can be used outside of that header.
Split them to args.h and replace open coded variants.


This patch (of 4):

kernel.h is being used as a dump for all kinds of stuff for a long time.
The COUNT_ARGS() and CONCATENATE() macros may be used in some places
without need of the full kernel.h dependency train with it.

Here is the attempt on cleaning it up by splitting out these macros().

While at it, include new header where it's being used.

Link: https://lkml.kernel.org/r/20230718211147.18647-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20230718211147.18647-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [PCI]
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: Daniel Latypov <dlatypov@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Gow <davidgow@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a20deb3a Mon Nov 08 19:33:51 MST 2021 Kefeng Wang <wangkefeng.wang@huawei.com> sections: move and rename core_kernel_data() to is_kernel_core_data()

Move core_kernel_data() into sections.h and rename it to
is_kernel_core_data(), also make it return bool value, then update all the
callers.

Link: https://lkml.kernel.org/r/20210930071143.63410-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff c0891ac1 Mon Aug 02 14:40:32 MDT 2021 Alexey Dobriyan <adobriyan@gmail.com> isystem: ship and use stdarg.h

Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>.
stdarg.h is the only userspace header commonly used in the kernel.

GPL 2 version of <stdarg.h> can be extracted from
http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
diff 4c527293 Wed Jun 30 19:56:10 MDT 2021 Andy Shevchenko <andriy.shevchenko@linux.intel.com> kernel.h: split out kstrtox() and simple_strtox() to a separate header

kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out kstrtox() and
simple_strtox() helpers.

At the same time convert users in header and lib folders to use new
header. Though for time being include new header back to kernel.h to
avoid twisted indirected includes for existing users.

[andy.shevchenko@gmail.com: fix documentation references]
Link: https://lkml.kernel.org/r/20210615220003.377901-1-andy.shevchenko@gmail.com

Link: https://lkml.kernel.org/r/20210611185815.44103-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Francis Laniel <laniel_francis@privacyrequired.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Kars Mulder <kerneldev@karsmulder.nl>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff ef0f2685 Tue Aug 11 19:34:56 MDT 2020 Kars Mulder <kerneldev@karsmulder.nl> kstrto*: do not describe simple_strto*() as obsolete/replaced

The documentation of the kstrto*() functions describes kstrto*() as
"replacements" of the "obsolete" simple_strto*() functions. Both of these
terms are inaccurate: they're not replacements because they have different
behaviour, and the simple_strto*() are not obsolete because there are
cases where they have benefits over kstrto*().

Remove usage of the terms "replacement" and "obsolete" in reference to
simple_strto*(), and instead use the term "preferred over".

Fixes: 4c925d6031f71 ("kstrto*: add documentation")
Fixes: 885e68e8b7b13 ("kernel.h: update comment about simple_strto<foo>() functions")
Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Eldad Zack <eldad@fogrefinery.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Mans Rullgard <mans@mansr.com>
Cc: Petr Mladek <pmladek@suse.com>
Link: http://lkml.kernel.org/r/29b9-5f234c80-13-4e3aa200@244003027
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff ef0f2685 Tue Aug 11 19:34:56 MDT 2020 Kars Mulder <kerneldev@karsmulder.nl> kstrto*: do not describe simple_strto*() as obsolete/replaced

The documentation of the kstrto*() functions describes kstrto*() as
"replacements" of the "obsolete" simple_strto*() functions. Both of these
terms are inaccurate: they're not replacements because they have different
behaviour, and the simple_strto*() are not obsolete because there are
cases where they have benefits over kstrto*().

Remove usage of the terms "replacement" and "obsolete" in reference to
simple_strto*(), and instead use the term "preferred over".

Fixes: 4c925d6031f71 ("kstrto*: add documentation")
Fixes: 885e68e8b7b13 ("kernel.h: update comment about simple_strto<foo>() functions")
Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Eldad Zack <eldad@fogrefinery.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Mans Rullgard <mans@mansr.com>
Cc: Petr Mladek <pmladek@suse.com>
Link: http://lkml.kernel.org/r/29b9-5f234c80-13-4e3aa200@244003027
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff b642e44e Tue Aug 11 19:34:53 MDT 2020 Kars Mulder <kerneldev@karsmulder.nl> kstrto*: correct documentation references to simple_strto*()

The documentation of the kstrto*() functions reference the simple_strtoull
function by "used as a replacement for [the obsolete] simple_strtoull".
All these functions describes themselves as replacements for the function
simple_strtoull, even though a function like kstrtol() would be more aptly
described as a replacement of simple_strtol().

Fix these references by making the documentation of kstrto*() reference
the closest simple_strto*() equivalent available. The functions
kstrto[u]int() do not have direct simple_strto[u]int() equivalences, so
these are made to refer to simple_strto[u]l() instead.

Furthermore, add parentheses after function names, as is standard in
kernel documentation.

Fixes: 4c925d6031f71 ("kstrto*: add documentation")
Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Eldad Zack <eldad@fogrefinery.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Mans Rullgard <mans@mansr.com>
Cc: Petr Mladek <pmladek@suse.com>
Link: http://lkml.kernel.org/r/1ee1-5f234c00-f3-165a6440@234394593
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 229f5879 Wed Jul 22 05:03:05 MDT 2020 Kishon Vijay Abraham I <kishon@ti.com> linux/kernel.h: Add PTR_ALIGN_DOWN macro

Add a macro for aligning down a pointer. This is useful to get an
aligned register address when a device allows only word access and
doesn't allow half word or byte access.

Link: https://lore.kernel.org/r/20200722110317.4744-4-kishon@ti.com
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Rob Herring <robh@kernel.org>
diff 4e139c77 Fri Feb 14 06:39:19 MST 2020 Thomas Gleixner <tglx@linutronix.de> sched: Provide cant_migrate()

Some code pathes rely on preempt_disable() to prevent migration on a non RT
enabled kernel. These preempt_disable/enable() pairs are substituted by
migrate_disable/enable() pairs or other forms of RT specific protection. On
RT these protections prevent migration but not preemption. Obviously a
cant_sleep() check in such a section will trigger on RT because preemption
is not disabled.

Provide a cant_migrate() macro which maps to cant_sleep() on a non RT
kernel and an empty placeholder for RT for now. The placeholder will be
changed to a proper debug check along with the RT specific migration
protection mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200214161503.070487511@linutronix.de
diff 312364f3 Mon Aug 26 14:14:23 MDT 2019 Daniel Vetter <daniel.vetter@ffwll.ch> kernel.h: Add non_block_start/end()

In some special cases we must not block, but there's not a spinlock,
preempt-off, irqs-off or similar critical section already that arms the
might_sleep() debug checks. Add a non_block_start/end() pair to annotate
these.

This will be used in the oom paths of mmu-notifiers, where blocking is not
allowed to make sure there's forward progress. Quoting Michal:

"The notifier is called from quite a restricted context - oom_reaper -
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially."

Peter also asked whether we want to catch spinlocks on top, but Michal
said those are less of a problem because spinlocks can't have an indirect
dependency upon the page allocator and hence close the loop with the oom
reaper.

Suggested by Michal Hocko.

Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch
Acked-by: Christian König <christian.koenig@amd.com> (v1)
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
H A Dsched.hdiff fa3bea4e Thu Feb 01 22:02:37 MST 2024 Gregory Price <gourry.memverge@gmail.com> mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such as
socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based
interleave policy does not optimally distribute data to make use of their
different bandwidth characteristics.

Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes. Weighted interleave
allows for proportional distribution of memory across multiple numa nodes,
preferably apportioned to match the bandwidth of each node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight
distribution is (2:1).

Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/

For now, the default value of all nodes will be `1`, which matches the
behavior of standard 1:1 round-robin interleave. An extension will be
added in the future to allow default values to be registered at kernel and
device bringup time.

The policy allocates a number of pages equal to the set weights. For
example, if the weights are (2,1), then 2 pages will be allocated on node0
for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

Some high level notes about the pieces of weighted interleave:

current->il_prev:
Tracks the node previously allocated from.

current->il_weight:
The active weight of the current node (current->il_prev)
When this reaches 0, current->il_prev is set to the next node
and current->il_weight is set to the next weight.

weighted_interleave_nodes:
Counts the number of allocations as they occur, and applies the
weight for the current node. When the weight reaches 0, switch
to the next node. Operates only on task->mempolicy.

weighted_interleave_nid:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the node based on the given index.
Operates on VMA policies.

bulk_array_weighted_interleave:
Gets the total weight of the nodemask as well as each individual
node weight, then calculates the number of "interleave rounds" as
well as any delta ("partial round"). Calculates the number of
pages for each node and allocates them.

If a node was scheduled for interleave via interleave_nodes, the
current weight will be allocated first.

Operates only on the task->mempolicy.

One piece of complexity is the interaction between a recent refactor which
split the logic to acquire the "ilx" (interleave index) of an allocation
and the actually application of the interleave. If a call to
alloc_pages_mpol() were made with a weighted-interleave policy and ilx set
to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA
policy - violating the description above.

An inspection of all callers of alloc_pages_mpol() shows that all external
callers set ilx to `0`, an index value, or will call get_vma_policy() to
acquire the ilx.

For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks
all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the
`weighted_interleave_nodes()` and `weighted_interleave_nid()` policy
requirements (task/vma respectively).

Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com
Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff 0eb5085c Thu Nov 16 06:36:38 MST 2023 Heiko Carstens <hca@linux.ibm.com> arch: remove ARCH_TASK_STRUCT_ON_STACK

IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ON_STACK.
IA-64 was removed with commit cf8e8658100d ("arch: Remove Itanium (IA-64)
architecture"). Therefore remove support for ARCH_TASK_STRUCT_ON_STACK
as well.

Note: this also reveals a potential bug in powerpc code, which makes use of
__init_task_data without selecting ARCH_TASK_STRUCT_ON_STACK which makes
__init_task_data a no-op. This is broken since commit d11ed3ab3166 ("Expand
INIT_TASK() in init/init_task.c and remove") from 2018 and needs to be
addressed separately.

Link: https://lkml.kernel.org/r/20231116133638.1636277-4-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff e3ff7c60 Fri Feb 24 09:50:00 MST 2023 Josh Poimboeuf <jpoimboe@kernel.org> livepatch,sched: Add livepatch task switching to cond_resched()

There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.

Fix it by patching tasks in cond_resched().

There are four different flavors of cond_resched(), depending on the
kernel configuration. Hook into all of them.

A more elegant solution might be to use a preempt notifier. However,
non-ORC unwinders can't unwind a preempted task reliably.

[1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/
[2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org
diff ee3e3ac0 Tue Nov 22 13:39:05 MST 2022 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> rseq: Introduce extensible rseq ABI

Introduce the extensible rseq ABI, where the feature size supported by
the kernel and the required alignment are communicated to user-space
through ELF auxiliary vectors.

This allows user-space to call rseq registration with a rseq_len of
either 32 bytes for the original struct rseq size (which includes
padding), or larger.

If rseq_len is larger than 32 bytes, then it must be large enough to
contain the feature size communicated to user-space through ELF
auxiliary vectors.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221122203932.231377-4-mathieu.desnoyers@efficios.com
diff f80be457 Thu Sep 15 09:03:45 MDT 2022 Alexander Potapenko <glider@google.com> kmsan: add KMSAN runtime core

For each memory location KernelMemorySanitizer maintains two types of
metadata:

1. The so-called shadow of that location - а byte:byte mapping describing
whether or not individual bits of memory are initialized (shadow is 0)
or not (shadow is 1).
2. The origins of that location - а 4-byte:4-byte mapping containing
4-byte IDs of the stack traces where uninitialized values were
created.

Each struct page now contains pointers to two struct pages holding KMSAN
metadata (shadow and origins) for the original struct page. Utility
routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata
creation, addressing, copying and checking. mm/kmsan/report.c performs
error reporting in the cases an uninitialized value is used in a way that
leads to undefined behavior.

KMSAN compiler instrumentation is responsible for tracking the metadata
along with the kernel memory. mm/kmsan/instrumentation.c provides the
implementation for instrumentation hooks that are called from files
compiled with -fsanitize=kernel-memory.

To aid parameter passing (also done at instrumentation level), each
task_struct now contains a struct kmsan_task_state used to track the
metadata of function parameters and return values for that task.

Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares
CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The
KMSAN_SANITIZE:=n Makefile directive can be used to completely disable
KMSAN instrumentation for certain files.

Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly
created stack memory initialized.

Users can also use functions from include/linux/kmsan-checks.h to mark
certain memory regions as uninitialized or initialized (this is called
"poisoning" and "unpoisoning") or check that a particular region is
initialized.

Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff f80be457 Thu Sep 15 09:03:45 MDT 2022 Alexander Potapenko <glider@google.com> kmsan: add KMSAN runtime core

For each memory location KernelMemorySanitizer maintains two types of
metadata:

1. The so-called shadow of that location - а byte:byte mapping describing
whether or not individual bits of memory are initialized (shadow is 0)
or not (shadow is 1).
2. The origins of that location - а 4-byte:4-byte mapping containing
4-byte IDs of the stack traces where uninitialized values were
created.

Each struct page now contains pointers to two struct pages holding KMSAN
metadata (shadow and origins) for the original struct page. Utility
routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata
creation, addressing, copying and checking. mm/kmsan/report.c performs
error reporting in the cases an uninitialized value is used in a way that
leads to undefined behavior.

KMSAN compiler instrumentation is responsible for tracking the metadata
along with the kernel memory. mm/kmsan/instrumentation.c provides the
implementation for instrumentation hooks that are called from files
compiled with -fsanitize=kernel-memory.

To aid parameter passing (also done at instrumentation level), each
task_struct now contains a struct kmsan_task_state used to track the
metadata of function parameters and return values for that task.

Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares
CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The
KMSAN_SANITIZE:=n Makefile directive can be used to completely disable
KMSAN instrumentation for certain files.

Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly
created stack memory initialized.

Users can also use functions from include/linux/kmsan-checks.h to mark
certain memory regions as uninitialized or initialized (this is called
"poisoning" and "unpoisoning") or check that a particular region is
initialized.

Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff f80be457 Thu Sep 15 09:03:45 MDT 2022 Alexander Potapenko <glider@google.com> kmsan: add KMSAN runtime core

For each memory location KernelMemorySanitizer maintains two types of
metadata:

1. The so-called shadow of that location - а byte:byte mapping describing
whether or not individual bits of memory are initialized (shadow is 0)
or not (shadow is 1).
2. The origins of that location - а 4-byte:4-byte mapping containing
4-byte IDs of the stack traces where uninitialized values were
created.

Each struct page now contains pointers to two struct pages holding KMSAN
metadata (shadow and origins) for the original struct page. Utility
routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata
creation, addressing, copying and checking. mm/kmsan/report.c performs
error reporting in the cases an uninitialized value is used in a way that
leads to undefined behavior.

KMSAN compiler instrumentation is responsible for tracking the metadata
along with the kernel memory. mm/kmsan/instrumentation.c provides the
implementation for instrumentation hooks that are called from files
compiled with -fsanitize=kernel-memory.

To aid parameter passing (also done at instrumentation level), each
task_struct now contains a struct kmsan_task_state used to track the
metadata of function parameters and return values for that task.

Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares
CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The
KMSAN_SANITIZE:=n Makefile directive can be used to completely disable
KMSAN instrumentation for certain files.

Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly
created stack memory initialized.

Users can also use functions from include/linux/kmsan-checks.h to mark
certain memory regions as uninitialized or initialized (this is called
"poisoning" and "unpoisoning") or check that a particular region is
initialized.

Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff ec1c86b2 Sun Sep 18 02:00:02 MDT 2022 Yu Zhao <yuzhao@google.com> mm: multi-gen LRU: groundwork

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim. These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.

There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction. The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then. This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff ec1c86b2 Sun Sep 18 02:00:02 MDT 2022 Yu Zhao <yuzhao@google.com> mm: multi-gen LRU: groundwork

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim. These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.

There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction. The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then. This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff bb447999 Tue Jun 21 03:04:10 MDT 2022 Dietmar Eggemann <dietmar.eggemann@arm.com> sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()

effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).

A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.

Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
/linux-master/fs/nfs/
H A Dsuper.cdiff 4e04143c Thu Mar 16 07:07:51 MDT 2023 Ondrej Mosnacek <omosnace@redhat.com> fs_context: drop the unused lsm_flags member

This isn't ever used by VFS now, and it couldn't even work. Any FS that
uses the SECURITY_LSM_NATIVE_LABELS flag needs to also process the
value returned back from the LSM, so it needs to do its
security_sb_set_mnt_opts() call on its own anyway.

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
diff e33c267a Tue May 31 21:22:24 MDT 2022 Roman Gushchin <roman.gushchin@linux.dev> mm: shrinkers: provide shrinkers with names

Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.

This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.

In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.

The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40

[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff e33c267a Tue May 31 21:22:24 MDT 2022 Roman Gushchin <roman.gushchin@linux.dev> mm: shrinkers: provide shrinkers with names

Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.

This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.

In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.

The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40

[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff a6b5a28e Sat Nov 14 11:43:54 MST 2020 Dave Wysochanski <dwysocha@redhat.com> nfs: Convert to new fscache volume/cookie API

Change the nfs filesystem to support fscache's indexing rewrite and
reenable caching in nfs.

The following changes have been made:

(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole.

(2) The session cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). That takes three parameters: a string
representing the "volume" in the index, a string naming the cache to
use (or NULL) and a u64 that conveys coherency metadata for the
volume.

For nfs, I've made it render the volume name string as:

"nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]"

(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.

fscache_acquire_cookie() is passed the same keying and coherency
information as before.

(4) fscache_enable/disable_cookie() have been removed.

Call fscache_use_cookie() and fscache_unuse_cookie() when a file is
opened or closed to prevent a cache file from being culled and to keep
resources to hand that are needed to do I/O.

If a file is opened for writing, we invalidate it with
FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache,
thereby making it cease caching until all currently open files are
closed. This should give the same behaviour as the uptream code.
Making the cache store local modifications isn't straightforward for
NFS, so that's left for future patches.

(5) fscache_invalidate() now needs to be given uptodate auxiliary data and
a file size. It also takes a flag to indicate if this was due to a
DIO write.

(6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file
to which a DIO write is made.

(7) Call fscache_note_page_release() from nfs_release_page().

(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.

(9) The functions to read and write data to/from the cache are stubbed out
pending a conversion to use netfslib.

Changes
=======
ver #3:
- Added missing =n fallback for nfs_fscache_release_file()[1][2].

ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.
- Remove NFS_INO_FSCACHE as it's no longer used.
- Need to unuse a cookie on file-release, not inode-clear.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: linux-nfs@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1]
Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2]
Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
diff ca1a199e Mon Apr 15 18:19:40 MDT 2019 Al Viro <viro@zeniv.linux.org.uk> nfs{,4}: switch to ->free_inode()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff 4f253e1e Mon May 15 16:18:11 MDT 2017 Jan Kara <jack@suse.cz> nfs: Mark unnecessarily extern functions as static

nfs_initialise_sb() and nfs_clone_super() are declared as extern even
though they are used only in fs/nfs/super.c. Mark them as static.

Also remove explicit 'inline' directive from nfs_initialise_sb() and
leave it upto compiler to decide whether inlining is worth it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
diff 9e08ef1a Wed Nov 13 19:00:17 MST 2013 NeilBrown <neilb@suse.de> NFS: correctly report misuse of "migration" mount option.

The current test on valid use of the "migration" mount option can never
report an error as it will only do so if
mnt->version !=4 && mnt->minor_version != 0
(and some other condition), but if that test would succeed, then the previous
test has already gone-to out_minorversion_mismatch.

So change the && to an || to get correct semantics.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff 4d4b69dd Fri Oct 18 13:15:19 MDT 2013 Weston Andros Adamson <dros@netapp.com> NFS: add support for multiple sec= mount options

This patch adds support for multiple security options which can be
specified using a colon-delimited list of security flavors (the same
syntax as nfsd's exports file).

This is useful, for instance, when NFSv4.x mounts cross SECINFO
boundaries. With this patch a user can use "sec=krb5i,krb5p"
to mount a remote filesystem using krb5i, but can still cross
into krb5p-only exports.

New mounts will try all security options before failing. NFSv4.x
SECINFO results will be compared against the sec= flavors to
find the first flavor in both lists or if no match is found will
return -EPERM.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff cdf66442 Tue Jun 05 12:59:54 MDT 2012 Bryan Schumaker <bjschuma@netapp.com> NFS4: Set parsed mount data version to 4

This patch only affects mounting through "-t nfs4" since it doesn't set
up an nfs version to use in the mount data. The nfs client was trying
to mount using NFS v0, causing either a BUG() or a protocol not
supported message.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff 3862279a Fri Mar 02 12:06:39 MST 2012 Trond Myklebust <Trond.Myklebust@netapp.com> NFS: Consolidate the parsing of the '-ov4.x' and '-overs=4.x' mount options

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
/linux-master/net/ipv4/
H A Dudp.cdiff 4bb3ba7b Thu Mar 07 07:23:50 MST 2024 Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> udp: fix incorrect parameter validation in the udp_lib_getsockopt() function

The 'len' variable can't be negative when assigned the result of
'min_t' because all 'min_t' parameters are cast to unsigned int,
and then the minimum one is chosen.

To fix the logic, check 'len' as read from 'optlen',
where the types of relevant variables are (signed) int.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff f796feab Mon Feb 19 16:00:01 MST 2024 Paolo Abeni <pabeni@redhat.com> udp: add local "peek offset enabled" flag

We want to re-organize the struct sock layout. The sk_peek_off
field location is problematic, as most protocols want it in the
RX read area, while UDP wants it on a cacheline different from
sk_receive_queue.

Create a local (inside udp_sock) copy of the 'peek offset is enabled'
flag and place it inside the same cacheline of reader_queue.

Check such flag before reading sk_peek_off. This will save potential
false sharing and cache misses in the fast-path.

Tested under UDP flood with small packets. The struct sock layout
update causes a 4% performance drop, and this patch restores completely
the original tput.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/67ab679c15fbf49fa05b3ffe05d91c47ab84f147.1708426665.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF)

syzbot reported a lockdep violation [1] involving af_unix
support of SO_PEEK_OFF.

Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
sk_peek_off field), there is really no point to enforce a pointless
thread safety in the kernel.

After this patch :

- setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.

- skb_consume_udp() no longer has to acquire the socket lock.

- af_unix no longer needs a special version of sk_set_peek_off(),
because it does not lock u->iolock anymore.

As a followup, we could replace prot->set_peek_off to be a boolean
and avoid an indirect call, since we always use sk_set_peek_off().

[1]

WARNING: possible circular locking dependency detected
6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted

syz-executor.2/30025 is trying to acquire lock:
ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789

but task is already holding lock:
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
lock_sock_nested+0x48/0x100 net/core/sock.c:3524
lock_sock include/net/sock.h:1691 [inline]
__unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
___sys_recvmsg net/socket.c:2845 [inline]
do_recvmmsg+0x474/0xae0 net/socket.c:2939
__sys_recvmmsg net/socket.c:3018 [inline]
__do_sys_recvmmsg net/socket.c:3041 [inline]
__se_sys_recvmmsg net/socket.c:3034 [inline]
__x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

-> #0 (&u->iolock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(sk_lock-AF_UNIX);
lock(&u->iolock);
lock(sk_lock-AF_UNIX);
lock(&u->iolock);

*** DEADLOCK ***

1 lock held by syz-executor.2/30025:
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

stack backtrace:
CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f78a1c7dda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8

Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Daan De Meyer <daan.j.demeyer@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF)

syzbot reported a lockdep violation [1] involving af_unix
support of SO_PEEK_OFF.

Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
sk_peek_off field), there is really no point to enforce a pointless
thread safety in the kernel.

After this patch :

- setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.

- skb_consume_udp() no longer has to acquire the socket lock.

- af_unix no longer needs a special version of sk_set_peek_off(),
because it does not lock u->iolock anymore.

As a followup, we could replace prot->set_peek_off to be a boolean
and avoid an indirect call, since we always use sk_set_peek_off().

[1]

WARNING: possible circular locking dependency detected
6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted

syz-executor.2/30025 is trying to acquire lock:
ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789

but task is already holding lock:
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
lock_sock_nested+0x48/0x100 net/core/sock.c:3524
lock_sock include/net/sock.h:1691 [inline]
__unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
___sys_recvmsg net/socket.c:2845 [inline]
do_recvmmsg+0x474/0xae0 net/socket.c:2939
__sys_recvmmsg net/socket.c:3018 [inline]
__do_sys_recvmmsg net/socket.c:3041 [inline]
__se_sys_recvmmsg net/socket.c:3034 [inline]
__x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

-> #0 (&u->iolock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(sk_lock-AF_UNIX);
lock(&u->iolock);
lock(sk_lock-AF_UNIX);
lock(&u->iolock);

*** DEADLOCK ***

1 lock held by syz-executor.2/30025:
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

stack backtrace:
CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f78a1c7dda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8

Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Daan De Meyer <daan.j.demeyer@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF)

syzbot reported a lockdep violation [1] involving af_unix
support of SO_PEEK_OFF.

Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
sk_peek_off field), there is really no point to enforce a pointless
thread safety in the kernel.

After this patch :

- setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.

- skb_consume_udp() no longer has to acquire the socket lock.

- af_unix no longer needs a special version of sk_set_peek_off(),
because it does not lock u->iolock anymore.

As a followup, we could replace prot->set_peek_off to be a boolean
and avoid an indirect call, since we always use sk_set_peek_off().

[1]

WARNING: possible circular locking dependency detected
6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted

syz-executor.2/30025 is trying to acquire lock:
ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789

but task is already holding lock:
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
lock_sock_nested+0x48/0x100 net/core/sock.c:3524
lock_sock include/net/sock.h:1691 [inline]
__unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
___sys_recvmsg net/socket.c:2845 [inline]
do_recvmmsg+0x474/0xae0 net/socket.c:2939
__sys_recvmmsg net/socket.c:3018 [inline]
__do_sys_recvmmsg net/socket.c:3041 [inline]
__se_sys_recvmmsg net/socket.c:3034 [inline]
__x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

-> #0 (&u->iolock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(sk_lock-AF_UNIX);
lock(&u->iolock);
lock(sk_lock-AF_UNIX);
lock(&u->iolock);

*** DEADLOCK ***

1 lock held by syz-executor.2/30025:
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

stack backtrace:
CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f78a1c7dda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8

Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Daan De Meyer <daan.j.demeyer@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF)

syzbot reported a lockdep violation [1] involving af_unix
support of SO_PEEK_OFF.

Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
sk_peek_off field), there is really no point to enforce a pointless
thread safety in the kernel.

After this patch :

- setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.

- skb_consume_udp() no longer has to acquire the socket lock.

- af_unix no longer needs a special version of sk_set_peek_off(),
because it does not lock u->iolock anymore.

As a followup, we could replace prot->set_peek_off to be a boolean
and avoid an indirect call, since we always use sk_set_peek_off().

[1]

WARNING: possible circular locking dependency detected
6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted

syz-executor.2/30025 is trying to acquire lock:
ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789

but task is already holding lock:
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
lock_sock_nested+0x48/0x100 net/core/sock.c:3524
lock_sock include/net/sock.h:1691 [inline]
__unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
___sys_recvmsg net/socket.c:2845 [inline]
do_recvmmsg+0x474/0xae0 net/socket.c:2939
__sys_recvmmsg net/socket.c:3018 [inline]
__do_sys_recvmmsg net/socket.c:3041 [inline]
__se_sys_recvmmsg net/socket.c:3034 [inline]
__x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

-> #0 (&u->iolock){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(sk_lock-AF_UNIX);
lock(&u->iolock);
lock(sk_lock-AF_UNIX);
lock(&u->iolock);

*** DEADLOCK ***

1 lock held by syz-executor.2/30025:
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
#0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193

stack backtrace:
CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
check_prev_add kernel/locking/lockdep.c:3134 [inline]
check_prevs_add kernel/locking/lockdep.c:3253 [inline]
validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
__lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
__mutex_lock_common kernel/locking/mutex.c:608 [inline]
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
sk_setsockopt+0x207e/0x3360
do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
__sys_setsockopt+0x1ad/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xf9/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f78a1c7dda9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8

Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Daan De Meyer <daan.j.demeyer@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 482521d8 Fri Jan 12 03:44:27 MST 2024 Eric Dumazet <edumazet@google.com> udp: annotate data-races around up->pending

up->pending can be read without holding the socket lock,
as pointed out by syzbot [1]

Add READ_ONCE() in lockless contexts, and WRITE_ONCE()
on write side.

[1]
BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg

write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1:
udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
__sys_sendto+0x257/0x310 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b

read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0:
udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586
___sys_sendmsg net/socket.c:2640 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2726
__do_sys_sendmmsg net/socket.c:2755 [inline]
__se_sys_sendmmsg net/socket.c:2752 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b

value changed: 0x00000000 -> 0x0000000a

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G W 6.7.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 482521d8 Fri Jan 12 03:44:27 MST 2024 Eric Dumazet <edumazet@google.com> udp: annotate data-races around up->pending

up->pending can be read without holding the socket lock,
as pointed out by syzbot [1]

Add READ_ONCE() in lockless contexts, and WRITE_ONCE()
on write side.

[1]
BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg

write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1:
udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
__sys_sendto+0x257/0x310 net/socket.c:2192
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2200
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b

read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0:
udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586
___sys_sendmsg net/socket.c:2640 [inline]
__sys_sendmmsg+0x269/0x500 net/socket.c:2726
__do_sys_sendmmsg net/socket.c:2755 [inline]
__se_sys_sendmmsg net/socket.c:2752 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b

value changed: 0x00000000 -> 0x0000000a

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G W 6.7.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff 0f495f76 Thu Jul 20 09:30:08 MDT 2023 Lorenz Bauer <lmb@isovalent.com> net: remove duplicate reuseport_lookup functions

There are currently four copies of reuseport_lookup: one each for
(TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of
those functions as well. This is already the case for sk_lookup
helpers (inet,inet6,udp4,udp6)_lookup_run_bpf.

There are two differences between the reuseport_lookup helpers:

1. They call different hash functions depending on protocol
2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED

Move the check for sk_state into the caller and use the INDIRECT_CALL
infrastructure to cut down the helpers to one per IP version.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-4-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
diff e1d001fa Fri Jun 09 09:27:42 MDT 2023 Breno Leitao <leitao@debian.org> net: ioctl: Use kernel memory on protocol ioctl callbacks

Most of the ioctls to net protocols operates directly on userspace
argument (arg). Usually doing get_user()/put_user() directly in the
ioctl callback. This is not flexible, because it is hard to reuse these
functions without passing userspace buffers.

Change the "struct proto" ioctls to avoid touching userspace memory and
operate on kernel buffers, i.e., all protocol's ioctl callbacks is
adapted to operate on a kernel memory other than on userspace (so, no
more {put,get}_user() and friends being called in the ioctl callback).

This changes the "struct proto" ioctl format in the following way:

int (*ioctl)(struct sock *sk, int cmd,
- unsigned long arg);
+ int *karg);

(Important to say that this patch does not touch the "struct proto_ops"
protocols)

So, the "karg" argument, which is passed to the ioctl callback, is a
pointer allocated to kernel space memory (inside a function wrapper).
This buffer (karg) may contain input argument (copied from userspace in
a prep function) and it might return a value/buffer, which is copied
back to userspace if necessary. There is not one-size-fits-all format
(that is I am using 'may' above), but basically, there are three type of
ioctls:

1) Do not read from userspace, returns a result to userspace
2) Read an input parameter from userspace, and does not return anything
to userspace
3) Read an input from userspace, and return a buffer to userspace.

The default case (1) (where no input parameter is given, and an "int" is
returned to userspace) encompasses more than 90% of the cases, but there
are two other exceptions. Here is a list of exceptions:

* Protocol RAW:
* cmd = SIOCGETVIFCNT:
* input and output = struct sioc_vif_req
* cmd = SIOCGETSGCNT
* input and output = struct sioc_sg_req
* Explanation: for the SIOCGETVIFCNT case, userspace passes the input
argument, which is struct sioc_vif_req. Then the callback populates
the struct, which is copied back to userspace.

* Protocol RAW6:
* cmd = SIOCGETMIFCNT_IN6
* input and output = struct sioc_mif_req6
* cmd = SIOCGETSGCNT_IN6
* input and output = struct sioc_sg_req6

* Protocol PHONET:
* cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE
* input int (4 bytes)
* Nothing is copied back to userspace.

For the exception cases, functions sock_sk_ioctl_inout() will
copy the userspace input, and copy it back to kernel space.

The wrapper that prepare the buffer and put the buffer back to user is
sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now
calls sk_ioctl(), which will handle all cases.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Completed in 2093 milliseconds