Searched +hist:4 +hist:be929be (Results 1 - 19 of 19) sorted by relevance
/linux-master/fs/ocfs2/ | ||
H A D | blockcheck.c | diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/drivers/dma/ | ||
H A D | timb_dma.c | diff 4f339d6e Tue Sep 19 07:32:01 MDT 2023 Uwe Kleine-König <u.kleine-koenig@pengutronix.de> dmaengine: timb_dma: Convert to platform remove callback returning void The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new() which already returns void. Eventually after all drivers are converted, .remove_new() is renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Link: https://lore.kernel.org/r/20230919133207.1400430-54-u.kleine-koenig@pengutronix.de Signed-off-by: Vinod Koul <vkoul@kernel.org> diff 4bf27b8b Fri Dec 21 16:09:59 MST 2012 Greg Kroah-Hartman <gregkh@linuxfoundation.org> Drivers: dma: remove __dev* attributes. CONFIG_HOTPLUG is going away as an option. As a result, the __dev* markings need to be removed. This change removes the use of __devinit, __devexit_p, __devinitconst, and __devexit from these drivers. Based on patches originally written by Bill Pemberton, but redone by me in order to handle some of the coding style issues better, by hand. Cc: Bill Pemberton <wfp5p@virginia.edu> Cc: Viresh Kumar <viresh.linux@gmail.com> Cc: Dan Williams <djbw@fb.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: Barry Song <baohua.song@csr.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Jassi Brar <jassisinghbrar@gmail.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Bill Pemberton <wfp5p@virginia.edu> Cc: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff 4d4e58de Tue Mar 06 15:34:06 MST 2012 Russell King - ARM Linux <linux@arm.linux.org.uk> dmaengine: move last completed cookie into generic dma_chan structure Every DMA engine implementation declares a last completed dma cookie in their private dma channel structures. This is pointless, and forces driver specific code. Move this out into the common dma_chan structure. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Tested-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Jassi Brar <jassisinghbrar@gmail.com> [imx-sdma.c & mxs-dma.c] Tested-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@linux.intel.com> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/net/9p/ | ||
H A D | protocol.c | diff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size() This new function calculates a buffer size suitable for holding the intended 9p request or response. For rather small message types (which applies to almost all 9p message types actually) simply use hard coded values. For some variable-length and potentially large message types calculate a more precise value according to what data is actually transmitted to avoid unnecessarily huge buffers. So p9_msg_buf_size() divides the individual 9p message types into 3 message size categories: - dynamically calculated message size (i.e. potentially large) - 8k hard coded message size - 4k hard coded message size As for the latter two hard coded message types: for most 9p message types it is pretty obvious whether they would always fit into 4k or 8k. But for some of them it depends on the maximum directory entry name length allowed by OS and filesystem for determining into which of the two size categories they would fit into. Currently Linux supports directory entry names up to NAME_MAX (255), however when comparing the limitation of individual filesystems, ReiserFS theoretically supports up to slightly below 4k long names. So in order to make this code more future proof, and as revisiting it later on is a bit tedious and has the potential to miss out details, the decision [1] was made to take 4k as basis as for max. name length. Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1] Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> diff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size() This new function calculates a buffer size suitable for holding the intended 9p request or response. For rather small message types (which applies to almost all 9p message types actually) simply use hard coded values. For some variable-length and potentially large message types calculate a more precise value according to what data is actually transmitted to avoid unnecessarily huge buffers. So p9_msg_buf_size() divides the individual 9p message types into 3 message size categories: - dynamically calculated message size (i.e. potentially large) - 8k hard coded message size - 4k hard coded message size As for the latter two hard coded message types: for most 9p message types it is pretty obvious whether they would always fit into 4k or 8k. But for some of them it depends on the maximum directory entry name length allowed by OS and filesystem for determining into which of the two size categories they would fit into. Currently Linux supports directory entry names up to NAME_MAX (255), however when comparing the limitation of individual filesystems, ReiserFS theoretically supports up to slightly below 4k long names. So in order to make this code more future proof, and as revisiting it later on is a bit tedious and has the potential to miss out details, the decision [1] was made to take 4k as basis as for max. name length. Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1] Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> diff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size() This new function calculates a buffer size suitable for holding the intended 9p request or response. For rather small message types (which applies to almost all 9p message types actually) simply use hard coded values. For some variable-length and potentially large message types calculate a more precise value according to what data is actually transmitted to avoid unnecessarily huge buffers. So p9_msg_buf_size() divides the individual 9p message types into 3 message size categories: - dynamically calculated message size (i.e. potentially large) - 8k hard coded message size - 4k hard coded message size As for the latter two hard coded message types: for most 9p message types it is pretty obvious whether they would always fit into 4k or 8k. But for some of them it depends on the maximum directory entry name length allowed by OS and filesystem for determining into which of the two size categories they would fit into. Currently Linux supports directory entry names up to NAME_MAX (255), however when comparing the limitation of individual filesystems, ReiserFS theoretically supports up to slightly below 4k long names. So in order to make this code more future proof, and as revisiting it later on is a bit tedious and has the potential to miss out details, the decision [1] was made to take 4k as basis as for max. name length. Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1] Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> diff 1effdbf9 Fri Jul 15 15:32:34 MDT 2022 Christian Schoenebeck <linux_oss@crudebyte.com> net/9p: add p9_msg_buf_size() This new function calculates a buffer size suitable for holding the intended 9p request or response. For rather small message types (which applies to almost all 9p message types actually) simply use hard coded values. For some variable-length and potentially large message types calculate a more precise value according to what data is actually transmitted to avoid unnecessarily huge buffers. So p9_msg_buf_size() divides the individual 9p message types into 3 message size categories: - dynamically calculated message size (i.e. potentially large) - 8k hard coded message size - 4k hard coded message size As for the latter two hard coded message types: for most 9p message types it is pretty obvious whether they would always fit into 4k or 8k. But for some of them it depends on the maximum directory entry name length allowed by OS and filesystem for determining into which of the two size categories they would fit into. Currently Linux supports directory entry names up to NAME_MAX (255), however when comparing the limitation of individual filesystems, ReiserFS theoretically supports up to slightly below 4k long names. So in order to make this code more future proof, and as revisiting it later on is a bit tedious and has the potential to miss out details, the decision [1] was made to take 4k as basis as for max. name length. Link: https://lkml.kernel.org/r/bd6be891cf67e867688e8c8796d06408bfafa0d9.1657920926.git.linux_oss@crudebyte.com Link: https://lore.kernel.org/all/5564296.oo812IJUPE@silver/ [1] Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> diff ef5305f1 Fri Sep 07 09:36:08 MDT 2018 Dominique Martinet <dominique.martinet@cea.fr> 9p: p9dirent_read: check network-provided name length strcpy to dirent->d_name could overflow the buffer, use strscpy to check the provided string length and error out if the size was too big. While we are here, make the function return an error when the pdu parsing failed, instead of returning the pdu offset as if it had been a success... Link: http://lkml.kernel.org/r/1536339057-21974-4-git-send-email-asmadeus@codewreck.org Addresses-Coverity-ID: 139133 ("Copy into fixed size buffer") Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> diff 6da2ec56 Tue Jun 12 14:55:00 MDT 2018 Kees Cook <keescook@chromium.org> treewide: kmalloc() -> kmalloc_array() The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org> diff 4f3b35c1 Wed Apr 01 17:57:53 MDT 2015 Al Viro <viro@zeniv.linux.org.uk> net/9p: switch the guts of p9_client_{read,write}() to iov_iter ... and have get_user_pages_fast() mapping fewer pages than requested to generate a short read/write. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 348b5901 Sat Aug 06 13:16:59 MDT 2011 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> net/9p: Convert net/9p protocol dumps to tracepoints This helps in more control over debugging. root@qemu-img-64:~# ls /pass/123 ls: cannot access /pass/123: No such file or directory root@qemu-img-64:~# cat /sys/kernel/debug/tracing/trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | ls-1536 [001] 70.928584: 9p_protocol_dump: clnt 18446612132784021504 P9_TWALK(tag = 1) 000: 16 00 00 00 6e 01 00 01 00 00 00 02 00 00 00 01 010: 00 03 00 31 32 33 00 00 00 ff ff ff ff 00 00 00 ls-1536 [001] 70.928587: <stack trace> => trace_9p_protocol_dump => p9pdu_finalize => p9_client_rpc => p9_client_walk => v9fs_vfs_lookup => d_alloc_and_lookup => walk_component => path_lookupat ls-1536 [000] 70.929696: 9p_protocol_dump: clnt 18446612132784021504 P9_RLERROR(tag = 1) 000: 0b 00 00 00 07 01 00 02 00 00 00 4e 03 00 02 00 010: 00 00 00 00 03 00 02 00 00 00 00 00 ff 43 00 00 ls-1536 [000] 70.929697: <stack trace> => trace_9p_protocol_dump => p9_client_rpc => p9_client_walk => v9fs_vfs_lookup => d_alloc_and_lookup => walk_component => path_lookupat => do_path_lookup Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol. SYNOPSIS size[4] Tsetattr tag[2] attr[n] size[4] Rsetattr tag[2] DESCRIPTION The setattr command changes some of the file status information. attr resembles the iattr structure used in Linux kernel. It specifies which status parameter is to be changed and to what value. It is laid out as follows: valid[4] specifies which status information is to be changed. Possible values are: ATTR_MODE (1 << 0) ATTR_UID (1 << 1) ATTR_GID (1 << 2) ATTR_SIZE (1 << 3) ATTR_ATIME (1 << 4) ATTR_MTIME (1 << 5) ATTR_ATIME_SET (1 << 7) ATTR_MTIME_SET (1 << 8) The last two bits represent whether the time information is being sent by the client's user space. In the absense of these bits the server always uses server's time. mode[4] File permission bits uid[4] Owner id of file gid[4] Group id of the file size[8] File size atime_sec[8] Time of last file access, seconds atime_nsec[8] Time of last file access, nanoseconds mtime_sec[8] Time of last file modification, seconds mtime_nsec[8] Time of last file modification, nanoseconds Explanation of the patches: -------------------------- *) The kernel just copies relevent contents of iattr structure to p9_iattr_dotl structure and passes it down to the client. The only check it has is calling inode_change_ok() *) The p9_iattr_dotl structure does not have ctime and ia_file parameters because I don't think these are needed in our case. The client user space can request updating just ctime by calling chown(fd, -1, -1). This is handled on server side without a need for putting ctime on the wire. *) The server currently supports changing mode, time, ownership and size of the file. *) 9P RFC says "Either all the changes in wstat request happen, or none of them does: if the request succeeds, all changes were made; if it fails, none were." I have not done anything to implement this specifically because I don't see a reason. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol. SYNOPSIS size[4] Tsetattr tag[2] attr[n] size[4] Rsetattr tag[2] DESCRIPTION The setattr command changes some of the file status information. attr resembles the iattr structure used in Linux kernel. It specifies which status parameter is to be changed and to what value. It is laid out as follows: valid[4] specifies which status information is to be changed. Possible values are: ATTR_MODE (1 << 0) ATTR_UID (1 << 1) ATTR_GID (1 << 2) ATTR_SIZE (1 << 3) ATTR_ATIME (1 << 4) ATTR_MTIME (1 << 5) ATTR_ATIME_SET (1 << 7) ATTR_MTIME_SET (1 << 8) The last two bits represent whether the time information is being sent by the client's user space. In the absense of these bits the server always uses server's time. mode[4] File permission bits uid[4] Owner id of file gid[4] Group id of the file size[8] File size atime_sec[8] Time of last file access, seconds atime_nsec[8] Time of last file access, nanoseconds mtime_sec[8] Time of last file modification, seconds mtime_nsec[8] Time of last file modification, nanoseconds Explanation of the patches: -------------------------- *) The kernel just copies relevent contents of iattr structure to p9_iattr_dotl structure and passes it down to the client. The only check it has is calling inode_change_ok() *) The p9_iattr_dotl structure does not have ctime and ia_file parameters because I don't think these are needed in our case. The client user space can request updating just ctime by calling chown(fd, -1, -1). This is handled on server side without a need for putting ctime on the wire. *) The server currently supports changing mode, time, ownership and size of the file. *) 9P RFC says "Either all the changes in wstat request happen, or none of them does: if the request succeeds, all changes were made; if it fails, none were." I have not done anything to implement this specifically because I don't see a reason. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol. SYNOPSIS size[4] Tsetattr tag[2] attr[n] size[4] Rsetattr tag[2] DESCRIPTION The setattr command changes some of the file status information. attr resembles the iattr structure used in Linux kernel. It specifies which status parameter is to be changed and to what value. It is laid out as follows: valid[4] specifies which status information is to be changed. Possible values are: ATTR_MODE (1 << 0) ATTR_UID (1 << 1) ATTR_GID (1 << 2) ATTR_SIZE (1 << 3) ATTR_ATIME (1 << 4) ATTR_MTIME (1 << 5) ATTR_ATIME_SET (1 << 7) ATTR_MTIME_SET (1 << 8) The last two bits represent whether the time information is being sent by the client's user space. In the absense of these bits the server always uses server's time. mode[4] File permission bits uid[4] Owner id of file gid[4] Group id of the file size[8] File size atime_sec[8] Time of last file access, seconds atime_nsec[8] Time of last file access, nanoseconds mtime_sec[8] Time of last file modification, seconds mtime_nsec[8] Time of last file modification, nanoseconds Explanation of the patches: -------------------------- *) The kernel just copies relevent contents of iattr structure to p9_iattr_dotl structure and passes it down to the client. The only check it has is calling inode_change_ok() *) The p9_iattr_dotl structure does not have ctime and ia_file parameters because I don't think these are needed in our case. The client user space can request updating just ctime by calling chown(fd, -1, -1). This is handled on server side without a need for putting ctime on the wire. *) The server currently supports changing mode, time, ownership and size of the file. *) 9P RFC says "Either all the changes in wstat request happen, or none of them does: if the request succeeds, all changes were made; if it fails, none were." I have not done anything to implement this specifically because I don't see a reason. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol. SYNOPSIS size[4] Tsetattr tag[2] attr[n] size[4] Rsetattr tag[2] DESCRIPTION The setattr command changes some of the file status information. attr resembles the iattr structure used in Linux kernel. It specifies which status parameter is to be changed and to what value. It is laid out as follows: valid[4] specifies which status information is to be changed. Possible values are: ATTR_MODE (1 << 0) ATTR_UID (1 << 1) ATTR_GID (1 << 2) ATTR_SIZE (1 << 3) ATTR_ATIME (1 << 4) ATTR_MTIME (1 << 5) ATTR_ATIME_SET (1 << 7) ATTR_MTIME_SET (1 << 8) The last two bits represent whether the time information is being sent by the client's user space. In the absense of these bits the server always uses server's time. mode[4] File permission bits uid[4] Owner id of file gid[4] Group id of the file size[8] File size atime_sec[8] Time of last file access, seconds atime_nsec[8] Time of last file access, nanoseconds mtime_sec[8] Time of last file modification, seconds mtime_nsec[8] Time of last file modification, nanoseconds Explanation of the patches: -------------------------- *) The kernel just copies relevent contents of iattr structure to p9_iattr_dotl structure and passes it down to the client. The only check it has is calling inode_change_ok() *) The p9_iattr_dotl structure does not have ctime and ia_file parameters because I don't think these are needed in our case. The client user space can request updating just ctime by calling chown(fd, -1, -1). This is handled on server side without a need for putting ctime on the wire. *) The server currently supports changing mode, time, ownership and size of the file. *) 9P RFC says "Either all the changes in wstat request happen, or none of them does: if the request succeeds, all changes were made; if it fails, none were." I have not done anything to implement this specifically because I don't see a reason. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol. SYNOPSIS size[4] Tsetattr tag[2] attr[n] size[4] Rsetattr tag[2] DESCRIPTION The setattr command changes some of the file status information. attr resembles the iattr structure used in Linux kernel. It specifies which status parameter is to be changed and to what value. It is laid out as follows: valid[4] specifies which status information is to be changed. Possible values are: ATTR_MODE (1 << 0) ATTR_UID (1 << 1) ATTR_GID (1 << 2) ATTR_SIZE (1 << 3) ATTR_ATIME (1 << 4) ATTR_MTIME (1 << 5) ATTR_ATIME_SET (1 << 7) ATTR_MTIME_SET (1 << 8) The last two bits represent whether the time information is being sent by the client's user space. In the absense of these bits the server always uses server's time. mode[4] File permission bits uid[4] Owner id of file gid[4] Group id of the file size[8] File size atime_sec[8] Time of last file access, seconds atime_nsec[8] Time of last file access, nanoseconds mtime_sec[8] Time of last file modification, seconds mtime_nsec[8] Time of last file modification, nanoseconds Explanation of the patches: -------------------------- *) The kernel just copies relevent contents of iattr structure to p9_iattr_dotl structure and passes it down to the client. The only check it has is calling inode_change_ok() *) The p9_iattr_dotl structure does not have ctime and ia_file parameters because I don't think these are needed in our case. The client user space can request updating just ctime by calling chown(fd, -1, -1). This is handled on server side without a need for putting ctime on the wire. *) The server currently supports changing mode, time, ownership and size of the file. *) 9P RFC says "Either all the changes in wstat request happen, or none of them does: if the request succeeds, all changes were made; if it fails, none were." I have not done anything to implement this specifically because I don't see a reason. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol. SYNOPSIS size[4] Tsetattr tag[2] attr[n] size[4] Rsetattr tag[2] DESCRIPTION The setattr command changes some of the file status information. attr resembles the iattr structure used in Linux kernel. It specifies which status parameter is to be changed and to what value. It is laid out as follows: valid[4] specifies which status information is to be changed. Possible values are: ATTR_MODE (1 << 0) ATTR_UID (1 << 1) ATTR_GID (1 << 2) ATTR_SIZE (1 << 3) ATTR_ATIME (1 << 4) ATTR_MTIME (1 << 5) ATTR_ATIME_SET (1 << 7) ATTR_MTIME_SET (1 << 8) The last two bits represent whether the time information is being sent by the client's user space. In the absense of these bits the server always uses server's time. mode[4] File permission bits uid[4] Owner id of file gid[4] Group id of the file size[8] File size atime_sec[8] Time of last file access, seconds atime_nsec[8] Time of last file access, nanoseconds mtime_sec[8] Time of last file modification, seconds mtime_nsec[8] Time of last file modification, nanoseconds Explanation of the patches: -------------------------- *) The kernel just copies relevent contents of iattr structure to p9_iattr_dotl structure and passes it down to the client. The only check it has is calling inode_change_ok() *) The p9_iattr_dotl structure does not have ctime and ia_file parameters because I don't think these are needed in our case. The client user space can request updating just ctime by calling chown(fd, -1, -1). This is handled on server side without a need for putting ctime on the wire. *) The server currently supports changing mode, time, ownership and size of the file. *) 9P RFC says "Either all the changes in wstat request happen, or none of them does: if the request succeeds, all changes were made; if it fails, none were." I have not done anything to implement this specifically because I don't see a reason. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> diff 87d7845a Fri Jun 18 00:20:10 MDT 2010 Sripathi Kodi <sripathik@in.ibm.com> 9p: Implement client side of setattr for 9P2000.L protocol. SYNOPSIS size[4] Tsetattr tag[2] attr[n] size[4] Rsetattr tag[2] DESCRIPTION The setattr command changes some of the file status information. attr resembles the iattr structure used in Linux kernel. It specifies which status parameter is to be changed and to what value. It is laid out as follows: valid[4] specifies which status information is to be changed. Possible values are: ATTR_MODE (1 << 0) ATTR_UID (1 << 1) ATTR_GID (1 << 2) ATTR_SIZE (1 << 3) ATTR_ATIME (1 << 4) ATTR_MTIME (1 << 5) ATTR_ATIME_SET (1 << 7) ATTR_MTIME_SET (1 << 8) The last two bits represent whether the time information is being sent by the client's user space. In the absense of these bits the server always uses server's time. mode[4] File permission bits uid[4] Owner id of file gid[4] Group id of the file size[8] File size atime_sec[8] Time of last file access, seconds atime_nsec[8] Time of last file access, nanoseconds mtime_sec[8] Time of last file modification, seconds mtime_nsec[8] Time of last file modification, nanoseconds Explanation of the patches: -------------------------- *) The kernel just copies relevent contents of iattr structure to p9_iattr_dotl structure and passes it down to the client. The only check it has is calling inode_change_ok() *) The p9_iattr_dotl structure does not have ctime and ia_file parameters because I don't think these are needed in our case. The client user space can request updating just ctime by calling chown(fd, -1, -1). This is handled on server side without a need for putting ctime on the wire. *) The server currently supports changing mode, time, ownership and size of the file. *) 9P RFC says "Either all the changes in wstat request happen, or none of them does: if the request succeeds, all changes were made; if it fails, none were." I have not done anything to implement this specifically because I don't see a reason. Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> |
/linux-master/net/sunrpc/ | ||
H A D | rpcb_clnt.c | diff 4f3ed837 Wed Oct 11 02:00:22 MDT 2023 Dan Carpenter <dan.carpenter@linaro.org> SUNRPC: Add an IS_ERR() check back to where it was This IS_ERR() check was deleted during in a cleanup because, at the time, the rpcb_call_async() function could not return an error pointer. That changed in commit 25cf32ad5dba ("SUNRPC: Handle allocation failure in rpc_new_task()") and now it can return an error pointer. Put the check back. A related revert was done in commit 13bd90141804 ("Revert "SUNRPC: Remove unreachable error condition""). Fixes: 037e910b52b0 ("SUNRPC: Remove unreachable error condition in rpcb_getport_async()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> diff 81c88b18 Thu Dec 20 08:35:11 MST 2018 J. Bruce Fields <bfields@redhat.com> sunrpc: handle ENOMEM in rpcb_getport_async If we ignore the error we'll hit a null dereference a little later. Reported-by: syzbot+4b98281f2401ab849f4b@syzkaller.appspotmail.com Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> diff 4e0038b6 Thu Mar 01 15:01:05 MST 2012 Trond Myklebust <Trond.Myklebust@netapp.com> SUNRPC: Move clnt->cl_server into struct rpc_xprt When the cl_xprt field is updated, the cl_server field will also have to change. Since the contents of cl_server follow the remote endpoint of cl_xprt, just move that field to the rpc_xprt. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> [ cel: simplify check_gss_callback_principal(), whitespace changes ] [ cel: forward ported to 3.4 ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 5a0e3ad6 Wed Mar 24 02:04:11 MDT 2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> diff c526611d Thu Dec 03 13:58:56 MST 2009 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Use a cached RPC client and transport for rpcbind upcalls The kernel's rpcbind client creates and deletes an rpc_clnt and its underlying transport socket for every upcall to the local rpcbind daemon. When starting a typical NFS server on IPv4 and IPv6, the NFS service itself does three upcalls (one per version) times two upcalls (one per transport) times two upcalls (one per address family), making 12, plus another one for the initial call to unregister previous NFS services. Starting the NLM service adds an additional 13 upcalls, for similar reasons. (Currently the NFS service doesn't start IPv6 listeners, but it will soon enough). Instead, let's create an rpc_clnt for rpcbind upcalls during the first local rpcbind query, and cache it. This saves the overhead of creating and destroying an rpc_clnt and a socket for every upcall. The new logic also prevents the kernel from attempting an RPCB_SET or RPCB_UNSET if it knows from the start that the local portmapper does not support rpcbind protocol version 4. This will cut down on the number of rpcbind upcalls in legacy environments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon Introduce a new API to register RPC services on IPv6 interfaces to allow the NFS server and lockd to advertise on IPv6 networks. Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind protocol version 4 to contact the local rpcbind daemon. The version 4 SET/UNSET procedures allow services to register address families besides AF_INET, register at specific network interfaces, and register transport protocols besides UDP and TCP. All of this functionality is exposed via the new rpcb_v4_register() kernel API. A user-space rpcbind daemon implementation that supports version 4 of the rpcbind protocol is required in order to make use of this new API. Note that rpcbind version 3 is sufficient to support the new rpcbind facilities listed above, but most extant implementations use version 4. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon Introduce a new API to register RPC services on IPv6 interfaces to allow the NFS server and lockd to advertise on IPv6 networks. Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind protocol version 4 to contact the local rpcbind daemon. The version 4 SET/UNSET procedures allow services to register address families besides AF_INET, register at specific network interfaces, and register transport protocols besides UDP and TCP. All of this functionality is exposed via the new rpcb_v4_register() kernel API. A user-space rpcbind daemon implementation that supports version 4 of the rpcbind protocol is required in order to make use of this new API. Note that rpcbind version 3 is sufficient to support the new rpcbind facilities listed above, but most extant implementations use version 4. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon Introduce a new API to register RPC services on IPv6 interfaces to allow the NFS server and lockd to advertise on IPv6 networks. Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind protocol version 4 to contact the local rpcbind daemon. The version 4 SET/UNSET procedures allow services to register address families besides AF_INET, register at specific network interfaces, and register transport protocols besides UDP and TCP. All of this functionality is exposed via the new rpcb_v4_register() kernel API. A user-space rpcbind daemon implementation that supports version 4 of the rpcbind protocol is required in order to make use of this new API. Note that rpcbind version 3 is sufficient to support the new rpcbind facilities listed above, but most extant implementations use version 4. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> diff c2e1b09f Mon Jul 14 14:03:30 MDT 2008 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon Introduce a new API to register RPC services on IPv6 interfaces to allow the NFS server and lockd to advertise on IPv6 networks. Unlike rpcb_register(), the new rpcb_v4_register() function uses rpcbind protocol version 4 to contact the local rpcbind daemon. The version 4 SET/UNSET procedures allow services to register address families besides AF_INET, register at specific network interfaces, and register transport protocols besides UDP and TCP. All of this functionality is exposed via the new rpcb_v4_register() kernel API. A user-space rpcbind daemon implementation that supports version 4 of the rpcbind protocol is required in order to make use of this new API. Note that rpcbind version 3 is sufficient to support the new rpcbind facilities listed above, but most extant implementations use version 4. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> |
/linux-master/net/dccp/ | ||
H A D | options.c | diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> diff f17a37c9 Wed Nov 10 13:20:07 MST 2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Ack Vector interface clean-up This patch brings the Ack Vector interface up to date. Its main purpose is to lay the basis for the subsequent patches of this set, which will use the new data structure fields and routines. There are no real algorithmic changes, rather an adaptation: (1) Replaced the static Ack Vector size (2) with a #define so that it can be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems to be sufficient for the moment) and added a solution so that computing the ECN nonce will continue to work - even with larger Ack Vectors. (2) Replaced the #defines for Ack Vector states with a complete enum. (3) Replaced #defines to compute Ack Vector length and state with general purpose routines (inlines), and updated code to use these. (4) Added a `tail' field (conversion to circular buffer in subsequent patch). (5) Updated the (outdated) documentation for Ack Vector struct. (6) All sequence number containers now trimmed to 48 bits. (7) Removal of unused bits: * removed dccpav_ack_nonce from struct dccp_ackvec, since this is already redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record); * removed Elapsed Time for Ack Vectors (it was nowhere used); * replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since the code needs to be able to remember the old run length; * reduced the de-/allocation routines (redundant / duplicate tests). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> diff 59b80802 Mon Jun 21 19:14:35 MDT 2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: make implementation of Syn-RTT symmetric This patch is thanks to Andre Noll who reported the issue and helped testing. The Syn-RTT sampled during the initial handshake currently only works for the client sending the DCCP-Request. TFRC penalizes the absence of an RTT sample with a very slow initial speed (1 packet per second), which delays slow-start significantly, resulting in sluggish performance. This patch mirrors the "Syn RTT" principle by adding a timestamp also onto the DCCP-Response, producing an RTT sample when the (Data)Ack completing the handshake arrives. Also changed the documentation to 'TFRC' since Syn RTTs are also used by CCID-4. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 883ca833 Fri Jan 16 16:36:32 MST 2009 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Initialisation and type-checking of feature sysctls This patch takes care of initialising and type-checking sysctls related to feature negotiation. Type checking is important since some of the sysctls now directly impact the feature-negotiation process. The sysctls are initialised with the known default values for each feature. For the type-checking the value constraints from RFC 4340 are used: * Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes), tested and confirmed that it works up to 4294967295 - for Gbps speed; * Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer); * CCIDs are between 0 .. 255; * request_retries, retries1, retries2 also between 0..255 for good measure; * tx_qlen is checked to be non-negative; * sync_ratelimit remains as before. Notes: ------ 1. Die s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c. 2. As pointed out by Arnaldo, the pattern of type-checking repeats itself in other places, sometimes with exactly the same kind of definitions (e.g. "static int zero;"). It may be a good idea (kernel janitors?) to consolidate type checking. For the sake of keeping the changeset small and in order not to affect other subsystems, I have not strived to generalise here. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net> diff 6fdd34d4 Mon Dec 08 02:19:06 MST 2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Phase out the use of boolean Ack Vector sysctl This removes the use of the sysctl and the minisock variable for the Send Ack Vector feature, as it now is handled fully dynamically via feature negotiation (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per RFC 4341, 4.). Using a sysctl in parallel to this implementation would open the door to crashes, since much of the code relies on tests of the boolean minisock / sysctl variable. Thus, this patch replaces all tests of type if (dccp_msk(sk)->dccpms_send_ack_vector) /* ... */ with if (dp->dccps_hc_rx_ackvec != NULL) /* ... */ The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature negotiation concluded that Ack Vectors are to be used on the half-connection. Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child), so that the test is a valid one. The activation handler for Ack Vectors is called as soon as the feature negotiation has concluded at the * server when the Ack marking the transition RESPOND => OPEN arrives; * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN. Adding the sequence number of the Response packet to the Ack Vector has been removed, since (a) connection establishment implies that the Response has been received; (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e. this entry will always be ignored; (c) it can not be used for anything useful - to detect loss for instance, only packets received after the loss can serve as pseudo-dupacks. There was a FIXME to change the error code when dccp_ackvec_add() fails. I removed this after finding out that: * the check whether ackno < ISN is already made earlier, * this Response is likely the 1st packet with an Ackno that the client gets, * so when dccp_ackvec_add() fails, the reason is likely not a packet error. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net> diff dd9c0e36 Sun Nov 16 23:55:08 MST 2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Deprecate Ack Ratio sysctl This patch deprecates the Ack Ratio sysctl, since * Ack Ratio is entirely ignored by CCID-3 and CCID-4, * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1); * even if it would work in CCID-2, there is no point for a user to change it: - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2), - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts (since waiting for Acks which will never arrive in this window), - cwnd is not a user-configurable value. The only reasonable place for Ack Ratio is to print it for debugging. It is planned to do this later on, as part of e.g. dccp_probe. With this patch Ack Ratio is now under full control of feature negotiation: * Ack Ratio is resolved as a dependency of the selected CCID; * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to the default of 2, following RFC 4340, 11.3 - "New connections start with Ack Ratio 2 for both endpoints"; * what happens then is part of another patch set, since it concerns the dynamic update of Ack Ratio while the connection is in full flight. Thanks to Tomasz Grobelny for discussion leading up to this patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 68b1de15 Wed Sep 03 23:30:19 MDT 2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Algorithm to update buffer state This provides a routine to consistently update the buffer state when the peer acknowledges receipt of Ack Vectors; updating state in the list of Ack Vectors as well as in the circular buffer. While based on RFC 4340, several additional (and necessary) precautions were added to protect the consistency of the buffer state. These additions are essential, since analysis and experience showed that the basic algorithm was insufficient for this task (which lead to problems that were hard to debug). The algorithm now * deals with HC-sender acknowledging to HC-receiver and vice versa, * keeps track of the last unacknowledged but received seqno in tail_ackno, * has special cases to reset the overflow condition when appropriate, * is protected against receiving older information (would mess up buffer state). Note: The older code performed an unnecessary step, where the sender cleared Ack Vector state by parsing the Ack Vector received by the HC-receiver. Doing this was entirely redundant, since * the receiver always puts the full acknowledgment window (groups 2,3 in 11.4.2) into the Ack Vectors it sends; hence the HC-receiver is only interested in the highest state that the HC-sender received; * this means that the acknowledgment number on the (Data)Ack from the HC-sender is sufficient; and work done in parsing earlier state is not necessary, since the later state subsumes the earlier one (see also RFC 4340, A.4). This older interface (dccp_ackvec_parse()) is therefore removed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> |
/linux-master/ipc/ | ||
H A D | msg.c | diff 18319498 Thu Sep 02 15:55:31 MDT 2021 Vasily Averin <vvs@virtuozzo.com> memcg: enable accounting of ipc resources When user creates IPC objects it forces kernel to allocate memory for these long-living objects. It makes sense to account them to restrict the host's memory consumption from inside the memcg-limited container. This patch enables accounting for IPC shared memory segments, messages semaphores and semaphore's undo lists. Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff a11ddb37 Sat May 22 18:41:49 MDT 2021 Varad Gautam <varad.gautam@suse.com> ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry do_mq_timedreceive calls wq_sleep with a stack local address. The sender (do_mq_timedsend) uses this address to later call pipelined_send. This leads to a very hard to trigger race where a do_mq_timedreceive call might return and leave do_mq_timedsend to rely on an invalid address, causing the following crash: RIP: 0010:wake_q_add_safe+0x13/0x60 Call Trace: __x64_sys_mq_timedsend+0x2a9/0x490 do_syscall_64+0x80/0x680 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5928e40343 The race occurs as: 1. do_mq_timedreceive calls wq_sleep with the address of `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it holds a valid `struct ext_wait_queue *` as long as the stack has not been overwritten. 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call __pipelined_op. 3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY). Here is where the race window begins. (`this` is `ewq_addr`.) 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it will see `state == STATE_READY` and break. 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's stack. (Although the address may not get overwritten until another function happens to touch it, which means it can persist around for an indefinite time.) 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a `struct ext_wait_queue *`, and uses it to find a task_struct to pass to the wake_q_add_safe call. In the lucky case where nothing has overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct. In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a bogus address as the receiver's task_struct causing the crash. do_mq_timedsend::__pipelined_op() should not dereference `this` after setting STATE_READY, as the receiver counterpart is now free to return. Change __pipelined_op to call wake_q_add_safe on the receiver's task_struct returned by get_task_struct, instead of dereferencing `this` which sits on the receiver's stack. As Manfred pointed out, the race potentially also exists in ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix those in the same way. Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers") Fixes: 8116b54e7e23ef ("ipc/sem.c: document and update memory barriers") Fixes: 0d97a82ba830d8 ("ipc/msg.c: update and document memory barriers") Signed-off-by: Varad Gautam <varad.gautam@suse.com> Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4b78e201 Sun Jun 07 22:40:07 MDT 2020 Jules Irenge <jbi.octave@gmail.com> ipc/msg: add missing annotation for freeque() Sparse reports a warning at freeque() warning: context imbalance in freeque() - unexpected unlock The root cause is the missing annotation at freeque() Add the missing __releases(RCU) annotation Add the missing __releases(&msq->q_perm) annotation Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lu Shuaibing <shuaibinglu@126.com> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Link: http://lkml.kernel.org/r/20200403160505.2832-2-jbi.octave@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions A use of uninitialized memory in msgctl_down() because msqid64 in ksys_msgctl hasn't been initialized. The local | msqid64 | is created in ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64 is never initialized before msgctl_down() checks msqid64->msg_qbytes. KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool) reports: ================================================================== BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300 Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022 CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x75/0xae __kumsan_report+0x17c/0x3e6 kumsan_report+0xe/0x20 msgctl_down+0x94/0x300 ksys_msgctl.constprop.14+0xef/0x260 do_syscall_64+0x7e/0x1f0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x4400e9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970 R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x100000000000000() raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kumsan: bad access detected ================================================================== Syzkaller reproducer: msgctl$IPC_RMID(0x0, 0x0) C reproducer: // autogenerated by syzkaller (https://github.com/google/syzkaller) int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); syscall(__NR_msgctl, 0, 0, 0); return 0; } [natechancellor@gmail.com: adjust indentation in ksys_msgctl] Link: https://github.com/ClangBuiltLinux/linux/issues/829 Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com Signed-off-by: Lu Shuaibing <shuaibinglu@126.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: NeilBrown <neilb@suse.com> From: Andrew Morton <akpm@linux-foundation.org> Subject: drivers/block/null_blk_main.c: fix layout Each line here overflows 80 cols by exactly one character. Delete one tab per line to fix. Cc: Shaohua Li <shli@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions A use of uninitialized memory in msgctl_down() because msqid64 in ksys_msgctl hasn't been initialized. The local | msqid64 | is created in ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64 is never initialized before msgctl_down() checks msqid64->msg_qbytes. KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool) reports: ================================================================== BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300 Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022 CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x75/0xae __kumsan_report+0x17c/0x3e6 kumsan_report+0xe/0x20 msgctl_down+0x94/0x300 ksys_msgctl.constprop.14+0xef/0x260 do_syscall_64+0x7e/0x1f0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x4400e9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970 R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x100000000000000() raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kumsan: bad access detected ================================================================== Syzkaller reproducer: msgctl$IPC_RMID(0x0, 0x0) C reproducer: // autogenerated by syzkaller (https://github.com/google/syzkaller) int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); syscall(__NR_msgctl, 0, 0, 0); return 0; } [natechancellor@gmail.com: adjust indentation in ksys_msgctl] Link: https://github.com/ClangBuiltLinux/linux/issues/829 Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com Signed-off-by: Lu Shuaibing <shuaibinglu@126.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: NeilBrown <neilb@suse.com> From: Andrew Morton <akpm@linux-foundation.org> Subject: drivers/block/null_blk_main.c: fix layout Each line here overflows 80 cols by exactly one character. Delete one tab per line to fix. Cc: Shaohua Li <shli@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions A use of uninitialized memory in msgctl_down() because msqid64 in ksys_msgctl hasn't been initialized. The local | msqid64 | is created in ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64 is never initialized before msgctl_down() checks msqid64->msg_qbytes. KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool) reports: ================================================================== BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300 Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022 CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x75/0xae __kumsan_report+0x17c/0x3e6 kumsan_report+0xe/0x20 msgctl_down+0x94/0x300 ksys_msgctl.constprop.14+0xef/0x260 do_syscall_64+0x7e/0x1f0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x4400e9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970 R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x100000000000000() raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kumsan: bad access detected ================================================================== Syzkaller reproducer: msgctl$IPC_RMID(0x0, 0x0) C reproducer: // autogenerated by syzkaller (https://github.com/google/syzkaller) int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); syscall(__NR_msgctl, 0, 0, 0); return 0; } [natechancellor@gmail.com: adjust indentation in ksys_msgctl] Link: https://github.com/ClangBuiltLinux/linux/issues/829 Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com Signed-off-by: Lu Shuaibing <shuaibinglu@126.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: NeilBrown <neilb@suse.com> From: Andrew Morton <akpm@linux-foundation.org> Subject: drivers/block/null_blk_main.c: fix layout Each line here overflows 80 cols by exactly one character. Delete one tab per line to fix. Cc: Shaohua Li <shli@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 889b3317 Mon Feb 03 18:34:46 MST 2020 Lu Shuaibing <shuaibinglu@126.com> ipc/msg.c: consolidate all xxxctl_down() functions A use of uninitialized memory in msgctl_down() because msqid64 in ksys_msgctl hasn't been initialized. The local | msqid64 | is created in ksys_msgctl() and then passed into msgctl_down(). Along the way msqid64 is never initialized before msgctl_down() checks msqid64->msg_qbytes. KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool) reports: ================================================================== BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300 Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022 CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x75/0xae __kumsan_report+0x17c/0x3e6 kumsan_report+0xe/0x20 msgctl_down+0x94/0x300 ksys_msgctl.constprop.14+0xef/0x260 do_syscall_64+0x7e/0x1f0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x4400e9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970 R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x100000000000000() raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kumsan: bad access detected ================================================================== Syzkaller reproducer: msgctl$IPC_RMID(0x0, 0x0) C reproducer: // autogenerated by syzkaller (https://github.com/google/syzkaller) int main(void) { syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0); syscall(__NR_msgctl, 0, 0, 0); return 0; } [natechancellor@gmail.com: adjust indentation in ksys_msgctl] Link: https://github.com/ClangBuiltLinux/linux/issues/829 Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com Signed-off-by: Lu Shuaibing <shuaibinglu@126.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Suggested-by: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: NeilBrown <neilb@suse.com> From: Andrew Morton <akpm@linux-foundation.org> Subject: drivers/block/null_blk_main.c: fix layout Each line here overflows 80 cols by exactly one character. Delete one tab per line to fix. Cc: Shaohua Li <shli@fb.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 39cfffd7 Tue Aug 21 23:01:29 MDT 2018 Manfred Spraul <manfred@colorfullife.com> ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc_addid() is impossible to use: - for certain failures, the caller must not use ipc_rcu_putref(), because the reference counter is not yet initialized. - for other failures, the caller must use ipc_rcu_putref(), because parallel operations could be ongoing already. The patch cleans that up, by initializing the refcount early, and by modifying all callers. The issues is related to the finding of syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an issue with reading kern_ipc_perm.seq, here both read and write to already released memory could happen. Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 23c8cec8 Tue Apr 10 17:35:30 MDT 2018 Davidlohr Bueso <dave@stgolabs.net> ipc/msg: introduce msgctl(MSG_STAT_ANY) There is a permission discrepancy when consulting msq ipc object metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the msq metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new MSG_STAT_ANY command such that the msq ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Robert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
H A D | util.c | diff 20401d10 Tue Sep 07 21:00:53 MDT 2021 Rafael Aquini <aquini@redhat.com> ipc: replace costly bailout check in sysvipc_find_ipc() sysvipc_find_ipc() was left with a costly way to check if the offset position fed to it is bigger than the total number of IPC IDs in use. So much so that the time it takes to iterate over /proc/sysvipc/* files grows exponentially for a custom benchmark that creates "N" SYSV shm segments and then times the read of /proc/sysvipc/shm (milliseconds): 12 msecs to read 1024 segs from /proc/sysvipc/shm 18 msecs to read 2048 segs from /proc/sysvipc/shm 65 msecs to read 4096 segs from /proc/sysvipc/shm 325 msecs to read 8192 segs from /proc/sysvipc/shm 1303 msecs to read 16384 segs from /proc/sysvipc/shm 5182 msecs to read 32768 segs from /proc/sysvipc/shm The root problem lies with the loop that computes the total amount of ids in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than "ids->in_use". That is a quite inneficient way to get to the maximum index in the id lookup table, specially when that value is already provided by struct ipc_ids.max_idx. This patch follows up on the optimization introduced via commit 15df03c879836 ("sysvipc: make get_maxid O(1) again") and gets rid of the aforementioned costly loop replacing it by a simpler checkpoint based on ipc_get_maxidx() returned value, which allows for a smooth linear increase in time complexity for the same custom benchmark: 2 msecs to read 1024 segs from /proc/sysvipc/shm 2 msecs to read 2048 segs from /proc/sysvipc/shm 4 msecs to read 4096 segs from /proc/sysvipc/shm 9 msecs to read 8192 segs from /proc/sysvipc/shm 19 msecs to read 16384 segs from /proc/sysvipc/shm 39 msecs to read 32768 segs from /proc/sysvipc/shm Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Waiman Long <llong@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff d919b33d Mon Apr 06 21:09:01 MDT 2020 Alexey Dobriyan <adobriyan@gmail.com> proc: faster open/read/close with "permanent" files Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include <chrono> #include <iostream> #include <thread> #include <vector> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector<std::thread> T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration<double> dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 39cfffd7 Tue Aug 21 23:01:29 MDT 2018 Manfred Spraul <manfred@colorfullife.com> ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc_addid() is impossible to use: - for certain failures, the caller must not use ipc_rcu_putref(), because the reference counter is not yet initialized. - for other failures, the caller must use ipc_rcu_putref(), because parallel operations could be ongoing already. The patch cleans that up, by initializing the refcount early, and by modifying all callers. The issues is related to the finding of syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com: syzbot found an issue with reading kern_ipc_perm.seq, here both read and write to already released memory could happen. Link: http://lkml.kernel.org/r/20180712185241.4017-4-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ebf66799 Fri Nov 17 16:31:15 MST 2017 Davidlohr Bueso <dave@stgolabs.net> sysvipc: properly name ipc_addid() limit parameter This is better understood as a limit, instead of size; exactly like the function comment indicates. Rename it. Link: http://lkml.kernel.org/r/20170831172049.14576-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff b8fd9983 Fri Nov 17 16:31:08 MST 2017 Davidlohr Bueso <dave@stgolabs.net> sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE Patch series "sysvipc: ipc-key management improvements". Here are a few improvements I spotted while eyeballing Guillaume's rhashtable implementation for ipc keys. The first and fourth patches are the interesting ones, the middle two are trivial. This patch (of 4): The next_id object-allocation functionality was introduced in commit 03f595668017 ("ipc: add sysctl to specify desired next object id"). Given that these new entries are _only_ exported under the CONFIG_CHECKPOINT_RESTORE option, there is no point for the common case to even know about ->next_id. As such rewrite ipc_buildid() such that it can do away with the field as well as unnecessary branches when adding a new identifier. The end result also better differentiates both cases, so the code ends up being cleaner; albeit the small duplications regarding the default case. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170831172049.14576-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff b2441318 Wed Nov 01 08:07:57 MDT 2017 Greg Kroah-Hartman <gregkh@linuxfoundation.org> License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff 9405c03e Fri Sep 08 17:17:45 MDT 2017 Elena Reshetova <elena.reshetova@intel.com> ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/1499417992-3238-4-git-send-email-elena.reshetova@intel.com Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: <arozansk@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4d2bff5e Tue Apr 30 20:15:19 MDT 2013 Davidlohr Bueso <davidlohr.bueso@hp.com> ipc: introduce obtaining a lockless ipc object Through ipc_lock() and therefore ipc_lock_check() we currently return the locked ipc object. This is not necessary for all situations and can, therefore, cause unnecessary ipc lock contention. Introduce analogous ipc_obtain_object() and ipc_obtain_object_check() functions that only lookup and return the ipc object. Both these functions must be called within the RCU read critical section. [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()] Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Chegu Vinod <chegu_vinod@hp.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/security/keys/ | ||
H A D | keyring.c | diff 2e12256b Thu Jun 27 16:03:07 MDT 2019 David Howells <dhowells@redhat.com> keys: Replace uid/gid/perm permissions checking with an ACL Replace the uid/gid/perm permissions checking on a key with an ACL to allow the SETATTR and SEARCH permissions to be split. This will also allow a greater range of subjects to represented. ============ WHY DO THIS? ============ The problem is that SETATTR and SEARCH cover a slew of actions, not all of which should be grouped together. For SETATTR, this includes actions that are about controlling access to a key: (1) Changing a key's ownership. (2) Changing a key's security information. (3) Setting a keyring's restriction. And actions that are about managing a key's lifetime: (4) Setting an expiry time. (5) Revoking a key. and (proposed) managing a key as part of a cache: (6) Invalidating a key. Managing a key's lifetime doesn't really have anything to do with controlling access to that key. Expiry time is awkward since it's more about the lifetime of the content and so, in some ways goes better with WRITE permission. It can, however, be set unconditionally by a process with an appropriate authorisation token for instantiating a key, and can also be set by the key type driver when a key is instantiated, so lumping it with the access-controlling actions is probably okay. As for SEARCH permission, that currently covers: (1) Finding keys in a keyring tree during a search. (2) Permitting keyrings to be joined. (3) Invalidation. But these don't really belong together either, since these actions really need to be controlled separately. Finally, there are number of special cases to do with granting the administrator special rights to invalidate or clear keys that I would like to handle with the ACL rather than key flags and special checks. =============== WHAT IS CHANGED =============== The SETATTR permission is split to create two new permissions: (1) SET_SECURITY - which allows the key's owner, group and ACL to be changed and a restriction to be placed on a keyring. (2) REVOKE - which allows a key to be revoked. The SEARCH permission is split to create: (1) SEARCH - which allows a keyring to be search and a key to be found. (2) JOIN - which allows a keyring to be joined as a session keyring. (3) INVAL - which allows a key to be invalidated. The WRITE permission is also split to create: (1) WRITE - which allows a key's content to be altered and links to be added, removed and replaced in a keyring. (2) CLEAR - which allows a keyring to be cleared completely. This is split out to make it possible to give just this to an administrator. (3) REVOKE - see above. Keys acquire ACLs which consist of a series of ACEs, and all that apply are unioned together. An ACE specifies a subject, such as: (*) Possessor - permitted to anyone who 'possesses' a key (*) Owner - permitted to the key owner (*) Group - permitted to the key group (*) Everyone - permitted to everyone Note that 'Other' has been replaced with 'Everyone' on the assumption that you wouldn't grant a permit to 'Other' that you wouldn't also grant to everyone else. Further subjects may be made available by later patches. The ACE also specifies a permissions mask. The set of permissions is now: VIEW Can view the key metadata READ Can read the key content WRITE Can update/modify the key content SEARCH Can find the key by searching/requesting LINK Can make a link to the key SET_SECURITY Can change owner, ACL, expiry INVAL Can invalidate REVOKE Can revoke JOIN Can join this keyring CLEAR Can clear this keyring The KEYCTL_SETPERM function is then deprecated. The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set, or if the caller has a valid instantiation auth token. The KEYCTL_INVALIDATE function then requires INVAL. The KEYCTL_REVOKE function then requires REVOKE. The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an existing keyring. The JOIN permission is enabled by default for session keyrings and manually created keyrings only. ====================== BACKWARD COMPATIBILITY ====================== To maintain backward compatibility, KEYCTL_SETPERM will translate the permissions mask it is given into a new ACL for a key - unless KEYCTL_SET_ACL has been called on that key, in which case an error will be returned. It will convert possessor, owner, group and other permissions into separate ACEs, if each portion of the mask is non-zero. SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned on if a keyring is being altered. The KEYCTL_DESCRIBE function translates the ACL back into a permissions mask to return depending on possessor, owner, group and everyone ACEs. It will make the following mappings: (1) INVAL, JOIN -> SEARCH (2) SET_SECURITY -> SETATTR (3) REVOKE -> WRITE if SETATTR isn't already set (4) CLEAR -> WRITE Note that the value subsequently returned by KEYCTL_DESCRIBE may not match the value set with KEYCTL_SETATTR. ======= TESTING ======= This passes the keyutils testsuite for all but a couple of tests: (1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed if the type doesn't have ->read(). You still can't actually read the key. (2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't work as Other has been replaced with Everyone in the ACL. Signed-off-by: David Howells <dhowells@redhat.com> diff 2e12256b Thu Jun 27 16:03:07 MDT 2019 David Howells <dhowells@redhat.com> keys: Replace uid/gid/perm permissions checking with an ACL Replace the uid/gid/perm permissions checking on a key with an ACL to allow the SETATTR and SEARCH permissions to be split. This will also allow a greater range of subjects to represented. ============ WHY DO THIS? ============ The problem is that SETATTR and SEARCH cover a slew of actions, not all of which should be grouped together. For SETATTR, this includes actions that are about controlling access to a key: (1) Changing a key's ownership. (2) Changing a key's security information. (3) Setting a keyring's restriction. And actions that are about managing a key's lifetime: (4) Setting an expiry time. (5) Revoking a key. and (proposed) managing a key as part of a cache: (6) Invalidating a key. Managing a key's lifetime doesn't really have anything to do with controlling access to that key. Expiry time is awkward since it's more about the lifetime of the content and so, in some ways goes better with WRITE permission. It can, however, be set unconditionally by a process with an appropriate authorisation token for instantiating a key, and can also be set by the key type driver when a key is instantiated, so lumping it with the access-controlling actions is probably okay. As for SEARCH permission, that currently covers: (1) Finding keys in a keyring tree during a search. (2) Permitting keyrings to be joined. (3) Invalidation. But these don't really belong together either, since these actions really need to be controlled separately. Finally, there are number of special cases to do with granting the administrator special rights to invalidate or clear keys that I would like to handle with the ACL rather than key flags and special checks. =============== WHAT IS CHANGED =============== The SETATTR permission is split to create two new permissions: (1) SET_SECURITY - which allows the key's owner, group and ACL to be changed and a restriction to be placed on a keyring. (2) REVOKE - which allows a key to be revoked. The SEARCH permission is split to create: (1) SEARCH - which allows a keyring to be search and a key to be found. (2) JOIN - which allows a keyring to be joined as a session keyring. (3) INVAL - which allows a key to be invalidated. The WRITE permission is also split to create: (1) WRITE - which allows a key's content to be altered and links to be added, removed and replaced in a keyring. (2) CLEAR - which allows a keyring to be cleared completely. This is split out to make it possible to give just this to an administrator. (3) REVOKE - see above. Keys acquire ACLs which consist of a series of ACEs, and all that apply are unioned together. An ACE specifies a subject, such as: (*) Possessor - permitted to anyone who 'possesses' a key (*) Owner - permitted to the key owner (*) Group - permitted to the key group (*) Everyone - permitted to everyone Note that 'Other' has been replaced with 'Everyone' on the assumption that you wouldn't grant a permit to 'Other' that you wouldn't also grant to everyone else. Further subjects may be made available by later patches. The ACE also specifies a permissions mask. The set of permissions is now: VIEW Can view the key metadata READ Can read the key content WRITE Can update/modify the key content SEARCH Can find the key by searching/requesting LINK Can make a link to the key SET_SECURITY Can change owner, ACL, expiry INVAL Can invalidate REVOKE Can revoke JOIN Can join this keyring CLEAR Can clear this keyring The KEYCTL_SETPERM function is then deprecated. The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set, or if the caller has a valid instantiation auth token. The KEYCTL_INVALIDATE function then requires INVAL. The KEYCTL_REVOKE function then requires REVOKE. The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an existing keyring. The JOIN permission is enabled by default for session keyrings and manually created keyrings only. ====================== BACKWARD COMPATIBILITY ====================== To maintain backward compatibility, KEYCTL_SETPERM will translate the permissions mask it is given into a new ACL for a key - unless KEYCTL_SET_ACL has been called on that key, in which case an error will be returned. It will convert possessor, owner, group and other permissions into separate ACEs, if each portion of the mask is non-zero. SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned on if a keyring is being altered. The KEYCTL_DESCRIBE function translates the ACL back into a permissions mask to return depending on possessor, owner, group and everyone ACEs. It will make the following mappings: (1) INVAL, JOIN -> SEARCH (2) SET_SECURITY -> SETATTR (3) REVOKE -> WRITE if SETATTR isn't already set (4) CLEAR -> WRITE Note that the value subsequently returned by KEYCTL_DESCRIBE may not match the value set with KEYCTL_SETATTR. ======= TESTING ======= This passes the keyutils testsuite for all but a couple of tests: (1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed if the type doesn't have ->read(). You still can't actually read the key. (2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't work as Other has been replaced with Everyone in the ACL. Signed-off-by: David Howells <dhowells@redhat.com> diff b206f281 Wed Jun 26 14:02:32 MDT 2019 David Howells <dhowells@redhat.com> keys: Namespace keyring names Keyring names are held in a single global list that any process can pick from by means of keyctl_join_session_keyring (provided the keyring grants Search permission). This isn't very container friendly, however. Make the following changes: (1) Make default session, process and thread keyring names begin with a '.' instead of '_'. (2) Keyrings whose names begin with a '.' aren't added to the list. Such keyrings are system specials. (3) Replace the global list with per-user_namespace lists. A keyring adds its name to the list for the user_namespace that it is currently in. (4) When a user_namespace is deleted, it just removes itself from the keyring name list. The global keyring_name_lock is retained for accessing the name lists. This allows (4) to work. This can be tested by: # keyctl newring foo @s 995906392 # unshare -U $ keyctl show ... 995906392 --alswrv 65534 65534 \_ keyring: foo ... $ keyctl session foo Joined session keyring: 935622349 As can be seen, a new session keyring was created. The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is employing this feature. Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric W. Biederman <ebiederm@xmission.com> diff b206f281 Wed Jun 26 14:02:32 MDT 2019 David Howells <dhowells@redhat.com> keys: Namespace keyring names Keyring names are held in a single global list that any process can pick from by means of keyctl_join_session_keyring (provided the keyring grants Search permission). This isn't very container friendly, however. Make the following changes: (1) Make default session, process and thread keyring names begin with a '.' instead of '_'. (2) Keyrings whose names begin with a '.' aren't added to the list. Such keyrings are system specials. (3) Replace the global list with per-user_namespace lists. A keyring adds its name to the list for the user_namespace that it is currently in. (4) When a user_namespace is deleted, it just removes itself from the keyring name list. The global keyring_name_lock is retained for accessing the name lists. This allows (4) to work. This can be tested by: # keyctl newring foo @s 995906392 # unshare -U $ keyctl show ... 995906392 --alswrv 65534 65534 \_ keyring: foo ... $ keyctl session foo Joined session keyring: 935622349 As can be seen, a new session keyring was created. The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is employing this feature. Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric W. Biederman <ebiederm@xmission.com> diff 363b02da Wed Oct 04 09:43:25 MDT 2017 David Howells <dhowells@redhat.com> KEYS: Fix race between updating and finding a negative key Consolidate KEY_FLAG_INSTANTIATED, KEY_FLAG_NEGATIVE and the rejection error into one field such that: (1) The instantiation state can be modified/read atomically. (2) The error can be accessed atomically with the state. (3) The error isn't stored unioned with the payload pointers. This deals with the problem that the state is spread over three different objects (two bits and a separate variable) and reading or updating them atomically isn't practical, given that not only can uninstantiated keys change into instantiated or rejected keys, but rejected keys can also turn into instantiated keys - and someone accessing the key might not be using any locking. The main side effect of this problem is that what was held in the payload may change, depending on the state. For instance, you might observe the key to be in the rejected state. You then read the cached error, but if the key semaphore wasn't locked, the key might've become instantiated between the two reads - and you might now have something in hand that isn't actually an error code. The state is now KEY_IS_UNINSTANTIATED, KEY_IS_POSITIVE or a negative error code if the key is negatively instantiated. The key_is_instantiated() function is replaced with key_is_positive() to avoid confusion as negative keys are also 'instantiated'. Additionally, barriering is included: (1) Order payload-set before state-set during instantiation. (2) Order state-read before payload-read when using the key. Further separate barriering is necessary if RCU is being used to access the payload content after reading the payload pointers. Fixes: 146aa8b1453b ("KEYS: Merge the type-specific data with the payload data") Cc: stable@vger.kernel.org # v4.4+ Reported-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Eric Biggers <ebiggers@google.com> diff 237bbd29 Mon Sep 18 12:37:03 MDT 2017 Eric Biggers <ebiggers@google.com> KEYS: prevent creating a different user's keyrings It was possible for an unprivileged user to create the user and user session keyrings for another user. For example: sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u keyctl add keyring _uid_ses.4000 "" @u sleep 15' & sleep 1 sudo -u '#4000' keyctl describe @u sudo -u '#4000' keyctl describe @us This is problematic because these "fake" keyrings won't have the right permissions. In particular, the user who created them first will own them and will have full access to them via the possessor permissions, which can be used to compromise the security of a user's keys: -4: alswrv-----v------------ 3000 0 keyring: _uid.4000 -5: alswrv-----v------------ 3000 0 keyring: _uid_ses.4000 Fix it by marking user and user session keyrings with a flag KEY_FLAG_UID_KEYRING. Then, when searching for a user or user session keyring by name, skip all keyrings that don't have the flag set. Fixes: 69664cf16af4 ("keys: don't generate user and user session keyrings unless they're accessed") Cc: <stable@vger.kernel.org> [v2.6.26+] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> diff 5ac7eace Wed Apr 06 09:14:24 MDT 2016 David Howells <dhowells@redhat.com> KEYS: Add a facility to restrict new links into a keyring Add a facility whereby proposed new links to be added to a keyring can be vetted, permitting them to be rejected if necessary. This can be used to block public keys from which the signature cannot be verified or for which the signature verification fails. It could also be used to provide blacklisting. This affects operations like add_key(), KEYCTL_LINK and KEYCTL_INSTANTIATE. To this end: (1) A function pointer is added to the key struct that, if set, points to the vetting function. This is called as: int (*restrict_link)(struct key *keyring, const struct key_type *key_type, unsigned long key_flags, const union key_payload *key_payload), where 'keyring' will be the keyring being added to, key_type and key_payload will describe the key being added and key_flags[*] can be AND'ed with KEY_FLAG_TRUSTED. [*] This parameter will be removed in a later patch when KEY_FLAG_TRUSTED is removed. The function should return 0 to allow the link to take place or an error (typically -ENOKEY, -ENOPKG or -EKEYREJECTED) to reject the link. The pointer should not be set directly, but rather should be set through keyring_alloc(). Note that if called during add_key(), preparse is called before this method, but a key isn't actually allocated until after this function is called. (2) KEY_ALLOC_BYPASS_RESTRICTION is added. This can be passed to key_create_or_update() or key_instantiate_and_link() to bypass the restriction check. (3) KEY_FLAG_TRUSTED_ONLY is removed. The entire contents of a keyring with this restriction emplaced can be considered 'trustworthy' by virtue of being in the keyring when that keyring is consulted. (4) key_alloc() and keyring_alloc() take an extra argument that will be used to set restrict_link in the new key. This ensures that the pointer is set before the key is published, thus preventing a window of unrestrictedness. Normally this argument will be NULL. (5) As a temporary affair, keyring_restrict_trusted_only() is added. It should be passed to keyring_alloc() as the extra argument instead of setting KEY_FLAG_TRUSTED_ONLY on a keyring. This will be replaced in a later patch with functions that look in the appropriate places for authoritative keys. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com> diff 0b0a8415 Mon Dec 01 15:52:53 MST 2014 David Howells <dhowells@redhat.com> KEYS: request_key() should reget expired keys rather than give EKEYEXPIRED Since the keyring facility can be viewed as a cache (at least in some applications), the local expiration time on the key should probably be viewed as a 'needs updating after this time' property rather than an absolute 'anyone now wanting to use this object is out of luck' property. Since request_key() is the main interface for the usage of keys, this should update or replace an expired key rather than issuing EKEYEXPIRED if the local expiration has been reached (ie. it should refresh the cache). For absolute conditions where refreshing the cache probably doesn't help, the key can be negatively instantiated using KEYCTL_REJECT_KEY with EKEYEXPIRED given as the error to issue. This will still cause request_key() to return EKEYEXPIRED as that was explicitly set. In the future, if the key type has an update op available, we might want to upcall with the expired key and allow the upcall to update it. We would pass a different operation name (the first column in /etc/request-key.conf) to the request-key program. request_key() returning EKEYEXPIRED is causing an NFS problem which Chuck Lever describes thusly: After about 10 minutes, my NFSv4 functional tests fail because the ownership of the test files goes to "-2". Looking at /proc/keys shows that the id_resolv keys that map to my test user ID have expired. The ownership problem persists until the expired keys are purged from the keyring, and fresh keys are obtained. I bisected the problem to 3.13 commit b2a4df200d57 ("KEYS: Expand the capacity of a keyring"). This commit inadvertantly changes the API contract of the internal function keyring_search_aux(). The root cause appears to be that b2a4df200d57 made "no state check" the default behavior. "No state check" means the keyring search iterator function skips checking the key's expiry timeout, and returns expired keys. request_key_and_link() depends on getting an -EAGAIN result code to know when to perform an upcall to refresh an expired key. This patch can be tested directly by: keyctl request2 user debug:fred a @s keyctl timeout %user:debug:fred 3 sleep 4 keyctl request2 user debug:fred a @s Without the patch, the last command gives error EKEYEXPIRED, but with the command it gives a new key. Reported-by: Carl Hetherington <cth@carlh.net> Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> diff 46291959 Tue Sep 16 10:36:02 MDT 2014 David Howells <dhowells@redhat.com> KEYS: Preparse match data Preparse the match data. This provides several advantages: (1) The preparser can reject invalid criteria up front. (2) The preparser can convert the criteria to binary data if necessary (the asymmetric key type really wants to do binary comparison of the key IDs). (3) The preparser can set the type of search to be performed. This means that it's not then a one-off setting in the key type. (4) The preparser can set an appropriate comparator function. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> diff 4bdf0bc3 Tue Sep 24 03:35:15 MDT 2013 David Howells <dhowells@redhat.com> KEYS: Introduce a search context structure Search functions pass around a bunch of arguments, each of which gets copied with each call. Introduce a search context structure to hold these. Whilst we're at it, create a search flag that indicates whether the search should be directly to the description or whether it should iterate through all keys looking for a non-description match. This will be useful when keyrings use a generic data struct with generic routines to manage their content as the search terms can just be passed through to the iterator callback function. Also, for future use, the data to be supplied to the match function is separated from the description pointer in the search context. This makes it clear which is being supplied. Signed-off-by: David Howells <dhowells@redhat.com> |
/linux-master/fs/nfsd/ | ||
H A D | nfsctl.c | diff 4b148854 Fri Jan 26 08:39:47 MST 2024 Josef Bacik <josef@toxicpanda.com> nfsd: make all of the nfsd stats per-network namespace We have a global set of counters that we modify for all of the nfsd operations, but now that we're exposing these stats across all network namespaces we need to make the stats also be per-network namespace. We already have some caching stats that are per-network namespace, so move these definitions into the same counter and then adjust all the helpers and users of these stats to provide the appropriate nfsd_net struct so that the stats are maintained for the per-network namespace objects. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> diff 4b471a8b Tue Feb 14 08:19:30 MST 2023 Chuck Lever <chuck.lever@oracle.com> NFSD: Clean up nfsd_symlink() The pointer dentry is assigned a value that is never read, the assignment is redundant and can be removed. Cleans up clang-scan warning: fs/nfsd/nfsctl.c:1231:2: warning: Value stored to 'dentry' is never read [deadcode.DeadStores] dentry = ERR_PTR(ret); No need to initialize "int ret = -ENOMEM;" either. These are vestiges of nfsd_mkdir(), from whence I copied nfsd_symlink(). Reported-by: Colin Ian King <colin.i.king@gmail.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> diff 4df750c9 Sat Jan 14 22:21:39 MST 2023 Chuck Lever <chuck.lever@oracle.com> NFSD: Replace /proc/fs/nfsd/supported_krb5_enctypes with a symlink Now that I've added a file under /proc/net/rpc that is managed by the SunRPC's Kerberos mechanism, replace NFSD's supported_krb5_enctypes file with a symlink to the new SunRPC proc file, which contains exactly the same content. Remarkably, commit b0b0c0a26e84 ("nfsd: add proc file listing kernel's gss_krb5 enctypes") added the nfsd_supported_krb5_enctypes file in 2011, but this file has never been documented in nfsd(7). Tested-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Simo Sorce <simo@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> diff 3f649ab7 Wed Jun 03 14:09:38 MDT 2020 Kees Cook <keescook@chromium.org> treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> diff 4df493a2 Mon Apr 08 22:13:37 MDT 2019 Trond Myklebust <trondmy@gmail.com> SUNRPC: Cache the process user cred in the RPC server listener In order to be able to interpret uids and gids correctly in knfsd, we should cache the user namespace of the process that created the RPC server's listener. To do so, we refcount the credential of that process. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> diff abcb4dac Thu Mar 09 17:36:39 MST 2017 NeilBrown <neilb@suse.com> NFSD: further refinement of content of /proc/fs/nfsd/versions Prior to e35659f1b03c ("NFSD: correctly range-check v4.x minor version when setting versions.") v4.0 could not be disabled without disabling all NFSv4 protocols. So the 'versions' file contained ±4 ±4.1 ±4.2. Writing "-4" would disable all v4 completely. Writing +4 would enabled those minor versions that are currently enabled, either by default or otherwise. After that commit, it was possible to disable v4.0 independently. To maximize backward compatibility with use cases which never disabled v4.0, the "versions" file would never contain "+4.0" - that was implied by "+4", unless explicitly negated by "-4.0". This introduced an inconsistency in that it was possible to disable all minor versions, but still have the major version advertised. e.g. "-4.0 -4.1 -4.2 +4" would result in NFSv4 support being advertised, but all attempts to use it rejected. Commit d3635ff07e8c ("nfsd: fix configuration of supported minor versions") and following removed this inconsistency. If all minor version were disabled, the major would be disabled too. If any minor was enabled, the major would be disabled. This patch also treated "+4" as equivalent to "+4.0" and "-4" as "-4.0". A consequence of this is that writing "-4" would only disable 4.0. This is a regression against the earlier behaviour, in a use case that rpc.nfsd actually uses. The command "rpc.nfsd -N 4" will write "+2 +3 -4" to the versions files. Previously, that would disable v4 completely. Now it will only disable v4.0. Also "4.0" never appears in the "versions" file when read. So if only v4.1 is available, the previous kernel would have reported "+4 -4.0 +4.1 -4.2" the current kernel reports "-4 +4.1 -4.2" which could easily confuse. This patch restores the implication that "+4" and "-4" apply more globals and do not imply "4.0". Specifically: writing "-4" will disable all 4.x minor versions. writing "+4" will enable all 4.1 minor version if none are currently enabled. rpc.nfsd will list minor versions before major versions, so rpc.nfsd -V 4.2 -N 4.1 will write "-4.1 +4.2 +2 +3 +4" so it would be a regression for "+4" to enable always all versions. reading "-4" implies that no v4.x are enabled reading "+4" implies that some v4.x are enabled, and that v4.0 is enabled unless "-4.0" is also present. All other minor versions will explicitly be listed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> |
/linux-master/drivers/vhost/ | ||
H A D | vhost.c | diff c1ecd8e9 Mon Jun 26 17:23:05 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: allow userspace to create workers For vhost-scsi with 3 vqs or more and a workload that tries to use them in parallel like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 the single vhost worker thread will become a bottlneck and we are stuck at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are used. To better utilize virtqueues and available CPUs, this patch allows userspace to create workers and bind them to vqs. You can have N workers per dev and also share N workers with M vqs on that dev. This patch adds the interface related code and the next patch will hook vhost-scsi into it. The patches do not try to hook net and vsock into the interface because: 1. multiple workers don't seem to help vsock. The problem is that with only 2 virtqueues we never fully use the existing worker when doing bidirectional tests. This seems to match vhost-scsi where we don't see the worker as a bottleneck until 3 virtqueues are used. 2. net already has a way to use multiple workers. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-16-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> diff 737bdb64 Mon Jun 26 17:22:53 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: add vhost_worker pointer to vhost_virtqueue This patchset allows userspace to map vqs to different workers. This patch adds a worker pointer to the vq so in later patches in this set we can queue/flush specific vqs and their workers. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-4-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> diff 55d8122f Mon Apr 24 16:50:30 MDT 2023 Shannon Nelson <shannon.nelson@amd.com> vhost: support PACKED when setting-getting vring_base Use the right structs for PACKED or split vqs when setting and getting the vring base. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Message-Id: <20230424225031.18947-3-shannon.nelson@amd.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> diff 4b13cbef Wed Jun 07 13:23:38 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: Fix worker hangs due to missed wake up calls We can race where we have added work to the work_list, but vhost_task_fn has passed that check but not yet set us into TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and just return. This bug was intoduced in commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") when I moved the setting of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from logging warnings about being in the wrong state. This moves the setting of TASK_INTERRUPTIBLE back to before we test if we need to stop the task to avoid a possible race there as well. We then have vhost_worker set TASK_RUNNING if it finds work similar to before. Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230607192338.6041-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> diff a284f09e Wed Jun 07 13:23:37 MDT 2023 Mike Christie <michael.christie@oracle.com> vhost: Fix crash during early vhost_transport_send_pkt calls If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we can race where: 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create. 3. vhost_worker_create will set the dev->worker pointer before setting the worker->vtsk pointer. 4. thread0's vhost_work_queue will see the dev->worker pointer is set and try to call vhost_task_wake using not yet set worker->vtsk pointer. 5. We then crash since vtsk is NULL. Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads"), we only had the worker pointer so we could just check it to see if VHOST_SET_OWNER has been done. After that commit we have the vhost_worker and vhost_task pointer, so we can now hit the bug above. This patch embeds the vhost_worker in the vhost_dev and moves the work list initialization back to vhost_dev_init, so we can just check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done like before. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230607192338.6041-2-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: syzbot+d0d442c22fa8db45ff0e@syzkaller.appspotmail.com Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> diff 4d8df0f5 Mon May 22 02:50:19 MDT 2023 Prathu Baronia <prathubaronia2011@gmail.com> vhost: use kzalloc() instead of kmalloc() followed by memset() Use kzalloc() to allocate new zeroed out msg node instead of memsetting a node allocated with kmalloc(). Signed-off-by: Prathu Baronia <prathubaronia2011@gmail.com> Message-Id: <20230522085019.42914-1-prathubaronia2011@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> diff 4c809363 Thu Jan 13 07:11:34 MST 2022 Stefano Garzarella <sgarzare@redhat.com> vhost: remove avail_event arg from vhost_update_avail_event() In vhost_update_avail_event() we never used the `avail_event` argument, since its introduction in commit 2723feaa8ec6 ("vhost: set log when updating used flags or avail event"). Let's remove it to clean up the code. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20220113141134.186773-1-sgarzare@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> diff 71c0b9a6 Wed Oct 30 10:22:17 MDT 2019 Will Deacon <will@kernel.org> vhost: Remove redundant use of read_barrier_depends() barrier Since commit 76ebbe78f739 ("locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()"), there is no need to use smp_read_barrier_depends() outside of the Alpha architecture code. Unfortunately, there is precisely _one_ user in the vhost code, and there isn't an obvious READ_ONCE() access making the barrier redundant. However, on closer inspection (thanks, Jason), it appears that vring synchronisation between the producer and consumer occurs via the 'avail_idx' field, which is followed up by an rmb() in vhost_get_vq_desc(), making the read_barrier_depends() redundant on Alpha. Jason says: | I'm also confused about the barrier here, basically in driver side | we did: | | 1) allocate pages | 2) store pages in indirect->addr | 3) smp_wmb() | 4) increase the avail idx (somehow a tail pointer of vring) | | in vhost we did: | | 1) read avail idx | 2) smp_rmb() | 3) read indirect->addr | 4) read from indirect->addr | | It looks to me even the data dependency barrier is not necessary | since we have rmb() which is sufficient for us to the correct | indirect->addr and driver are not expected to do any writing to | indirect->addr after avail idx is increased Remove the redundant barrier invocation. Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Jason Wang <jasowang@redhat.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> diff 71c0b9a6 Wed Oct 30 10:22:17 MDT 2019 Will Deacon <will@kernel.org> vhost: Remove redundant use of read_barrier_depends() barrier Since commit 76ebbe78f739 ("locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()"), there is no need to use smp_read_barrier_depends() outside of the Alpha architecture code. Unfortunately, there is precisely _one_ user in the vhost code, and there isn't an obvious READ_ONCE() access making the barrier redundant. However, on closer inspection (thanks, Jason), it appears that vring synchronisation between the producer and consumer occurs via the 'avail_idx' field, which is followed up by an rmb() in vhost_get_vq_desc(), making the read_barrier_depends() redundant on Alpha. Jason says: | I'm also confused about the barrier here, basically in driver side | we did: | | 1) allocate pages | 2) store pages in indirect->addr | 3) smp_wmb() | 4) increase the avail idx (somehow a tail pointer of vring) | | in vhost we did: | | 1) read avail idx | 2) smp_rmb() | 3) read indirect->addr | 4) read from indirect->addr | | It looks to me even the data dependency barrier is not necessary | since we have rmb() which is sufficient for us to the correct | indirect->addr and driver are not expected to do any writing to | indirect->addr after avail idx is increased Remove the redundant barrier invocation. Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Jason Wang <jasowang@redhat.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> diff 0bbe3066 Thu Mar 26 08:01:19 MDT 2020 Jason Wang <jasowang@redhat.com> vhost: factor out IOTLB This patch factors out IOTLB into a dedicated module in order to be reused by other modules like vringh. User may choose to enable the automatic retiring by specifying VHOST_IOTLB_FLAG_RETIRE flag to fit for the case of vhost device IOTLB implementation. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200326140125.19794-4-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> |
/linux-master/include/net/ | ||
H A D | ip.h | diff d9f28735 Wed Dec 06 06:44:20 MST 2023 David Laight <David.Laight@ACULAB.COM> Use READ/WRITE_ONCE() for IP local_port_range. Commit 227b60f5102cd added a seqlock to ensure that the low and high port numbers were always updated together. This is overkill because the two 16bit port numbers can be held in a u32 and read/written in a single instruction. More recently 91d0b78c5177f added support for finer per-socket limits. The user-supplied value is 'high << 16 | low' but they are held separately and the socket options protected by the socket lock. Use a u32 containing 'high << 16 | low' for both the 'net' and 'sk' fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are always updated together. Change (the now trival) inet_get_local_port_range() to a static inline to optimise the calling code. (In particular avoiding returning integers by reference.) Signed-off-by: David Laight <david.laight@aculab.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/4e505d4198e946a8be03fb1b4c3072b0@AcuMS.aculab.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos() Christoph Paasch reported a panic in TCP stack [1] Indeed, we should not call sk_dst_reset() without holding the socket lock, as __sk_dst_get() callers do not all rely on bare RCU. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321 Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69 RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293 RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01 RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002 RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0 R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580 R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000 FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0 Call Trace: <TASK> tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567 tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419 tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839 tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323 __inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676 tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021 tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336 __sock_sendmsg+0x83/0xd0 net/socket.c:730 __sys_sendto+0x20a/0x2a0 net/socket.c:2194 __do_sys_sendto net/socket.c:2206 [inline] Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS") Reported-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos() Christoph Paasch reported a panic in TCP stack [1] Indeed, we should not call sk_dst_reset() without holding the socket lock, as __sk_dst_get() callers do not all rely on bare RCU. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321 Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69 RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293 RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01 RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002 RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0 R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580 R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000 FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0 Call Trace: <TASK> tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567 tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419 tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839 tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323 __inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676 tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021 tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336 __sock_sendmsg+0x83/0xd0 net/socket.c:730 __sys_sendto+0x20a/0x2a0 net/socket.c:2194 __do_sys_sendto net/socket.c:2206 [inline] Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS") Reported-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos() Christoph Paasch reported a panic in TCP stack [1] Indeed, we should not call sk_dst_reset() without holding the socket lock, as __sk_dst_get() callers do not all rely on bare RCU. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321 Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69 RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293 RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01 RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002 RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0 R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580 R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000 FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0 Call Trace: <TASK> tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567 tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419 tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839 tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323 __inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676 tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021 tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336 __sock_sendmsg+0x83/0xd0 net/socket.c:730 __sys_sendto+0x20a/0x2a0 net/socket.c:2194 __do_sys_sendto net/socket.c:2206 [inline] Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS") Reported-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> diff 878d951c Wed Oct 18 03:00:13 MDT 2023 Eric Dumazet <edumazet@google.com> inet: lock the socket in ip_sock_set_tos() Christoph Paasch reported a panic in TCP stack [1] Indeed, we should not call sk_dst_reset() without holding the socket lock, as __sk_dst_get() callers do not all rely on bare RCU. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321 Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69 RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293 RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01 RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002 RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0 R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580 R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000 FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0 Call Trace: <TASK> tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567 tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419 tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839 tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323 __inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676 tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021 tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336 __sock_sendmsg+0x83/0xd0 net/socket.c:730 __sys_sendto+0x20a/0x2a0 net/socket.c:2194 __do_sys_sendto net/socket.c:2206 [inline] Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS") Reported-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231018090014.345158-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> diff 3c5b4d69 Fri Jul 28 09:03:15 MDT 2023 Eric Dumazet <edumazet@google.com> net: annotate data-races around sk->sk_mark sk->sk_mark is often read while another thread could change the value. Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 3632679d Mon May 22 06:08:20 MDT 2023 Nicolas Dichtel <nicolas.dichtel@6wind.com> ipv{4,6}/raw: fix output xfrm lookup wrt protocol With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the protocol field of the flow structure, build by raw_sendmsg() / rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy lookup when some policies are defined with a protocol in the selector. For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to specify the protocol. Just accept all values for IPPROTO_RAW socket. For ipv4, the sin_port field of 'struct sockaddr_in' could not be used without breaking backward compatibility (the value of this field was never checked). Let's add a new kind of control message, so that the userland could specify which protocol is used. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 91d0b78c Tue Jan 24 06:36:43 MST 2023 Jakub Sitnicki <jakub@cloudflare.com> inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
H A D | ipv6.h | diff 4bd0623f Wed Aug 16 02:15:41 MDT 2023 Eric Dumazet <edumazet@google.com> inet: move inet->transparent to inet->inet_flags IP_TRANSPARENT socket option can now be set/read without locking the socket. v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent() v4: rebased after commit 3f326a821b99 ("mptcp: change the mpc check helper to return a sk") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> diff 21985f43 Thu Oct 06 12:53:46 MDT 2022 Kuniyuki Iwashima <kuniyu@amazon.com> udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM). Commit 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") forgot to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6 socket into IPv4 with IPV6_ADDRFORM. After conversion, sk_prot is changed to udp_prot and ->destroy() never cleans it up, resulting in a memory leak. This is due to the discrepancy between inet6_destroy_sock() and IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM to remove the difference. However, this is not enough for now because rxpmtu can be changed without lock_sock() after commit 03485f2adcde ("udpv6: Add lockless sendmsg() support"). We will fix this case in the following patch. Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy() in the future. Fixes: 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 21985f43 Thu Oct 06 12:53:46 MDT 2022 Kuniyuki Iwashima <kuniyu@amazon.com> udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM). Commit 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") forgot to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6 socket into IPv4 with IPV6_ADDRFORM. After conversion, sk_prot is changed to udp_prot and ->destroy() never cleans it up, resulting in a memory leak. This is due to the discrepancy between inet6_destroy_sock() and IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM to remove the difference. However, this is not enough for now because rxpmtu can be changed without lock_sock() after commit 03485f2adcde ("udpv6: Add lockless sendmsg() support"). We will fix this case in the following patch. Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy() in the future. Fixes: 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff f93431c8 Tue Jun 07 06:00:27 MDT 2022 Wang Yufen <wangyufen@huawei.com> ipv6: Fix signed integer overflow in __ip6_append_data Resurrect ubsan overflow checks and ubsan report this warning, fix it by change the variable [length] type to size_t. UBSAN: signed-integer-overflow in net/ipv6/ip6_output.c:1489:19 2147479552 + 8567 cannot be represented in type 'int' CPU: 0 PID: 253 Comm: err Not tainted 5.16.0+ #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x214/0x230 show_stack+0x30/0x78 dump_stack_lvl+0xf8/0x118 dump_stack+0x18/0x30 ubsan_epilogue+0x18/0x60 handle_overflow+0xd0/0xf0 __ubsan_handle_add_overflow+0x34/0x44 __ip6_append_data.isra.48+0x1598/0x1688 ip6_append_data+0x128/0x260 udpv6_sendmsg+0x680/0xdd0 inet6_sendmsg+0x54/0x90 sock_sendmsg+0x70/0x88 ____sys_sendmsg+0xe8/0x368 ___sys_sendmsg+0x98/0xe0 __sys_sendmmsg+0xf4/0x3b8 __arm64_sys_sendmmsg+0x34/0x48 invoke_syscall+0x64/0x160 el0_svc_common.constprop.4+0x124/0x300 do_el0_svc+0x44/0xc8 el0_svc+0x3c/0x1e8 el0t_64_sync_handler+0x88/0xb0 el0t_64_sync+0x16c/0x170 Changes since v1: -Change the variable [length] type to unsigned, as Eric Dumazet suggested. Changes since v2: -Don't change exthdrlen type in ip6_make_skb, as Paolo Abeni suggested. Changes since v3: -Don't change ulen type in udpv6_sendmsg and l2tp_ip6_sendmsg, as Jakub Kicinski suggested. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Yufen <wangyufen@huawei.com> Link: https://lore.kernel.org/r/20220607120028.845916-1-wangyufen@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 4f6570d7 Tue Sep 24 09:01:14 MDT 2019 Eric Dumazet <edumazet@google.com> ipv6: add priority parameter to ip6_xmit() Currently, ip6_xmit() sets skb->priority based on sk->sk_priority This is not desirable for TCP since TCP shares the same ctl socket for a given netns. We want to be able to send RST or ACK packets with a non zero skb->priority. This patch has no functional change. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list syzbot reported nasty use-after-free [1] Lets remove frag_list field from structs ip_fraglist_iter and ip6_fraglist_iter. This seens not needed anyway. [1] : BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947 CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x44add9 Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9 RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf Allocated by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc_node mm/slab.c:3269 [inline] kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199 alloc_skb include/linux/skbuff.h:1058 [inline] __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 kfree_skbmem net/core/skbuff.c:625 [inline] kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619 __kfree_skb net/core/skbuff.c:682 [inline] kfree_skb net/core/skbuff.c:699 [inline] kfree_skb+0xf0/0x390 net/core/skbuff.c:693 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708 __dev_xmit_skb net/core/dev.c:3551 [inline] __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888085a3cbc0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 0 bytes inside of 224-byte region [ffff888085a3cbc0, ffff888085a3cca0) The buggy address belongs to the page: page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0 raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter") Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net> diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list syzbot reported nasty use-after-free [1] Lets remove frag_list field from structs ip_fraglist_iter and ip6_fraglist_iter. This seens not needed anyway. [1] : BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947 CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x44add9 Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9 RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf Allocated by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc_node mm/slab.c:3269 [inline] kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199 alloc_skb include/linux/skbuff.h:1058 [inline] __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 kfree_skbmem net/core/skbuff.c:625 [inline] kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619 __kfree_skb net/core/skbuff.c:682 [inline] kfree_skb net/core/skbuff.c:699 [inline] kfree_skb+0xf0/0x390 net/core/skbuff.c:693 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708 __dev_xmit_skb net/core/dev.c:3551 [inline] __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888085a3cbc0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 0 bytes inside of 224-byte region [ffff888085a3cbc0, ffff888085a3cca0) The buggy address belongs to the page: page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0 raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter") Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net> diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list syzbot reported nasty use-after-free [1] Lets remove frag_list field from structs ip_fraglist_iter and ip6_fraglist_iter. This seens not needed anyway. [1] : BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947 CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x44add9 Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9 RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf Allocated by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc_node mm/slab.c:3269 [inline] kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199 alloc_skb include/linux/skbuff.h:1058 [inline] __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 kfree_skbmem net/core/skbuff.c:625 [inline] kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619 __kfree_skb net/core/skbuff.c:682 [inline] kfree_skb net/core/skbuff.c:699 [inline] kfree_skb+0xf0/0x390 net/core/skbuff.c:693 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708 __dev_xmit_skb net/core/dev.c:3551 [inline] __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888085a3cbc0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 0 bytes inside of 224-byte region [ffff888085a3cbc0, ffff888085a3cca0) The buggy address belongs to the page: page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0 raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter") Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net> diff b7034146 Sun Jun 02 12:24:18 MDT 2019 Eric Dumazet <edumazet@google.com> net: fix use-after-free in kfree_skb_list syzbot reported nasty use-after-free [1] Lets remove frag_list field from structs ip_fraglist_iter and ip6_fraglist_iter. This seens not needed anyway. [1] : BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947 CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x44add9 Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9 RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf Allocated by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc_node mm/slab.c:3269 [inline] kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199 alloc_skb include/linux/skbuff.h:1058 [inline] __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 kfree_skbmem net/core/skbuff.c:625 [inline] kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619 __kfree_skb net/core/skbuff.c:682 [inline] kfree_skb net/core/skbuff.c:699 [inline] kfree_skb+0xf0/0x390 net/core/skbuff.c:693 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708 __dev_xmit_skb net/core/dev.c:3551 [inline] __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888085a3cbc0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 0 bytes inside of 224-byte region [ffff888085a3cbc0, ffff888085a3cca0) The buggy address belongs to the page: page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0 raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter") Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net> diff bec1f6f6 Thu Apr 26 11:42:17 MDT 2018 Willem de Bruijn <willemb@google.com> udp: generate gso with UDP_SEGMENT Support generic segmentation offload for udp datagrams. Callers can concatenate and send at once the payload of multiple datagrams with the same destination. To set segment size, the caller sets socket option UDP_SEGMENT to the length of each discrete payload. This value must be smaller than or equal to the relevant MTU. A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a per send call basis. Total byte length may then exceed MTU. If not an exact multiple of segment size, the last segment will be shorter. The implementation adds a gso_size field to the udp socket, ip(v6) cmsg cookie and inet_cork structure to be able to set the value at setsockopt or cmsg time and to work with both lockless and corked paths. Initial benchmark numbers show UDP GSO about as expensive as TCP GSO. tcp tso 3197 MB/s 54232 msg/s 54232 calls/s 6,457,754,262 cycles tcp gso 1765 MB/s 29939 msg/s 29939 calls/s 11,203,021,806 cycles tcp without tso/gso * 739 MB/s 12548 msg/s 12548 calls/s 11,205,483,630 cycles udp 876 MB/s 14873 msg/s 624666 calls/s 11,205,777,429 cycles udp gso 2139 MB/s 36282 msg/s 36282 calls/s 11,204,374,561 cycles [*] after reverting commit 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") Measured total system cycles ('-a') for one core while pinning both the network receive path and benchmark process to that core: perf stat -a -C 12 -e cycles \ ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4 Note the reduction in calls/s with GSO. Bytes per syscall drops increases from 1470 to 61818. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff bec1f6f6 Thu Apr 26 11:42:17 MDT 2018 Willem de Bruijn <willemb@google.com> udp: generate gso with UDP_SEGMENT Support generic segmentation offload for udp datagrams. Callers can concatenate and send at once the payload of multiple datagrams with the same destination. To set segment size, the caller sets socket option UDP_SEGMENT to the length of each discrete payload. This value must be smaller than or equal to the relevant MTU. A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a per send call basis. Total byte length may then exceed MTU. If not an exact multiple of segment size, the last segment will be shorter. The implementation adds a gso_size field to the udp socket, ip(v6) cmsg cookie and inet_cork structure to be able to set the value at setsockopt or cmsg time and to work with both lockless and corked paths. Initial benchmark numbers show UDP GSO about as expensive as TCP GSO. tcp tso 3197 MB/s 54232 msg/s 54232 calls/s 6,457,754,262 cycles tcp gso 1765 MB/s 29939 msg/s 29939 calls/s 11,203,021,806 cycles tcp without tso/gso * 739 MB/s 12548 msg/s 12548 calls/s 11,205,483,630 cycles udp 876 MB/s 14873 msg/s 624666 calls/s 11,205,777,429 cycles udp gso 2139 MB/s 36282 msg/s 36282 calls/s 11,204,374,561 cycles [*] after reverting commit 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") Measured total system cycles ('-a') for one core while pinning both the network receive path and benchmark process to that core: perf stat -a -C 12 -e cycles \ ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4 Note the reduction in calls/s with GSO. Bytes per syscall drops increases from 1470 to 61818. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
/linux-master/lib/ | ||
H A D | vsprintf.c | diff 72fcce70 Fri Oct 27 08:13:58 MDT 2023 Alexey Dobriyan <adobriyan@gmail.com> vsprintf: uninline simple_strntoull(), reorder arguments * uninline simple_strntoull(), gcc overinlines and this function is not performance critical * reorder arguments, so that appending INT_MAX as 4th argument generates very efficient tail call Space savings: add/remove: 1/0 grow/shrink: 0/3 up/down: 27/-179 (-152) Function old new delta simple_strntoll - 27 +27 simple_strtoull 15 10 -5 simple_strtoll 41 7 -34 vsscanf 1930 1790 -140 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/all/82a2af6e-9b6c-4a09-89d7-ca90cc1cdad1@p183/ diff 72fcce70 Fri Oct 27 08:13:58 MDT 2023 Alexey Dobriyan <adobriyan@gmail.com> vsprintf: uninline simple_strntoull(), reorder arguments * uninline simple_strntoull(), gcc overinlines and this function is not performance critical * reorder arguments, so that appending INT_MAX as 4th argument generates very efficient tail call Space savings: add/remove: 1/0 grow/shrink: 0/3 up/down: 27/-179 (-152) Function old new delta simple_strntoll - 27 +27 simple_strtoull 15 10 -5 simple_strtoll 41 7 -34 vsscanf 1930 1790 -140 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/all/82a2af6e-9b6c-4a09-89d7-ca90cc1cdad1@p183/ diff 4c85c0be Sun Jan 29 21:25:13 MST 2023 Hyeonggon Yoo <42.hyeyoo@gmail.com> mm, printk: introduce new format %pGt for page_type %pGp format is used to display 'flags' field of a struct page. However, some page flags (i.e. PG_buddy, see page-flags.h for more details) are stored in page_type field. To display human-readable output of page_type, introduce %pGt format. It is important to note the meaning of bits are different in page_type. if page_type is 0xffffffff, no flags are set. Setting PG_buddy (0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit actually means setting a flag. Bits in page_type are inverted when displaying type names. Only values for which page_type_has_type() returns true are considered as page_type, to avoid confusion with mapcount values. if it returns false, only raw values are displayed and not page type names. Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part] Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff ef62c8ff Thu Mar 24 19:09:02 MDT 2022 Waiman Long <longman@redhat.com> lib/vsprintf: avoid redundant work with 0 size Patch series "mm/page_owner: Extend page_owner to show memcg information", v4. While debugging the constant increase in percpu memory consumption on a system that spawned large number of containers, it was found that a lot of offline mem_cgroup structures remained in place without being freed. Further investigation indicated that those mem_cgroup structures were pinned by some pages. In order to find out what those pages are, the existing page_owner debugging tool is extended to show memory cgroup information and whether those memcgs are offline or not. With the enhanced page_owner tool, the following is a typical page that pinned the mem_cgroup structure in my test case: Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff) prep_new_page+0xac/0xe0 get_page_from_freelist+0x1327/0x14d0 __alloc_pages+0x191/0x340 alloc_pages_vma+0x84/0x250 shmem_alloc_page+0x3f/0x90 shmem_alloc_and_acct_page+0x76/0x1c0 shmem_getpage_gfp+0x281/0x940 shmem_write_begin+0x36/0xe0 generic_perform_write+0xed/0x1d0 __generic_file_write_iter+0xdc/0x1b0 generic_file_write_iter+0x5d/0xb0 new_sync_write+0x11f/0x1b0 vfs_write+0x1ba/0x2a0 ksys_write+0x59/0xd0 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Charged to offline memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8. So the page was not freed because it was part of a shmem segment. That is useful information that can help users to diagnose similar problems. With cgroup v1, /proc/cgroups can be read to find out the total number of memory cgroups (online + offline). With cgroup v2, the cgroup.stat of the root cgroup can be read to find the number of dying cgroups (most likely pinned by dying memcgs). The page_owner feature is not supposed to be enabled for production system due to its memory overhead. However, if it is suspected that dying memcgs are increasing over time, a test environment with page_owner enabled can then be set up with appropriate workload for further analysis on what may be causing the increasing number of dying memcgs. This patch (of 4): For *scnprintf(), vsnprintf() is always called even if the input size is 0. That is a waste of time, so just return 0 in this case. Note that vsnprintf() will never return -1 to indicate an error. So skipping the call to vsnprintf() when size is 0 will have no functional impact at all. Link: https://lkml.kernel.org/r/20220202203036.744010-1-longman@redhat.com Link: https://lkml.kernel.org/r/20220202203036.744010-2-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Ira Weiny <ira.weiny@intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 839b395e Mon Nov 08 19:33:22 MST 2021 Alexey Dobriyan <adobriyan@gmail.com> lib: uninline simple_strntoull() as well Codegen become bloated again after simple_strntoull() introduction add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-224 (-224) Function old new delta simple_strtoul 5 2 -3 simple_strtol 23 20 -3 simple_strtoull 119 15 -104 simple_strtoll 155 41 -114 Link: https://lkml.kernel.org/r/YVmlB9yY4lvbNKYt@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Richard Fitzgerald <rf@opensource.cirrus.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff c0891ac1 Mon Aug 02 14:40:32 MDT 2021 Alexey Dobriyan <adobriyan@gmail.com> isystem: ship and use stdarg.h Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>. stdarg.h is the only userspace header commonly used in the kernel. GPL 2 version of <stdarg.h> can be extracted from http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> diff ad65dcef Wed Jun 30 19:56:04 MDT 2021 Alexey Dobriyan <adobriyan@gmail.com> lib: uninline simple_strtoull() Gcc inlines simple_strtoull() too agressively. Given that all 4 signatures match, everything very efficiently calls or tailcalls into simple_strtoull(): ffffffff81da0240 <simple_strtoll>: ffffffff81da0240: 80 3f 2d cmp BYTE PTR [rdi],0x2d ffffffff81da0243: 74 05 je ffffffff81da024a <simple_strtoll+0xa> ffffffff81da0245: e9 76 ff ff ff jmp simple_strtoull ffffffff81da024a: 48 83 c7 01 add rdi,0x1 ffffffff81da024e: e8 6d ff ff ff call simple_strtoull ffffffff81da0253: 48 f7 d8 neg rax ffffffff81da0256: c3 ret Space savings (on F34-ish .config) add/remove: 0/0 grow/shrink: 1/3 up/down: 52/-313 (-261) Function old new delta vsscanf 2167 2219 +52 simple_strtoul 72 2 -70 simple_strtoll 143 23 -120 simple_strtol 143 20 -123 Link: https://lkml.kernel.org/r/YMO2zoOQk2eF34tn@localhost.localdomain Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 900fdc45 Fri May 14 10:12:04 MDT 2021 Richard Fitzgerald <rf@opensource.cirrus.com> lib: vsprintf: Fix handling of number field widths in vsscanf The existing code attempted to handle numbers by doing a strto[u]l(), ignoring the field width, and then repeatedly dividing to extract the field out of the full converted value. If the string contains a run of valid digits longer than will fit in a long or long long, this would overflow and no amount of dividing can recover the correct value. This patch fixes vsscanf() to obey number field widths when parsing the number. A new _parse_integer_limit() is added that takes a limit for the number of characters to parse. The number field conversion in vsscanf is changed to use this new function. If a number starts with a radix prefix, the field width must be long enough for at last one digit after the prefix. If not, it will be handled like this: sscanf("0x4", "%1i", &i): i=0, scanning continues with the 'x' sscanf("0x4", "%2i", &i): i=0, scanning continues with the '4' This is consistent with the observed behaviour of userland sscanf. Note that this patch does NOT fix the problem of a single field value overflowing the target type. So for example: sscanf("123456789abcdef", "%x", &i); Will not produce the correct result because the value obviously overflows INT_MAX. But sscanf will report a successful conversion. Note that where a very large number is used to mean "unlimited", the value INT_MAX is used for consistency with the behaviour of vsnprintf(). Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210514161206.30821-2-rf@opensource.cirrus.com diff c244297a Fri Mar 19 04:12:46 MDT 2021 Yafang Shao <laoar.shao@gmail.com> vsprintf: dump full information of page flags in pGp Currently the pGp only shows the names of page flags, rather than the full information including section, node, zone, last cpupid and kasan tag. While it is not easy to parse these information manually because there're so many flavors. Let's interpret them in pGp as well. To be compitable with the existed format of pGp, the new introduced ones also use '|' as the separator, then the user tools parsing pGp won't need to make change, suggested by Matthew. The new information is tracked onto the end of the existed one. On example of the output in mm/slub.c as follows, - Before the patch, [ 6343.396602] Slab 0x000000004382e02b objects=33 used=3 fp=0x000000009ae06ffc flags=0x17ffffc0010200(slab|head) - After the patch, [ 8448.272530] Slab 0x0000000090797883 objects=33 used=3 fp=0x00000000790f1c26 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff) The documentation and test cases are also updated. The output of the test cases as follows, [68599.816764] test_printf: loaded. [68599.819068] test_printf: all 388 tests passed [68599.830367] test_printf: unloaded. [lkp@intel.com: reported issues in the prev version in test_printf.c] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: kernel test robot <lkp@intel.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210319101246.73513-4-laoar.shao@gmail.com diff 5ead723a Sun Feb 14 09:13:48 MST 2021 Timur Tabi <timur@kernel.org> lib/vsprintf: no_hash_pointers prints all addresses as unhashed If the no_hash_pointers command line parameter is set, then printk("%p") will print pointers as unhashed, which is useful for debugging purposes. This change applies to any function that uses vsprintf, such as print_hex_dump() and seq_buf_printf(). A large warning message is displayed if this option is enabled. Unhashed pointers expose kernel addresses, which can be a security risk. Also update test_printf to skip the hashed pointer tests if the command-line option is set. Signed-off-by: Timur Tabi <timur@kernel.org> Acked-by: Petr Mladek <pmladek@suse.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Marco Elver <elver@google.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210214161348.369023-4-timur@kernel.org |
/linux-master/net/mac80211/ | ||
H A D | sta_info.c | diff 4d3acf43 Mon Aug 28 06:00:01 MDT 2023 Johannes Berg <johannes.berg@intel.com> wifi: mac80211: remove sta_mtx We now hold the wiphy mutex everywhere that we use or needed the sta_mtx, so we don't need this mutex any more. Remove it. Most of this change was done automatically with spatch. Signed-off-by: Johannes Berg <johannes.berg@intel.com> diff 4c51541d Fri Sep 02 08:12:54 MDT 2022 Benjamin Berg <benjamin.berg@intel.com> wifi: mac80211: keep A-MSDU data in sta and per-link The A-MSDU data needs to be stored per-link and aggregated into a single value for the station. Add a new struct ieee_80211_sta_aggregates in order to store this data and a new function ieee80211_sta_recalc_aggregates to update the current data for the STA. Note that in the non MLO case the pointer in ieee80211_sta will directly reference the data in deflink.agg, which means that recalculation may be skipped in that case. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> diff 4dde3c36 Mon Nov 29 06:32:42 MST 2021 Mordechay Goodstein <mordechay.goodstein@intel.com> mac80211: update channel context before station state Currently channel context is updated only after station got an update about new assoc state, this results in station using the old channel context. Fix this by moving the update channel context before updating station, enabling the driver to immediately use the updated channel context in the new assoc state. Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20211129152938.1c80c17ffd8a.I94ae31378b363c1182cfdca46c4b7e7165cff984@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> diff 7bc40aed Thu Nov 12 03:22:04 MST 2020 Johannes Berg <johannes.berg@intel.com> mac80211: free sta in sta_info_insert_finish() on errors If sta_info_insert_finish() fails, we currently keep the station around and free it only in the caller, but there's only one such caller and it always frees it immediately. As syzbot found, another consequence of this split is that we can put things that sleep only into __cleanup_single_sta() and not in sta_info_free(), but this is the only place that requires such of sta_info_free() now. Change this to free the station in sta_info_insert_finish(), in which case we can still sleep. This will also let us unify the cleanup code later. Cc: stable@vger.kernel.org Fixes: dcd479e10a05 ("mac80211: always wind down STA state") Reported-by: syzbot+32c6c38c4812d22f2f0b@syzkaller.appspotmail.com Reported-by: syzbot+4c81fe92e372d26c4246@syzkaller.appspotmail.com Reported-by: syzbot+6a7fe9faf0d1d61bc24a@syzkaller.appspotmail.com Reported-by: syzbot+abed06851c5ffe010921@syzkaller.appspotmail.com Reported-by: syzbot+b7aeb9318541a1c709f1@syzkaller.appspotmail.com Reported-by: syzbot+d5a9416c6cafe53b5dd0@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20201112112201.ee6b397b9453.I9c31d667a0ea2151441cc64ed6613d36c18a48e0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> diff 3ace10f5 Mon Nov 18 23:06:09 MST 2019 Kan Yan <kyan@google.com> mac80211: Implement Airtime-based Queue Limit (AQL) In order for the Fq_CoDel algorithm integrated in mac80211 layer to operate effectively to control excessive queueing latency, the CoDel algorithm requires an accurate measure of how long packets stays in the queue, AKA sojourn time. The sojourn time measured at the mac80211 layer doesn't include queueing latency in the lower layer (firmware/hardware) and CoDel expects lower layer to have a short queue. However, most 802.11ac chipsets offload tasks such TX aggregation to firmware or hardware, thus have a deep lower layer queue. Without a mechanism to control the lower layer queue size, packets only stay in mac80211 layer transiently before being sent to firmware queue. As a result, the sojourn time measured by CoDel in the mac80211 layer is almost always lower than the CoDel latency target, hence CoDel does little to control the latency, even when the lower layer queue causes excessive latency. The Byte Queue Limits (BQL) mechanism is commonly used to address the similar issue with wired network interface. However, this method cannot be applied directly to the wireless network interface. "Bytes" is not a suitable measure of queue depth in the wireless network, as the data rate can vary dramatically from station to station in the same network, from a few Mbps to over Gbps. This patch implements an Airtime-based Queue Limit (AQL) to make CoDel work effectively with wireless drivers that utilized firmware/hardware offloading. AQL allows each txq to release just enough packets to the lower layer to form 1-2 large aggregations to keep hardware fully utilized and retains the rest of the frames in mac80211 layer to be controlled by the CoDel algorithm. Signed-off-by: Kan Yan <kyan@google.com> [ Toke: Keep API to set pending airtime internal, fix nits in commit msg ] Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20191119060610.76681-4-kyan@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> diff 3e493173 Wed Sep 11 07:03:05 MDT 2019 Jouni Malinen <jouni@codeaurora.org> mac80211: Do not send Layer 2 Update frame before authorization The Layer 2 Update frame is used to update bridges when a station roams to another AP even if that STA does not transmit any frames after the reassociation. This behavior was described in IEEE Std 802.11F-2003 as something that would happen based on MLME-ASSOCIATE.indication, i.e., before completing 4-way handshake. However, this IEEE trial-use recommended practice document was published before RSN (IEEE Std 802.11i-2004) and as such, did not consider RSN use cases. Furthermore, IEEE Std 802.11F-2003 was withdrawn in 2006 and as such, has not been maintained amd should not be used anymore. Sending out the Layer 2 Update frame immediately after association is fine for open networks (and also when using SAE, FT protocol, or FILS authentication when the station is actually authenticated by the time association completes). However, it is not appropriate for cases where RSN is used with PSK or EAP authentication since the station is actually fully authenticated only once the 4-way handshake completes after authentication and attackers might be able to use the unauthenticated triggering of Layer 2 Update frame transmission to disrupt bridge behavior. Fix this by postponing transmission of the Layer 2 Update frame from station entry addition to the point when the station entry is marked authorized. Similarly, send out the VLAN binding update only if the STA entry has already been authorized. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net> diff 3e493173 Wed Sep 11 07:03:05 MDT 2019 Jouni Malinen <jouni@codeaurora.org> mac80211: Do not send Layer 2 Update frame before authorization The Layer 2 Update frame is used to update bridges when a station roams to another AP even if that STA does not transmit any frames after the reassociation. This behavior was described in IEEE Std 802.11F-2003 as something that would happen based on MLME-ASSOCIATE.indication, i.e., before completing 4-way handshake. However, this IEEE trial-use recommended practice document was published before RSN (IEEE Std 802.11i-2004) and as such, did not consider RSN use cases. Furthermore, IEEE Std 802.11F-2003 was withdrawn in 2006 and as such, has not been maintained amd should not be used anymore. Sending out the Layer 2 Update frame immediately after association is fine for open networks (and also when using SAE, FT protocol, or FILS authentication when the station is actually authenticated by the time association completes). However, it is not appropriate for cases where RSN is used with PSK or EAP authentication since the station is actually fully authenticated only once the 4-way handshake completes after authentication and attackers might be able to use the unauthenticated triggering of Layer 2 Update frame transmission to disrupt bridge behavior. Fix this by postponing transmission of the Layer 2 Update frame from station entry addition to the point when the station entry is marked authorized. Similarly, send out the VLAN binding update only if the STA entry has already been authorized. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net> diff 41cbb0f5 Sat Jun 09 00:14:44 MDT 2018 Luca Coelho <luciano.coelho@intel.com> mac80211: add support for HE Add support for HE in mac80211 conforming with P802.11ax_D1.4. Johannes: Fix another bug with the buf_size comparison in agg-rx.c. Signed-off-by: Liad Kaufman <liad.kaufman@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Ido Yariv <idox.yariv@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> diff 4c02d62f Thu Oct 05 11:39:10 MDT 2017 Kees Cook <keescook@chromium.org> net/mac80211/mesh_plink: Convert timers to use timer_setup() In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. This requires adding a pointer back to the sta_info since container_of() can't resolve the sta_info. Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> diff 4df864c1 Fri Jun 16 06:29:21 MDT 2017 Johannes Berg <johannes.berg@intel.com> networking: make skb_put & friends return void pointers It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions (skb_put, __skb_put and pskb_put) return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_put, __skb_put }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_put, __skb_put }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) which actually doesn't cover pskb_put since there are only three users overall. A handful of stragglers were converted manually, notably a macro in drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many instances in net/bluetooth/hci_sock.c. In the former file, I also had to fix one whitespace problem spatch introduced. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
/linux-master/drivers/scsi/fcoe/ | ||
H A D | fcoe.c | diff 276f9aa2 Wed Mar 03 07:46:04 MST 2021 Lee Jones <lee.jones@linaro.org> scsi: fcoe: Fix function name fcoe_set_vport_symbolic_name() in description Fixes the following W=1 kernel build warning(s): drivers/scsi/fcoe/fcoe.c:2782: warning: expecting prototype for fcoe_vport_set_symbolic_name(). Prototype was for fcoe_set_vport_symbolic_name() instead Link: https://lore.kernel.org/r/20210303144631.3175331-4-lee.jones@linaro.org Cc: Hannes Reinecke <hare@suse.de> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> diff b7a9d0c6 Mon Apr 20 21:40:08 MDT 2020 Jason Yan <yanaijie@huawei.com> scsi: fcoe: remove unneeded semicolon in fcoe.c Fix the following coccicheck warning: drivers/scsi/fcoe/fcoe.c:1918:3-4: Unneeded semicolon drivers/scsi/fcoe/fcoe.c:1930:3-4: Unneeded semicolon Link: https://lore.kernel.org/r/20200421034008.27865-1-yanaijie@huawei.com Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> diff b7a9d0c6 Mon Apr 20 21:40:08 MDT 2020 Jason Yan <yanaijie@huawei.com> scsi: fcoe: remove unneeded semicolon in fcoe.c Fix the following coccicheck warning: drivers/scsi/fcoe/fcoe.c:1918:3-4: Unneeded semicolon drivers/scsi/fcoe/fcoe.c:1930:3-4: Unneeded semicolon Link: https://lore.kernel.org/r/20200421034008.27865-1-yanaijie@huawei.com Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> diff 4df864c1 Fri Jun 16 06:29:21 MDT 2017 Johannes Berg <johannes.berg@intel.com> networking: make skb_put & friends return void pointers It seems like a historic accident that these return unsigned char *, and in many places that means casts are required, more often than not. Make these functions (skb_put, __skb_put and pskb_put) return void * and remove all the casts across the tree, adding a (u8 *) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_put, __skb_put }; @@ - *(fn(SKB, LEN)) + *(u8 *)fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_put, __skb_put }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) which actually doesn't cover pskb_put since there are only three users overall. A handful of stragglers were converted manually, notably a macro in drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many instances in net/bluetooth/hci_sock.c. In the former file, I also had to fix one whitespace problem spatch introduced. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 4b9bc86d Tue Apr 12 09:16:54 MDT 2016 Sebastian Andrzej Siewior <bigeasy@linutronix.de> fcoe: convert to kworker The driver creates its own per-CPU threads which are updated based on CPU hotplug events. It is also possible to use kworkers and remove some of the kthread infrastrucure. The code checked ->thread to decide if there is an active per-CPU thread. By using the kworker infrastructure this is no longer possible (or required). The thread pointer is saved in `kthread' instead of `thread' so anything trying to use thread is caught by the compiler. Currently only the bnx2fc driver is using struct fcoe_percpu_s and the kthread member. After a CPU went offline, we may still enqueue items on the "offline" CPU. This isn't much of a problem. The work will be done on a random CPU. The allocated crc_eof_page page won't be cleaned up. It is probably expected that the CPU comes up at some point so it should not be a problem. The crc_eof_page memory is released of course once the module is removed. This patch was only compile-tested due to -ENODEV. Cc: Vasu Dev <vasu.dev@intel.com> Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: fcoe-devel@open-fcoe.org Cc: linux-scsi@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> diff 860eca2b Tue Sep 27 22:38:18 MDT 2011 Vasu Dev <vasu.dev@intel.com> [SCSI] fcoe: setup default initial value for DDP threshold Currently fcoe_ddp_min doesn't have default value so by default not used, so setting up default value as 4k as this works better by avoiding overhead of programing DDP for small IOs. Signed-off-by: Vasu Dev <vasu.dev@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Signed-off-by: Yi Zou <yi.zou@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> diff 4bc71cb9 Fri Sep 02 21:34:30 MDT 2011 Jiri Pirko <jpirko@redhat.com> net: consolidate and fix ethtool_ops->get_settings calling This patch does several things: - introduces __ethtool_get_settings which is called from ethtool code and from drivers as well. Put ASSERT_RTNL there. - dev_ethtool_get_settings() is replaced by __ethtool_get_settings() - changes calling in drivers so rtnl locking is respected. In iboe_get_rate was previously ->get_settings() called unlocked. This fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same problem. Also fixed by calling __dev_get_by_index() instead of dev_get_by_index() and holding rtnl_lock for both calls. - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create() so bnx2fc_if_create() and fcoe_if_create() are called locked as they are from other places. - use __ethtool_get_settings() in bonding code Signed-off-by: Jiri Pirko <jpirko@redhat.com> v2->v3: -removed dev_ethtool_get_settings() -added ASSERT_RTNL into __ethtool_get_settings() -prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock around it and __ethtool_get_settings() call v1->v2: add missing export_symbol Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits] Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net> diff 848e7d5b Thu Aug 25 01:40:47 MDT 2011 Robert Love <robert.w.love@intel.com> [SCSI] fcoe: Fix deadlock between fip's recv_work and rtnl The rtnl cannot be held durrng the fcoe_interface_put. If it is the last reference on the fcoe_interface the fcoe_ctlr_destroy will be called as a part of the cleanup, ultimately calling cancel_work_sync(&fip->recv_work); If we are processing a flogi response we will be in the recv_work context and we will lock the rtnl to add a new unicast MAC address. This is how the deadlock can occur. The fix is simply to move the rtnl_lock/unlock into fcoe_interface_cleanup so that it can be unlocked before fcoe_interface_put is called. Here is the lockdep report: Jul 21 11:26:35 bubba [ 223.870702] ul 21 11:26:35 bubba [ 223.870704] ======================================================= Jul 21 11:26:35 bubba [ 223.871255] [ INFO: possible circular locking dependency detected ] Jul 21 11:26:35 bubba [ 223.871530] 3.0.0-rc7+ #1 Jul 21 11:26:35 bubba [ 223.871797] ------------------------------------------------------- Jul 21 11:26:35 bubba [ 223.872072] lockdeptest.sh/3464 is trying to acquire lock: Jul 21 11:26:35 bubba [ 223.872345] ((&fip->recv_work) Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffff810531f1>] wait_on_work+0x0/0xbd Jul 21 11:26:35 bubba [ 223.873022] Jul 21 11:26:35 bubba [ 223.873023] but task is already holding lock: Jul 21 11:26:35 bubba [ 223.873555] (rtnl_mutex Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffff813e8233>] rtnl_lock+0x12/0x14 Jul 21 11:26:35 bubba [ 223.874229] Jul 21 11:26:35 bubba [ 223.874230] which lock already depends on the new lock. Jul 21 11:26:35 bubba [ 223.874231] Jul 21 11:26:35 bubba [ 223.875032] Jul 21 11:26:35 bubba [ 223.875033] the existing dependency chain (in reverse order) is: Jul 21 11:26:35 bubba [ 223.875573] Jul 21 11:26:35 bubba [ 223.875573] -> #1 Jul 21 11:26:35 bubba (rtnl_mutex Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba : Jul 21 11:26:35 bubba [ 223.876301] Jul 21 11:26:35 bubba [<ffffffff8106c14a>] lock_acquire+0xd2/0xf7 Jul 21 11:26:35 bubba [ 223.876645] Jul 21 11:26:35 bubba [<ffffffff8151d975>] __mutex_lock_common+0x47/0x30d Jul 21 11:26:35 bubba [ 223.876991] Jul 21 11:26:35 bubba [<ffffffff8151dd36>] mutex_lock_nested+0x3b/0x40 Jul 21 11:26:35 bubba [ 223.877334] Jul 21 11:26:35 bubba [<ffffffff813e8233>] rtnl_lock+0x12/0x14 Jul 21 11:26:35 bubba [ 223.877675] Jul 21 11:26:35 bubba [<ffffffffa003d5a0>] fcoe_update_src_mac+0x2b/0x80 [fcoe] Jul 21 11:26:35 bubba [ 223.878022] Jul 21 11:26:35 bubba [<ffffffffa003d698>] fcoe_flogi_resp+0x5e/0x79 [fcoe] Jul 21 11:26:35 bubba [ 223.878366] Jul 21 11:26:35 bubba [<ffffffffa001566f>] fc_exch_recv+0x7f5/0x9da [libfc] Jul 21 11:26:35 bubba [ 223.878713] Jul 21 11:26:35 bubba [<ffffffffa00327d8>] fcoe_ctlr_recv_work+0x71f/0x10dc [libfcoe] Jul 21 11:26:35 bubba [ 223.879258] Jul 21 11:26:35 bubba [<ffffffff81053761>] process_one_work+0x1d7/0x347 Jul 21 11:26:35 bubba [ 223.879601] Jul 21 11:26:35 bubba [<ffffffff81054ade>] worker_thread+0xf8/0x17c Jul 21 11:26:35 bubba [ 223.879944] Jul 21 11:26:35 bubba [<ffffffff81058184>] kthread+0x7d/0x85 Jul 21 11:26:35 bubba [ 223.880287] Jul 21 11:26:35 bubba [<ffffffff81526414>] kernel_thread_helper+0x4/0x10 Jul 21 11:26:35 bubba [ 223.880634] Jul 21 11:26:35 bubba [ 223.880635] -> #0 Jul 21 11:26:35 bubba ((&fip->recv_work) Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba : Jul 21 11:26:35 bubba [ 223.881357] Jul 21 11:26:35 bubba [<ffffffff8106b93e>] __lock_acquire+0xb1d/0xe2c Jul 21 11:26:35 bubba [ 223.881695] Jul 21 11:26:35 bubba [<ffffffff8106c14a>] lock_acquire+0xd2/0xf7 Jul 21 11:26:35 bubba [ 223.882033] Jul 21 11:26:35 bubba [<ffffffff81053241>] wait_on_work+0x50/0xbd Jul 21 11:26:35 bubba [ 223.882378] Jul 21 11:26:35 bubba [<ffffffff81053b32>] __cancel_work_timer+0xb6/0xf4 Jul 21 11:26:35 bubba [ 223.882718] Jul 21 11:26:35 bubba [<ffffffff81053b8a>] cancel_work_sync+0xb/0xd Jul 21 11:26:35 bubba [ 223.883057] Jul 21 11:26:35 bubba [<ffffffffa00317e6>] fcoe_ctlr_destroy+0x1d/0x67 [libfcoe] Jul 21 11:26:35 bubba [ 223.883399] Jul 21 11:26:35 bubba [<ffffffffa003e51e>] fcoe_interface_release+0x21/0x45 [fcoe] Jul 21 11:26:35 bubba [ 223.883940] Jul 21 11:26:35 bubba [<ffffffff811fbbe6>] kref_put+0x43/0x4d Jul 21 11:26:35 bubba [ 223.884280] Jul 21 11:26:35 bubba [<ffffffffa003ebba>] fcoe_interface_put+0x17/0x19 [fcoe] Jul 21 11:26:35 bubba [ 223.884624] Jul 21 11:26:35 bubba [<ffffffffa003f2a6>] fcoe_interface_cleanup+0x188/0x193 [fcoe] Jul 21 11:26:35 bubba [ 223.885163] Jul 21 11:26:35 bubba [<ffffffffa003f303>] fcoe_destroy+0x52/0x72 [fcoe] Jul 21 11:26:35 bubba [ 223.885502] Jul 21 11:26:35 bubba [<ffffffffa00340a4>] fcoe_transport_destroy+0xab/0x110 [libfcoe] Jul 21 11:26:35 bubba [ 223.886045] Jul 21 11:26:35 bubba [<ffffffff81056153>] param_attr_store+0x43/0x62 Jul 21 11:26:35 bubba [ 223.886385] Jul 21 11:26:35 bubba [<ffffffff8105602d>] module_attr_store+0x21/0x25 Jul 21 11:26:35 bubba [ 223.886728] Jul 21 11:26:35 bubba [<ffffffff8114c23d>] sysfs_write_file+0x103/0x13f Jul 21 11:26:35 bubba [ 223.887068] Jul 21 11:26:35 bubba [<ffffffff810f3e7b>] vfs_write+0xa7/0xfa Jul 21 11:26:35 bubba [ 223.887406] Jul 21 11:26:35 bubba [<ffffffff810f4073>] sys_write+0x45/0x69 Jul 21 11:26:35 bubba [ 223.887742] Jul 21 11:26:35 bubba [<ffffffff815252bb>] system_call_fastpath+0x16/0x1b Jul 21 11:26:35 bubba [ 223.888083] Jul 21 11:26:35 bubba [ 223.888084] other info that might help us debug this: Jul 21 11:26:35 bubba [ 223.888085] Jul 21 11:26:35 bubba [ 223.888879] Possible unsafe locking scenario: Jul 21 11:26:35 bubba [ 223.888881] Jul 21 11:26:35 bubba [ 223.889411] CPU0 CPU1 Jul 21 11:26:35 bubba [ 223.889683] ---- ---- Jul 21 11:26:35 bubba [ 223.889955] lock( Jul 21 11:26:35 bubba rtnl_mutex Jul 21 11:26:35 bubba ); Jul 21 11:26:35 bubba [ 223.890349] lock( Jul 21 11:26:35 bubba (&fip->recv_work) Jul 21 11:26:35 bubba ); Jul 21 11:26:35 bubba [ 223.890751] lock( Jul 21 11:26:35 bubba rtnl_mutex Jul 21 11:26:35 bubba ); Jul 21 11:26:35 bubba [ 223.891154] lock( Jul 21 11:26:35 bubba (&fip->recv_work) Jul 21 11:26:35 bubba ); Jul 21 11:26:35 bubba [ 223.891549] Jul 21 11:26:35 bubba [ 223.891550] *** DEADLOCK *** Jul 21 11:26:35 bubba [ 223.891551] Jul 21 11:26:35 bubba [ 223.892347] 6 locks held by lockdeptest.sh/3464: Jul 21 11:26:35 bubba [ 223.892621] #0: Jul 21 11:26:35 bubba (&buffer->mutex Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffff8114c171>] sysfs_write_file+0x37/0x13f Jul 21 11:26:35 bubba [ 223.893359] #1: Jul 21 11:26:35 bubba (s_active Jul 21 11:26:35 bubba ){++++.+} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffff8114c21c>] sysfs_write_file+0xe2/0x13f Jul 21 11:26:35 bubba [ 223.894094] #2: Jul 21 11:26:35 bubba (param_lock Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffff81056146>] param_attr_store+0x36/0x62 Jul 21 11:26:35 bubba [ 223.894835] #3: Jul 21 11:26:35 bubba (ft_mutex Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffffa0034017>] fcoe_transport_destroy+0x1e/0x110 [libfcoe] Jul 21 11:26:35 bubba [ 223.895574] #4: Jul 21 11:26:35 bubba (fcoe_config_mutex Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffffa003f2c9>] fcoe_destroy+0x18/0x72 [fcoe] Jul 21 11:26:35 bubba [ 223.896314] #5: Jul 21 11:26:35 bubba (rtnl_mutex Jul 21 11:26:35 bubba ){+.+.+.} Jul 21 11:26:35 bubba , at: Jul 21 11:26:35 bubba [<ffffffff813e8233>] rtnl_lock+0x12/0x14 Jul 21 11:26:35 bubba [ 223.897047] Jul 21 11:26:35 bubba [ 223.897048] stack backtrace: Jul 21 11:26:35 bubba [ 223.897578] Pid: 3464, comm: lockdeptest.sh Not tainted 3.0.0-rc7+ #1 Jul 21 11:26:35 bubba [ 223.897853] Call Trace: Jul 21 11:26:35 bubba [ 223.898128] [<ffffffff81068e16>] print_circular_bug+0x1f8/0x209 Jul 21 11:26:35 bubba [ 223.898416] [<ffffffff8106b93e>] __lock_acquire+0xb1d/0xe2c Jul 21 11:26:35 bubba [ 223.898699] [<ffffffff810531f1>] ? wait_on_cpu_work+0xe6/0xe6 Jul 21 11:26:35 bubba [ 223.898982] [<ffffffff8106c14a>] lock_acquire+0xd2/0xf7 Jul 21 11:26:35 bubba [ 223.899263] [<ffffffff810531f1>] ? wait_on_cpu_work+0xe6/0xe6 Jul 21 11:26:35 bubba [ 223.899547] [<ffffffff8104a097>] ? mod_timer+0x8f/0x98 Jul 21 11:26:35 bubba [ 223.899827] [<ffffffff81053241>] wait_on_work+0x50/0xbd Jul 21 11:26:35 bubba [ 223.900108] [<ffffffff810531f1>] ? wait_on_cpu_work+0xe6/0xe6 Jul 21 11:26:35 bubba [ 223.900390] [<ffffffff81053b32>] __cancel_work_timer+0xb6/0xf4 Jul 21 11:26:35 bubba [ 223.900671] [<ffffffff81053b8a>] cancel_work_sync+0xb/0xd Jul 21 11:26:35 bubba [ 223.900953] [<ffffffffa00317e6>] fcoe_ctlr_destroy+0x1d/0x67 [libfcoe] Jul 21 11:26:35 bubba [ 223.901237] [<ffffffffa003e51e>] fcoe_interface_release+0x21/0x45 [fcoe] Jul 21 11:26:35 bubba [ 223.901522] [<ffffffffa003e4fd>] ? fcoe_enable+0x6b/0x6b [fcoe] Jul 21 11:26:35 bubba [ 223.901803] [<ffffffff811fbbe6>] kref_put+0x43/0x4d Jul 21 11:26:35 bubba [ 223.902083] [<ffffffffa003ebba>] fcoe_interface_put+0x17/0x19 [fcoe] Jul 21 11:26:35 bubba [ 223.902367] [<ffffffffa003f2a6>] fcoe_interface_cleanup+0x188/0x193 [fcoe] Jul 21 11:26:35 bubba [ 223.902653] [<ffffffff8151dd36>] ? mutex_lock_nested+0x3b/0x40 Jul 21 11:26:35 bubba [ 223.902939] [<ffffffffa003f303>] fcoe_destroy+0x52/0x72 [fcoe] Jul 21 11:26:35 bubba [ 223.903223] [<ffffffffa00340a4>] fcoe_transport_destroy+0xab/0x110 [libfcoe] Jul 21 11:26:35 bubba [ 223.903508] [<ffffffff81056153>] param_attr_store+0x43/0x62 Jul 21 11:26:35 bubba [ 223.903792] [<ffffffff8105602d>] module_attr_store+0x21/0x25 Jul 21 11:26:35 bubba [ 223.904075] [<ffffffff8114c23d>] sysfs_write_file+0x103/0x13f Jul 21 11:26:35 bubba [ 223.904357] [<ffffffff810f3e7b>] vfs_write+0xa7/0xfa Jul 21 11:26:35 bubba [ 223.904642] [<ffffffff810f51d6>] ? fget_light+0x35/0x96 Jul 21 11:26:35 bubba [ 223.904923] [<ffffffff810f4073>] sys_write+0x45/0x69 Jul 21 11:26:35 bubba [ 223.905204] [<ffffffff815252bb>] system_call_fastpath+0x16/0x1b Jul 21 11:26:36 bubba [ 223.964438] ixgbe 0000:05:00.0: eth3: detected SFP+: 5 Jul 21 11:26:37 bubba [ 225.196702] ixgbe 0000:05:00.0: eth3: NIC Link is Up 10 Gbps, Flow Control: None Signed-off-by: Robert Love <robert.w.love@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Reviewed-by: Yi Zou <yi.zou@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> diff e10f8c66 Tue Jul 20 16:20:30 MDT 2010 Joe Eykholt <jeykholt@cisco.com> [SCSI] libfcoe: fcoe: fnic: add FIP VN2VN point-to-multipoint support The FC-BB-6 committee is proposing a new FIP usage model called VN_port to VN_port mode. It allows VN_ports to discover each other over a loss-free L2 Ethernet without any FCF or Fibre-channel fabric services. This is point-to-multipoint. There is also a variant of this called point-to-point which provides for making sure there is just one pair of ports operating over the Ethernet fabric. We add these new states: VNMP_START, _PROBE1, _PROBE2, _CLAIM, and _UP. These usually go quickly in that sequence. After waiting a random amount of time up to 100 ms in START, we select a pseudo-random proposed locally-unique port ID and send out probes in states PROBE1 and PROBE2, 100 ms apart. If no probe responses are heard, we proceed to CLAIM state 400 ms later and send a claim notification. We wait another 400 ms to receive claim responses, which give us a list of the other nodes on the network, including their FC-4 capabilities. After another 400 ms we go to VNMP_UP state and should start interoperating with any of the nodes for whic we receivec claim responses. More details are in the spec.j Add the new mode as FIP_MODE_VN2VN. The driver must specify explicitly that it wants to operate in this mode. There is no automatic detection between point-to-multipoint and fabric mode, and the local port initialization is affected, so it isn't anticipated that there will ever be any such automatic switchover. It may eventually be possible to have both fabric and VN2VN modes on the same L2 network, which may be done by two separate local VN_ports (lports). When in VN2VN mode, FIP replaces libfc's fabric-oriented discovery module with its own simple code that adds remote ports as they are discovered from incoming claim notifications and responses. These hooks are placed by fcoe_disc_init(). A linear list of discovered vn_ports is maintained under the fcoe_ctlr struct. It is expected to be short for now, and accessed infrequently. It is kept under RCU for lock-ordering reasons. The lport and/or rport mutexes may be held when we need to lookup a fcoe_vnport during an ELS send. Change fcoe_ctlr_encaps() to lookup the destination vn_port in the list of peers for the destination MAC address of the FIP-encapsulated frame. Add a new function fcoe_disc_init() to initialize just the discovery portion of libfcoe for VN2VN mode. Signed-off-by: Joe Eykholt <jeykholt@cisco.com> Signed-off-by: Robert Love <robert.w.love@intel.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 4be929be Mon May 24 15:33:03 MDT 2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
/linux-master/include/linux/ | ||
H A D | kernel.h | diff b229baa3 Tue Jul 18 15:11:44 MDT 2023 Andy Shevchenko <andriy.shevchenko@linux.intel.com> kernel.h: split out COUNT_ARGS() and CONCATENATE() to args.h Patch series "kernel.h: Split out a couple of macros to args.h", v4. There are macros in kernel.h that can be used outside of that header. Split them to args.h and replace open coded variants. This patch (of 4): kernel.h is being used as a dump for all kinds of stuff for a long time. The COUNT_ARGS() and CONCATENATE() macros may be used in some places without need of the full kernel.h dependency train with it. Here is the attempt on cleaning it up by splitting out these macros(). While at it, include new header where it's being used. Link: https://lkml.kernel.org/r/20230718211147.18647-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20230718211147.18647-2-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> [PCI] Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Daniel Latypov <dlatypov@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Gow <davidgow@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff a20deb3a Mon Nov 08 19:33:51 MST 2021 Kefeng Wang <wangkefeng.wang@huawei.com> sections: move and rename core_kernel_data() to is_kernel_core_data() Move core_kernel_data() into sections.h and rename it to is_kernel_core_data(), also make it return bool value, then update all the callers. Link: https://lkml.kernel.org/r/20210930071143.63410-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexander Potapenko <glider@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Mackerras <paulus@samba.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff c0891ac1 Mon Aug 02 14:40:32 MDT 2021 Alexey Dobriyan <adobriyan@gmail.com> isystem: ship and use stdarg.h Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>. stdarg.h is the only userspace header commonly used in the kernel. GPL 2 version of <stdarg.h> can be extracted from http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> diff 4c527293 Wed Jun 30 19:56:10 MDT 2021 Andy Shevchenko <andriy.shevchenko@linux.intel.com> kernel.h: split out kstrtox() and simple_strtox() to a separate header kernel.h is being used as a dump for all kinds of stuff for a long time. Here is the attempt to start cleaning it up by splitting out kstrtox() and simple_strtox() helpers. At the same time convert users in header and lib folders to use new header. Though for time being include new header back to kernel.h to avoid twisted indirected includes for existing users. [andy.shevchenko@gmail.com: fix documentation references] Link: https://lkml.kernel.org/r/20210615220003.377901-1-andy.shevchenko@gmail.com Link: https://lkml.kernel.org/r/20210611185815.44103-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Francis Laniel <laniel_francis@privacyrequired.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Kars Mulder <kerneldev@karsmulder.nl> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ef0f2685 Tue Aug 11 19:34:56 MDT 2020 Kars Mulder <kerneldev@karsmulder.nl> kstrto*: do not describe simple_strto*() as obsolete/replaced The documentation of the kstrto*() functions describes kstrto*() as "replacements" of the "obsolete" simple_strto*() functions. Both of these terms are inaccurate: they're not replacements because they have different behaviour, and the simple_strto*() are not obsolete because there are cases where they have benefits over kstrto*(). Remove usage of the terms "replacement" and "obsolete" in reference to simple_strto*(), and instead use the term "preferred over". Fixes: 4c925d6031f71 ("kstrto*: add documentation") Fixes: 885e68e8b7b13 ("kernel.h: update comment about simple_strto<foo>() functions") Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Eldad Zack <eldad@fogrefinery.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Mans Rullgard <mans@mansr.com> Cc: Petr Mladek <pmladek@suse.com> Link: http://lkml.kernel.org/r/29b9-5f234c80-13-4e3aa200@244003027 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff ef0f2685 Tue Aug 11 19:34:56 MDT 2020 Kars Mulder <kerneldev@karsmulder.nl> kstrto*: do not describe simple_strto*() as obsolete/replaced The documentation of the kstrto*() functions describes kstrto*() as "replacements" of the "obsolete" simple_strto*() functions. Both of these terms are inaccurate: they're not replacements because they have different behaviour, and the simple_strto*() are not obsolete because there are cases where they have benefits over kstrto*(). Remove usage of the terms "replacement" and "obsolete" in reference to simple_strto*(), and instead use the term "preferred over". Fixes: 4c925d6031f71 ("kstrto*: add documentation") Fixes: 885e68e8b7b13 ("kernel.h: update comment about simple_strto<foo>() functions") Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Eldad Zack <eldad@fogrefinery.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Mans Rullgard <mans@mansr.com> Cc: Petr Mladek <pmladek@suse.com> Link: http://lkml.kernel.org/r/29b9-5f234c80-13-4e3aa200@244003027 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff b642e44e Tue Aug 11 19:34:53 MDT 2020 Kars Mulder <kerneldev@karsmulder.nl> kstrto*: correct documentation references to simple_strto*() The documentation of the kstrto*() functions reference the simple_strtoull function by "used as a replacement for [the obsolete] simple_strtoull". All these functions describes themselves as replacements for the function simple_strtoull, even though a function like kstrtol() would be more aptly described as a replacement of simple_strtol(). Fix these references by making the documentation of kstrto*() reference the closest simple_strto*() equivalent available. The functions kstrto[u]int() do not have direct simple_strto[u]int() equivalences, so these are made to refer to simple_strto[u]l() instead. Furthermore, add parentheses after function names, as is standard in kernel documentation. Fixes: 4c925d6031f71 ("kstrto*: add documentation") Signed-off-by: Kars Mulder <kerneldev@karsmulder.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Eldad Zack <eldad@fogrefinery.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Mans Rullgard <mans@mansr.com> Cc: Petr Mladek <pmladek@suse.com> Link: http://lkml.kernel.org/r/1ee1-5f234c00-f3-165a6440@234394593 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> diff 229f5879 Wed Jul 22 05:03:05 MDT 2020 Kishon Vijay Abraham I <kishon@ti.com> linux/kernel.h: Add PTR_ALIGN_DOWN macro Add a macro for aligning down a pointer. This is useful to get an aligned register address when a device allows only word access and doesn't allow half word or byte access. Link: https://lore.kernel.org/r/20200722110317.4744-4-kishon@ti.com Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Acked-by: Rob Herring <robh@kernel.org> diff 4e139c77 Fri Feb 14 06:39:19 MST 2020 Thomas Gleixner <tglx@linutronix.de> sched: Provide cant_migrate() Some code pathes rely on preempt_disable() to prevent migration on a non RT enabled kernel. These preempt_disable/enable() pairs are substituted by migrate_disable/enable() pairs or other forms of RT specific protection. On RT these protections prevent migration but not preemption. Obviously a cant_sleep() check in such a section will trigger on RT because preemption is not disabled. Provide a cant_migrate() macro which maps to cant_sleep() on a non RT kernel and an empty placeholder for RT for now. The placeholder will be changed to a proper debug check along with the RT specific migration protection mechanism. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200214161503.070487511@linutronix.de diff 312364f3 Mon Aug 26 14:14:23 MDT 2019 Daniel Vetter <daniel.vetter@ffwll.ch> kernel.h: Add non_block_start/end() In some special cases we must not block, but there's not a spinlock, preempt-off, irqs-off or similar critical section already that arms the might_sleep() debug checks. Add a non_block_start/end() pair to annotate these. This will be used in the oom paths of mmu-notifiers, where blocking is not allowed to make sure there's forward progress. Quoting Michal: "The notifier is called from quite a restricted context - oom_reaper - which shouldn't depend on any locks or sleepable conditionals. The code should be swift as well but we mostly do care about it to make a forward progress. Checking for sleepable context is the best thing we could come up with that would describe these demands at least partially." Peter also asked whether we want to catch spinlocks on top, but Michal said those are less of a problem because spinlocks can't have an indirect dependency upon the page allocator and hence close the loop with the oom reaper. Suggested by Michal Hocko. Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch Acked-by: Christian König <christian.koenig@amd.com> (v1) Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
H A D | sched.h | diff fa3bea4e Thu Feb 01 22:02:37 MST 2024 Gregory Price <gourry.memverge@gmail.com> mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Co-developed-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff 0eb5085c Thu Nov 16 06:36:38 MST 2023 Heiko Carstens <hca@linux.ibm.com> arch: remove ARCH_TASK_STRUCT_ON_STACK IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ON_STACK. IA-64 was removed with commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture"). Therefore remove support for ARCH_TASK_STRUCT_ON_STACK as well. Note: this also reveals a potential bug in powerpc code, which makes use of __init_task_data without selecting ARCH_TASK_STRUCT_ON_STACK which makes __init_task_data a no-op. This is broken since commit d11ed3ab3166 ("Expand INIT_TASK() in init/init_task.c and remove") from 2018 and needs to be addressed separately. Link: https://lkml.kernel.org/r/20231116133638.1636277-4-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff e3ff7c60 Fri Feb 24 09:50:00 MST 2023 Josh Poimboeuf <jpoimboe@kernel.org> livepatch,sched: Add livepatch task switching to cond_resched() There have been reports [1][2] of live patches failing to complete within a reasonable amount of time due to CPU-bound kthreads. Fix it by patching tasks in cond_resched(). There are four different flavors of cond_resched(), depending on the kernel configuration. Hook into all of them. A more elegant solution might be to use a preempt notifier. However, non-ORC unwinders can't unwind a preempted task reliably. [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/ [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org diff ee3e3ac0 Tue Nov 22 13:39:05 MST 2022 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> rseq: Introduce extensible rseq ABI Introduce the extensible rseq ABI, where the feature size supported by the kernel and the required alignment are communicated to user-space through ELF auxiliary vectors. This allows user-space to call rseq registration with a rseq_len of either 32 bytes for the original struct rseq size (which includes padding), or larger. If rseq_len is larger than 32 bytes, then it must be large enough to contain the feature size communicated to user-space through ELF auxiliary vectors. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-4-mathieu.desnoyers@efficios.com diff f80be457 Thu Sep 15 09:03:45 MDT 2022 Alexander Potapenko <glider@google.com> kmsan: add KMSAN runtime core For each memory location KernelMemorySanitizer maintains two types of metadata: 1. The so-called shadow of that location - а byte:byte mapping describing whether or not individual bits of memory are initialized (shadow is 0) or not (shadow is 1). 2. The origins of that location - а 4-byte:4-byte mapping containing 4-byte IDs of the stack traces where uninitialized values were created. Each struct page now contains pointers to two struct pages holding KMSAN metadata (shadow and origins) for the original struct page. Utility routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata creation, addressing, copying and checking. mm/kmsan/report.c performs error reporting in the cases an uninitialized value is used in a way that leads to undefined behavior. KMSAN compiler instrumentation is responsible for tracking the metadata along with the kernel memory. mm/kmsan/instrumentation.c provides the implementation for instrumentation hooks that are called from files compiled with -fsanitize=kernel-memory. To aid parameter passing (also done at instrumentation level), each task_struct now contains a struct kmsan_task_state used to track the metadata of function parameters and return values for that task. Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The KMSAN_SANITIZE:=n Makefile directive can be used to completely disable KMSAN instrumentation for certain files. Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly created stack memory initialized. Users can also use functions from include/linux/kmsan-checks.h to mark certain memory regions as uninitialized or initialized (this is called "poisoning" and "unpoisoning") or check that a particular region is initialized. Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff f80be457 Thu Sep 15 09:03:45 MDT 2022 Alexander Potapenko <glider@google.com> kmsan: add KMSAN runtime core For each memory location KernelMemorySanitizer maintains two types of metadata: 1. The so-called shadow of that location - а byte:byte mapping describing whether or not individual bits of memory are initialized (shadow is 0) or not (shadow is 1). 2. The origins of that location - а 4-byte:4-byte mapping containing 4-byte IDs of the stack traces where uninitialized values were created. Each struct page now contains pointers to two struct pages holding KMSAN metadata (shadow and origins) for the original struct page. Utility routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata creation, addressing, copying and checking. mm/kmsan/report.c performs error reporting in the cases an uninitialized value is used in a way that leads to undefined behavior. KMSAN compiler instrumentation is responsible for tracking the metadata along with the kernel memory. mm/kmsan/instrumentation.c provides the implementation for instrumentation hooks that are called from files compiled with -fsanitize=kernel-memory. To aid parameter passing (also done at instrumentation level), each task_struct now contains a struct kmsan_task_state used to track the metadata of function parameters and return values for that task. Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The KMSAN_SANITIZE:=n Makefile directive can be used to completely disable KMSAN instrumentation for certain files. Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly created stack memory initialized. Users can also use functions from include/linux/kmsan-checks.h to mark certain memory regions as uninitialized or initialized (this is called "poisoning" and "unpoisoning") or check that a particular region is initialized. Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff f80be457 Thu Sep 15 09:03:45 MDT 2022 Alexander Potapenko <glider@google.com> kmsan: add KMSAN runtime core For each memory location KernelMemorySanitizer maintains two types of metadata: 1. The so-called shadow of that location - а byte:byte mapping describing whether or not individual bits of memory are initialized (shadow is 0) or not (shadow is 1). 2. The origins of that location - а 4-byte:4-byte mapping containing 4-byte IDs of the stack traces where uninitialized values were created. Each struct page now contains pointers to two struct pages holding KMSAN metadata (shadow and origins) for the original struct page. Utility routines in mm/kmsan/core.c and mm/kmsan/shadow.c handle the metadata creation, addressing, copying and checking. mm/kmsan/report.c performs error reporting in the cases an uninitialized value is used in a way that leads to undefined behavior. KMSAN compiler instrumentation is responsible for tracking the metadata along with the kernel memory. mm/kmsan/instrumentation.c provides the implementation for instrumentation hooks that are called from files compiled with -fsanitize=kernel-memory. To aid parameter passing (also done at instrumentation level), each task_struct now contains a struct kmsan_task_state used to track the metadata of function parameters and return values for that task. Finally, this patch provides CONFIG_KMSAN that enables KMSAN, and declares CFLAGS_KMSAN, which are applied to files compiled with KMSAN. The KMSAN_SANITIZE:=n Makefile directive can be used to completely disable KMSAN instrumentation for certain files. Similarly, KMSAN_ENABLE_CHECKS:=n disables KMSAN checks and makes newly created stack memory initialized. Users can also use functions from include/linux/kmsan-checks.h to mark certain memory regions as uninitialized or initialized (this is called "poisoning" and "unpoisoning") or check that a particular region is initialized. Link: https://lkml.kernel.org/r/20220915150417.722975-12-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff ec1c86b2 Sun Sep 18 02:00:02 MDT 2022 Yu Zhao <yuzhao@google.com> mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff ec1c86b2 Sun Sep 18 02:00:02 MDT 2022 Yu Zhao <yuzhao@google.com> mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff bb447999 Tue Jun 21 03:04:10 MDT 2022 Dietmar Eggemann <dietmar.eggemann@arm.com> sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com |
/linux-master/fs/nfs/ | ||
H A D | super.c | diff 4e04143c Thu Mar 16 07:07:51 MDT 2023 Ondrej Mosnacek <omosnace@redhat.com> fs_context: drop the unused lsm_flags member This isn't ever used by VFS now, and it couldn't even work. Any FS that uses the SECURITY_LSM_NATIVE_LABELS flag needs to also process the value returned back from the LSM, so it needs to do its security_sb_set_mnt_opts() call on its own anyway. Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> diff e33c267a Tue May 31 21:22:24 MDT 2022 Roman Gushchin <roman.gushchin@linux.dev> mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff e33c267a Tue May 31 21:22:24 MDT 2022 Roman Gushchin <roman.gushchin@linux.dev> mm: shrinkers: provide shrinkers with names Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> diff a6b5a28e Sat Nov 14 11:43:54 MST 2020 Dave Wysochanski <dwysocha@redhat.com> nfs: Convert to new fscache volume/cookie API Change the nfs filesystem to support fscache's indexing rewrite and reenable caching in nfs. The following changes have been made: (1) The fscache_netfs struct is no more, and there's no need to register the filesystem as a whole. (2) The session cookie is now an fscache_volume cookie, allocated with fscache_acquire_volume(). That takes three parameters: a string representing the "volume" in the index, a string naming the cache to use (or NULL) and a u64 that conveys coherency metadata for the volume. For nfs, I've made it render the volume name string as: "nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]" (3) The fscache_cookie_def is no more and needed information is passed directly to fscache_acquire_cookie(). The cache no longer calls back into the filesystem, but rather metadata changes are indicated at other times. fscache_acquire_cookie() is passed the same keying and coherency information as before. (4) fscache_enable/disable_cookie() have been removed. Call fscache_use_cookie() and fscache_unuse_cookie() when a file is opened or closed to prevent a cache file from being culled and to keep resources to hand that are needed to do I/O. If a file is opened for writing, we invalidate it with FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache, thereby making it cease caching until all currently open files are closed. This should give the same behaviour as the uptream code. Making the cache store local modifications isn't straightforward for NFS, so that's left for future patches. (5) fscache_invalidate() now needs to be given uptodate auxiliary data and a file size. It also takes a flag to indicate if this was due to a DIO write. (6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file to which a DIO write is made. (7) Call fscache_note_page_release() from nfs_release_page(). (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for PG_fscache to be cleared. (9) The functions to read and write data to/from the cache are stubbed out pending a conversion to use netfslib. Changes ======= ver #3: - Added missing =n fallback for nfs_fscache_release_file()[1][2]. ver #2: - Use gfpflags_allow_blocking() rather than using flag directly. - fscache_acquire_volume() now returns errors. - Remove NFS_INO_FSCACHE as it's no longer used. - Need to unuse a cookie on file-release, not inode-clear. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Jeff Layton <jlayton@kernel.org> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna.schumaker@netapp.com> cc: linux-nfs@vger.kernel.org cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1] Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2] Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4 diff ca1a199e Mon Apr 15 18:19:40 MDT 2019 Al Viro <viro@zeniv.linux.org.uk> nfs{,4}: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> diff 4f253e1e Mon May 15 16:18:11 MDT 2017 Jan Kara <jack@suse.cz> nfs: Mark unnecessarily extern functions as static nfs_initialise_sb() and nfs_clone_super() are declared as extern even though they are used only in fs/nfs/super.c. Mark them as static. Also remove explicit 'inline' directive from nfs_initialise_sb() and leave it upto compiler to decide whether inlining is worth it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> diff 9e08ef1a Wed Nov 13 19:00:17 MST 2013 NeilBrown <neilb@suse.de> NFS: correctly report misuse of "migration" mount option. The current test on valid use of the "migration" mount option can never report an error as it will only do so if mnt->version !=4 && mnt->minor_version != 0 (and some other condition), but if that test would succeed, then the previous test has already gone-to out_minorversion_mismatch. So change the && to an || to get correct semantics. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> diff 4d4b69dd Fri Oct 18 13:15:19 MDT 2013 Weston Andros Adamson <dros@netapp.com> NFS: add support for multiple sec= mount options This patch adds support for multiple security options which can be specified using a colon-delimited list of security flavors (the same syntax as nfsd's exports file). This is useful, for instance, when NFSv4.x mounts cross SECINFO boundaries. With this patch a user can use "sec=krb5i,krb5p" to mount a remote filesystem using krb5i, but can still cross into krb5p-only exports. New mounts will try all security options before failing. NFSv4.x SECINFO results will be compared against the sec= flavors to find the first flavor in both lists or if no match is found will return -EPERM. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> diff cdf66442 Tue Jun 05 12:59:54 MDT 2012 Bryan Schumaker <bjschuma@netapp.com> NFS4: Set parsed mount data version to 4 This patch only affects mounting through "-t nfs4" since it doesn't set up an nfs version to use in the mount data. The nfs client was trying to mount using NFS v0, causing either a BUG() or a protocol not supported message. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> diff 3862279a Fri Mar 02 12:06:39 MST 2012 Trond Myklebust <Trond.Myklebust@netapp.com> NFS: Consolidate the parsing of the '-ov4.x' and '-overs=4.x' mount options Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> |
/linux-master/net/ipv4/ | ||
H A D | udp.c | diff 4bb3ba7b Thu Mar 07 07:23:50 MST 2024 Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> udp: fix incorrect parameter validation in the udp_lib_getsockopt() function The 'len' variable can't be negative when assigned the result of 'min_t' because all 'min_t' parameters are cast to unsigned int, and then the minimum one is chosen. To fix the logic, check 'len' as read from 'optlen', where the types of relevant variables are (signed) int. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Signed-off-by: David S. Miller <davem@davemloft.net> diff f796feab Mon Feb 19 16:00:01 MST 2024 Paolo Abeni <pabeni@redhat.com> udp: add local "peek offset enabled" flag We want to re-organize the struct sock layout. The sk_peek_off field location is problematic, as most protocols want it in the RX read area, while UDP wants it on a cacheline different from sk_receive_queue. Create a local (inside udp_sock) copy of the 'peek offset is enabled' flag and place it inside the same cacheline of reader_queue. Check such flag before reading sk_peek_off. This will save potential false sharing and cache misses in the fast-path. Tested under UDP flood with small packets. The struct sock layout update causes a 4% performance drop, and this patch restores completely the original tput. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/67ab679c15fbf49fa05b3ffe05d91c47ab84f147.1708426665.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF) syzbot reported a lockdep violation [1] involving af_unix support of SO_PEEK_OFF. Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket sk_peek_off field), there is really no point to enforce a pointless thread safety in the kernel. After this patch : - setsockopt(SO_PEEK_OFF) no longer acquires the socket lock. - skb_consume_udp() no longer has to acquire the socket lock. - af_unix no longer needs a special version of sk_set_peek_off(), because it does not lock u->iolock anymore. As a followup, we could replace prot->set_peek_off to be a boolean and avoid an indirect call, since we always use sk_set_peek_off(). [1] WARNING: possible circular locking dependency detected 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted syz-executor.2/30025 is trying to acquire lock: ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 but task is already holding lock: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}: lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 lock_sock_nested+0x48/0x100 net/core/sock.c:3524 lock_sock include/net/sock.h:1691 [inline] __unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415 sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046 ____sys_recvmsg+0x3c0/0x470 net/socket.c:2801 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 -> #0 (&u->iolock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_UNIX); lock(&u->iolock); lock(sk_lock-AF_UNIX); lock(&u->iolock); *** DEADLOCK *** 1 lock held by syz-executor.2/30025: #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 stack backtrace: CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f78a1c7dda9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9 RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006 RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000 R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8 Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daan De Meyer <daan.j.demeyer@gmail.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF) syzbot reported a lockdep violation [1] involving af_unix support of SO_PEEK_OFF. Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket sk_peek_off field), there is really no point to enforce a pointless thread safety in the kernel. After this patch : - setsockopt(SO_PEEK_OFF) no longer acquires the socket lock. - skb_consume_udp() no longer has to acquire the socket lock. - af_unix no longer needs a special version of sk_set_peek_off(), because it does not lock u->iolock anymore. As a followup, we could replace prot->set_peek_off to be a boolean and avoid an indirect call, since we always use sk_set_peek_off(). [1] WARNING: possible circular locking dependency detected 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted syz-executor.2/30025 is trying to acquire lock: ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 but task is already holding lock: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}: lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 lock_sock_nested+0x48/0x100 net/core/sock.c:3524 lock_sock include/net/sock.h:1691 [inline] __unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415 sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046 ____sys_recvmsg+0x3c0/0x470 net/socket.c:2801 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 -> #0 (&u->iolock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_UNIX); lock(&u->iolock); lock(sk_lock-AF_UNIX); lock(&u->iolock); *** DEADLOCK *** 1 lock held by syz-executor.2/30025: #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 stack backtrace: CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f78a1c7dda9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9 RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006 RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000 R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8 Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daan De Meyer <daan.j.demeyer@gmail.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF) syzbot reported a lockdep violation [1] involving af_unix support of SO_PEEK_OFF. Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket sk_peek_off field), there is really no point to enforce a pointless thread safety in the kernel. After this patch : - setsockopt(SO_PEEK_OFF) no longer acquires the socket lock. - skb_consume_udp() no longer has to acquire the socket lock. - af_unix no longer needs a special version of sk_set_peek_off(), because it does not lock u->iolock anymore. As a followup, we could replace prot->set_peek_off to be a boolean and avoid an indirect call, since we always use sk_set_peek_off(). [1] WARNING: possible circular locking dependency detected 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted syz-executor.2/30025 is trying to acquire lock: ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 but task is already holding lock: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}: lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 lock_sock_nested+0x48/0x100 net/core/sock.c:3524 lock_sock include/net/sock.h:1691 [inline] __unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415 sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046 ____sys_recvmsg+0x3c0/0x470 net/socket.c:2801 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 -> #0 (&u->iolock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_UNIX); lock(&u->iolock); lock(sk_lock-AF_UNIX); lock(&u->iolock); *** DEADLOCK *** 1 lock held by syz-executor.2/30025: #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 stack backtrace: CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f78a1c7dda9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9 RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006 RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000 R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8 Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daan De Meyer <daan.j.demeyer@gmail.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 56667da7 Mon Feb 19 07:12:20 MST 2024 Eric Dumazet <edumazet@google.com> net: implement lockless setsockopt(SO_PEEK_OFF) syzbot reported a lockdep violation [1] involving af_unix support of SO_PEEK_OFF. Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket sk_peek_off field), there is really no point to enforce a pointless thread safety in the kernel. After this patch : - setsockopt(SO_PEEK_OFF) no longer acquires the socket lock. - skb_consume_udp() no longer has to acquire the socket lock. - af_unix no longer needs a special version of sk_set_peek_off(), because it does not lock u->iolock anymore. As a followup, we could replace prot->set_peek_off to be a boolean and avoid an indirect call, since we always use sk_set_peek_off(). [1] WARNING: possible circular locking dependency detected 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Not tainted syz-executor.2/30025 is trying to acquire lock: ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 but task is already holding lock: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}: lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 lock_sock_nested+0x48/0x100 net/core/sock.c:3524 lock_sock include/net/sock.h:1691 [inline] __unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415 sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046 ____sys_recvmsg+0x3c0/0x470 net/socket.c:2801 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 -> #0 (&u->iolock){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_UNIX); lock(&u->iolock); lock(sk_lock-AF_UNIX); lock(&u->iolock); *** DEADLOCK *** 1 lock held by syz-executor.2/30025: #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline] #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193 stack backtrace: CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e91e2b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869 __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754 __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789 sk_setsockopt+0x207e/0x3360 do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307 __sys_setsockopt+0x1ad/0x250 net/socket.c:2334 __do_sys_setsockopt net/socket.c:2343 [inline] __se_sys_setsockopt net/socket.c:2340 [inline] __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f78a1c7dda9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9 RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006 RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000 R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8 Fixes: 859051dd165e ("bpf: Implement cgroup sockaddr hooks for unix sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daan De Meyer <daan.j.demeyer@gmail.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 482521d8 Fri Jan 12 03:44:27 MST 2024 Eric Dumazet <edumazet@google.com> udp: annotate data-races around up->pending up->pending can be read without holding the socket lock, as pointed out by syzbot [1] Add READ_ONCE() in lockless contexts, and WRITE_ONCE() on write side. [1] BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1: udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] __sys_sendto+0x257/0x310 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0x78/0x90 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0: udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586 ___sys_sendmsg net/socket.c:2640 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2726 __do_sys_sendmmsg net/socket.c:2755 [inline] __se_sys_sendmmsg net/socket.c:2752 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b value changed: 0x00000000 -> 0x0000000a Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G W 6.7.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 482521d8 Fri Jan 12 03:44:27 MST 2024 Eric Dumazet <edumazet@google.com> udp: annotate data-races around up->pending up->pending can be read without holding the socket lock, as pointed out by syzbot [1] Add READ_ONCE() in lockless contexts, and WRITE_ONCE() on write side. [1] BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1: udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] __sys_sendto+0x257/0x310 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0x78/0x90 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0: udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586 ___sys_sendmsg net/socket.c:2640 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2726 __do_sys_sendmmsg net/socket.c:2755 [inline] __se_sys_sendmmsg net/socket.c:2752 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b value changed: 0x00000000 -> 0x0000000a Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G W 6.7.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff 0f495f76 Thu Jul 20 09:30:08 MDT 2023 Lorenz Bauer <lmb@isovalent.com> net: remove duplicate reuseport_lookup functions There are currently four copies of reuseport_lookup: one each for (TCP, UDP)x(IPv4, IPv6). This forces us to duplicate all callers of those functions as well. This is already the case for sk_lookup helpers (inet,inet6,udp4,udp6)_lookup_run_bpf. There are two differences between the reuseport_lookup helpers: 1. They call different hash functions depending on protocol 2. UDP reuseport_lookup checks that sk_state != TCP_ESTABLISHED Move the check for sk_state into the caller and use the INDIRECT_CALL infrastructure to cut down the helpers to one per IP version. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-4-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> diff e1d001fa Fri Jun 09 09:27:42 MDT 2023 Breno Leitao <leitao@debian.org> net: ioctl: Use kernel memory on protocol ioctl callbacks Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl callback. This is not flexible, because it is hard to reuse these functions without passing userspace buffers. Change the "struct proto" ioctls to avoid touching userspace memory and operate on kernel buffers, i.e., all protocol's ioctl callbacks is adapted to operate on a kernel memory other than on userspace (so, no more {put,get}_user() and friends being called in the ioctl callback). This changes the "struct proto" ioctl format in the following way: int (*ioctl)(struct sock *sk, int cmd, - unsigned long arg); + int *karg); (Important to say that this patch does not touch the "struct proto_ops" protocols) So, the "karg" argument, which is passed to the ioctl callback, is a pointer allocated to kernel space memory (inside a function wrapper). This buffer (karg) may contain input argument (copied from userspace in a prep function) and it might return a value/buffer, which is copied back to userspace if necessary. There is not one-size-fits-all format (that is I am using 'may' above), but basically, there are three type of ioctls: 1) Do not read from userspace, returns a result to userspace 2) Read an input parameter from userspace, and does not return anything to userspace 3) Read an input from userspace, and return a buffer to userspace. The default case (1) (where no input parameter is given, and an "int" is returned to userspace) encompasses more than 90% of the cases, but there are two other exceptions. Here is a list of exceptions: * Protocol RAW: * cmd = SIOCGETVIFCNT: * input and output = struct sioc_vif_req * cmd = SIOCGETSGCNT * input and output = struct sioc_sg_req * Explanation: for the SIOCGETVIFCNT case, userspace passes the input argument, which is struct sioc_vif_req. Then the callback populates the struct, which is copied back to userspace. * Protocol RAW6: * cmd = SIOCGETMIFCNT_IN6 * input and output = struct sioc_mif_req6 * cmd = SIOCGETSGCNT_IN6 * input and output = struct sioc_sg_req6 * Protocol PHONET: * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE * input int (4 bytes) * Nothing is copied back to userspace. For the exception cases, functions sock_sk_ioctl_inout() will copy the userspace input, and copy it back to kernel space. The wrapper that prepare the buffer and put the buffer back to user is sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now calls sk_ioctl(), which will handle all cases. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
Completed in 2093 milliseconds