Searched hist:715 (Results 176 - 200 of 271) sorted by relevance

1234567891011

/linux-master/tools/lib/bpf/
H A Dlibbpf.hdiff 715c5ce4 Wed May 12 17:41:22 MDT 2021 Kumar Kartikeya Dwivedi <memxor@gmail.com> libbpf: Add low level TC-BPF management API

This adds functions that wrap the netlink API used for adding, manipulating,
and removing traffic control filters.

The API summary:

A bpf_tc_hook represents a location where a TC-BPF filter can be attached.
This means that creating a hook leads to creation of the backing qdisc,
while destruction either removes all filters attached to a hook, or destroys
qdisc if requested explicitly (as discussed below).

The TC-BPF API functions operate on this bpf_tc_hook to attach, replace,
query, and detach tc filters. All functions return 0 on success, and a
negative error code on failure.

bpf_tc_hook_create - Create a hook
Parameters:
@hook - Cannot be NULL, ifindex > 0, attach_point must be set to
proper enum constant. Note that parent must be unset when
attach_point is one of BPF_TC_INGRESS or BPF_TC_EGRESS. Note
that as an exception BPF_TC_INGRESS|BPF_TC_EGRESS is also a
valid value for attach_point.

Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

bpf_tc_hook_destroy - Destroy a hook
Parameters:
@hook - Cannot be NULL. The behaviour depends on value of
attach_point. If BPF_TC_INGRESS, all filters attached to
the ingress hook will be detached. If BPF_TC_EGRESS, all
filters attached to the egress hook will be detached. If
BPF_TC_INGRESS|BPF_TC_EGRESS, the clsact qdisc will be
deleted, also detaching all filters. As before, parent must
be unset for these attach_points, and set for BPF_TC_CUSTOM.

It is advised that if the qdisc is operated on by many programs,
then the program at least check that there are no other existing
filters before deleting the clsact qdisc. An example is shown
below:

DECLARE_LIBBPF_OPTS(bpf_tc_hook, .ifindex = if_nametoindex("lo"),
.attach_point = BPF_TC_INGRESS);
/* set opts as NULL, as we're not really interested in
* getting any info for a particular filter, but just
* detecting its presence.
*/
r = bpf_tc_query(&hook, NULL);
if (r == -ENOENT) {
/* no filters */
hook.attach_point = BPF_TC_INGRESS|BPF_TC_EGREESS;
return bpf_tc_hook_destroy(&hook);
} else {
/* failed or r == 0, the latter means filters do exist */
return r;
}

Note that there is a small race between checking for no
filters and deleting the qdisc. This is currently unavoidable.

Returns -EOPNOTSUPP when hook has attach_point as BPF_TC_CUSTOM.

bpf_tc_attach - Attach a filter to a hook
Parameters:
@hook - Cannot be NULL. Represents the hook the filter will be
attached to. Requirements for ifindex and attach_point are
same as described in bpf_tc_hook_create, but BPF_TC_CUSTOM
is also supported. In that case, parent must be set to the
handle where the filter will be attached (using BPF_TC_PARENT).
E.g. to set parent to 1:16 like in tc command line, the
equivalent would be BPF_TC_PARENT(1, 16).

@opts - Cannot be NULL. The following opts are optional:
* handle - The handle of the filter
* priority - The priority of the filter
Must be >= 0 and <= UINT16_MAX
Note that when left unset, they will be auto-allocated by
the kernel. The following opts must be set:
* prog_fd - The fd of the loaded SCHED_CLS prog
The following opts must be unset:
* prog_id - The ID of the BPF prog
The following opts are optional:
* flags - Currently only BPF_TC_F_REPLACE is allowed. It
allows replacing an existing filter instead of
failing with -EEXIST.
The following opts will be filled by bpf_tc_attach on a
successful attach operation if they are unset:
* handle - The handle of the attached filter
* priority - The priority of the attached filter
* prog_id - The ID of the attached SCHED_CLS prog
This way, the user can know what the auto allocated values
for optional opts like handle and priority are for the newly
attached filter, if they were unset.

Note that some other attributes are set to fixed default
values listed below (this holds for all bpf_tc_* APIs):
protocol as ETH_P_ALL, direct action mode, chain index of 0,
and class ID of 0 (this can be set by writing to the
skb->tc_classid field from the BPF program).

bpf_tc_detach
Parameters:
@hook - Cannot be NULL. Represents the hook the filter will be
detached from. Requirements are same as described above
in bpf_tc_attach.

@opts - Cannot be NULL. The following opts must be set:
* handle, priority
The following opts must be unset:
* prog_fd, prog_id, flags

bpf_tc_query
Parameters:
@hook - Cannot be NULL. Represents the hook where the filter lookup will
be performed. Requirements are same as described above in
bpf_tc_attach().

@opts - Cannot be NULL. The following opts must be set:
* handle, priority
The following opts must be unset:
* prog_fd, prog_id, flags
The following fields will be filled by bpf_tc_query upon a
successful lookup:
* prog_id

Some usage examples (using BPF skeleton infrastructure):

BPF program (test_tc_bpf.c):

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("classifier")
int cls(struct __sk_buff *skb)
{
return 0;
}

Userspace loader:

struct test_tc_bpf *skel = NULL;
int fd, r;

skel = test_tc_bpf__open_and_load();
if (!skel)
return -ENOMEM;

fd = bpf_program__fd(skel->progs.cls);

DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex =
if_nametoindex("lo"), .attach_point =
BPF_TC_INGRESS);
/* Create clsact qdisc */
r = bpf_tc_hook_create(&hook);
if (r < 0)
goto end;

DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .prog_fd = fd);
r = bpf_tc_attach(&hook, &opts);
if (r < 0)
goto end;
/* Print the auto allocated handle and priority */
printf("Handle=%u", opts.handle);
printf("Priority=%u", opts.priority);

opts.prog_fd = opts.prog_id = 0;
bpf_tc_detach(&hook, &opts);
end:
test_tc_bpf__destroy(skel);

This is equivalent to doing the following using tc command line:
# tc qdisc add dev lo clsact
# tc filter add dev lo ingress bpf obj foo.o sec classifier da
# tc filter del dev lo ingress handle <h> prio <p> bpf
... where the handle and priority can be found using:
# tc filter show dev lo ingress

Another example replacing a filter (extending prior example):

/* We can also choose both (or one), let's try replacing an
* existing filter.
*/
DECLARE_LIBBPF_OPTS(bpf_tc_opts, replace_opts, .handle =
opts.handle, .priority = opts.priority,
.prog_fd = fd);
r = bpf_tc_attach(&hook, &replace_opts);
if (r == -EEXIST) {
/* Expected, now use BPF_TC_F_REPLACE to replace it */
replace_opts.flags = BPF_TC_F_REPLACE;
return bpf_tc_attach(&hook, &replace_opts);
} else if (r < 0) {
return r;
}
/* There must be no existing filter with these
* attributes, so cleanup and return an error.
*/
replace_opts.prog_fd = replace_opts.prog_id = 0;
bpf_tc_detach(&hook, &replace_opts);
return -1;

To obtain info of a particular filter:

/* Find info for filter with handle 1 and priority 50 */
DECLARE_LIBBPF_OPTS(bpf_tc_opts, info_opts, .handle = 1,
.priority = 50);
r = bpf_tc_query(&hook, &info_opts);
if (r == -ENOENT)
printf("Filter not found");
else if (r < 0)
return r;
printf("Prog ID: %u", info_opts.prog_id);
return 0;

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> # libbpf API design
[ Daniel: also did major patch cleanup ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210512103451.989420-3-memxor@gmail.com
diff 715f8db9 Tue Nov 03 04:21:05 MST 2015 Namhyung Kim <namhyung@kernel.org> tools lib bpf: Fix compiler warning on CentOS 6

CC libbpf.o
cc1: warnings being treated as errors
libbpf.c: In function 'bpf_program__title':
libbpf.c:1037: error: declaration of 'dup' shadows a global declaration
/usr/include/unistd.h:528: error: shadowed declaration is here
mv: cannot stat `./.libbpf.o.tmp': No such file or directory
make[3]: *** [libbpf.o] Error 1
make[2]: *** [libbpf-in.o] Error 2
make[1]: *** [/linux/tools/lib/bpf/libbpf.a] Error 2
make[1]: *** Waiting for unfinished jobs....

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1446549665-2342-1-git-send-email-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
/linux-master/kernel/time/
H A Dposix-cpu-timers.cdiff 715eb7a9 Mon Jan 30 20:09:33 MST 2017 Frederic Weisbecker <fweisbec@gmail.com> timers/posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc calculated version

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-18-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
/linux-master/drivers/cpuidle/governors/
H A Dmenu.cdiff 7024b18c Tue Feb 16 12:19:18 MST 2016 Rasmus Villemoes <linux@rasmusvillemoes.dk> cpuidle: menu: avoid expensive square root computation

Computing the integer square root is a rather expensive operation, at
least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64
bit platforms, doing an extra comparison to a constant (variance <=
U64_MAX/36).

On 64 bit platforms, this does mean that we add a restriction on the
range of the variance where we end up using the estimate (since
previously the stddev <= ULONG_MAX was a tautology), but on the other
hand, we extend the range quite substantially on 32 bit platforms - in
both cases, we now allow standard deviations up to 715 seconds, which
is for example guaranteed if all observations are less than 1430
seconds.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
/linux-master/net/sctp/
H A Dinput.cdiff 715f5552 Sat Sep 10 09:11:23 MDT 2016 Xin Long <lucien.xin@gmail.com> sctp: hold the transport before using it in sctp_hash_cmp

Since commit 4f0087812648 ("sctp: apply rhashtable api to send/recv
path"), sctp uses transport rhashtable with .obj_cmpfn sctp_hash_cmp,
in which it compares the members of the transport with the rhashtable
args to check if it's the right transport.

But sctp uses the transport without holding it in sctp_hash_cmp, it can
cause a use-after-free panic. As after it gets transport from hashtable,
another CPU may close the sk and free the asoc. In sctp_association_free,
it frees all the transports, meanwhile, the assoc's refcnt may be reduced
to 0, assoc can be destroyed by sctp_association_destroy.

So after that, transport->assoc is actually an unavailable memory address
in sctp_hash_cmp. Although sctp_hash_cmp is under rcu_read_lock, it still
can not avoid this, as assoc is not freed by RCU.

This patch is to hold the transport before checking it's members with
sctp_transport_hold, in which it checks the refcnt first, holds it if
it's not 0.

Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/drivers/net/ethernet/broadcom/
H A Dbcmsysport.cdiff 715a0227 Sun Jun 19 12:39:08 MDT 2016 Philippe Reynes <tremyfr@gmail.com> net: ethernet: bcmsysport: use phydev from struct net_device

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/drivers/gpu/drm/qxl/
H A Dqxl_display.cdiff 715a11fa Mon Feb 27 13:43:16 MST 2017 Gabriel Krisman Bertazi <krisman@collabora.co.uk> drm: qxl: Consolidate bo reservation when pinning

Every attempt to pin/unpin objects in memory requires
qxl_bo_reserve/unreserve calls around the pinning operation to protect
the object from concurrent access, which causes that call sequence to be
reproduced every place where pinning is needed. In some cases, that
sequence was not executed correctly, resulting in potential unprotected
pinning operations.

This commit encapsulates the reservation inside a new wrapper to make
sure it is always handled properly. In cases where reservation must be
done beforehand, for some reason, one can use the unprotected version
__qxl_bo_pin/unpin.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Reviewed-by: Gustavo Padovan <gustavo.padovan@collabora.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170227204328.18761-3-krisman@collabora.co.uk
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
/linux-master/net/netfilter/
H A Dx_tables.cdiff 715cf35a Thu Jan 31 05:49:16 MST 2008 Alexey Dobriyan <adobriyan@sw.ru> [NETFILTER]: x_tables: netns propagation for /proc/net/*_tables_names

Propagate netns together with AF down to ->start/->next/->stop
iterators. Choose table based on netns and AF for showing.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/net/sched/
H A Dcls_u32.cdiff 715df5ec Wed Jan 24 01:54:13 MST 2018 Jakub Kicinski <kuba@kernel.org> net: sched: propagate extack to cls->destroy callbacks

Propagate extack to cls->destroy callbacks when called from
non-error paths. On error paths pass NULL to avoid overwriting
the failure message.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/kernel/locking/
H A Drtmutex.cdiff 715f7f9e Sun Aug 15 15:28:30 MDT 2021 Peter Zijlstra <peterz@infradead.org> locking/rtmutex: Squash !RT tasks to DEFAULT_PRIO

Ensure all !RT tasks have the same prio such that they end up in FIFO
order and aren't split up according to nice level.

The reason why nice levels were taken into account so far is historical. In
the early days of the rtmutex code it was done to give the PI boosting and
deboosting a larger coverage.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211303.938676930@linutronix.de
H A Drwsem.cdiff 5cfd92e1 Mon May 20 14:59:14 MDT 2019 Waiman Long <longman@redhat.com> locking/rwsem: Adaptive disabling of reader optimistic spinning

Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

1) The reader critical section is long, perhaps the task sleeps after
acquiring the read lock.
2) There are just too many readers contending the lock causing it to
take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.

Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.

This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.

In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:

Threads Before rspin Before patch After patch %change
------- ------------ ------------ ----------- -------
20 5541068 5345484 5455667 -3.5%/ +2.1%
40 10185150 7292313 9219276 -28.5%/+26.4%
60 8196733 6460517 7181209 -21.2%/+11.2%
80 9508864 6739559 8107025 -29.1%/+20.3%

This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

# of Threads Pre-patch Post-patch
------------ --------- ----------
2 4,730 4,969
4 4,814 4,786
8 4,866 4,815
16 4,715 4,511
32 3,338 3,500
64 3,212 3,389
80 3,110 3,044

When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.

Before patch:

40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

After patch:

40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
/linux-master/drivers/pinctrl/
H A DMakefilediff 3b588e43 Tue Aug 07 15:25:26 MDT 2018 Tomer Maimon <tmaimon77@gmail.com> pinctrl: nuvoton: add NPCM7xx pinctrl and GPIO driver

Add Nuvoton BMC NPCM750/730/715/705 Pinmux and
GPIO controller driver.

Signed-off-by: Tomer Maimon <tmaimon77@gmail.com>
[Add back select GPIO_GENERIC]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
H A DKconfigdiff 3b588e43 Tue Aug 07 15:25:26 MDT 2018 Tomer Maimon <tmaimon77@gmail.com> pinctrl: nuvoton: add NPCM7xx pinctrl and GPIO driver

Add Nuvoton BMC NPCM750/730/715/705 Pinmux and
GPIO controller driver.

Signed-off-by: Tomer Maimon <tmaimon77@gmail.com>
[Add back select GPIO_GENERIC]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
/linux-master/drivers/net/ethernet/freescale/enetc/
H A Denetc.cdiff 715bf261 Tue Sep 27 15:52:02 MDT 2022 Vladimir Oltean <vladimir.oltean@nxp.com> net: enetc: cache accesses to &priv->si->hw

The &priv->si->hw construct dereferences 2 pointers and makes lines
longer than they need to be, in turn making the code harder to read.

Replace &priv->si->hw accesses with a "hw" variable when there are 2 or
more accesses within a function that dereference this. This includes
loops, since &priv->si->hw is a loop invariant.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
/linux-master/include/linux/
H A Dkprobes.hdiff ef53d9c5 Fri Jul 25 02:46:04 MDT 2008 Srinivasa D S <srinivasa@in.ibm.com> kprobes: improve kretprobe scalability with hashed locking

Currently list of kretprobe instances are stored in kretprobe object (as
used_instances,free_instances) and in kretprobe hash table. We have one
global kretprobe lock to serialise the access to these lists. This causes
only one kretprobe handler to execute at a time. Hence affects system
performance, particularly on SMP systems and when return probe is set on
lot of functions (like on all systemcalls).

Solution proposed here gives fine-grain locks that performs better on SMP
system compared to present kretprobe implementation.

Solution:

1) Instead of having one global lock to protect kretprobe instances
present in kretprobe object and kretprobe hash table. We will have
two locks, one lock for protecting kretprobe hash table and another
lock for kretporbe object.

2) We hold lock present in kretprobe object while we modify kretprobe
instance in kretprobe object and we hold per-hash-list lock while
modifying kretprobe instances present in that hash list. To prevent
deadlock, we never grab a per-hash-list lock while holding a kretprobe
lock.

3) We can remove used_instances from struct kretprobe, as we can
track used instances of kretprobe instances using kretprobe hash
table.

Time duration for kernel compilation ("make -j 8") on a 8-way ppc64 system
with return probes set on all systemcalls looks like this.

cacheline non-cacheline Un-patched kernel
aligned patch aligned patch
===============================================================================
real 9m46.784s 9m54.412s 10m2.450s
user 40m5.715s 40m7.142s 40m4.273s
sys 2m57.754s 2m58.583s 3m17.430s
===========================================================

Time duration for kernel compilation ("make -j 8) on the same system, when
kernel is not probed.
=========================
real 9m26.389s
user 40m8.775s
sys 2m7.283s
=========================

Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
H A Dmigrate.hdiff 715cbfd6 Fri May 07 13:05:06 MDT 2021 Matthew Wilcox (Oracle) <willy@infradead.org> mm/migrate: Add folio_migrate_copy()

This is the folio equivalent of migrate_page_copy(), which is retained
as a wrapper for filesystems which are not yet converted to folios.
Also convert copy_huge_page() to folio_copy().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
/linux-master/arch/x86/entry/syscalls/
H A Dsyscall_64.tbldiff ff388fe5 Mon Apr 15 10:35:20 MDT 2024 Jeff Xu <jeffxu@chromium.org> mseal: wire up mseal syscall

Patch series "Introduce mseal", v10.

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory range
against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW) and
no-execute (NX) bits. Linux has supported NX since the release of kernel
version 2.6.8 in August 2004 [1]. The memory permission feature improves
the security stance on memory corruption bugs, as an attacker cannot
simply write to arbitrary memory and point the code to it. The memory
must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects the
VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime. A
similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
[4]. Also, Chrome wants to adopt this feature for their CFI work [2] and
this patchset has been designed to be compatible with the Chrome use case.

Two system calls are involved in sealing the map: mmap() and mseal().

The new mseal() is an syscall on 64 bit CPU, and with following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in the
case of libc, sealing is only applied to read-only (RO) or read-execute
(RX) memory segments (such as .text and .RELRO) to prevent them from
becoming writable, the lifetime of those mappings are tied to the lifetime
of the process.

Chrome wants to seal two large address space reservations that are managed
by different allocators. The memory is mapped RW- and RWX respectively
but write access to it is restricted using pkeys (or in the future ARM
permission overlay extensions). The lifetime of those mappings are not
tied to the lifetime of the process, therefore, while the memory is
sealed, the allocators still need to free or discard the unused memory.
For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a security
risk. For example if a jump instruction crosses a page boundary and the
second page gets discarded, it will overwrite the target bytes with zeros
and change the control flow. Checking write-permission before the discard
operation allows us to control when the operation is valid. In this case,
the madvise will only succeed if the executing thread has PKEY write
permissions and PKRU changes are protected in software by control-flow
integrity.

Although the initial version of this patch series is targeting the Chrome
browser as its first user, it became evident during upstream discussions
that we would also want to ensure that the patch set eventually is a
complete solution for memory sealing and compatible with other use cases.
The specific scenario currently in mind is glibc's use case of loading and
sealing ELF executables. To this end, Stephen is working on a change to
glibc to add sealing support to the dynamic linker, which will seal all
non-writable segments at startup. Once this work is completed, all
applications will be able to automatically benefit from these new
protections.

In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental in
shaping this patch:

Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.

MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.

To measure the performance impact of this loop, two tests are developed.
[8]

The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.

The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.

Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.

Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%

The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%

Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.

In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%

The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%

For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.

It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%

The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%

From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.

In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten times
greater for munmap and mprotect.

When I discuss the mm performance with Brian Makin, an engineer who worked
on performance, it was brought to my attention that such performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best to
take this data with a grain of salt.


This patch (of 5):

Wire up mseal syscall for all architectures.

Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Jann Horn <jannh@google.com> [Bug #2]
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
/linux-master/include/uapi/asm-generic/
H A Dunistd.hdiff ff388fe5 Mon Apr 15 10:35:20 MDT 2024 Jeff Xu <jeffxu@chromium.org> mseal: wire up mseal syscall

Patch series "Introduce mseal", v10.

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory range
against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW) and
no-execute (NX) bits. Linux has supported NX since the release of kernel
version 2.6.8 in August 2004 [1]. The memory permission feature improves
the security stance on memory corruption bugs, as an attacker cannot
simply write to arbitrary memory and point the code to it. The memory
must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects the
VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime. A
similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
[4]. Also, Chrome wants to adopt this feature for their CFI work [2] and
this patchset has been designed to be compatible with the Chrome use case.

Two system calls are involved in sealing the map: mmap() and mseal().

The new mseal() is an syscall on 64 bit CPU, and with following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in the
case of libc, sealing is only applied to read-only (RO) or read-execute
(RX) memory segments (such as .text and .RELRO) to prevent them from
becoming writable, the lifetime of those mappings are tied to the lifetime
of the process.

Chrome wants to seal two large address space reservations that are managed
by different allocators. The memory is mapped RW- and RWX respectively
but write access to it is restricted using pkeys (or in the future ARM
permission overlay extensions). The lifetime of those mappings are not
tied to the lifetime of the process, therefore, while the memory is
sealed, the allocators still need to free or discard the unused memory.
For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a security
risk. For example if a jump instruction crosses a page boundary and the
second page gets discarded, it will overwrite the target bytes with zeros
and change the control flow. Checking write-permission before the discard
operation allows us to control when the operation is valid. In this case,
the madvise will only succeed if the executing thread has PKEY write
permissions and PKRU changes are protected in software by control-flow
integrity.

Although the initial version of this patch series is targeting the Chrome
browser as its first user, it became evident during upstream discussions
that we would also want to ensure that the patch set eventually is a
complete solution for memory sealing and compatible with other use cases.
The specific scenario currently in mind is glibc's use case of loading and
sealing ELF executables. To this end, Stephen is working on a change to
glibc to add sealing support to the dynamic linker, which will seal all
non-writable segments at startup. Once this work is completed, all
applications will be able to automatically benefit from these new
protections.

In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental in
shaping this patch:

Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.

MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.

To measure the performance impact of this loop, two tests are developed.
[8]

The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.

The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.

Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.

Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%

The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%

Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.

In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%

The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%

For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.

It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%

The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%

From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.

In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten times
greater for munmap and mprotect.

When I discuss the mm performance with Brian Makin, an engineer who worked
on performance, it was brought to my attention that such performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best to
take this data with a grain of salt.


This patch (of 5):

Wire up mseal syscall for all architectures.

Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Jann Horn <jannh@google.com> [Bug #2]
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
/linux-master/sound/usb/
H A Dpcm.cdiff 715a1705 Tue Sep 18 10:49:46 MDT 2012 Dylan Reid <dgreid@chromium.org> ALSA: usb-audio: set period_bytes in substream.

Set the peiod_bytes member of snd_usb_substream. It was no longer being
set, but will be needed to resume properly in a future commit.

Signed-off-by: Dylan Reid <dgreid@chromium.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
/linux-master/mm/kasan/
H A Dreport.cdiff c068664c Thu Mar 24 19:13:06 MDT 2022 Andrey Konovalov <andreyknvl@gmail.com> kasan: respect KASAN_BIT_REPORTED in all reporting routines

Currently, only kasan_report() checks the KASAN_BIT_REPORTED and
KASAN_BIT_MULTI_SHOT flags.

Make other reporting routines check these flags as well.

Also add explanatory comments.

Note that the current->kasan_depth check is split out into
report_suppressed() and only called for kasan_report().

Link: https://lkml.kernel.org/r/715e346b10b398e29ba1b425299dcd79e29d58ce.1646237226.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/tools/perf/util/
H A Dstat-shadow.cdiff 6859bc0e Mon Mar 15 08:30:47 MDT 2021 Changbin Du <changbin.du@intel.com> perf stat: Improve readability of shadow stats

This adds function convert_unit_double() and selects appropriate
unit for shadow stats between K/M/G.

$ sudo perf stat -a -- sleep 1

Before: Unit 'M' is selected even the number is very small.

Performance counter stats for 'system wide':

4,003.06 msec cpu-clock # 3.998 CPUs utilized
16,179 context-switches # 0.004 M/sec
161 cpu-migrations # 0.040 K/sec
4,699 page-faults # 0.001 M/sec
6,135,801,925 cycles # 1.533 GHz (83.21%)
5,783,308,491 stalled-cycles-frontend # 94.26% frontend cycles idle (83.21%)
4,543,694,050 stalled-cycles-backend # 74.05% backend cycles idle (66.49%)
4,720,130,587 instructions # 0.77 insn per cycle
# 1.23 stalled cycles per insn (83.28%)
753,848,078 branches # 188.318 M/sec (83.61%)
37,457,747 branch-misses # 4.97% of all branches (83.48%)

1.001283725 seconds time elapsed

After:

$ sudo perf stat -a -- sleep 2

Performance counter stats for 'system wide':

8,005.52 msec cpu-clock # 3.999 CPUs utilized
10,715 context-switches # 1.338 K/sec
785 cpu-migrations # 98.057 /sec
102 page-faults # 12.741 /sec
1,948,202,279 cycles # 0.243 GHz
2,816,470,932 stalled-cycles-frontend # 144.57% frontend cycles idle
2,661,172,207 stalled-cycles-backend # 136.60% backend cycles idle
464,172,105 instructions # 0.24 insn per cycle
# 6.07 stalled cycles per insn
91,567,662 branches # 11.438 M/sec
7,756,054 branch-misses # 8.47% of all branches

2.002040043 seconds time elapsed

v2:
o do not change 'sec' to 'cpu-sec'.
o use convert_unit_double to implement convert_unit.

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210315143047.3867-1-changbin.du@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
/linux-master/net/ipv4/
H A Draw.cdiff 715b49ef Wed Jan 18 18:44:07 MST 2006 Alan Cox <alan@lxorguk.ukuu.org.uk> [PATCH] EDAC: atomic scrub operations

EDAC requires a way to scrub memory if an ECC error is found and the chipset
does not do the work automatically. That means rewriting memory locations
atomically with respect to all CPUs _and_ bus masters. That means we can't
use atomic_add(foo, 0) as it gets optimised for non-SMP

This adds a function to include/asm-foo/atomic.h for the platforms currently
supported which implements a scrub of a mapped block.

It also adjusts a few other files include order where atomic.h is included
before types.h as this now causes an error as atomic_scrub uses u32.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/linux-master/drivers/gpu/drm/
H A Ddrm_connector.cdiff f85d9e59 Sun Oct 10 16:44:59 MDT 2021 Randy Dunlap <rdunlap@infradead.org> drm/connector: fix all kernel-doc warnings

Clean up all of the kernel-doc issues in drm_connector.c:

drivers/gpu/drm/drm_connector.c:2611: warning: Excess function parameter 'connector' description in 'drm_connector_oob_hotplug_event'
drivers/gpu/drm/drm_connector.c:2611: warning: Function parameter or member 'connector_fwnode' not described in 'drm_connector_oob_hotplug_event'

drm_connector.c:630: warning: No description found for return value of 'drm_get_connector_status_name'
drm_connector.c:715: warning: No description found for return value of 'drm_connector_list_iter_next'
drm_connector.c:785: warning: No description found for return value of 'drm_get_subpixel_order_name'
drm_connector.c:816: warning: No description found for return value of 'drm_display_info_set_bus_formats'
drm_connector.c:1331: warning: No description found for return value of 'drm_mode_create_dvi_i_properties'
drm_connector.c:1412: warning: No description found for return value of 'drm_connector_attach_content_type_property'
drm_connector.c:1492: warning: No description found for return value of 'drm_mode_create_tv_margin_properties'
drm_connector.c:1534: warning: No description found for return value of 'drm_mode_create_tv_properties'
drm_connector.c:1627: warning: No description found for return value of 'drm_mode_create_scaling_mode_property'
drm_connector.c:1944: warning: No description found for return value of 'drm_mode_create_suggested_offset_properties'

drm_connector.c:2315: warning: missing initial short description on line:
* drm_connector_set_panel_orientation_with_quirk -

[The last warning listed is probably a quirk/bug in scripts/kernel-doc.]

Fixes: 613051dac40d ("drm: locking&new iterators for connector_list")
Fixes: 522171951761 ("drm: Extract drm_connector.[hc]")
Fixes: b3c6c8bfe378 ("drm: document drm_display_info")
Fixes: 50525c332b55 ("drm: content-type property for HDMI connector")
Fixes: 6c4f52dca36f ("drm/connector: Allow creation of margin props alone")
Fixes: 69654c632d80 ("drm/connector: Split out orientation quirk detection (v2)")
Fixes: 72ad49682dde ("drm/connector: Add support for out-of-band hotplug notification (v3)")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
Cc: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Derek Basehore <dbasehore@chromium.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20211010224459.3603-1-rdunlap@infradead.org
/linux-master/net/bpf/
H A Dtest_run.cdiff a6763080 Wed Feb 02 13:53:20 MST 2022 Lorenzo Bianconi <lorenzo@kernel.org> bpf: test_run: Fix OOB access in bpf_prog_test_run_xdp

Fix the following kasan issue reported by syzbot:

BUG: KASAN: slab-out-of-bounds in __skb_frag_set_page include/linux/skbuff.h:3242 [inline]
BUG: KASAN: slab-out-of-bounds in bpf_prog_test_run_xdp+0x10ac/0x1150 net/bpf/test_run.c:972
Write of size 8 at addr ffff888048c75000 by task syz-executor.5/23405

CPU: 1 PID: 23405 Comm: syz-executor.5 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
__skb_frag_set_page include/linux/skbuff.h:3242 [inline]
bpf_prog_test_run_xdp+0x10ac/0x1150 net/bpf/test_run.c:972
bpf_prog_test_run kernel/bpf/syscall.c:3356 [inline]
__sys_bpf+0x1858/0x59a0 kernel/bpf/syscall.c:4658
__do_sys_bpf kernel/bpf/syscall.c:4744 [inline]
__se_sys_bpf kernel/bpf/syscall.c:4742 [inline]
__x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:4742
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4ea30dd059
RSP: 002b:00007f4ea1a52168 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007f4ea31eff60 RCX: 00007f4ea30dd059
RDX: 0000000000000048 RSI: 0000000020000000 RDI: 000000000000000a
RBP: 00007f4ea313708d R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc8367c5af R14: 00007f4ea1a52300 R15: 0000000000022000
</TASK>

Allocated by task 23405:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:437 [inline]
____kasan_kmalloc mm/kasan/common.c:516 [inline]
____kasan_kmalloc mm/kasan/common.c:475 [inline]
__kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:525
kmalloc include/linux/slab.h:586 [inline]
kzalloc include/linux/slab.h:715 [inline]
bpf_test_init.isra.0+0x9f/0x150 net/bpf/test_run.c:411
bpf_prog_test_run_xdp+0x2f8/0x1150 net/bpf/test_run.c:941
bpf_prog_test_run kernel/bpf/syscall.c:3356 [inline]
__sys_bpf+0x1858/0x59a0 kernel/bpf/syscall.c:4658
__do_sys_bpf kernel/bpf/syscall.c:4744 [inline]
__se_sys_bpf kernel/bpf/syscall.c:4742 [inline]
__x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:4742
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff888048c74000
which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 0 bytes to the right of
4096-byte region [ffff888048c74000, ffff888048c75000)
The buggy address belongs to the page:
page:ffffea0001231c00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x48c70
head:ffffea0001231c00 order:3 compound_mapcount:0 compound_pincount:0
flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 dead000000000100 dead000000000122 ffff888010c42140
raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
prep_new_page mm/page_alloc.c:2434 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
alloc_slab_page mm/slub.c:1799 [inline]
allocate_slab mm/slub.c:1944 [inline]
new_slab+0x28a/0x3b0 mm/slub.c:2004
___slab_alloc+0x87c/0xe90 mm/slub.c:3018
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
slab_alloc_node mm/slub.c:3196 [inline]
__kmalloc_node_track_caller+0x2cb/0x360 mm/slub.c:4957
kmalloc_reserve net/core/skbuff.c:354 [inline]
__alloc_skb+0xde/0x340 net/core/skbuff.c:426
alloc_skb include/linux/skbuff.h:1159 [inline]
nsim_dev_trap_skb_build drivers/net/netdevsim/dev.c:745 [inline]
nsim_dev_trap_report drivers/net/netdevsim/dev.c:802 [inline]
nsim_dev_trap_report_work+0x29a/0xbc0 drivers/net/netdevsim/dev.c:843
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1352 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
free_unref_page_prepare mm/page_alloc.c:3325 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3404
qlink_free mm/kasan/quarantine.c:157 [inline]
qlist_free_all+0x6d/0x160 mm/kasan/quarantine.c:176
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:283
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:447
kasan_slab_alloc include/linux/kasan.h:260 [inline]
slab_post_alloc_hook mm/slab.h:732 [inline]
slab_alloc_node mm/slub.c:3230 [inline]
slab_alloc mm/slub.c:3238 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
getname_flags.part.0+0x50/0x4f0 fs/namei.c:138
getname_flags include/linux/audit.h:323 [inline]
getname+0x8e/0xd0 fs/namei.c:217
do_sys_openat2+0xf5/0x4d0 fs/open.c:1208
do_sys_open fs/open.c:1230 [inline]
__do_sys_openat fs/open.c:1246 [inline]
__se_sys_openat fs/open.c:1241 [inline]
__x64_sys_openat+0x13f/0x1f0 fs/open.c:1241
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Memory state around the buggy address:
ffff888048c74f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888048c74f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
ffff888048c75080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888048c75100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

Fixes: 1c19499825246 ("bpf: introduce frags support to bpf_prog_test_run_xdp()")
Reported-by: syzbot+6d70ca7438345077c549@syzkaller.appspotmail.com
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/688c26f9dd6e885e58e8e834ede3f0139bb7fa95.1643835097.git.lorenzo@kernel.org
/linux-master/net/core/
H A Dnet_namespace.cdiff 6af2d5ff Thu Dec 01 18:21:32 MST 2016 Alexey Dobriyan <adobriyan@gmail.com> netns: fix net_generic() "id - 1" bloat

net_generic() function is both a) inline and b) used ~600 times.

It has the following code inside

...
ptr = ng->ptr[id - 1];
...

"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.

We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.

Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.

Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.

Code size savings (oh boy): -4.2 KB

As usual, ignore the initial compiler stupidity part of the table.

add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
function old new delta
tipc_nametbl_insert_publ 1250 1270 +20
nlmclnt_lookup_host 686 703 +17
nfsd4_encode_fattr 5930 5941 +11
nfs_get_client 1050 1061 +11
register_pernet_operations 333 342 +9
tcf_mirred_init 843 849 +6
tcf_bpf_init 1143 1149 +6
gss_setup_upcall 990 994 +4
idmap_name_to_id 432 434 +2
ops_init 274 275 +1
nfsd_inject_forget_client 259 260 +1
nfs4_alloc_client 612 613 +1
tunnel_key_walker 164 163 -1

...

tipc_bcbase_select_primary 392 360 -32
mac80211_hwsim_new_radio 2808 2767 -41
ipip6_tunnel_ioctl 2228 2186 -42
tipc_bcast_rcv 715 672 -43
tipc_link_build_proto_msg 1140 1089 -51
nfsd4_lock 3851 3796 -55
tipc_mon_rcv 1012 956 -56
Total: Before=156643951, After=156639743, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/linux-master/drivers/vhost/
H A Dnet.cdiff db688c24 Tue Apr 24 02:34:36 MDT 2018 Paolo Abeni <pabeni@redhat.com> vhost_net: use packet weight for rx handler, too

Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
tx polling to 2 * vq size"), we need a packet-based limit for
handler_rx, too - elsewhere, under rx flood with small packets,
tx can be delayed for a very long time, even without busypolling.

The pkt limit applied to handle_rx must be the same applied by
handle_tx, or we will get unfair scheduling between rx and tx.
Tying such limit to the queue length makes it less effective for
large queue length values and can introduce large process
scheduler latencies, so a constant valued is used - likewise
the existing bytes limit.

The selected limit has been validated with PVP[1] performance
test with different queue sizes:

queue size 256 512 1024

baseline 366 354 362
weight 128 715 723 670
weight 256 740 745 733
weight 512 600 460 583
weight 1024 423 427 418

A packet weight of 256 gives peek performances in under all the
tested scenarios.

No measurable regression in unidirectional performance tests has
been detected.

[1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Completed in 1107 milliseconds

1234567891011