#
f6e0a498 |
|
14-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: move dev->state into net_device_read_txrx group dev->state can be read in rx and tx fast paths. netif_running() which needs dev->state is called from - enqueue_to_backlog() [RX path] - __dev_direct_xmit() [TX path] Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240314200845.3050179-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
e5b7aefe |
|
08-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: gro: move two declarations to include/net/gro.h Move gro_find_receive_by_type() and gro_find_complete_by_type() to include/net/gro.h where they belong. Also use _NET_GRO_H instead of _NET_IPV6_GRO_H to protect include/net/gro.h from multiple inclusions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240308102230.296224-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
ab63a238 |
|
06-Mar-2024 |
Jakub Kicinski <kuba@kernel.org> |
netdev: add per-queue statistics The ethtool-nl family does a good job exposing various protocol related and IEEE/IETF statistics which used to get dumped under ethtool -S, with creative names. Queue stats don't have a netlink API, yet, and remain a lion's share of ethtool -S output for new drivers. Not only is that bad because the names differ driver to driver but it's also bug-prone. Intuitively drivers try to report only the stats for active queues, but querying ethtool stats involves multiple system calls, and the number of stats is read separately from the stats themselves. Worse still when user space asks for values of the stats, it doesn't inform the kernel how big the buffer is. If number of stats increases in the meantime kernel will overflow user buffer. Add a netlink API for dumping queue stats. Queue information is exposed via the netdev-genl family, so add the stats there. Support per-queue and sum-for-device dumps. Latter will be useful when subsequent patches add more interesting common stats than just bytes and packets. The API does not currently distinguish between HW and SW stats. The expectation is that the source of the stats will either not matter much (good packets) or be obvious (skb alloc errors). Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/20240306195509.1502746-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
490a79fa |
|
06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: introduce include/net/rps.h Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
71c0de9b |
|
06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: move dev_rx_weight to net_hotdata dev_rx_weight is read from process_backlog(). Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
26722dc7 |
|
06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: move dev_tx_weight to net_hotdata dev_tx_weight is used in tx fast path. Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
edbc666c |
|
06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: move netdev_max_backlog to net_hotdata netdev_max_backlog is used in rx fat path. Move it to net_hodata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0b91fa4b |
|
06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: move ptype_all into net_hotdata ptype_all is used in rx/tx fast paths. Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2658b5a8 |
|
06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: introduce struct net_hotdata Instead of spreading networking critical fields all over the places, add a custom net_hotdata structure so that we can precisely control its layout. In this first patch, move : - gro_normal_batch used in rx (GRO stack) - offload_base used in rx and tx (GRO and TSO stacks) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8afc7a78 |
|
22-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
ipv6: prepare inet6_fill_ifinfo() for RCU protection We want to use RCU protection instead of RTNL for inet6_fill_ifinfo(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b3ef46c |
|
12-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
net: remove dev_base_lock dev_base_lock is not needed anymore, all remaining users also hold RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6a2968ee |
|
12-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
net: add netdev_set_operstate() helper dev_base_lock is going away, add netdev_set_operstate() helper so that hsr does not have to know core internals. Remove dev_base_lock acquisition from rfc2863_policy() v3: use an "unsigned int" for dev->operstate, so that try_cmpxchg() can work on all arches. ( https://lore.kernel.org/oe-kbuild-all/202402081918.OLyGaea3-lkp@intel.com/ ) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4d42b37d |
|
12-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
net: convert dev->reg_state to u8 Prepares things so that dev->reg_state reads can be lockless, by adding WRITE_ONCE() on write side. READ_ONCE()/WRITE_ONCE() do not support bitfields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4d2bb0bf |
|
12-Feb-2024 |
Lorenzo Bianconi <lorenzo@kernel.org> |
xdp: rely on skb pointer reference in do_xdp_generic and netif_receive_generic_xdp Rely on skb pointer reference instead of the skb pointer in do_xdp_generic and netif_receive_generic_xdp routine signatures. This is a preliminary patch to add multi-buff support for xdp running in generic mode where we will need to reallocate the skb to avoid linearization and we will need to make it visible to do_xdp_generic() caller. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c09415b1f48c8620ef4d76deed35050a7bddf7c2.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d160c66c |
|
04-Feb-2024 |
Amit Cohen <amcohen@nvidia.com> |
net: Do not return value from init_dummy_netdev() init_dummy_netdev() always returns zero and all the callers do not check the returned value. Set the function to not return value, as it is not really used today. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240205103022.440946-1-amcohen@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fd2bc419 |
|
03-Oct-2023 |
Leon Romanovsky <leon@kernel.org> |
xfrm: generalize xdo_dev_state_update_curlft to allow statistics update In order to allow drivers to fill all statistics, change the name of xdo_dev_state_update_curlft to be xdo_dev_state_update_stats. Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
289e9225 |
|
04-Mar-2024 |
Jakub Kicinski <kuba@kernel.org> |
dpll: move all dpll<>netdev helpers to dpll code Older versions of GCC really want to know the full definition of the type involved in rcu_assign_pointer(). struct dpll_pin is defined in a local header, net/core can't reach it. Move all the netdev <> dpll code into dpll, where the type is known. Otherwise we'd need multiple function calls to jump between the compilation units. This is the same problem the commit under fixes was trying to address, but with rcu_assign_pointer() not rcu_dereference(). Some of the exports are not needed, networking core can't be a module, we only need exports for the helpers used by drivers. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/all/35a869c8-52e8-177-1d4d-e57578b99b6@linux-m68k.org/ Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240305013532.694866-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
eb2c11b2 |
|
28-Feb-2024 |
Arnd Bergmann <arnd@arndb.de> |
net: bql: fix building with BQL disabled It is now possible to disable BQL, but that causes the cpsw driver to break: drivers/net/ethernet/ti/am65-cpsw-nuss.c:297:28: error: no member named 'dql' in 'struct netdev_queue' 297 | dql_avail(&netif_txq->dql), There is already a helper function in net/sch_generic.h that could be used to help here. Move its implementation into the common linux/netdevice.h along with the other bql interfaces and change both users over to the new interface. Fixes: ea7f3cfaa588 ("net: bql: allow the config to be disabled") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0d60d8df |
|
22-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
dpll: rely on rcu for netdev_dpll_pin() This fixes a possible UAF in if_nlmsg_size(), which can run without RTNL. Add rcu protection to "struct dpll_pin" Move netdev_dpll_pin() from netdevice.h to dpll.h to decrease name pollution. Note: This looks possible to no longer acquire RTNL in netdev_dpll_pin_assign() later in net-next. v2: do not force rcu_read_lock() in rtnl_dpll_pin_size() (Jiri Pirko) Fixes: 5f1842692880 ("netdev: expose DPLL pin handle for netdevice") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240223123208.3543319-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c353c7b7 |
|
08-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
net-device: move lstats in net_device_read_txrx dev->lstats is notably used from loopback ndo_start_xmit() and other virtual drivers. Per cpu stats updates are dirtying per-cpu data, but the pointer itself is read-only. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe1eb24b |
|
04-Jan-2024 |
Jakub Kicinski <kuba@kernel.org> |
Revert "Introduce PHY listing and link_topology tracking" This reverts commit 32bb4515e34469975abc936deb0a116c4a445817. This reverts commit d078d480639a4f3b5fc2d56247afa38e0956483a. This reverts commit fcc4b105caa4b844bf043375bf799c20a9c99db1. This reverts commit 345237dbc1bdbb274c9fb9ec38976261ff4a40b8. This reverts commit 7db69ec9cfb8b4ab50420262631fb2d1908b25bf. This reverts commit 95132a018f00f5dad38bdcfd4180d1af955d46f6. This reverts commit 63d5eaf35ac36cad00cfb3809d794ef0078c822b. This reverts commit c29451aefcb42359905d18678de38e52eccb3bb5. This reverts commit 2ab0edb505faa9ac90dee1732571390f074e8113. This reverts commit dedd702a35793ab462fce4c737eeba0badf9718e. This reverts commit 034fcc210349b873ece7356905be5c6ca11eef2a. This reverts commit 9c5625f559ad6fe9f6f733c11475bf470e637d34. This reverts commit 02018c544ef113e980a2349eba89003d6f399d22. Looks like we need more time for reviews, and incremental changes will be hard to make sense of. So revert. Link: https://lore.kernel.org/all/ZZP6FV5sXEf+xd58@shell.armlinux.org.uk/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d3d344a1 |
|
02-Jan-2024 |
Eric Dumazet <edumazet@google.com> |
net-device: move xdp_prog to net_device_read_rx xdp_prog is used in receive path, both from XDP enabled drivers and from netif_elide_gro(). This patch also removes two 4-bytes holes. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240102162220.750823-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
993498e5 |
|
21-Dec-2023 |
Eric Dumazet <edumazet@google.com> |
net-device: move gso_partial_features to net_device_read_tx dev->gso_partial_features is read from tx fast path for GSO packets. Move it to appropriate section to avoid a cache line miss. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02018c54 |
|
21-Dec-2023 |
Maxime Chevallier <maxime.chevallier@bootlin.com> |
net: phy: Introduce ethernet link topology representation Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1a36e0f5 |
|
17-Dec-2023 |
Ido Schimmel <idosch@nvidia.com> |
net: Add MDB bulk deletion device operation Add MDB net device operation that will be invoked by rtnetlink code in response to received 'RTM_DELMDB' messages with the 'NLM_F_BULK' flag set. Subsequent patches will implement the operation in the bridge and VXLAN drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
facd15df |
|
04-Dec-2023 |
Johannes Berg <johannes.berg@intel.com> |
net: core: synchronize link-watch when carrier is queried There are multiple ways to query for the carrier state: through rtnetlink, sysfs, and (possibly) ethtool. Synchronize linkwatch work before these operations so that we don't have a situation where userspace queries the carrier state between the driver's carrier off->on transition and linkwatch running and expects it to work, when really (at least) TX cannot work until linkwatch has run. I previously posted a longer explanation of how this applies to wireless [1] but with this wireless can simply query the state before sending data, to ensure the kernel is ready for it. [1] https://lore.kernel.org/all/346b21d87c69f817ea3c37caceb34f1f56255884.camel@sipsolutions.net/ Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231204214706.303c62768415.I1caedccae72ee5a45c9085c5eb49c145ce1c0dd5@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
43a71cd6 |
|
04-Dec-2023 |
Coco Li <lixiaoyan@google.com> |
net-device: reorganize net_device fast path variables Reorganize fast path variables on tx-txrx-rx order Fastpath variables end after npinfo. Below data generated with pahole on x86 architecture. Fast path variables span cache lines before change: 12 Fast path variables span cache lines after change: 4 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231204201232.520025-2-lixiaoyan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
26793bfb |
|
01-Dec-2023 |
Amritha Nambiar <amritha.nambiar@intel.com> |
net: Add NAPI IRQ support Add support to associate the interrupt vector number for a NAPI instance. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147334728.5260.13221803396905901904.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2a502ff0 |
|
01-Dec-2023 |
Amritha Nambiar <amritha.nambiar@intel.com> |
net: Add queue and napi association Add the napi pointer in netdev queue for tracking the napi instance for each queue. This achieves the queue<->napi mapping. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Link: https://lore.kernel.org/r/170147331483.5260.15723438819994285695.stgit@anambiarhost.jf.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
48eb03dd |
|
27-Nov-2023 |
Stanislav Fomichev <sdf@google.com> |
xsk: Add TX timestamp and TX checksum offload support This change actually defines the (initial) metadata layout that should be used by AF_XDP userspace (xsk_tx_metadata). The first field is flags which requests appropriate offloads, followed by the offload-specific fields. The supported per-device offloads are exported via netlink (new xsk-flags). The offloads themselves are still implemented in a bit of a framework-y fashion that's left from my initial kfunc attempt. I'm introducing new xsk_tx_metadata_ops which drivers are supposed to implement. The drivers are also supposed to call xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since xsk_tx_metadata_{request,_complete} are static inline, we don't incur any extra overhead doing indirect calls. The benefit of this scheme is as follows: - keeps all metadata layout parsing away from driver code - makes it easy to grep and see which drivers implement what - don't need any extra flags to maintain to keep track of what offloads are implemented; if the callback is implemented - the offload is supported (used by netlink reporting code) Two offloads are defined right now: 1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset 2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata area upon completion (tx_timestamp field) XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass metadata pointer). The struct is forward-compatible and can be extended in the future by appending more fields. Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
083772c9 |
|
26-Nov-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: page_pool: record pools per netdev Link the page pools with netdevs. This needs to be netns compatible so we have two options. Either we record the pools per netns and have to worry about moving them as the netdev gets moved. Or we record them directly on the netdev so they move with the netdev without any extra work. Implement the latter option. Since pools may outlast netdev we need a place to store orphans. In time honored tradition use loopback for this purpose. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
289354f2 |
|
18-Nov-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: partial revert of the "Make timestamping selectable: series Revert following commits: commit acec05fb78ab ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask") commit 11d55be06df0 ("net: ethtool: Add a command to expose current time stamping layer") commit bb8645b00ced ("netlink: specs: Introduce new netlink command to get current timestamp") commit d905f9c75329 ("net: ethtool: Add a command to list available time stamping layers") commit aed5004ee7a0 ("netlink: specs: Introduce new netlink command to list available time stamping layers") commit 51bdf3165f01 ("net: Replace hwtstamp_source by timestamping layer") commit 0f7f463d4821 ("net: Change the API of PHY default timestamp to MAC") commit 091fab122869 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp") commit 152c75e1d002 ("net: ethtool: ts: Let the active time stamping layer be selectable") commit ee60ea6be0d3 ("netlink: specs: Introduce time stamping set command") They need more time for reviews. Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0f7f463d |
|
13-Nov-2023 |
Kory Maincent <kory.maincent@bootlin.com> |
net: Change the API of PHY default timestamp to MAC Change the API to select MAC default time stamping instead of the PHY. Indeed the PHY is closer to the wire therefore theoretically it has less delay than the MAC timestamping but the reality is different. Due to lower time stamping clock frequency, latency in the MDIO bus and no PHC hardware synchronization between different PHY, the PHY PTP is often less precise than the MAC. The exception is for PHY designed specially for PTP case but these devices are not very widespread. For not breaking the compatibility I introduce a default_timestamp flag in phy_device that is set by the phy driver to know we are using the old API behavior. The phy_set_timestamp function is called at each call of phy_attach_direct. In case of MAC driver using phylink this function is called when the interface is turned up. Then if the interface goes down and up again the last choice of timestamp will be overwritten by the default choice. A solution could be to cache the timestamp status but it can bring other issues. In case of SFP, if we change the module, it doesn't make sense to blindly re-set the timestamp back to PHY, if the new module has a PHY with mediocre timestamping capabilities. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
011dd3b3 |
|
13-Nov-2023 |
Kory Maincent <kory.maincent@bootlin.com> |
net: Make dev_set_hwtstamp_phylib accessible Make the dev_set_hwtstamp_phylib function accessible in prevision to use it from ethtool to reset the tstamp current configuration. Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34d21de9 |
|
13-Nov-2023 |
Daniel Borkmann <daniel@iogearbox.net> |
net: Move {l,t,d}stats allocation to core and convert veth & vrf Move {l,t,d}stats allocation to the core and let netdevs pick the stats type they need. That way the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc) - all happening in the core. Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
79e0c5be |
|
13-Nov-2023 |
Daniel Borkmann <daniel@iogearbox.net> |
net, vrf: Move dstats structure to core Just move struct pcpu_dstats out of the vrf into the core, and streamline the field names slightly, so they better align with the {t,l}stats ones. No functional change otherwise. A conversion of the u64s to u64_stats_t could be done at a separate point in future. This move is needed as we are moving the {t,l,d}stats allocation/freeing to the core. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
62f47bf9 |
|
25-Oct-2023 |
Ido Schimmel <idosch@nvidia.com> |
net: Add MDB get device operation Add MDB net device operation that will be invoked by rtnetlink code in response to received RTM_GETMDB messages. Subsequent patches will implement the operation in the bridge and VXLAN drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f3eb217 |
|
18-Oct-2023 |
Christian Marangi <ansuelsmth@gmail.com> |
net: introduce napi_is_scheduled helper We currently have napi_if_scheduled_mark_missed that can be used to check if napi is scheduled but that does more thing than simply checking it and return a bool. Some driver already implement custom function to check if napi is scheduled. Drop these custom function and introduce napi_is_scheduled that simply check if napi is scheduled atomically. Update any driver and code that implement a similar check and instead use this new helper. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
38985e8c |
|
09-Oct-2023 |
Amit Cohen <amcohen@nvidia.com> |
net: Handle bulk delete policy in bridge driver The merge commit 92716869375b ("Merge branch 'br-flush-filtering'") added support for FDB flushing in bridge driver. The following patches will extend VXLAN driver to support FDB flushing as well. The netlink message for bulk delete is shared between the drivers. With the existing implementation, there is no way to prevent user from flushing with attributes that are not supported per driver. For example, when VNI will be added, user will not get an error for flush FDB entries in bridge with VNI, although this attribute is not relevant for bridge. As preparation for support of FDB flush in VXLAN driver, move the policy to be handled in bridge driver, later a new policy for VXLAN will be added in VXLAN driver. Do not pass 'vid' as part of ndo_fdb_del_bulk(), as this field is relevant only for bridge. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
73382e91 |
|
09-Oct-2023 |
Christian Marangi <ansuelsmth@gmail.com> |
netdev: replace napi_reschedule with napi_schedule Now that napi_schedule return a bool, we can drop napi_reschedule that does the same exact function. The function comes from a very old commit bfe13f54f502 ("ibm_emac: Convert to use napi_struct independent of struct net_device") and the purpose is actually deprecated in favour of different logic. Convert every user of napi_reschedule to napi_schedule. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com> # ath10k Acked-by: Nick Child <nnac123@linux.ibm.com> # ibm Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for can/dev/rx-offload.c Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20231009133754.9834-3-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0a779003 |
|
09-Oct-2023 |
Christian Marangi <ansuelsmth@gmail.com> |
netdev: make napi_schedule return bool on NAPI successful schedule Change napi_schedule to return a bool on NAPI successful schedule. This might be useful for some driver to do additional steps after a NAPI has been scheduled. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231009133754.9834-2-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5247dbf1 |
|
09-Oct-2023 |
Yajun Deng <yajun.deng@linux.dev> |
net/core: Introduce netdev_core_stats_inc() Although there is a kfree_skb_reason() helper function that can be used to find the reason why this skb is dropped, but most callers didn't increase one of rx_dropped, tx_dropped, rx_nohandler and rx_otherhost_dropped. For the users, people are more concerned about why the dropped in ip is increasing. Introduce netdev_core_stats_inc() for trace the caller of dev_core_stats_*_inc(). Also, add __code to netdev_core_stats_alloc(), as it's called with small probability. And add noinline make sure netdev_core_stats_inc was never inlined. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0b068c71 |
|
21-Sep-2023 |
Eric Dumazet <edumazet@google.com> |
net: add DEV_STATS_READ() helper Companion of DEV_STATS_INC() & DEV_STATS_ADD(). This is going to be used in the series. Use it in macsec_get_stats64(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b230812b |
|
17-Sep-2023 |
Sujuan Chen <sujuan.chen@mediatek.com> |
net: ethernet: mtk_wed: introduce partial AMSDU offload support for MT7988 Introduce partial AMSDU offload support for MT7988 SoC in order to merge in hw packets belonging to the same AMSDU before passing them to the WLAN nic. Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
5f184269 |
|
13-Sep-2023 |
Jiri Pirko <jiri@resnulli.us> |
netdev: expose DPLL pin handle for netdevice In case netdevice represents a SyncE port, the user needs to understand the connection between netdevice and associated DPLL pin. There might me multiple netdevices pointing to the same pin, in case of VF/SF implementation. Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to how it is implemented for devlink port. Add a struct dpll_pin pointer to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set() and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear the DPLL pin relationship to netdev. Note that during the lifetime of struct dpll_pin the pin handle does not change. Therefore it is save to access it lockless. It is drivers responsibility to call netdev_dpll_pin_clear() before dpll_pin_put(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
680ee045 |
|
02-Aug-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: invert the netdevice.h vs xdp.h dependency xdp.h is far more specific and is included in only 67 other files vs netdevice.h's 1538 include sites. Make xdp.h include netdevice.h, instead of the other way around. This decreases the incremental allmodconfig builds size when xdp.h is touched from 5947 to 662 objects. Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h is a mega-header in its own right so it's nice to avoid xdp.h getting included there as well. The only unfortunate part is that the typedef for xdp_features_t has to move to netdevice.h, since its embedded in struct netdevice. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
49e47a5b |
|
02-Aug-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: move struct netdev_rx_queue out of netdevice.h struct netdev_rx_queue is touched in only a few places and having it defined in netdevice.h brings in the dependency on xdp.h, because struct xdp_rxq_info gets embedded in struct netdev_rx_queue. In prep for removal of xdp.h from netdevice.h move all the netdev_rx_queue stuff to a new header. We could technically break the new header up to avoid the sysfs.h include but it's so rarely included it doesn't seem to be worth it at this point. Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
f11e5bd1 |
|
01-Aug-2023 |
Mateusz Kowalski <mko@redhat.com> |
bonding: support balance-alb with openvswitch Commit d5410ac7b0ba ("net:bonding:support balance-alb interface with vlan to bridge") introduced a support for balance-alb mode for interfaces connected to the linux bridge by fixing missing matching of MAC entry in FDB. In our testing we discovered that it still does not work when the bond is connected to the OVS bridge as show in diagram below: eth1(mac:eth1_mac)--bond0(balance-alb,mac:eth0_mac)--eth0(mac:eth0_mac) | bond0.150(mac:eth0_mac) | ovs_bridge(ip:bridge_ip,mac:eth0_mac) This patch fixes it by checking not only if the device is a bridge but also if it is an openvswitch. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/9fe7297c-609e-208b-c77b-3ceef6eb51a4@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
fd770e85 |
|
01-Aug-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers It is desirable that the new .ndo_hwtstamp_set() API gives more uniformity, less overhead and future flexibility w.r.t. the PHY timestamping behavior. Currently there are some drivers which allow PHY timestamping through the procedure mentioned in Documentation/networking/timestamping.rst. They don't do anything locally if phy_has_hwtstamp() is set, except for lan966x which installs PTP packet traps. Centralize that behavior in a new dev_set_hwtstamp_phylib() code function, which calls either phy_mii_ioctl() for the phylib PHY, or .ndo_hwtstamp_set() of the netdev, based on a single policy (currently simplistic: phy_has_hwtstamp()). Any driver converted to .ndo_hwtstamp_set() will automatically opt into the centralized phylib timestamping policy. Unconverted drivers still get to choose whether they let the PHY handle timestamping or not. Netdev drivers with integrated PHY drivers that don't use phylib presumably don't set dev->phydev, and those will always see HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping policy will remain 100% up to them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e47d01fe |
|
01-Aug-2023 |
Maxim Georgiev <glipus@gmail.com> |
net: add hwtstamping helpers for stackable net devices The stackable net devices with hwtstamping support (vlan, macvlan, bonding) only pass the hwtstamping ops to the lower (real) device. These drivers are the first that need to be converted to the new timestamping API, because if they aren't prepared to handle that, then no real device driver cannot be converted to the new API either. After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and bond_eth_ioctl() have in common, here we propose two generic implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which can be called by those 3 drivers, with "dev" being their lower device. These helpers cover both cases, when the lower driver is converted to the new API or unconverted. We need some hacks in case of an unconverted driver, namely to stuff some pointers in struct kernel_hwtstamp_config which shouldn't have been there (since the new API isn't supposed to need it). These will be removed when all drivers will have been converted to the new API. Signed-off-by: Maxim Georgiev <glipus@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
66f72230 |
|
01-Aug-2023 |
Maxim Georgiev <glipus@gmail.com> |
net: add NDOs for configuring hardware timestamping Current hardware timestamping API for NICs requires implementing .ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP. That API has some boilerplate such as request parameter translation between user and kernel address spaces, handling possible translation failures correctly, etc. Since it is the same all across the board, it would be desirable to handle it through generic code. Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which implement that boilerplate and allow drivers to just act upon requests. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Maxim Georgiev <glipus@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
84e00d9b |
|
26-Jul-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: convert some netlink netdev iterators to depend on the xarray Reap the benefits of easier iteration thanks to the xarray. Convert just the genetlink ones, those are easier to test. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d0358c1a |
|
26-Jul-2023 |
YueHaibing <yuehaibing@huawei.com> |
net: Remove unused declaration dev_restart() This is not used, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230726143715.24700-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
a097627d |
|
21-Jul-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
net: add missing net_device::xdp_zc_max_segs description Cited commit under 'Fixes' tag introduced new member to struct net_device without providing description of it - fix it. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20230720141613.61488b9e@canb.auug.org.au/ Fixes: 13ce2daa259a ("xsk: add new netlink attribute dedicated for ZC max frags") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Simon Horman <simon.horman@corigine.com> # build-tested Link: https://lore.kernel.org/r/20230721145808.596298-1-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
e420bed0 |
|
19-Jul-2023 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: Add fd-based tcx multi-prog infra with link support This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment. Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example: - From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1] - From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2] BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware. Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different. We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient. For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one. For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective. The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline. tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail. The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB. Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work. [0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
13ce2daa |
|
19-Jul-2023 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
xsk: add new netlink attribute dedicated for ZC max frags Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will carry maximum fragments that underlying ZC driver is able to handle on TX side. It is going to be included in netlink response only when driver supports ZC. Any value higher than 1 implies multi-buffer ZC support on underlying device. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
9a5cb797 |
|
21-Jun-2023 |
Gilad Sever <gilad9366@gmail.com> |
bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings When calling bpf_sk_lookup_tcp(), bpf_sk_lookup_udp() or bpf_skc_lookup_tcp() from tc/xdp ingress, VRF socket bindings aren't respoected, i.e. unbound sockets are returned, and bound sockets aren't found. VRF binding is determined by the sdif argument to sk_lookup(), however when called from tc the IP SKB control block isn't initialized and thus inet{,6}_sdif() always returns 0. Fix by calculating sdif for the tc/xdp flows by observing the device's l3 enslaved state. The cg/sk_skb hooking points which are expected to support inet{,6}_sdif() pass sdif=-1 which makes __bpf_skc_lookup() use the existing logic. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Gilad Sever <gilad9366@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Eyal Birger <eyal.birger@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/20230621104211.301902-4-gilad9366@gmail.com
|
#
70f7457a |
|
12-Jun-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: create device lookup API with reference tracking New users of dev_get_by_index() and dev_get_by_name() keep getting added and it would be nice to steer them towards the APIs with reference tracking. Add variants of those calls which allocate the reference tracker and use them in a couple of places. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d457a0e3 |
|
08-Jun-2023 |
Eric Dumazet <edumazet@google.com> |
net: move gso declarations and functions to their own files Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d636fc5d |
|
06-Jun-2023 |
Eric Dumazet <edumazet@google.com> |
net: sched: add rcu annotations around qdisc->qdisc_sleeping syzbot reported a race around qdisc->qdisc_sleeping [1] It is time we add proper annotations to reads and writes to/from qdisc->qdisc_sleeping. [1] BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1: qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331 __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174 tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547 rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0: dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115 qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103 tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023 Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@nvidia.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c3b74a9 |
|
06-Jun-2023 |
Eric Dumazet <edumazet@google.com> |
rfs: annotate lockless accesses to RFS sock flow table Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table. This also prevents a (smart ?) compiler to remove the condition in: if (table->ents[index] != newval) table->ents[index] = newval; We need the condition to avoid dirtying a shared cache line. Fixes: fec5e652e58f ("rfs: Receive Flow Steering") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
87eff2ec |
|
21-Apr-2023 |
Eric Dumazet <edumazet@google.com> |
net: optimize napi_threaded_poll() vs RPS/RFS We use napi_threaded_poll() in order to reduce our softirq dependency. We can add a followup of 821eba962d95 ("net: optimize napi_schedule_rps()") to further remove the need of firing NET_RX_SOFTIRQ whenever RPS/RFS are used. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b8285cc |
|
19-Apr-2023 |
Johannes Berg <johannes.berg@intel.com> |
net: move dropreason.h to dropreason-core.h This will, after the next patch, hold only the core drop reasons and minimal infrastructure. Fix a small kernel-doc issue while at it, to avoid the move triggering a checker. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8c48eea3 |
|
12-Apr-2023 |
Jakub Kicinski <kuba@kernel.org> |
page_pool: allow caching from safely localized NAPI Recent patches to mlx5 mentioned a regression when moving from driver local page pool to only using the generic page pool code. Page pool has two recycling paths (1) direct one, which runs in safe NAPI context (basically consumer context, so producing can be lockless); and (2) via a ptr_ring, which takes a spin lock because the freeing can happen from any CPU; producer and consumer may run concurrently. Since the page pool code was added, Eric introduced a revised version of deferred skb freeing. TCP skbs are now usually returned to the CPU which allocated them, and freed in softirq context. This places the freeing (producing of pages back to the pool) enticingly close to the allocation (consumer). If we can prove that we're freeing in the same softirq context in which the consumer NAPI will run - lockless use of the cache is perfectly fine, no need for the lock. Let drivers link the page pool to a NAPI instance. If the NAPI instance is scheduled on the same CPU on which we're freeing - place the pages in the direct cache. With that and patched bnxt (XDP enabled to engage the page pool, sigh, bnxt really needs page pool work :() I see a 2.6% perf boost with a TCP stream test (app on a different physical core than softirq). The CPU use of relevant functions decreases as expected: page_pool_refill_alloc_cache 1.17% -> 0% _raw_spin_lock 2.41% -> 0.98% Only consider lockless path to be safe when NAPI is scheduled - in practice this should cover majority if not all of steady state workloads. It's usually the NAPI kicking in that causes the skb flush. The main case we'll miss out on is when application runs on the same CPU as NAPI. In that case we don't use the deferred skb free path. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
301f227f |
|
06-Apr-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: piggy back on the memory barrier in bql when waking queues Drivers call netdev_tx_completed_queue() right before netif_txq_maybe_wake(). If BQL is enabled netdev_tx_completed_queue() should issue a memory barrier, so we can depend on that separating the stop check from the consumer index update, instead of adding another barrier in netif_txq_maybe_wake(). This matters more than the barriers on the xmit path, because the wake condition is almost always true. So we issue the consumer side barrier often. Wrap netdev_tx_completed_queue() in a local helper to issue the barrier even if BQL is disabled. Keep the same semantics as netdev_tx_completed_queue() (barrier only if bytes != 0) to make it clear that the barrier is conditional. Plus since macro gets pkt/byte counts as arguments now - we can skip waking if there were no packets completed. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c91c46de |
|
06-Apr-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: provide macros for commonly copied lockless queue stop/wake code A lot of drivers follow the same scheme to stop / start queues without introducing locks between xmit and NAPI tx completions. I'm guessing they all copy'n'paste each other's code. The original code dates back all the way to e1000 and Linux 2.6.19. Smaller drivers shy away from the scheme and introduce a lock which may cause deadlocks in netpoll. Provide macros which encapsulate the necessary logic. The macros do not prevent false wake ups, the extra barrier required to close that race is not worth it. See discussion in: https://lore.kernel.org/all/c39312a2-4537-14b4-270c-9fe1fbb91e89@gmail.com/ Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5a178186 |
|
06-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a netdev notifier for DSA to reject PTP on DSA master"), due to a desire to convert DSA's attempt to deny TX timestamping on a DSA master to something that doesn't block the kernel-wide API conversion from ndo_eth_ioctl() to ndo_hwtstamp_set(). What was required was a mechanism that did not depend on ndo_eth_ioctl(), and what was provided was a mechanism that did not depend on ndo_eth_ioctl(), while at the same time introducing something that wasn't absolutely necessary - a new netdev notifier. There have been objections from Jakub Kicinski that using notifiers in general when they are not absolutely necessary creates complications to the control flow and difficulties to maintainers who look at the code. So there is a desire to not use notifiers. In addition to that, the notifier chain gets called even if there is no DSA in the system and no one is interested in applying any restriction. Take the model of udp_tunnel_nic_ops and introduce a stub mechanism, through which net/core/dev_ioctl.c can call into DSA even when CONFIG_NET_DSA=m. Compared to the code that existed prior to the notifier conversion, aka what was added in commits: - 4cfab3566710 ("net: dsa: Add wrappers for overloaded ndo_ops") - 3369afba1e46 ("net: Call into DSA netdevice_ops wrappers") this is different because we are not overloading any struct net_device_ops of the DSA master anymore, but rather, we are exposing a rather specific functionality which is orthogonal to which API is used to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set(). Also, what is similar is that both approaches use function pointers to get from built-in code to DSA. There is no point in replicating the function pointers towards __dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr). Instead, it is sufficient to introduce a singleton struct dsa_stubs, built into the kernel, which contains a single function pointer to __dsa_master_hwtstamp_validate(). I find this approach preferable to what we had originally, because dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any attempts to add any data encapsulation and hide DSA data structures from the outside world. Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
88c0a6b5 |
|
02-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: create a netdev notifier for DSA to reject PTP on DSA master The fact that PTP 2-step TX timestamping is broken on DSA switches if the master also timestamps the same packets is documented by commit f685e609a301 ("net: dsa: Deny PTP on master if switch supports it"). We attempt to help the users avoid shooting themselves in the foot by making DSA reject the timestamping ioctls on an interface that is a DSA master, and the switch tree beneath it contains switches which are aware of PTP. The only problem is that there isn't an established way of intercepting ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network stack by creating a struct dsa_netdevice_ops with overlaid function pointers that are manually checked from the relevant call sites. There used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only one left. There is an ongoing effort to migrate driver-visible hardware timestamping control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set() model, but DSA actively prevents that migration, since dsa_master_ioctl() is currently coded to manually call the master's legacy ndo_eth_ioctl(), and so, whenever a network device driver would be converted to the new API, DSA's restrictions would be circumvented, because any device could be used as a DSA master. The established way for unrelated modules to react on a net device event is via netdevice notifiers. So we create a new notifier which gets called whenever there is an attempt to change hardware timestamping settings on a device. Finally, there is another reason why a netdev notifier will be a good idea, besides strictly DSA, and this has to do with PHY timestamping. With ndo_eth_ioctl(), all MAC drivers must manually call phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP, otherwise they must pass this ioctl to the PHY driver via phy_mii_ioctl(). With the new ndo_hwtstamp_set() API, it will be desirable to simply not make any calls into the MAC device driver when timestamping should be performed at the PHY level. But there exist drivers, such as the lan966x switch, which need to install packet traps for PTP regardless of whether they are the layer that provides the hardware timestamps, or the PHY is. That would be impossible to support with the new API. The proposal there, too, is to introduce a netdev notifier which acts as a better cue for switching drivers to add or remove PTP packet traps, than ndo_hwtstamp_set(). The one introduced here "almost" works there as well, except for the fact that packet traps should only be installed if the PHY driver succeeded to enable hardware timestamping, whereas here, we need to deny hardware timestamping on the DSA master before it actually gets enabled. This is why this notifier is called "PRE_", and the notifier that would get used for PHY timestamping and packet traps would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing. In expectation of future netlink UAPI, we also pass a non-NULL extack pointer to the netdev notifier, and we make DSA populate it with an informative reason for the rejection. To avoid making it go to waste, we make the ioctl-based dev_set_hwtstamp() create a fake extack and print the message to the kernel log. Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/ Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dd2d6604 |
|
30-Mar-2023 |
Jakub Kicinski <kuba@kernel.org> |
net: minor reshuffle of napi_struct napi_id is read by GRO and drivers to mark skbs, and it currently sits at the end of the structure, in a mostly unused cache line. Move it up into a hole, and separate the clearly control path fields from the important ones. Before: struct napi_struct { struct list_head poll_list; /* 0 16 */ long unsigned int state; /* 16 8 */ int weight; /* 24 4 */ int defer_hard_irqs_count; /* 28 4 */ long unsigned int gro_bitmask; /* 32 8 */ int (*poll)(struct napi_struct *, int); /* 40 8 */ int poll_owner; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ struct net_device * dev; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct gro_list gro_hash[8]; /* 64 192 */ /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * skb; /* 256 8 */ struct list_head rx_list; /* 264 16 */ int rx_count; /* 280 4 */ /* XXX 4 bytes hole, try to pack */ struct hrtimer timer; /* 288 64 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */ struct list_head dev_list; /* 352 16 */ struct hlist_node napi_hash_node; /* 368 16 */ /* --- cacheline 6 boundary (384 bytes) --- */ unsigned int napi_id; /* 384 4 */ /* XXX 4 bytes hole, try to pack */ struct task_struct * thread; /* 392 8 */ /* size: 400, cachelines: 7, members: 17 */ /* sum members: 388, holes: 3, sum holes: 12 */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; After: struct napi_struct { struct list_head poll_list; /* 0 16 */ long unsigned int state; /* 16 8 */ int weight; /* 24 4 */ int defer_hard_irqs_count; /* 28 4 */ long unsigned int gro_bitmask; /* 32 8 */ int (*poll)(struct napi_struct *, int); /* 40 8 */ int poll_owner; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ struct net_device * dev; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct gro_list gro_hash[8]; /* 64 192 */ /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * skb; /* 256 8 */ struct list_head rx_list; /* 264 16 */ int rx_count; /* 280 4 */ unsigned int napi_id; /* 284 4 */ struct hrtimer timer; /* 288 64 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */ struct task_struct * thread; /* 352 8 */ struct list_head dev_list; /* 360 16 */ struct hlist_node napi_hash_node; /* 376 16 */ /* size: 392, cachelines: 7, members: 17 */ /* sum members: 388, holes: 1, sum holes: 4 */ /* paddings: 1, sum paddings: 4 */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8))); Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c59647c0 |
|
28-Mar-2023 |
Eric Dumazet <edumazet@google.com> |
net: add softnet_data.in_net_rx_action We want to make two optimizations in napi_schedule_rps() and ____napi_schedule() which require to know if these helpers are called from net_rx_action(), instead of being called from other contexts. sd.in_net_rx_action is only read/written by the owning cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
3eb8eea2 |
|
21-Mar-2023 |
Jakub Kicinski <kuba@kernel.org> |
docs: networking: document NAPI Add basic documentation about NAPI. We can stop linking to the ancient doc on the LF wiki. Link: https://lore.kernel.org/all/20230315223044.471002-1-kuba@kernel.org/ Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> # for ctucanfd-driver.rst Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20230322053848.198452-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1cc6571f |
|
21-Mar-2023 |
Nick Child <nnac123@linux.ibm.com> |
netdev: Enforce index cap in netdev_get_tx_queue When requesting a TX queue at a given index, warn on out-of-bounds referencing if the index is greater than the allocated number of queues. Specifically, since this function is used heavily in the networking stack use DEBUG_NET_WARN_ON_ONCE to avoid executing a new branch on every packet. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://lore.kernel.org/r/20230321150725.127229-2-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8c44fa12 |
|
15-Mar-2023 |
Ido Schimmel <idosch@nvidia.com> |
net: Add MDB net device operations Add MDB net device operations that will be invoked by rtnetlink code in response to received RTM_{NEW,DEL,GET}MDB messages. Subsequent patches will implement these operations in the bridge and VXLAN drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
40bbae58 |
|
06-Mar-2023 |
Eric Dumazet <edumazet@google.com> |
net: remove enum skb_free_reason enum skb_drop_reason is more generic, we can adopt it instead. Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason(). This means drivers can use more precise drop reasons if they want to. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0cd917a4 |
|
12-Apr-2023 |
Jesper Dangaard Brouer <brouer@redhat.com> |
xdp: rss hash types representation The RSS hash type specifies what portion of packet data NIC hardware used when calculating RSS hash value. The RSS types are focused on Internet traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4 primarily TCP vs UDP, but some hardware supports SCTP. Hardware RSS types are differently encoded for each hardware NIC. Most hardware represent RSS hash type as a number. Determining L3 vs L4 often requires a mapping table as there often isn't a pattern or sorting according to ISO layer. The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that contains both BITs for the L3/L4 types, and combinations to be used by drivers for their mapping tables. The enum xdp_rss_type_bits get exposed to BPF via BTF, and it is up to the BPF-programmer to match using these defines. This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding a pointer value argument for provide the RSS hash type. Change signature for all xmo_rx_hash calls in drivers to make it compile. The RSS type implementations for each driver comes as separate patches. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
4b397c06 |
|
10-Mar-2023 |
Eric Dumazet <edumazet@google.com> |
net: tunnels: annotate lockless accesses to dev->needed_headroom IP tunnels can apparently update dev->needed_headroom in their xmit path. This patch takes care of three tunnels xmit, and also the core LL_RESERVED_SPACE() and LL_RESERVED_SPACE_EXTRA() helpers. More changes might be needed for completeness. BUG: KCSAN: data-race in ip_tunnel_xmit / ip_tunnel_xmit read to 0xffff88815b9da0ec of 2 bytes by task 888 on cpu 1: ip_tunnel_xmit+0x1270/0x1730 net/ipv4/ip_tunnel.c:803 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228 ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:444 [inline] ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126 iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 write to 0xffff88815b9da0ec of 2 bytes by task 2379 on cpu 0: ip_tunnel_xmit+0x1294/0x1730 net/ipv4/ip_tunnel.c:804 __gre_xmit net/ipv4/ip_gre.c:469 [inline] ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661 __netdev_start_xmit include/linux/netdevice.h:4881 [inline] netdev_start_xmit include/linux/netdevice.h:4895 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596 __dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3051 [inline] neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623 neigh_output include/net/neighbour.h:546 [inline] ip6_finish_output2+0x9bc/0xc50 net/ipv6/ip6_output.c:134 __ip6_finish_output net/ipv6/ip6_output.c:195 [inline] ip6_finish_output+0x39a/0x4e0 net/ipv6/ip6_output.c:206 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip6_output+0xeb/0x220 net/ipv6/ip6_output.c:227 dst_output include/net/dst.h:444 [inline] NF_HOOK include/linux/netfilter.h:302 [inline] mld_sendpack+0x438/0x6a0 net/ipv6/mcast.c:1820 mld_send_cr net/ipv6/mcast.c:2121 [inline] mld_ifc_work+0x519/0x7b0 net/ipv6/mcast.c:2653 process_one_work+0x3e6/0x750 kernel/workqueue.c:2390 worker_thread+0x5f2/0xa10 kernel/workqueue.c:2537 kthread+0x1ac/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 value changed: 0x0dd4 -> 0x0e14 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 2379 Comm: kworker/0:0 Not tainted 6.3.0-rc1-syzkaller-00002-g8ca09d5fa354-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023 Workqueue: mld mld_ifc_work Fixes: 8eb30be0352d ("ipv6: Create ip6_tnl_xmit") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230310191109.2384387-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
50bcfe8d |
|
17-Feb-2023 |
Paolo Abeni <pabeni@redhat.com> |
net: make default_rps_mask a per netns attribute That really was meant to be a per netns attribute from the beginning. The idea is that once proper isolation is in place in the main namespace, additional demux in the child namespaces will be redundant. Let's make child netns default rps mask empty by default. To avoid bloating the netns with a possibly large cpumask, allocate it on-demand during the first write operation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
605cfa1b |
|
07-Feb-2023 |
Paolo Abeni <pabeni@redhat.com> |
net: introduce default_rps_mask netns attribute If RPS is enabled, this allows configuring a default rps mask, which is effective since receive queue creation time. A default RPS mask allows the system admin to ensure proper isolation, avoiding races at network namespace or device creation time. The default RPS mask is initially empty, and can be modified via a newly added sysctl entry. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
26759bee |
|
06-Feb-2023 |
Lorenzo Bianconi <lorenzo@kernel.org> |
net, xdp: Add missing xdp_features description Add missing xdp_features field description in the struct net_device documentation. This patch fix the following warning: [...] ./include/linux/netdevice.h:2375: warning: Function parameter or member 'xdp_features' not described in 'net_device' [...] Fixes: d3d854fd6a1d ("netdev-genl: create a simple family for netdev stuff") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/7878544903d855b49e838c9d59f715bde0b5e63b.1675705948.git.lorenzo@kernel.org
|
#
d3d854fd |
|
01-Feb-2023 |
Jakub Kicinski <kuba@kernel.org> |
netdev-genl: create a simple family for netdev stuff Add a Netlink spec-compatible family for netdevs. This is a very simple implementation without much thought going into it. It allows us to reap all the benefits of Netlink specs, one can use the generic client to issue the commands: $ ./cli.py --spec netdev.yaml --dump dev_get [{'ifindex': 1, 'xdp-features': set()}, {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}}, {'ifindex': 3, 'xdp-features': {'rx-sg'}}] the generic python library does not have flags-by-name support, yet, but we also don't have to carry strings in the messages, as user space can get the names from the spec. Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Co-developed-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
9eefedd5 |
|
28-Jan-2023 |
Xin Long <lucien.xin@gmail.com> |
net: add gso_ipv4_max_size and gro_ipv4_max_size per device This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be guarded by a separate tunable in the next patch. To not break the old application using "gso/gro_max_size" for IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size" in netif_set_gso/gro_max_size() if the new size isn't greater than GSO_LEGACY_MAX_SIZE, so that nothing will change even if userspace doesn't realize the new netlink attributes. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
7681a4f5 |
|
24-Jan-2023 |
Leon Romanovsky <leon@kernel.org> |
xfrm: extend add state callback to set failure reason Almost all validation logic is in the drivers, but they are missing reliable way to convey failure reason to userspace applications. Let's use extack to return this information to users. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
3089386d |
|
24-Jan-2023 |
Leon Romanovsky <leon@kernel.org> |
xfrm: extend add policy callback to set failure reason Almost all validation logic is in the drivers, but they are missing reliable way to convey failure reason to userspace applications. Let's use extack to return this information to users. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
3d76a4d3 |
|
19-Jan-2023 |
Stanislav Fomichev <sdf@google.com> |
bpf: XDP metadata RX kfuncs Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not supported by the target device, the default implementation is called instead. The verifier, at load time, replaces a call to the generic kfunc with a call to the per-device one. Per-device kfunc pointers are stored in separate struct xdp_metadata_ops. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-8-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
#
b20b8aec |
|
15-Feb-2023 |
Ido Schimmel <idosch@nvidia.com> |
devlink: Fix netdev notifier chain corruption Cited commit changed devlink to register its netdev notifier block on the global netdev notifier chain instead of on the per network namespace one. However, when changing the network namespace of the devlink instance, devlink still tries to unregister its notifier block from the chain of the old namespace and register it on the chain of the new namespace. This results in corruption of the notifier chains, as the same notifier block is registered on two different chains: The global one and the per network namespace one. In turn, this causes other problems such as the inability to dismantle namespaces due to netdev reference count issues. Fix by preventing devlink from moving its notifier block between namespaces. Reproducer: # echo "10 1" > /sys/bus/netdevsim/new_device # ip netns add test123 # devlink dev reload netdevsim/netdevsim10 netns test123 # ip netns del test123 [ 71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 71.938348] leaked reference. Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
8a321cf7 |
|
09-Dec-2022 |
Xin Long <lucien.xin@gmail.com> |
net: add IFF_NO_ADDRCONF and use it in bonding to prevent ipv6 addrconf Currently, in bonding it reused the IFF_SLAVE flag and checked it in ipv6 addrconf to prevent ipv6 addrconf. However, it is not a proper flag to use for no ipv6 addrconf, for bonding it has to move IFF_SLAVE flag setting ahead of dev_open() in bond_enslave(). Also, IFF_MASTER/SLAVE are historical flags used in bonding and eql, as Jiri mentioned, the new devices like Team, Failover do not use this flag. So as Jiri suggested, this patch adds IFF_NO_ADDRCONF in priv_flags of the device to indicate no ipv6 addconf, and uses it in bonding and moves IFF_SLAVE flag setting back to its original place. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f3da86dc |
|
02-Dec-2022 |
Leon Romanovsky <leon@kernel.org> |
xfrm: add support to HW update soft and hard limits Both in RX and TX, the traffic that performs IPsec packet offload transformation is accounted by HW. It is needed to properly handle hard limits that require to drop the packet. It means that XFRM core needs to update internal counters with the one that accounted by the HW, so new callbacks are introduced in this patch. In case of soft or hard limit is occurred, the driver should call to xfrm_state_check_expire() that will perform key rekeying exactly as done by XFRM core. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
919e43fa |
|
02-Dec-2022 |
Leon Romanovsky <leon@kernel.org> |
xfrm: add an interface to offload policy Extend netlink interface to add and delete XFRM policy from the device. This functionality is a first step to implement packet IPsec offload solution. Signed-off-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
d9360708 |
|
30-Nov-2022 |
Heiner Kallweit <hkallweit1@gmail.com> |
net: add netdev_sw_irq_coalesce_default_on() Add a helper for drivers wanting to set SW IRQ coalescing by default. The related sysfs attributes can be used to override the default values. Follow Jakub's suggestion and put this functionality into net core so that drivers wanting to use software interrupt coalescing per default don't have to open-code it. Note that this function needs to be called before the netdevice is registered. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c1c5097 |
|
15-Nov-2022 |
Eric Dumazet <edumazet@google.com> |
net: add atomic_long_t to net_device_stats fields Long standing KCSAN issues are caused by data-race around some dev->stats changes. Most performance critical paths already use per-cpu variables, or per-queue ones. It is reasonable (and more correct) to use atomic operations for the slow paths. This patch adds an union for each field of net_device_stats, so that we can convert paths that are not yet protected by a spinlock or a mutex. netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64 Note that the memcpy() we were using on 64bit arches had no provision to avoid load-tearing, while atomic_long_read() is providing the needed protection at no cost. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14d898f3 |
|
08-Nov-2022 |
Toke Høiland-Jørgensen <toke@redhat.com> |
dev: Move received_rps counter next to RPS members in softnet data Move the received_rps counter value next to the other RPS-related members in softnet_data. This closes two four-byte holes in the structure, making room for another pointer in the first two cache lines without bumping the xmit struct to its own line. Acked-by: Song Liu <song@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221108140601.149971-2-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
3e52fba0 |
|
08-Nov-2022 |
Jiri Pirko <jiri@nvidia.com> |
net: introduce a helper to move notifier block to different namespace Currently, net_dev() netdev notifier variant follows the netdev with per-net notifier from namespace to namespace. This is implemented by move_netdevice_notifiers_dev_net() helper. For devlink it is needed to re-register per-net notifier during devlink reload. Introduce a new helper called move_netdevice_notifier_net() and share the unregister/register code with existing move_netdevice_notifiers_dev_net() helper. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
bd039b5e |
|
07-Nov-2022 |
Andy Ren <andy.ren@getcruise.com> |
net/core: Allow live renaming when an interface is up Allow a network interface to be renamed when the interface is up. As described in the netconsole documentation [1], when netconsole is used as a built-in, it will bring up the specified interface as soon as possible. As a result, user space will not be able to rename the interface since the kernel disallows renaming of interfaces that are administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set by the kernel. The original solution [2] to this problem was to add a new parameter to the netconsole configuration parameters that allows renaming of the interface used by netconsole while it is administratively up. However, during the discussion that followed, it became apparent that we have no reason to keep the current restriction and instead we should allow user space to rename interfaces regardless of their administrative state: 1. The restriction was put in place over 20 years ago when renaming was only possible via IOCTL and before rtnetlink started notifying user space about such changes like it does today. 2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version 5.2 and no regressions were reported. 3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about the administrative state of interface. Therefore, allow user space to rename running interfaces by removing the restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in possible triage by emitting a message to the kernel log that an interface was renamed while UP. [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/ Signed-off-by: Andy Ren <andy.ren@getcruise.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77df1db8 |
|
02-Nov-2022 |
Jiri Pirko <jiri@nvidia.com> |
net: remove unused ndo_get_devlink_port Remove ndo_get_devlink_port which is no longer used alongside with the implementations in drivers. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
02a68a47 |
|
02-Nov-2022 |
Jiri Pirko <jiri@nvidia.com> |
net: devlink: track netdev with devlink_port assigned Currently, ethernet drivers are using devlink_port_type_eth_set() and devlink_port_type_clear() to set devlink port type and link to related netdev. Instead of calling them directly, let the driver use SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let devlink to track it. Note the devlink port pointer is static during the time netdevice is registered. In devlink code, use per-namespace netdev notifier to track the netdevices with devlink_port assigned and change the internal devlink_port type and related type pointer accordingly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1d997f10 |
|
28-Oct-2022 |
Hangbin Liu <liuhangbin@gmail.com> |
rtnetlink: pass netlink message header and portid to rtnl_configure_link() This patch pass netlink message header and portid to rtnl_configure_link() All the functions in this call chain need to add the parameters so we can use them in the last call rtnl_notify(), and notify the userspace about the new link info if NLM_F_ECHO flag is set. - rtnl_configure_link() - __dev_notify_flags() - rtmsg_ifinfo() - rtmsg_ifinfo_event() - rtmsg_ifinfo_build_skb() - rtmsg_ifinfo_send() - rtnl_notify() Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub suggested. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f3fb589a |
|
27-Oct-2022 |
Juhee Kang <claudiajkang@gmail.com> |
net: remove unused netdev_unregistering() Currently, use dev->reg_state == NETREG_UNREGISTERING to check the status which is NETREG_UNREGISTERING, rather than using netdev_unregistering. Also, A helper function which is netdev_unregistering on nedevice.h is no longer used. Thus, netdev_unregistering removes from netdevice.h. Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc8695eb |
|
14-Oct-2022 |
Jakub Kicinski <kuba@kernel.org> |
Revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}" This reverts commit 854701ba4c39afae2362ba19a580c461cb183e4f. We have more violations around, which leads to: WARNING: CPU: 2 PID: 1 at include/linux/cpumask.h:110 __netif_set_xps_queue+0x14e/0x770 Let's back this out and retry with a larger clean up in -next. Fixes: 854701ba4c39 ("net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}") Link: https://lore.kernel.org/all/20221014030459.3272206-2-guoren@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
854701ba |
|
19-Sep-2022 |
Yury Norov <yury.norov@gmail.com> |
net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and} The functions require to be passed with a cpu index prior to one that is the first to start search, so the valid input range is [-1, nr_cpu_ids-1). However, the code checks against [-1, nr_cpu_ids). Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yury Norov <yury.norov@gmail.com>
|
#
aac4daa8 |
|
27-Sep-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: query offload capabilities through ndo_setup_tc() When adding optional new features to Qdisc offloads, existing drivers must reject the new configuration until they are coded up to act on it. Since modifying all drivers in lockstep with the changes in the Qdisc can create problems of its own, it would be nice if there existed an automatic opt-in mechanism for offloading optional features. Jakub proposes that we multiplex one more kind of call through ndo_setup_tc(): one where the driver populates a Qdisc-specific capability structure. First user will be taprio in further changes. Here we are introducing the definitions for the base functionality. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220923163310.3192733-3-vladimir.oltean@nxp.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
dbae2b06 |
|
28-Sep-2022 |
Paolo Abeni <pabeni@redhat.com> |
net: skb: introduce and use a single page frag cache After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
b48b89f9 |
|
27-Sep-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: drop the weight argument from netif_napi_add We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight(). Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
db01868b |
|
10-Sep-2022 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: introduce iterators over synced hw addresses Some network drivers use __dev_mc_sync()/__dev_uc_sync() and therefore program the hardware only with addresses with a non-zero sync_cnt. Some of the above drivers also need to save/restore the address filtering lists when certain events happen, and they need to walk through the struct net_device :: uc and struct net_device :: mc lists. But these lists contain unsynced addresses too. To keep the appearance of an elementary form of data encapsulation, provide iterators through these lists that only look at entries with a non-zero sync_cnt, instead of filtering entries out from device drivers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
c3f760ef |
|
31-Aug-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: remove netif_tx_napi_add() All callers are now gone. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
977f1aa5 |
|
31-Aug-2022 |
Eric Dumazet <edumazet@google.com> |
net: bql: add more documentation Add some documentation for netdev_tx_sent_queue() and netdev_tx_completed_queue() Stating that netdev_tx_completed_queue() must be called once per TX completion round is apparently not obvious for everybody. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b9030780 |
|
22-Aug-2022 |
Uros Bizjak <ubizjak@gmail.com> |
netdev: Use try_cmpxchg in napi_if_scheduled_mark_missed Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in napi_if_scheduled_mark_missed. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20220822143243.2798-1-ubizjak@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1202cdd6 |
|
17-Aug-2022 |
Stephen Hemminger <stephen@networkplumber.org> |
Remove DECnet support from kernel DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux kernel. It has been "Orphaned" in kernel since 2010. The iproute2 support for DECnet was dropped in 5.0 release. The documentation link on Sourceforge says it is abandoned there as well. Leave the UAPI alone to keep userspace programs compiling. This means that there is still an empty neighbour table for AF_DECNET. The table of /proc/sys/net entries was updated to match current directories and reformatted to be alphabetical. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Ahern <dsahern@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a5612ca1 |
|
23-Aug-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
net: Fix data-races around sysctl_devconf_inherit_init_net. While reading sysctl_devconf_inherit_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 856c395cfa63 ("net: introduce a knob to control whether to inherit devconf config") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
af67508e |
|
23-Aug-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
net: Fix data-races around sysctl_fb_tunnels_only_for_init_net. While reading sysctl_fb_tunnels_only_for_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9962acef |
|
08-Jun-2022 |
Eric Dumazet <edumazet@google.com> |
net: adopt u64_stats_t in struct pcpu_sw_netstats As explained in commit 316580b69d0a ("u64_stats: provide u64_stats_t type") we should use u64_stats_t and related accessors to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d62607c3 |
|
07-Jun-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: rename reference+tracking helpers Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
3b89b511 |
|
23-Jun-2022 |
Dan Carpenter <dan.carpenter@oracle.com> |
net: fix IFF_TX_SKB_NO_LINEAR definition The "1<<31" shift has a sign extension bug so IFF_TX_SKB_NO_LINEAR is 0xffffffff80000000 instead of 0x0000000080000000. Fixes: c2ff53d8049f ("net: Add priv_flags for allow tx skb without linear") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/YrRrcGttfEVnf85Q@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c304eddc |
|
19-May-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: wrap the wireless pointers in struct net_device in an ifdef Most protocol-specific pointers in struct net_device are under a respective ifdef. Wireless is the notable exception. Since there's a sizable number of custom-built kernels for datacenter workloads which don't build wireless it seems reasonable to ifdefy those pointers as well. While at it move IPv4 and IPv6 pointers up, those are special for obvious reasons. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154 Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97e719a8 |
|
15-May-2022 |
Eric Dumazet <edumazet@google.com> |
net: fix possible race in skb_attempt_defer_free() A cpu can observe sd->defer_count reaching 128, and call smp_call_function_single_async() Problem is that the remote CPU can clear sd->defer_count before the IPI is run/acknowledged. Other cpus can queue more packets and also decide to call smp_call_function_single_async() while the pending IPI was not yet delivered. This is a common issue with smp_call_function_single_async(). Callers must ensure correct synchronization and serialization. I triggered this issue while experimenting smaller threshold. Performing the call to smp_call_function_single_async() under sd->defer_lock protection did not solve the problem. Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async() to insert locked csd") replaced an informative WARN_ON_ONCE() with a return of -EBUSY, which is often ignored. Test of CSD_FLAG_LOCK presence is racy anyway. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0fe79f28 |
|
13-May-2022 |
Alexander Duyck <alexanderduyck@fb.com> |
net: allow gro_max_size to exceed 65536 Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34b92e8d |
|
13-May-2022 |
Eric Dumazet <edumazet@google.com> |
net: limit GSO_MAX_SIZE to 524280 bytes Make sure we will not overflow shinfo->gso_segs Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs is a 16bit field. TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h, it seems cleaner to not bring tcp details into include/linux/netdevice.h Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c4e983c |
|
13-May-2022 |
Alexander Duyck <alexanderduyck@fb.com> |
net: allow gso_max_size to exceed 65536 The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b87be9e |
|
09-May-2022 |
Eric Dumazet <edumazet@google.com> |
net: add include/net/net_debug.h Remove from include/linux/netdevice.h helpers that send debug/info/warnings to syslog. We plan adding more helpers in following patches. v2: added two includes, and 'struct net_device' forward declaration to avoid compile errors (kernel bots) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97dc7cd9 |
|
06-May-2022 |
Gerhard Engleder <gerhard@engleder-embedded.com> |
ptp: Support late timestamp determination If a physical clock supports a free running cycle counter, then timestamps shall be based on this time too. For TX it is known in advance before the transmission if a timestamp based on the free running cycle counter is needed. For RX it is impossible to know which timestamp is needed before the packet is received and assigned to a socket. Support late timestamp determination by a network device. Therefore, an address/cookie is stored within the new netdev_data field of struct skb_shared_hwtstamps. This address/cookie is provided to a new network device function called ndo_get_tstamp(), which returns a timestamp based on the normal/adjustable time or based on the free running cycle counter. If function is not supported, then timestamp handling is not changed. This mechanism is intended for RX, but TX use is also possible. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
ca4567f1 |
|
05-May-2022 |
Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com> |
rtnetlink: add extack support in fdb del handlers Add extack support to .ndo_fdb_del in netdevice.h and all related methods. Signed-off-by: Alaa Mohamed <eng.alaamohamedsoliman.am@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
744d49da |
|
05-May-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: move netif_set_gso_max helpers These are now internal to the core, no need to expose them. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14d7b812 |
|
05-May-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: don't allow user space to lift the device limits Up until commit 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") the gso_max_segs and gso_max_size of a device were not controlled from user space. The quoted commit added the ability to control them because of the following setup: netns A | netns B veth<->veth eth0 If eth0 has TSO limitations and user wants to efficiently forward traffic between eth0 and the veths they should copy the TSO limitations of eth0 onto the veths. This would happen automatically for macvlans or ipvlan but veth users are not so lucky (given the loose coupling). Unfortunately the commit in question allowed users to also override the limits on real HW devices. It may be useful to control the max GSO size and someone may be using that ability (not that I know of any user), so create a separate set of knobs to reliably record the TSO limitations. Validate the user requests. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6df6398f |
|
05-May-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: add netif_inherit_tso_max() To make later patches smaller create a helper for inheriting the TSO limitations of a lower device. The TSO in the name is not an accident, subsequent patches will replace GSO with TSO in more names. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
58caed3d |
|
02-May-2022 |
Jakub Kicinski <kuba@kernel.org> |
netdev: reshuffle netif_napi_add() APIs to allow dropping weight Most drivers should not have to worry about selecting the right weight for their NAPI instances and pass NAPI_POLL_WEIGHT. It'd be best if we didn't require the argument at all and selected the default internally. This change prepares the ground for such reshuffling, allowing for a smooth transition. The following API should remain after the next release cycle: netif_napi_add() netif_napi_add_weight() netif_napi_add_tx() netif_napi_add_tx_weight() Where the _weight() variants take an explicit weight argument. I opted for a _weight() suffix rather than a __ prefix, because we use __ in places to mean that caller needs to also issue a synchronize_net() call. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c526fd8f |
|
28-Apr-2022 |
Pavel Begunkov <asml.silence@gmail.com> |
net: inline dev_queue_xmit() Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small proxy functions doing nothing but redirecting the control flow to __dev_queue_xmit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
68822bdf |
|
22-Apr-2022 |
Eric Dumazet <edumazet@google.com> |
net: generalize skb freeing deferral to per-cpu lists Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2f1e85b1 |
|
15-Apr-2022 |
Tonghao Zhang <xiangxia.m.yue@gmail.com> |
net: sched: use queue_mapping to pick tx queue This patch fixes issue: * If we install tc filters with act_skbedit in clsact hook. It doesn't work, because netdev_core_pick_tx() overwrites queue_mapping. $ tc filter ... action skbedit queue_mapping 1 And this patch is useful: * We can use FQ + EDT to implement efficient policies. Tx queues are picked by xps, ndo_select_queue of netdev driver, or skb hash in netdev_core_pick_tx(). In fact, the netdev driver, and skb hash are _not_ under control. xps uses the CPUs map to select Tx queues, but we can't figure out which task_struct of pod/containter running on this cpu in most case. We can use clsact filters to classify one pod/container traffic to one Tx queue. Why ? In containter networking environment, there are two kinds of pod/ containter/net-namespace. One kind (e.g. P1, P2), the high throughput is key in these applications. But avoid running out of network resource, the outbound traffic of these pods is limited, using or sharing one dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods (e.g. Pn), the low latency of data access is key. And the traffic is not limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc. This choice provides two benefits. First, contention on the HTB/FQ Qdisc lock is significantly reduced since fewer CPUs contend for the same queue. More importantly, Qdisc contention can be eliminated completely if each CPU has its own FIFO Qdisc for the second kind of pods. There must be a mechanism in place to support classifying traffic based on pods/container to different Tx queues. Note that clsact is outside of Qdisc while Qdisc can run a classifier to select a sub-queue under the lock. In general recording the decision in the skb seems a little heavy handed. This patch introduces a per-CPU variable, suggested by Eric. The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit(). - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag is set in qdisc->enqueue() though tx queue has been selected in netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared firstly in __dev_queue_xmit(), is useful: - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy: For example, eth0, macvlan in pod, which root Qdisc install skbedit queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping because there is no filters in clsact or tx Qdisc of this netdev. Same action taked in eth0, ixgbe in Host. - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it in __dev_queue_xmit when processing next packets. For performance reasons, use the static key. If user does not config the NET_EGRESS, the patch will not be compiled. +----+ +----+ +----+ | P1 | | P2 | | Pn | +----+ +----+ +----+ | | | +-----------+-----------+ | | clsact/skbedit | MQ v +-----------+-----------+ | q0 | q1 | qn v v v HTB/FQ HTB/FQ ... FIFO Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Talal Ahmad <talalahmad@google.com> Cc: Kevin Hao <haokexin@gmail.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Antoine Tenart <atenart@kernel.org> Cc: Wei Wang <weiwan@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
1306d536 |
|
13-Apr-2022 |
Nikolay Aleksandrov <razor@blackwall.org> |
net: add ndo_fdb_del_bulk Add a new netdev op called ndo_fdb_del_bulk, it will be later used for driver-specific bulk delete implementation dispatched from rtnetlink. The first user will be the bridge, we need it to signal to rtnetlink from the driver that we support bulk delete operation (NLM_F_BULK). Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
794c24e9 |
|
06-Apr-2022 |
Jeffrey Ji <jeffreyji@google.com> |
net-core: rx_otherhost_dropped to core_stats Increment rx_otherhost_dropped counter when packet dropped due to mismatched dest MAC addr. An example when this drop can occur is when manually crafting raw packets that will be consumed by a user space application via a tap device. For testing purposes local traffic was generated using trafgen for the client and netcat to start a server Tested: Created 2 netns, sent 1 packet using trafgen from 1 to the other with "{eth(daddr=$INCORRECT_MAC...}", verified that iproute2 showed the counter was incremented. (Also had to modify iproute2 to show the stat, additional patch for that coming next.) Signed-off-by: Jeffrey Ji <jeffreyji@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220406172600.1141083-1-jeffreyjilinux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
6264f58c |
|
06-Apr-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: extract a few internals from netdevice.h There's a number of functions and static variables used under net/core/ but not from the outside. We currently dump most of them into netdevice.h. That bad for many reasons: - netdevice.h is very cluttered, hard to figure out what the APIs are; - netdevice.h is very long; - we have to touch netdevice.h more which causes expensive incremental builds. Create a header under net/core/ and move some declarations. The new header is also a bit of a catch-all but that's fine, if we create more specific headers people will likely over-think where their declaration fit best. And end up putting them in netdevice.h, again. More work should be done on splitting netdevice.h into more targeted headers, but that'd be more time consuming so small steps. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
a333215e |
|
05-Apr-2022 |
Felix Fietkau <nbd@nbd.name> |
net: ethernet: mtk_eth_soc: implement flow offloading to WED devices This allows hardware flow offloading from Ethernet to WLAN on MT7622 SoC Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0b5c21bb |
|
04-Apr-2022 |
Johannes Berg <johannes.berg@intel.com> |
net: ensure net_todo_list is processed quickly In [1], Will raised a potential issue that the cfg80211 code, which does (from a locking perspective) rtnl_lock() wiphy_lock() rtnl_unlock() might be suspectible to ABBA deadlocks, because rtnl_unlock() calls netdev_run_todo(), which might end up calling rtnl_lock() again, which could then deadlock (see the comment in the code added here for the scenario). Some back and forth and thinking ensued, but clearly this can't happen if the net_todo_list is empty at the rtnl_unlock() here. Clearly, the code here cannot actually put an entry on it, and all other users of rtnl_unlock() will empty it since that will always go through netdev_run_todo(), emptying the list. So the only other way to get there would be to add to the list and then unlock the RTNL without going through rtnl_unlock(), which is only possible through __rtnl_unlock(). However, this isn't exported and not used in many places, and none of them seem to be able to unregister before using it. Therefore, add a WARN_ON() in the code to ensure this invariant won't be broken, so that the cfg80211 (or any similar) code stays safe. [1] https://lore.kernel.org/r/Yjzpo3TfZxtKPMAG@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20220404113847.0ee02e4a70da.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
cf2df74e |
|
09-May-2022 |
Felix Fietkau <nbd@nbd.name> |
net: fix dev_fill_forward_path with pppoe + bridge When calling dev_fill_forward_path on a pppoe device, the provided destination address is invalid. In order for the bridge fdb lookup to succeed, the pppoe code needs to update ctx->daddr to the correct value. Fix this by storing the address inside struct net_device_path_ctx Fixes: f6efc675c9dd ("net: ppp: resolve forwarding path for bridge pppoe devices") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
6510ea97 |
|
25-Apr-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
net: Use this_cpu_inc() to increment net->core_stats The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes netdev_core_stats_alloc() to return a per-CPU pointer. netdev_core_stats_alloc() will allocate memory on its first invocation which breaks on PREEMPT_RT because it requires non-atomic context for memory allocation. This can be avoided by enabling preemption in netdev_core_stats_alloc() assuming the caller always disables preemption. It might be better to replace local_inc() with this_cpu_inc() now that dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does not rely on already disabled preemption. This results in less instructions on x86-64: local_inc: | incl %gs:__preempt_count(%rip) # __preempt_count | movq 488(%rdi), %rax # _1->core_stats, _22 | testq %rax, %rax # _22 | je .L585 #, | add %gs:this_cpu_off(%rip), %rax # this_cpu_off, tcp_ptr__ | .L586: | testq %rax, %rax # _27 | je .L587 #, | incq (%rax) # _6->a.counter | .L587: | decl %gs:__preempt_count(%rip) # __preempt_count this_cpu_inc(), this patch: | movq 488(%rdi), %rax # _1->core_stats, _5 | testq %rax, %rax # _5 | je .L591 #, | .L585: | incq %gs:(%rax) # _18->rx_dropped Use unsigned long as type for the counter. Use this_cpu_inc() to increment the counter. Use a plain read of the counter. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f32404ae |
|
25-Mar-2022 |
Johannes Berg <johannes.berg@intel.com> |
net: move net_unlink_todo() out of the header There's no reason for this to be in netdevice.h, it's all just used in dev.c. Also make it no longer inline and let the compiler decide to do that by itself. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20220325225023.f49b9056fe1c.I6b901a2df00000837a9bd251a8dd259bd23f5ded@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4a0cb83b |
|
21-Mar-2022 |
Jakub Kicinski <kuba@kernel.org> |
netdevice: add missing dm_private kdoc Building htmldocs complains: include/linux/netdevice.h:2295: warning: Function parameter or member 'dm_private' not described in 'net_device' Fixes: b26ef81c46ed ("drop_monitor: remove quadratic behavior") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220322051053.1883186-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
#
046e1537 |
|
15-Mar-2022 |
Íñigo Huguet <ihuguet@redhat.com> |
net: set default rss queues num to physical cores / 2 Network drivers can call to netif_get_num_default_rss_queues to get the default number of receive queues to use. Right now, this default number is min(8, num_online_cpus()). Instead, as suggested by Jakub, use the number of physical cores divided by 2 as a way to avoid wasting CPU resources and to avoid using both CPU threads, but still allowing to scale for high-end processors with many cores. As an exception, select 2 queues for processors with 2 cores, because otherwise it won't take any advantage of RSS despite being SMP capable. Tested: Processor Intel Xeon E5-2620 (2 sockets, 6 cores/socket, 2 threads/core). NIC Broadcom NetXtreme II BCM57810 (10GBps). Ran some tests with `perf stat iperf3 -R`, with parallelisms of 1, 8 and 24, getting the following results: - Number of queues: 6 (instead of 8) - Network throughput: not affected - CPU usage: utilized 0.05-0.12 CPUs more than before (having 24 CPUs this is only 0.2-0.5% higher) - Reduced the number of context switches by 7-50%, being more noticeable when using a higher number of parallel threads. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> Link: https://lore.kernel.org/r/20220315091832.13873-1-ihuguet@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
fc93db15 |
|
12-Mar-2022 |
Eric Dumazet <edumazet@google.com> |
net: disable preemption in dev_core_stats_XXX_inc() helpers syzbot was kind enough to remind us that dev->{tx_dropped|rx_dropped} could be increased in process context. BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor413/3593 caller is netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298 CPU: 1 PID: 3593 Comm: syz-executor413 Not tainted 5.17.0-rc7-syzkaller-02426-g97aeb877de7f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 check_preemption_disabled+0x16b/0x170 lib/smp_processor_id.c:49 netdev_core_stats_alloc+0x98/0x110 net/core/dev.c:10298 dev_core_stats include/linux/netdevice.h:3855 [inline] dev_core_stats_rx_dropped_inc include/linux/netdevice.h:3866 [inline] tun_get_user+0x3455/0x3ab0 drivers/net/tun.c:1800 tun_chr_write_iter+0xe1/0x200 drivers/net/tun.c:2015 call_write_iter include/linux/fs.h:2074 [inline] new_sync_write+0x431/0x660 fs/read_write.c:503 vfs_write+0x7cd/0xae0 fs/read_write.c:590 ksys_write+0x12d/0x250 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2cf4f887e3 Code: 5d 41 5c 41 5d 41 5e e9 9b fd ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18 RSP: 002b:00007ffd50dd5fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007ffd50dd6000 RCX: 00007f2cf4f887e3 RDX: 000000000000002a RSI: 0000000000000000 RDI: 00000000000000c8 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd50dd5ff0 R14: 00007ffd50dd5fe8 R15: 00007ffd50dd5fe4 </TASK> Fixes: 625788b58445 ("net: add per-cpu storage and net->core_stats") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: jeffreyji <jeffreyji@google.com> Cc: Brian Vazquez <brianvv@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220312214505.3294762-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
625788b5 |
|
10-Mar-2022 |
Eric Dumazet <edumazet@google.com> |
net: add per-cpu storage and net->core_stats Before adding yet another possibly contended atomic_long_t, it is time to add per-cpu storage for existing ones: dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler Because many devices do not have to increment such counters, allocate the per-cpu storage on demand, so that dev_get_stats() does not have to spend considerable time folding zero counters. Note that some drivers have abused these counters which were supposed to be only used by core networking stack. v4: should use per_cpu_ptr() in dev_get_stats() (Jakub) v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo) v2: add a missing include (reported by kernel test robot <lkp@intel.com>) Change in netdev_core_stats_alloc() (Jakub) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: jeffreyji <jeffreyji@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2655926a |
|
06-Mar-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
net: Remove netif_rx_any_context() and netif_rx_ni(). Remove netif_rx_any_context and netif_rx_ni() because there are no more users in tree. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9309f97a |
|
02-Mar-2022 |
Petr Machata <petrm@nvidia.com> |
net: dev: Add hardware stats support Offloading switch device drivers may be able to collect statistics of the traffic taking place in the HW datapath that pertains to a certain soft netdevice, such as VLAN. Add the necessary infrastructure to allow exposing these statistics to the offloaded netdevice in question. The API was shaped by the following considerations: - Collection of HW statistics is not free: there may be a finite number of counters, and the act of counting may have a performance impact. It is therefore necessary to allow toggling whether HW counting should be done for any particular SW netdevice. - As the drivers are loaded and removed, a particular device may get offloaded and unoffloaded again. At the same time, the statistics values need to stay monotonic (modulo the eventual 64-bit wraparound), increasing only to reflect traffic measured in the device. To that end, the netdevice keeps around a lazily-allocated copy of struct rtnl_link_stats64. Device drivers then contribute to the values kept therein at various points. Even as the driver goes away, the struct stays around to maintain the statistics values. - Different HW devices may be able to count different things. The motivation behind this patch in particular is exposure of HW counters on Nvidia Spectrum switches, where the only practical approach to counting traffic on offloaded soft netdevices currently is to use router interface counters, and count L3 traffic. Correspondingly that is the statistics suite added in this patch. Other devices may be able to measure different kinds of traffic, and for that reason, the APIs are built to allow uniform access to different statistics suites. - Because soft netdevices and offloading drivers are only loosely bound, a netdevice uses a notifier chain to communicate with the drivers. Several new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages to the offloading drivers. - Devices can have various conditions for when a particular counter is available. As the device is configured and reconfigured, the device offload may become or cease being suitable for counter binding. A netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to ping offloading drivers and determine whether anyone currently implements a given statistics suite. This information can then be propagated to user space. When the driver decides to unoffload a netdevice, it can use a newly-added function, netdev_offload_xstats_report_delta(), to record outstanding collected statistics, before destroying the HW counter. This patch adds a helper, call_netdevice_notifiers_info_robust(), for dispatching a notifier with the possibility of unwind when one of the consumers bails. Given the wish to eventually get rid of the global notifier block altogether, this helper only invokes the per-netns notifier block. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b26ef81c |
|
22-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
drop_monitor: remove quadratic behavior drop_monitor is using an unique list on which all netdevices in the host have an element, regardless of their netns. This scales poorly, not only at device unregister time (what I caught during my netns dismantle stress tests), but also at packet processing time whenever trace_napi_poll_hit() is called. If the intent was to avoid adding one pointer in 'struct net_device' then surely we prefer O(1) behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
baebdf48 |
|
11-Feb-2022 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
net: dev: Makes sure netif_rx() can be invoked in any context. Dave suggested a while ago (eleven years by now) "Let's make netif_rx() work in all contexts and get rid of netif_rx_ni()". Eric agreed and pointed out that modern devices should use netif_receive_skb() to avoid the overhead. In the meantime someone added another variant, netif_rx_any_context(), which behaves as suggested. netif_rx() must be invoked with disabled bottom halves to ensure that pending softirqs, which were raised within the function, are handled. netif_rx_ni() can be invoked only from process context (bottom halves must be enabled) because the function handles pending softirqs without checking if bottom halves were disabled or not. netif_rx_any_context() invokes on the former functions by checking in_interrupts(). netif_rx() could be taught to handle both cases (disabled and enabled bottom halves) by simply disabling bottom halves while invoking netif_rx_internal(). The local_bh_enable() invocation will then invoke pending softirqs only if the BH-disable counter drops to zero. Eric is concerned about the overhead of BH-disable+enable especially in regard to the loopback driver. As critical as this driver is, it will receive a shortcut to avoid the additional overhead which is not needed. Add a local_bh_disable() section in netif_rx() to ensure softirqs are handled if needed. Provide __netif_rx() which does not disable BH and has a lockdep assert to ensure that interrupts are disabled. Use this shortcut in the loopback driver and in drivers/net/*.c. Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they can be removed once they are no more users left. Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b2309a71 |
|
07-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
net: add dev->dev_registered_tracker Convert one dev_hold()/dev_put() pair in register_netdevice() and unregister_netdevice_many() to dev_hold_track() and dev_put_track(). This would allow to detect a rogue dev_put() a bit earlier. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4c6c11ea |
|
04-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
net: refine dev_put()/dev_hold() debugging We are still chasing some syzbot reports where we think a rogue dev_put() is called with no corresponding prior dev_hold(). Unfortunately it eats a reference on dev->dev_refcnt taken by innocent dev_hold_track(), meaning that the refcount saturation splat comes too late to be useful. Make sure that 'not tracked' dev_put() and dev_hold() better use CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure: Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free() to be called with a NULL @trackerp parameter, and to use a separate refcount only to detect too many put() even in the following case: dev_hold_track(dev, tracker_1, GFP_ATOMIC); dev_hold(dev); dev_put(dev); dev_put(dev); // Should complain loudly here. dev_put_track(dev, tracker_1); // instead of here Add clarification about netdev_tracker_alloc() role. v2: I replaced the dev_put() in linkwatch_do_dev() with __dev_put() because callers called netdev_tracker_free(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
23c7f8d7 |
|
07-Mar-2022 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Fix esp GSO on inter address family tunnels. The esp tunnel GSO handlers use skb_mac_gso_segment to push the inner packet to the segmentation handlers. However, skb_mac_gso_segment takes the Ethernet Protocol ID from 'skb->protocol' which is wrong for inter address family tunnels. We fix this by introducing a new skb_eth_gso_segment function. This function can be used if it is necessary to pass the Ethernet Protocol ID directly to the segmentation handler. First users of this function will be the esp4 and esp6 tunnel segmentation handlers. Fixes: c35fe4106b92 ("xfrm: Add mode handlers for IPsec on layer 2") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
5891cd5e |
|
11-Feb-2022 |
Eric Dumazet <edumazet@google.com> |
net_sched: add __rcu annotation to netdev->qdisc syzbot found a data-race [1] which lead me to add __rcu annotations to netdev->qdisc, and proper accessors to get LOCKDEP support. [1] BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1: attach_default_qdiscs net/sched/sch_generic.c:1167 [inline] dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221 __dev_open+0x2e9/0x3a0 net/core/dev.c:1416 __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139 rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150 __rtnl_newlink net/core/rtnetlink.c:3489 [inline] rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529 rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0: qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323 __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050 tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211 rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e1383-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 470502de5bdb ("net: sched: unlock rules update API") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@mellanox.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
47934e06 |
|
18-Jan-2022 |
Congyu Liu <liu3101@purdue.edu> |
net: fix information leakage in /proc/net/ptype In one net namespace, after creating a packet socket without binding it to a device, users in other net namespaces can observe the new `packet_type` added by this packet socket by reading `/proc/net/ptype` file. This is minor information leakage as packet socket is namespace aware. Add a net pointer in `packet_type` to keep the net namespace of of corresponding packet socket. In `ptype_seq_show`, this net pointer must be checked when it is not NULL. Fixes: 2feb27dbe00c ("[NETNS]: Minor information leak via /proc/net/ptype file.") Signed-off-by: Congyu Liu <liu3101@purdue.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eac1b93c |
|
05-Jan-2022 |
Coco Li <lixiaoyan@google.com> |
gro: add ability to control gro max packet size Eric Dumazet suggested to allow users to modify max GRO packet size. We have seen GRO being disabled by users of appliances (such as wifi access points) because of claimed bufferbloat issues, or some work arounds in sch_cake, to split GRO/GSO packets. Instead of disabling GRO completely, one can chose to limit the maximum packet size of GRO packets, depending on their latency constraints. This patch adds a per device gro_max_size attribute that can be changed with ip link command. ip link set dev eth0 gro_max_size 16000 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d6c6d0bb |
|
28-Dec-2021 |
Lukas Bulwahn <lukas.bulwahn@gmail.com> |
net: remove references to CONFIG_IRDA in network header files Commit d64c2a76123f ("staging: irda: remove the irda network stack and drivers") removes the config IRDA. Remove the remaining references to this non-existing config in the network header files. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20211229113620.19368-1-lukas.bulwahn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
8cbfe939 |
|
17-Dec-2021 |
Baowen Zheng <baowen.zheng@corigine.com> |
flow_offload: allow user to offload tc action to net device Use flow_indr_dev_register/flow_indr_dev_setup_offload to offload tc action. We need to call tc_cleanup_flow_action to clean up tc action entry since in tc_setup_action, some actions may hold dev refcnt, especially the mirror action. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f1d9268e |
|
14-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: add net device refcount tracker to struct packet_type Most notable changes are in af_packet, tipc ones are trivial. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9280ac2e |
|
14-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: dev_replace_track() cleanup Use existing helpers (netdev_tracker_free() and netdev_tracker_alloc()) to remove ifdefery. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211214151515.312535-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
9ba74e6c |
|
10-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: add networking namespace refcount tracker We have 100+ syzbot reports about netns being dismantled too soon, still unresolved as of today. We think a missing get_net() or an extra put_net() is the root cause. In order to find the bug(s), and be able to spot future ones, this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers to precisely pair all put_net() with corresponding get_net(). To use these helpers, each data structure owning a refcount should also use a "netns_tracker" to pair the get and put. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
f12bf6f3 |
|
06-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: watchdog: add net device refcount tracker Add a netdevice_tracker inside struct net_device, to track the self reference when a device has an active watchdog timer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
63f13937 |
|
04-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: linkwatch: add net device refcount tracker Add a netdevice_tracker inside struct net_device, to track the self reference when a device is in lweventlist. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
9038c320 |
|
04-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: dst: add net device refcount tracking to dst_entry We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
0b688f24 |
|
04-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: add net device refcount tracker to struct netdev_queue This will help debugging pesky netdev reference leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
80e8921b |
|
04-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: add net device refcount tracker to struct netdev_rx_queue This helps debugging net device refcount leaks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
4d92b95f |
|
04-Dec-2021 |
Eric Dumazet <edumazet@google.com> |
net: add net device refcount tracker infrastructure net device are refcounted. Over the years we had numerous bugs caused by imbalanced dev_hold() and dev_put() calls. The general idea is to be able to precisely pair each decrement with a corresponding prior increment. Both share a cookie, basically a pointer to private data storing stack traces. This patch adds dev_hold_track() and dev_put_track(). To use these helpers, each data structure owning a refcount should also use a "netdevice_tracker" to pair the hold and put. netdevice_tracker dev_tracker; ... dev_hold_track(dev, &dev_tracker, GFP_ATOMIC); ... dev_put_track(dev, &dev_tracker); Whenever a leak happens, we will get precise stack traces of the point dev_hold_track() happened, at device dismantle phase. We will also get a stack trace if too many dev_put_track() for the same netdevice_tracker are attempted. This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
2106efda |
|
22-Nov-2021 |
Jakub Kicinski <kuba@kernel.org> |
net: remove .ndo_change_proto_down .ndo_change_proto_down was added seemingly to enable out-of-tree implementations. Over 2.5yrs later we still have no real users upstream. Hardwire the generic implementation for now, we can revert once real users materialize. (rocker is a test vehicle, not a user.) We need to drop the optimization on the sysfs side, because unlike ndos priv_flags will be changed at runtime, so we'd need READ_ONCE/WRITE_ONCE everywhere.. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6d872df3 |
|
19-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: annotate accesses to dev->gso_max_segs dev->gso_max_segs is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add netif_set_gso_max_segs() helper. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4b66d216 |
|
19-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: annotate accesses to dev->gso_max_size dev->gso_max_size is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size() where we can to better document what is going on. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d07b26f5 |
|
19-Nov-2021 |
Jakub Kicinski <kuba@kernel.org> |
dev_addr: add a modification check netdev->dev_addr should only be modified via helpers, but someone may be casting off the const. Add a runtime check to catch abuses. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
adeef3e3 |
|
19-Nov-2021 |
Jakub Kicinski <kuba@kernel.org> |
net: constify netdev->dev_addr Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. We converted all users to make modifications via appropriate helpers, make netdev->dev_addr const. The update helpers need to upcast from the buffer to struct netdev_hw_addr. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dab8fe32 |
|
16-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: do not inline netif_tx_lock()/netif_tx_unlock() These are not fast path, there is no point in inlining them. Also provide netif_freeze_queues()/netif_unfreeze_queues() so that we can use them from dev_watchdog() in the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5337824f |
|
16-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: annotate accesses to queue->trans_start In following patches, dev_watchdog() will no longer stop all queues. It will read queue->trans_start locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8160fb43 |
|
16-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: use an atomic_long_t for queue->trans_timeout tx_timeout_show() assumed dev_watchdog() would stop all the queues, to fetch queue->trans_timeout under protection of the queue->_xmit_lock. As we want to no longer disrupt transmits, we use an atomic_long_t instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: david decotigny <david.decotigny@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7071732c |
|
15-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: use .data.once section in netdev_level_once() Same rationale than prior patch : using the dedicated section avoid holes and pack all these bool values. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
587652bb |
|
15-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: gro: populate net/core/gro.c Move gro code and data from net/core/dev.c to net/core/gro.c to ease maintenance. gro_normal_list() and gro_normal_one() are inlined because they are called from both files. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e456a18a |
|
15-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: gro: move skb_gro_receive into net/core/gro.c net/core/gro.c will contain all core gro functions, to shrink net/core/skbuff.c and net/core/dev.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0b935d7f |
|
15-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: gro: move skb_gro_receive_list to udp_offload.c This helper is used once, no need to keep it in fat net/core/skbuff.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4721031c |
|
15-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: move gro definitions to include/net/gro.h include/linux/netdevice.h became too big, move gro stuff into include/net/gro.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b62e3317 |
|
16-Dec-2021 |
Xiang wangx <wangxiang@cdjrlc.com> |
net: fix typo in a comment The double 'as' in a comment is repeated, thus it should be removed. Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a10d8c8 |
|
30-Nov-2021 |
Eric Dumazet <edumazet@google.com> |
net: annotate data-races on txq->xmit_lock_owner syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner without annotations. No serious issue there, let's document what is happening there. BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0: __netif_tx_unlock include/linux/netdevice.h:4437 [inline] __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_hh_output include/net/neighbour.h:511 [inline] neigh_output include/net/neighbour.h:525 [inline] ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1: __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213 dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265 macvlan_queue_xmit drivers/net/macvlan.c:543 [inline] macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567 __netdev_start_xmit include/linux/netdevice.h:4987 [inline] netdev_start_xmit include/linux/netdevice.h:5001 [inline] xmit_one+0x105/0x2f0 net/core/dev.c:3590 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342 __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817 __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194 dev_queue_xmit+0x13/0x20 net/core/dev.c:4259 neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523 neigh_output include/net/neighbour.h:527 [inline] ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:450 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508 ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898 call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421 expire_timers+0x116/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x410 kernel/time/timer.c:1734 run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747 __do_softirq+0x158/0x2de kernel/softirq.c:558 __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x37/0x70 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443 folio_test_anon include/linux/page-flags.h:581 [inline] PageAnon include/linux/page-flags.h:586 [inline] zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347 zap_pmd_range mm/memory.c:1467 [inline] zap_pud_range mm/memory.c:1496 [inline] zap_p4d_range mm/memory.c:1517 [inline] unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538 unmap_single_vma+0x157/0x210 mm/memory.c:1583 unmap_vmas+0xd0/0x180 mm/memory.c:1615 exit_mmap+0x23d/0x470 mm/mmap.c:3170 __mmput+0x27/0x1b0 kernel/fork.c:1113 mmput+0x3d/0x50 kernel/fork.c:1134 exit_mm+0xdb/0x170 kernel/exit.c:507 do_exit+0x608/0x17a0 kernel/exit.c:819 do_group_exit+0xce/0x180 kernel/exit.c:929 get_signal+0xfc3/0x1550 kernel/signal.c:2852 arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300 do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffffff Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G W 5.16.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
29cbcd85 |
|
16-Oct-2021 |
Ahmed S. Darwish <a.darwish@linutronix.de> |
net: sched: Remove Qdisc::running sequence counter The Qdisc::running sequence counter has two uses: 1. Reliably reading qdisc's tc statistics while the qdisc is running (a seqcount read/retry loop at gnet_stats_add_basic()). 2. As a flag, indicating whether the qdisc in question is running (without any retry loops). For the first usage, the Qdisc::running sequence counter write section, qdisc_run_begin() => qdisc_run_end(), covers a much wider area than what is actually needed: the raw qdisc's bstats update. A u64_stats sync point was thus introduced (in previous commits) inside the bstats structure itself. A local u64_stats write section is then started and stopped for the bstats updates. Use that u64_stats sync point mechanism for the bstats read/retry loop at gnet_stats_add_basic(). For the second qdisc->running usage, a __QDISC_STATE_RUNNING bit flag, accessed with atomic bitops, is sufficient. Using a bit flag instead of a sequence counter at qdisc_run_begin/end() and qdisc_is_running() leads to the SMP barriers implicitly added through raw_read_seqcount() and write_seqcount_begin/end() getting removed. All call sites have been surveyed though, and no required ordering was identified. Now that the qdisc->running sequence counter is no longer used, remove it. Note, using u64_stats implies no sequence counter protection for 64-bit architectures. This can lead to the qdisc tc statistics "packets" vs. "bytes" values getting out of sync on rare occasions. The individual values will still be valid. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
42df6e1d |
|
08-Oct-2021 |
Lukas Wunner <lukas@wunner.de> |
netfilter: Introduce egress hook Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet *is* subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/ Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Laura García Liébana <nevola@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
9974cb5c |
|
13-Oct-2021 |
Chen Wandun <chenwandun@huawei.com> |
net: delete redundant function declaration The implement of function netdev_all_upper_get_next_dev_rcu has been removed in: commit f1170fd462c6 ("net: Remove all_adj_list and its references") so delete redundant declaration in header file. Fixes: f1170fd462c6 ("net: Remove all_adj_list and its references") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Link: https://lore.kernel.org/r/20211013094702.3931071-1-chenwandun@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
40af35fd |
|
12-Oct-2021 |
Jakub Kicinski <kuba@kernel.org> |
netdevice: demote the type of some dev_addr_set() helpers __dev_addr_set() and dev_addr_mod() and pretty low level, let the arguments be void, there's no chance for confusion in callers converted to use them. Keep u8 in dev_addr_set() because some of the callers are converted from a loop and we want to make sure assignments are not from an array of a different type. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
01992152 |
|
09-Oct-2021 |
Juhee Kang <claudiajkang@gmail.com> |
mlxsw: spectrum: use netif_is_macsec() instead of open code Open code which is dev->priv_flags & IFF_MACSEC has already defined as netif_is_macsec(). So use netif_is_macsec() instead of open code. This patch doesn't change logic. Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
75ea27d0 |
|
07-Oct-2021 |
Antoine Tenart <atenart@kernel.org> |
net: introduce a function to check if a netdev name is in use __dev_get_by_name is currently used to either retrieve a net device reference using its name or to check if a name is already used by a registered net device (per ns). In the later case there is no need to return a reference to a net device. Introduce a new helper, netdev_name_in_use, to check if a name is currently used by a registered net device without leaking a reference the corresponding net device. This helper uses netdev_name_node_lookup instead of __dev_get_by_name as we don't need the extra logic retrieving a reference to the corresponding net device. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
48eab831 |
|
02-Sep-2021 |
Jakub Kicinski <kuba@kernel.org> |
net: create netdev->dev_addr assignment helpers Recent work on converting address list to a tree made it obvious we need an abstraction around writing netdev->dev_addr. Without such abstraction updating the main device address is invisible to the core. Introduce a number of helpers which for now just wrap memcpy() but in the future can make necessary changes to the address tree. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d0efb162 |
|
26-Aug-2021 |
Peter Collingbourne <pcc@google.com> |
net: don't unconditionally copy_from_user a struct ifreq for socket ioctls A common implementation of isatty(3) involves calling a ioctl passing a dummy struct argument and checking whether the syscall failed -- bionic and glibc use TCGETS (passing a struct termios), and musl uses TIOCGWINSZ (passing a struct winsize). If the FD is a socket, we will copy sizeof(struct ifreq) bytes of data from the argument and return -EFAULT if that fails. The result is that the isatty implementations may return a non-POSIX-compliant value in errno in the case where part of the dummy struct argument is inaccessible, as both struct termios and struct winsize are smaller than struct ifreq (at least on arm64). Although there is usually enough stack space following the argument on the stack that this did not present a practical problem up to now, with MTE stack instrumentation it's more likely for the copy to fail, as the memory following the struct may have a different tag. Fix the problem by adding an early check for whether the ioctl is a valid socket ioctl, and return -ENOTTY if it isn't. Fixes: 44c02a2c3dc5 ("dev_ioctl(): move copyin/copyout to callers") Link: https://linux-review.googlesource.com/id/I869da6cf6daabc3e4b7b82ac979683ba05e27d4d Signed-off-by: Peter Collingbourne <pcc@google.com> Cc: <stable@vger.kernel.org> # 4.19 Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
406f42fa |
|
19-Aug-2021 |
Gilad Naaman <gnaaman@drivenets.com> |
net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically. The source of most of the slow down is the `dev_addr_lists.c` module, which mainatins a linked list of HW addresses. When using IPv6, this list grows for each IPv6 address added on a VLAN, since each IPv6 address has a multicast HW address associated with it. When performing any modification to the involved links, this list is traversed many times, often for nothing, all while holding the RTNL lock. Instead, this patch adds an auxilliary rbtree which cuts down traversal time significantly. Performance can be seen with the following script: #!/bin/bash ip netns del test || true 2>/dev/null ip netns add test echo 1 | ip netns exec test tee /proc/sys/net/ipv6/conf/all/keep_addr_on_down > /dev/null set -e ip -n test link add foo type veth peer name bar ip -n test link add b1 type bond ip -n test link add florp type vrf table 10 ip -n test link set bar master b1 ip -n test link set foo up ip -n test link set bar up ip -n test link set b1 up ip -n test link set florp up VLAN_COUNT=1500 BASE_DEV=b1 echo Creating vlans ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test link add link $BASE_DEV name foo.\$i type vlan id \$i; done" echo Bringing them up ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test link set foo.\$i up; done" echo Assiging IPv6 Addresses ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test address add dev foo.\$i 2000::\$i/64; done" echo Attaching to VRF ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT); do ip -n test link set foo.\$i master florp; done" On an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz machine, the performance before the patch is (truncated): Creating vlans real 108.35 Bringing them up real 4.96 Assiging IPv6 Addresses real 19.22 Attaching to VRF real 458.84 After the patch: Creating vlans real 5.59 Bringing them up real 5.07 Assiging IPv6 Addresses real 5.64 Attaching to VRF real 25.37 Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Lu Wei <luwei32@huawei.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95d1d249 |
|
23-Aug-2021 |
Jakub Kicinski <kuba@kernel.org> |
netdevice: move xdp_rxq within netdev_rx_queue Both struct netdev_rx_queue and struct xdp_rxq_info are cacheline aligned. This causes extra padding before and after the xdp_rxq member. Move the member upfront, so that it's naturally aligned. Before: /* size: 256, cachelines: 4, members: 6 */ /* sum members: 160, holes: 1, sum holes: 40 */ /* padding: 56 */ /* paddings: 1, sum paddings: 36 */ /* forced alignments: 1, forced holes: 1, sum forced holes: 40 */ After: /* size: 192, cachelines: 3, members: 6 */ /* padding: 32 */ /* paddings: 1, sum paddings: 36 */ /* forced alignments: 1 */ Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/20210823180135.1153608-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
afa79d08 |
|
13-Aug-2021 |
Changbin Du <changbin.du@intel.com> |
net: in_irq() cleanup Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Signed-off-by: Changbin Du <changbin.du@gmail.com> Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
879af96f |
|
30-Jul-2021 |
Jussi Maki <joamaki@gmail.com> |
net, core: Add support for XDP redirection to slave device This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX into XDP_REDIRECT after BPF program run when the ingress device is a bond slave. The dev_xdp_prog_count is exposed so that slave devices can be checked for loaded XDP programs in order to avoid the situation where both bond master and slave have programs loaded according to xdp_state. Signed-off-by: Jussi Maki <joamaki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
|
#
b37a4668 |
|
05-Aug-2021 |
Yajun Deng <yajun.deng@linux.dev> |
netdevice: add the case if dev is NULL Add the case if dev is NULL in dev_{put, hold}, so the caller doesn't need to care whether dev is NULL or not. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
271e5b7d |
|
03-Aug-2021 |
Jakub Kicinski <kuba@kernel.org> |
net: add netif_set_real_num_queues() for device reconfig netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues() can fail which breaks drivers trying to implement reconfiguration in a way that can't leave the device half-broken. In other words those functions are incompatible with prepare/commit approach. Luckily setting real number of queues can fail only if the number is increased, meaning that if we order operations correctly we can guarantee ending up with either new config (success), or the old one (on error). Provide a helper implementing such logic so that drivers don't have to duplicate it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5ea2f5ff |
|
03-Aug-2021 |
Arnd Bergmann <arnd@arndb.de> |
move netdev_boot_setup into Space.c This is now only used by a handful of old ISA drivers, and can be moved into the file they already all depend on. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
583be982 |
|
28-Jul-2021 |
Jeremy Kerr <jk@codeconstruct.com.au> |
mctp: Add device handling and netlink interface This change adds the infrastructure for managing MCTP netdevices; we add a pointer to the AF_MCTP-specific data to struct netdevice, and hook up the rtnetlink operations for adding and removing addresses. Includes changes from Matt Johnston <matt@codeconstruct.com.au>. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3d9d00bd |
|
27-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
net: bonding: move ioctl handling to private ndo operation All other user triggered operations are gone from ndo_ioctl, so move the SIOCBOND family into a custom operation as well. The .ndo_ioctl() helper is no longer called by the dev_ioctl.c code now, but there are still a few definitions in obsolete wireless drivers as well as the appletalk and ieee802154 layers to call SIOCSIFADDR/SIOCGIFADDR helpers from inside the kernel. Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ad7eab2a |
|
27-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
net: split out ndo_siowandev ioctl In order to further reduce the scope of ndo_do_ioctl(), move out the SIOCWANDEV handling into a new network device operation function. Adjust the prototype to only pass the if_settings sub-structure in place of the ifreq, and remove the redundant 'cmd' argument in the process. Cc: Krzysztof Halasa <khc@pm.waw.pl> Cc: "Jan \"Yenya\" Kasprzak" <kas@fi.muni.cz> Cc: Kevin Curtis <kevin.curtis@farsite.co.uk> Cc: Zhao Qiang <qiang.zhao@nxp.com> Cc: Martin Schiller <ms@dev.tdt.de> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: linux-x25@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a7605370 |
|
27-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
dev_ioctl: split out ndo_eth_ioctl Most users of ndo_do_ioctl are ethernet drivers that implement the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP. Separate these from the few drivers that use ndo_do_ioctl to implement SIOCBOND, SIOCBR and SIOCWANDEV commands. This is a purely cosmetic change intended to help readers find their way through the implementation. Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Vladimir Oltean <olteanv@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: linux-rdma@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a554bf96 |
|
27-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
dev_ioctl: pass SIOCDEVPRIVATE data separately The compat handlers for SIOCDEVPRIVATE are incorrect for any driver that passes data as part of struct ifreq rather than as an ifr_data pointer, or that passes data back this way, since the compat_ifr_data_ioctl() helper overwrites the ifr_data pointer and does not copy anything back out. Since all drivers using devprivate commands are now converted to the new .ndo_siocdevprivate callback, fix this by adding the missing piece and passing the pointer separately the whole way. This further unifies the native and compat logic for socket ioctls, as the new code now passes the correct pointer as well as the correct data for both native and compat ioctls. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b9067f5d |
|
27-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
net: split out SIOCDEVPRIVATE handling from dev_ioctl SIOCDEVPRIVATE ioctl commands are mainly used in really old drivers, and they have a number of problems: - They hide behind the normal .ndo_do_ioctl function that is also used for other things in modern drivers, so it's hard to spot a driver that actually uses one of these - Since drivers use a number different calling conventions, it is impossible to support compat mode for them in a generic way. - With all drivers using the same 16 commands codes, there is no way to introspect the data being passed through things like strace. Add a new net_device_ops callback pointer, to address the first two of these. Separating them from .ndo_do_ioctl makes it easy to grep for drivers with a .ndo_siocdevprivate callback, and the unwieldy name hopefully makes it easier to spot in code review. By passing the ifreq structure and the ifr_data pointer separately, it is no longer necessary to overload these, and the driver can use either one for a given command. Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29c49648 |
|
22-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
net: socket: rework compat_ifreq_ioctl() compat_ifreq_ioctl() is one of the last users of copy_in_user() and compat_alloc_user_space(), as it attempts to convert the 'struct ifreq' arguments from 32-bit to 64-bit format as used by dev_ioctl() and a couple of socket family specific interpretations. The current implementation works correctly when calling dev_ioctl(), inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl() and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do not interpret the arguments and only block the corresponding commands, so they do not care. For af_inet6 and af_decnet however, the compat conversion is slightly incorrect, as it will copy more data than the native handler accesses, both of them use a structure that is shorter than ifreq. Replace the copy_in_user() conversion with a pair of accessor functions to read and write the ifreq data in place with the correct length where needed, while leaving the other ones to copy the (already compatible) structures directly. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
876f0bf9 |
|
22-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
net: socket: simplify dev_ifconf handling The dev_ifconf() calling conventions make compat handling more complicated than necessary, simplify this by moving the in_compat_syscall() check into the function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b0e99d03 |
|
22-Jul-2021 |
Arnd Bergmann <arnd@arndb.de> |
net: socket: remove register_gifconf Since dynamic registration of the gifconf() helper is only used for IPv4, and this can not be in a loadable module, this can be simplified noticeably by turning it into a direct function call as a preparation for cleaning up the compat handling. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe21cb91 |
|
02-Jul-2021 |
Kumar Kartikeya Dwivedi <memxor@gmail.com> |
net: core: Split out code to run generic XDP prog This helper can later be utilized in code that runs cpumap and devmap programs in generic redirect mode and adjust skb based on changes made to xdp_buff. When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a generic redirect path invokes devmap/cpumap prog if set, it must __skb_pull again as we expect mac header to be pulled. It also drops the skb_reset_mac_len call after do_xdp_generic, as the mac_header and network_header are advanced by the same offset, so the difference (mac_len) remains constant. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
|
#
ff70202b |
|
24-Jun-2021 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
dev_forward_skb: do not scrub skb mark within the same name space The goal is to keep the mark during a bpf_redirect(), like it is done for legacy encapsulation / decapsulation, when there is no x-netns. This was initially done in commit 213dd74aee76 ("skbuff: Do not scrub skb mark within the same name space"). When the call to skb_scrub_packet() was added in dev_forward_skb() (commit 8b27f27797ca ("skb: allow skb_scrub_packet() to be used by tunnels")), the second argument (xnet) was set to true to force a call to skb_orphan(). At this time, the mark was always cleanned up by skb_scrub_packet(), whatever xnet value was. This call to skb_orphan() was removed later in commit 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb."). But this 'true' stayed here without any real reason. Let's correctly set xnet in ____dev_forward_skb(), this function has access to the previous interface and to the new interface. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
490dceca |
|
19-May-2021 |
Jakub Kicinski <kuba@kernel.org> |
mlx5: count all link events mlx5 devices were observed generating MLX5_PORT_CHANGE_SUBTYPE_ACTIVE events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This breaks link flap detection based on Linux carrier state transition count as netif_carrier_on() does nothing if carrier is already on. Make sure we count such events. netif_carrier_event() increments the counters and fires the linkwatch events. The latter is not necessary for the use case but seems like the right thing to do. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
0854fa82 |
|
07-Apr-2021 |
Andrei Vagin <avagin@gmail.com> |
net: remove the new_ifindex argument from dev_change_net_namespace Here is only one place where we want to specify new_ifindex. In all other cases, callers pass 0 as new_ifindex. It looks reasonable to add a low-level function with new_ifindex and to convert dev_change_net_namespace to a static inline wrapper. Fixes: eeb85a14ee34 ("net: Allow to specify ifindex when device is moved to another namespace") Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eeb85a14 |
|
05-Apr-2021 |
Andrei Vagin <avagin@gmail.com> |
net: Allow to specify ifindex when device is moved to another namespace Currently, we can specify ifindex on link creation. This change allows to specify ifindex when a device is moved to another network namespace. Even now, a device ifindex can be changed if there is another device with the same ifindex in the target namespace. So this change doesn't introduce completely new behavior, it adds more control to the process. CRIU users want to restore containers with pre-created network devices. A user will provide network devices and instructions where they have to be restored, then CRIU will restore network namespaces and move devices into them. The problem is that devices have to be restored with the same indexes that they have before C/R. Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
26267bf9 |
|
23-Mar-2021 |
Felix Fietkau <nbd@nbd.name> |
netfilter: flowtable: bridge vlan hardware offload and switchdev The switch might have already added the VLAN tag through PVID hardware offload. Keep this extra VLAN in the flowtable but skip it on egress. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0994d492 |
|
23-Mar-2021 |
Felix Fietkau <nbd@nbd.name> |
net: dsa: resolve forwarding path for dsa slave ports Add .ndo_fill_forward_path for dsa slave port devices Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6efc675 |
|
23-Mar-2021 |
Felix Fietkau <nbd@nbd.name> |
net: ppp: resolve forwarding path for bridge pppoe devices Pass on the PPPoE session ID, destination hardware address and the real device. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bcf2766b |
|
23-Mar-2021 |
Felix Fietkau <nbd@nbd.name> |
net: bridge: resolve forwarding path for VLAN tag actions in bridge devices Depending on the VLAN settings of the bridge and the port, the bridge can either add or remove a tag. When vlan filtering is enabled, the fdb lookup also needs to know the VLAN tag/proto for the destination address To provide this, keep track of the stack of VLAN tags for the path in the lookup context Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ec9d16ba |
|
23-Mar-2021 |
Pablo Neira Ayuso <pablo@netfilter.org> |
net: bridge: resolve forwarding path for bridge devices Add .ndo_fill_forward_path for bridge devices. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e4417d69 |
|
23-Mar-2021 |
Pablo Neira Ayuso <pablo@netfilter.org> |
net: 8021q: resolve forwarding path for vlan devices Add .ndo_fill_forward_path for vlan devices. For instance, assuming the following topology: IP forwarding / \ eth0.100 eth0 | eth0 . . . ethX ab:cd:ef:ab:cd:ef For packets going through IP forwarding to eth0.100 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: eth0.100 -> eth0 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ddb94eaf |
|
23-Mar-2021 |
Pablo Neira Ayuso <pablo@netfilter.org> |
net: resolve forwarding path from virtual netdevice and HW destination address This patch adds dev_fill_forward_path() which resolves the path to reach the real netdevice from the IP forwarding side. This function takes as input the netdevice and the destination hardware address and it walks down the devices calling .ndo_fill_forward_path() for each device until the real device is found. For instance, assuming the following topology: IP forwarding / \ br0 eth0 / \ eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity. ethX is the interface in another box which is connected to the eth1 bridge port. For packets going through IP forwarding to br0 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: br0 -> eth1 .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port from the destination MAC address to get the bridge port eth1. This information allows to create a fast path that bypasses the classic bridge and IP forwarding paths, so packets go directly from the bridge port eth1 to eth0 (wan interface) and vice versa. fast path .------------------------. / \ | IP forwarding | | / \ \/ | br0 eth0 . / \ -> eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5aa3afe1 |
|
23-Mar-2021 |
Dmitry Vyukov <dvyukov@google.com> |
net: make unregister netdev warning timeout configurable netdev_wait_allrefs() issues a warning if refcount does not drop to 0 after 10 seconds. While 10 second wait generally should not happen under normal workload in normal environment, it seems to fire falsely very often during fuzzing and/or in qemu emulation (~10x slower). At least it's not possible to understand if it's really a false positive or not. Automated testing generally bumps all timeouts to very high values to avoid flake failures. Add net.core.netdev_unregister_timeout_secs sysctl to make the timeout configurable for automated testing systems. Lowering the timeout may also be useful for e.g. manual bisection. The default value matches the current behavior. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877 Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
add2d736 |
|
22-Mar-2021 |
Eric Dumazet <edumazet@google.com> |
net: set initial device refcount to 1 When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the initial net device refcount was 0. When CONFIG_PCPU_DEV_REFCNT is not set, this means the first dev_hold() triggers an illegal refcount operation (addition on 0) refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4 Fix is to change initial (and final) refcount to be 1. Also add a missing kerneldoc piece, as reported by Stephen Rothwell. Fixes: 919067cc845f ("net: add CONFIG_PCPU_DEV_REFCNT") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Guenter Roeck <groeck@google.com> Tested-by: Guenter Roeck <groeck@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
744b8376 |
|
22-Mar-2021 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: move the ptype_all and ptype_base declarations to include/linux/netdevice.h ptype_all and ptype_base are declared in net/core/dev.c as non-static, because they are used by net-procfs.c too. However, a "make W=1" build complains that there was no previous declaration of ptype_all and ptype_base in a header file, so this way of declaring things constitutes a violation of coding style. Let's move the extern declarations of ptype_all and ptype_base to the linux/netdevice.h file, which is included by net-procfs.c too. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f57bac3c |
|
21-Mar-2021 |
Vincent Mailhol <mailhol.vincent@wanadoo.fr> |
netdev: add netdev_queue_set_dql_min_limit() Add a function to set the dynamic queue limit minimum value. Some specific drivers might have legitimate reasons to configure dql.min_limit to a given value. Typically, this is the case when the PDU of the protocol is smaller than the packet size to used to carry those frames to the device. Concrete example: a CAN (Control Area Network) device with an USB 2.0 interface. The PDU of classical CAN protocol are roughly 16 bytes but the USB packet size (which is used to carry the CAN frames to the device) might be up to 512 bytes. Wen small traffic burst occurs, BQL algorithm is not able to immediately adjust and this would result in having to send many small USB packets (i.e packet of 16 bytes for each CAN frame). Filling up the USB packet with CAN frames is relatively fast (small latency issue) but the gain of not having to send several small USB packets is huge (big throughput increase). In this case, forcing dql.min_limit to a given value that would allow to stuff the USB packet is always a win. This function is to be used by network drivers which are able to prove through a rationale and through empirical tests on several environment (with other applications, heavy context switching, virtualization...), that they constantly reach better performances with a specific predefined dql.min_limit value with no noticeable latency impact. Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
919067cc |
|
19-Mar-2021 |
Eric Dumazet <edumazet@google.com> |
net: add CONFIG_PCPU_DEV_REFCNT I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1 unregister_netdevice: waiting for sit0 to become free. Usage count = -1 It would be nice if syzbot could trigger a warning at the time this reference count became negative. This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults to per cpu variables (as before this patch) on SMP builds. v2: free_dev label in alloc_netdev_mqs() is moved to avoid a compiler warning (-Wunused-label), as reported by kernel test robot <lkp@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
044ab86d |
|
18-Mar-2021 |
Antoine Tenart <atenart@kernel.org> |
net: move the xps maps to an array Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in net_device. That will simplify a lot the code removing the need for lots of if/else conditionals as the correct map will be available using its offset in the array. This should not modify the xps maps behaviour in any way. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5478fcd0 |
|
18-Mar-2021 |
Antoine Tenart <atenart@kernel.org> |
net: embed nr_ids in the xps maps Embed nr_ids (the number of cpu for the xps cpus map, and the number of rxqs for the xps cpus map) in dev_maps. That will help not accessing out of bound memory if those values change after dev_maps was allocated. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
255c04a8 |
|
18-Mar-2021 |
Antoine Tenart <atenart@kernel.org> |
net: embed num_tc in the xps maps The xps cpus/rxqs map is accessed using dev->num_tc, which is used when allocating the map. But later updates of dev->num_tc can lead to having a mismatch between the maps and how they're accessed. In such cases the map values do not make any sense and out of bound accesses can occur (that can be easily seen using KASAN). This patch aims at fixing this by embedding num_tc into the maps, using the value at the time the map is created. This brings two improvements: - The maps can be accessed using the embedded num_tc, so we know for sure we won't have out of bound accesses. - Checks can be made before accessing the maps so we know the values retrieved will make sense. We also update __netif_set_xps_queue to conditionally copy old maps from dev_maps in the new one only if the number of traffic classes from both maps match. Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a126a43 |
|
16-Sep-2020 |
Roi Dayan <roid@nvidia.com> |
net: Change dev parameter to const in netif_device_present() Not all ndos check the present bit before calling the ndo and the driver may want to check it. Sometimes the dev parameter passed as const so we pass it to netif_device_present() as const. Since netif_device_present() doesn't modify dev parameter anyway, declare it as const. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
#
c2ff53d8 |
|
18-Feb-2021 |
Xuan Zhuo <xuanzhuo@linux.alibaba.com> |
net: Add priv_flags for allow tx skb without linear In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Suggested-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-3-alobakin@pm.me
|
#
2463e073 |
|
18-Feb-2021 |
Alexander Lobakin <alobakin@pm.me> |
netdevice: Add missing IFF_PHONY_HEADROOM self-definition This is harmless for now, but can be fatal for future refactors. Fixes: 871b642adebe3 ("netdev: introduce ndo_set_rx_headroom") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-2-alobakin@pm.me
|
#
cb038357 |
|
16-Mar-2021 |
Wei Wang <weiwan@google.com> |
net: fix race between napi kthread mode and busy poll Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to determine if the kthread owns this napi and could call napi->poll() on it. However, if socket busy poll is enabled, it is possible that the busy poll thread grabs this SCHED bit (after the previous napi->poll() invokes napi_complete_done() and clears SCHED bit) and tries to poll on the same napi. napi_disable() could grab the SCHED bit as well. This patch tries to fix this race by adding a new bit NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in ____napi_schedule() if the threaded mode is enabled, and gets cleared in napi_complete_done(), and we only poll the napi in kthread if this bit is set. This helps distinguish the ownership of the napi between kthread and other scenarios and fixes the race issue. Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") Reported-by: Martin Zaharinov <micron10@gmail.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wei Wang <weiwan@google.com> Cc: Alexander Duyck <alexanderduyck@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c95c34f0 |
|
03-Mar-2021 |
Maciej Fijalkowski <maciej.fijalkowski@intel.com> |
xsk: Remove dangling function declaration from header file xdp_umem_query() is dead for a long time, drop the declaration from include/linux/netdevice.h Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/20210303185636.18070-2-maciej.fijalkowski@intel.com
|
#
4e096a18 |
|
23-Feb-2021 |
Oleksij Rempel <linux@rempel-privat.de> |
net: introduce CAN specific pointer in the struct net_device Since 20dd3850bcf8 ("can: Speed up CAN frame receiption by using ml_priv") the CAN framework uses per device specific data in the AF_CAN protocol. For this purpose the struct net_device->ml_priv is used. Later the ml_priv usage in CAN was extended for other users, one of them being CAN_J1939. Later in the kernel ml_priv was converted to an union, used by other drivers. E.g. the tun driver started storing it's stats pointer. Since tun devices can claim to be a CAN device, CAN specific protocols will wrongly interpret this pointer, which will cause system crashes. Mostly this issue is visible in the CAN_J1939 stack. To fix this issue, we request a dedicated CAN pointer within the net_device struct. Reported-by: syzbot+5138c4dd15a0401bec7b@syzkaller.appspotmail.com Fixes: 20dd3850bcf8 ("can: Speed up CAN frame receiption by using ml_priv") Fixes: ffd956eef69b ("can: introduce CAN midlayer private and allocate it automatically") Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Fixes: 497a5757ce4e ("tun: switch to net core provided statistics counters") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20210223070127.4538-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
5f7d5728 |
|
09-Feb-2021 |
Jesper Dangaard Brouer <brouer@redhat.com> |
bpf: Drop MTU check when doing TC-BPF redirect to ingress The use-case for dropping the MTU check when TC-BPF does redirect to ingress, is described by Eyal Birger in email[0]. The summary is the ability to increase packet size (e.g. with IPv6 headers for NAT64) and ingress redirect packet and let normal netstack fragment packet as needed. [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/ V15: - missing static for function declaration V9: - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect V4: - Keep net_device "up" (IFF_UP) check. - Adjustment to handle bpf_redirect_peer() helper Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287790971.790810.11785274340154740591.stgit@firesoul
|
#
3b23a32a |
|
11-Feb-2021 |
Cong Wang <cong.wang@bytedance.com> |
net: fix dev_ifsioc_locked() race condition dev_ifsioc_locked() is called with only RCU read lock, so when there is a parallel writer changing the mac address, it could get a partially updated mac address, as shown below: Thread 1 Thread 2 // eth_commit_mac_addr_change() memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN); // dev_ifsioc_locked() memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,...); Close this race condition by guarding them with a RW semaphore, like netdev_get_name(). We can not use seqlock here as it does not allow blocking. The writers already take RTNL anyway, so this does not affect the slow path. To avoid bothering existing dev_set_mac_address() callers in drivers, introduce a new wrapper just for user-facing callers on ioctl and rtnetlink paths. Note, bonding also changes slave mac addresses but that requires a separate patch due to the complexity of bonding code. Fixes: 3710becf8a58 ("net: RCU locking for simple ioctl()") Reported-by: "Gong, Sishuai" <sishuai@purdue.edu> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5fdd2f0e |
|
08-Feb-2021 |
Wei Wang <weiwan@google.com> |
net: add sysfs attribute to control napi threaded mode This patch adds a new sysfs attribute to the network device class. Said attribute provides a per-device control to enable/disable the threaded mode for all the napi instances of the given network device, without the need for a device up/down. User sets it to 1 or 0 to enable or disable threaded mode. Note: when switching between threaded and the current softirq based mode for a napi instance, it will not immediately take effect if the napi is currently being polled. The mode switch will happen for the next time napi_schedule() is called. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Co-developed-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Wei Wang <weiwan@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29863d41 |
|
08-Feb-2021 |
Wei Wang <weiwan@google.com> |
net: implement threaded-able napi poll loop support This patch allows running each napi poll loop inside its own kernel thread. The kthread is created during netif_napi_add() if dev->threaded is set. And threaded mode is enabled in napi_enable(). We will provide a way to set dev->threaded and enable threaded mode without a device up/down in the following patch. Once that threaded mode is enabled and the kthread is started, napi_schedule() will wake-up such thread instead of scheduling the softirq. The threaded poll loop behaves quite likely the net_rx_action, but it does not have to manipulate local irqs and uses an explicit scheduling point based on netdev_budget. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Wei Wang <weiwan@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28af22c6 |
|
26-Jan-2021 |
Jesper Dangaard Brouer <brouer@redhat.com> |
net: adjust net_device layout for cacheline usage The current layout of net_device is not optimal for cacheline usage. The member adj_list.lower linked list is split between cacheline 2 and 3. The ifindex is placed together with stats (struct net_device_stats), although most modern drivers don't update this stats member. The members netdev_ops, mtu and hard_header_len are placed on three different cachelines. These members are accessed for XDP redirect into devmap, which were noticeably with perf tool. When not using the map redirect variant (like TC-BPF does), then ifindex is also used, which is placed on a separate fourth cacheline. These members are also accessed during forwarding with regular network stack. The members priv_flags and flags are on fast-path for network stack transmit path in __dev_queue_xmit (currently located together with mtu cacheline). This patch creates a read mostly cacheline, with the purpose of keeping the above mentioned members on the same cacheline. Some netdev_features_t members also becomes part of this cacheline, which is on purpose, as function netif_skb_features() is on fast-path via validate_xmit_skb(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/161168277983.410784.12401225493601624417.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d03b195b |
|
19-Jan-2021 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
sch_htb: Hierarchical QoS hardware offload HTB doesn't scale well because of contention on a single lock, and it also consumes CPU. This patch adds support for offloading HTB to hardware that supports hierarchical rate limiting. In the offload mode, HTB passes control commands to the driver using ndo_setup_tc. The driver has to replicate the whole hierarchy of classes and their settings (rate, ceil) in the NIC. Every modification of the HTB tree caused by the admin results in ndo_setup_tc being called. After this setup, the HTB algorithm is done completely in the NIC. An SQ (send queue) is created for every leaf class and attached to the hierarchy, so that the NIC can calculate and obey aggregated rate limits, too. In the future, it can be changed, so that multiple SQs will back a single leaf class. ndo_select_queue is responsible for selecting the right queue that serves the traffic class of each packet. The data path works as follows: a packet is classified by clsact, the driver selects a hardware queue according to its class, and the packet is enqueued into this queue's qdisc. This solution addresses two main problems of scaling HTB: 1. Contention by flow classification. Currently the filters are attached to the HTB instance as follows: # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10 It's possible to move classification to clsact egress hook, which is thread-safe and lock-free: # tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10 This way classification still happens in software, but the lock contention is eliminated, and it happens before selecting the TX queue, allowing the driver to translate the class to the corresponding hardware queue in ndo_select_queue. Note that this is already compatible with non-offloaded HTB and doesn't require changes to the kernel nor iproute2. 2. Contention by handling packets. HTB is not multi-queue, it attaches to a whole net device, and handling of all packets takes the same lock. When HTB is offloaded, it registers itself as a multi-queue qdisc, similarly to mq: HTB is attached to the netdev, and each queue has its own qdisc. Some features of HTB may be not supported by some particular hardware, for example, the maximum number of classes may be limited, the granularity of rate and ceil parameters may be different, etc. - so, the offload is not enabled by default, a new parameter is used to enable it: # tc qdisc replace dev eth0 root handle 1: htb offload Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
7b8fc010 |
|
18-Jan-2021 |
Jarod Wilson <jarod@redhat.com> |
bonding: add a vlan+srcmac tx hashing option This comes from an end-user request, where they're running multiple VMs on hosts with bonded interfaces connected to some interest switch topologies, where 802.3ad isn't an option. They're currently running a proprietary solution that effectively achieves load-balancing of VMs and bandwidth utilization improvements with a similar form of transmission algorithm. Basically, each VM has it's own vlan, so it always sends its traffic out the same interface, unless that interface fails. Traffic gets split between the interfaces, maintaining a consistent path, with failover still available if an interface goes down. Unlike bond_eth_hash(), this hash function is using the full source MAC address instead of just the last byte, as there are so few components to the hash, and in the no-vlan case, we would be returning just the last byte of the source MAC as the hash value. It's entirely possible to have two NICs in a bond with the same last byte of their MAC, but not the same MAC, so this adjustment should guarantee distinct hashes in all cases. This has been rudimetarily tested to provide similar results to the proprietary solution it is aiming to replace. A patch for iproute2 is also posted, to properly support the new mode there as well. Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Thomas Davis <tadavis@lbl.gov> Signed-off-by: Jarod Wilson <jarod@redhat.com> Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
719a402c |
|
17-Jan-2021 |
Tariq Toukan <tariqt@nvidia.com> |
net: netdevice: Add operation ndo_sk_get_lower_dev ndo_sk_get_lower_dev returns the lower netdev that corresponds to a given socket. Additionally, we implement a helper netdev_sk_get_lowest_dev() to get the lowest one in chain. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1d11fa69 |
|
08-Jan-2021 |
Eric Dumazet <edumazet@google.com> |
net-gro: remove GRO_DROP GRO_DROP can only be returned from napi_gro_frags() if the skb has not been allocated by a prior napi_get_frags() Since drivers must use napi_get_frags() and test its result before populating the skb with metadata, we can safely remove GRO_DROP since it offers no practical use. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
30bfce10 |
|
06-Jan-2021 |
Jakub Kicinski <kuba@kernel.org> |
net: remove ndo_udp_tunnel_* callbacks All UDP tunnel port management is now routed via udp_tunnel_nic infra directly. Remove the old callbacks. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
3aa6bce9 |
|
05-Feb-2021 |
Edwin Peer <edwin.peer@broadcom.com> |
net: watchdog: hold device global xmit lock during tx disable Prevent netif_tx_disable() running concurrently with dev_watchdog() by taking the device global xmit lock. Otherwise, the recommended: netif_carrier_off(dev); netif_tx_disable(dev); driver shutdown sequence can happen after the watchdog has already checked carrier, resulting in possible false alarms. This is because netif_tx_lock() only sets the frozen bit without maintaining the locks on the individual queues. Fixes: c3f26a269c24 ("netdev: Fix lockdep warnings in multiqueue configurations.") Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7061eb8c |
|
14-Dec-2020 |
Lijun Pan <ljp@linux.ibm.com> |
net: core: introduce __netdev_notify_peers There are some use cases for netdev_notify_peers in the context when rtnl lock is already held. Introduce lockless version of netdev_notify_peers call to save the extra code to call call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev); call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev); After that, convert netdev_notify_peers to call the new helper. Suggested-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Lijun Pan <ljp@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
c214550f |
|
29-Nov-2020 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net: delete __dev_getfirstbyhwtype The last user of the RTNL brother of dev_getfirstbyhwtype (the latter being synchronized under RCU) has been deleted in commit b4db2b35fc44 ("afs: Use core kernel UUID generation"). Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20201129200550.2433401-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
7fd3253a |
|
30-Nov-2020 |
Björn Töpel <bjorn@kernel.org> |
net: Introduce preferred busy-polling The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Example usage: $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
|
#
36ccdf85 |
|
23-Nov-2020 |
Björn Töpel <bjorn@kernel.org> |
net, xsk: Avoid taking multiple skbuff references Commit 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY") addressed the problem that packets were discarded from the Tx AF_XDP ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was bumping the skbuff reference count, so that the buffer would not be freed by dev_direct_xmit(). A reference count larger than one means that the skbuff is "shared", which is not the case. If the "shared" skbuff is sent to the generic XDP receive path, netif_receive_generic_xdp(), and pskb_expand_head() is entered the BUG_ON(skb_shared(skb)) will trigger. This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(), where a user can select the skbuff free policy. This allows AF_XDP to avoid bumping the reference count, but still keep the NETDEV_TX_BUSY behavior. Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY") Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com
|
#
d5496990 |
|
20-Nov-2020 |
Eyal Birger <eyal.birger@gmail.com> |
net/packet: fix packet receive on L3 devices without visible hard header In the patchset merged by commit b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which did not have header_ops were given one for the purpose of protocol parsing on af_packet transmit path. That change made af_packet receive path regard these devices as having a visible L3 header and therefore aligned incoming skb->data to point to the skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not reset their mac_header prior to ingress and therefore their incoming packets became malformed. Ideally these devices would reset their mac headers, or af_packet would be able to rely on dev->hard_header_len being 0 for such cases, but it seems this is not the case. Fix by changing af_packet RX ll visibility criteria to include the existence of a '.create()' header operation, which is used when creating a device hard header - via dev_hard_header() - by upper layers, and does not exist in these L3 devices. As this predicate may be useful in other situations, add it as a common dev_has_header() helper in netdevice.h. Fixes: b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
cc69837f |
|
20-Nov-2020 |
Jakub Kicinski <kuba@kernel.org> |
net: don't include ethtool.h from netdevice.h linux/netdevice.h is included in very many places, touching any of its dependecies causes large incremental builds. Drop the linux/ethtool.h include, linux/netdevice.h just needs a forward declaration of struct ethtool_ops. Fix all the places which made use of this implicit include. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
270f3385 |
|
16-Nov-2020 |
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> |
net: core: fix some kernel-doc markups Some identifiers have different names between their prototypes and the kernel-doc markup. In the specific case of netif_subqueue_stopped(), keep the current markup for __netif_subqueue_stopped(), adding a new one for netif_subqueue_stopped(). Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
a1839426 |
|
07-Nov-2020 |
Heiner Kallweit <hkallweit1@gmail.com> |
net: core: add dev_get_tstats64 as a ndo_get_stats64 implementation It's a frequent pattern to use netdev->stats for the less frequently accessed counters and per-cpu counters for the frequently accessed counters (rx/tx bytes/packets). Add a default ndo_get_stats64() implementation for this use case. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
81b01894 |
|
29-Oct-2020 |
Heiner Kallweit <hkallweit1@gmail.com> |
net: core: add devm_netdev_alloc_pcpu_stats We have netdev_alloc_pcpu_stats(), and we have devm_alloc_percpu(). Add a managed version of netdev_alloc_pcpu_stats, e.g. for allocating the per-cpu stats in the probe() callback of a driver. It needs to be a macro for dealing properly with the type argument. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
d3fd6548 |
|
29-Oct-2020 |
Heiner Kallweit <hkallweit1@gmail.com> |
net: core: add dev_sw_netstats_tx_add Add dev_sw_netstats_tx_add(), complementing already existing dev_sw_netstats_rx_add(). Other than dev_sw_netstats_rx_add allow to pass the number of packets as function argument. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
44fa32f0 |
|
12-Oct-2020 |
Heiner Kallweit <hkallweit1@gmail.com> |
net: add function dev_fetch_sw_netstats for fetching pcpu_sw_netstats In several places the same code is used to populate rtnl_link_stats64 fields with data from pcpu_sw_netstats. Therefore factor out this code to a new function dev_fetch_sw_netstats(). v2: - constify argument netstats - don't ignore netstats being NULL or an ERRPTR - switch to EXPORT_SYMBOL_GPL Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/6d16a338-52f5-df69-0020-6bc771a7d498@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
9aa1206e |
|
10-Oct-2020 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: Add redirect_peer helper Add an efficient ingress to ingress netns switch that can be used out of tc BPF programs in order to redirect traffic from host ns ingress into a container veth device ingress without having to go via CPU backlog queue [0]. For local containers this can also be utilized and path via CPU backlog queue only needs to be taken once, not twice. On a high level this borrows from ipvlan which does similar switch in __netif_receive_skb_core() and then iterates via another_round. This helps to reduce latency for mentioned use cases. Pod to remote pod with redirect(), TCP_RR [1]: # percpu_netperf 10.217.1.33 RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 ) MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 ) STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 ) MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 ) P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 ) P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 ) P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 ) TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 ) Pod to remote pod with redirect_peer(), TCP_RR: # percpu_netperf 10.217.1.33 RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 ) MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 ) STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 ) MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 ) P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 ) P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 ) P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 ) TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 ) [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
|
#
451b05f4 |
|
05-Oct-2020 |
Fabian Frederick <fabf@skynet.be> |
net: netdevice.h: sw_netstats_rx_add helper some drivers/network protocols update rx bytes/packets under u64_stats_update_begin/end sequence. Add a specific helper like dev_lstats_add() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8e1b3884 |
|
01-Oct-2020 |
Taehee Yoo <ap420073@gmail.com> |
net: remove NETDEV_HW_ADDR_T_SLAVE NETDEV_HW_ADDR_T_SLAVE is not used anymore, remove it. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a93bdcb9 |
|
01-Oct-2020 |
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> |
net: core: document two new elements of struct net_device As warned by "make htmldocs", there are two new struct elements that aren't documented: ../include/linux/netdevice.h:2159: warning: Function parameter or member 'unlink_list' not described in 'net_device' ../include/linux/netdevice.h:2159: warning: Function parameter or member 'nested_level' not described in 'net_device' Fixes: 1fc70edb7d7b ("net: core: add nested_level variable in net_device") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c11171a4 |
|
29-Sep-2020 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
net: Add netif_rx_any_context() Quite some drivers make conditional decisions based on in_interrupt() to invoke either netif_rx() or netif_rx_ni(). Conditionals based on in_interrupt() or other variants of preempt count checks in drivers should not exist for various reasons and Linus clearly requested to either split the code pathes or pass an argument to the common functions which provides the context. This is obviously the correct solution, but for some of the affected drivers this needs a major rewrite due to their convoluted structure. As in_interrupt() usage in drivers needs to be phased out, provide netif_rx_any_context() as a stop gap for these drivers. This confines the in_interrupt() conditional to core code which in turn allows to remove the access to this check for driver code and provides one central place to do further modifications once the driver maze is cleaned up. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1fc70edb |
|
25-Sep-2020 |
Taehee Yoo <ap420073@gmail.com> |
net: core: add nested_level variable in net_device This patch is to add a new variable 'nested_level' into the net_device structure. This variable will be used as a parameter of spin_lock_nested() of dev->addr_list_lock. netif_addr_lock() can be called recursively so spin_lock_nested() is used instead of spin_lock() and dev->lower_level is used as a parameter of spin_lock_nested(). But, dev->lower_level value can be updated while it is being used. So, lockdep would warn a possible deadlock scenario. When a stacked interface is deleted, netif_{uc | mc}_sync() is called recursively. So, spin_lock_nested() is called recursively too. At this moment, the dev->lower_level variable is used as a parameter of it. dev->lower_level value is updated when interfaces are being unlinked/linked immediately. Thus, After unlinking, dev->lower_level shouldn't be a parameter of spin_lock_nested(). A (macvlan) | B (vlan) | C (bridge) | D (macvlan) | E (vlan) | F (bridge) A->lower_level : 6 B->lower_level : 5 C->lower_level : 4 D->lower_level : 3 E->lower_level : 2 F->lower_level : 1 When an interface 'A' is removed, it releases resources. At this moment, netif_addr_lock() would be called. Then, netdev_upper_dev_unlink() is called recursively. Then dev->lower_level is updated. There is no problem. But, when the bridge module is removed, 'C' and 'F' interfaces are removed at once. If 'F' is removed first, a lower_level value is like below. A->lower_level : 5 B->lower_level : 4 C->lower_level : 3 D->lower_level : 2 E->lower_level : 1 F->lower_level : 1 Then, 'C' is removed. at this moment, netif_addr_lock() is called recursively. The ordering is like this. C(3)->D(2)->E(1)->F(1) At this moment, the lower_level value of 'E' and 'F' are the same. So, lockdep warns a possible deadlock scenario. In order to avoid this problem, a new variable 'nested_level' is added. This value is the same as dev->lower_level - 1. But this value is updated in rtnl_unlock(). So, this variable can be used as a parameter of spin_lock_nested() safely in the rtnl context. Test commands: ip link add br0 type bridge vlan_filtering 1 ip link add vlan1 link br0 type vlan id 10 ip link add macvlan2 link vlan1 type macvlan ip link add br3 type bridge vlan_filtering 1 ip link set macvlan2 master br3 ip link add vlan4 link br3 type vlan id 10 ip link add macvlan5 link vlan4 type macvlan ip link add br6 type bridge vlan_filtering 1 ip link set macvlan5 master br6 ip link add vlan7 link br6 type vlan id 10 ip link add macvlan8 link vlan7 type macvlan ip link set br0 up ip link set vlan1 up ip link set macvlan2 up ip link set br3 up ip link set vlan4 up ip link set macvlan5 up ip link set br6 up ip link set vlan7 up ip link set macvlan8 up modprobe -rv bridge Splat looks like: [ 36.057436][ T744] WARNING: possible recursive locking detected [ 36.058848][ T744] 5.9.0-rc6+ #728 Not tainted [ 36.059959][ T744] -------------------------------------------- [ 36.061391][ T744] ip/744 is trying to acquire lock: [ 36.062590][ T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30 [ 36.064922][ T744] [ 36.064922][ T744] but task is already holding lock: [ 36.066626][ T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60 [ 36.068851][ T744] [ 36.068851][ T744] other info that might help us debug this: [ 36.070731][ T744] Possible unsafe locking scenario: [ 36.070731][ T744] [ 36.072497][ T744] CPU0 [ 36.073238][ T744] ---- [ 36.074007][ T744] lock(&vlan_netdev_addr_lock_key); [ 36.075290][ T744] lock(&vlan_netdev_addr_lock_key); [ 36.076590][ T744] [ 36.076590][ T744] *** DEADLOCK *** [ 36.076590][ T744] [ 36.078515][ T744] May be due to missing lock nesting notation [ 36.078515][ T744] [ 36.080491][ T744] 3 locks held by ip/744: [ 36.081471][ T744] #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490 [ 36.083614][ T744] #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60 [ 36.085942][ T744] #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80 [ 36.088400][ T744] [ 36.088400][ T744] stack backtrace: [ 36.089772][ T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728 [ 36.091364][ T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 36.093630][ T744] Call Trace: [ 36.094416][ T744] dump_stack+0x77/0x9b [ 36.095385][ T744] __lock_acquire+0xbc3/0x1f40 [ 36.096522][ T744] lock_acquire+0xb4/0x3b0 [ 36.097540][ T744] ? dev_set_rx_mode+0x19/0x30 [ 36.098657][ T744] ? rtmsg_ifinfo+0x1f/0x30 [ 36.099711][ T744] ? __dev_notify_flags+0xa5/0xf0 [ 36.100874][ T744] ? rtnl_is_locked+0x11/0x20 [ 36.101967][ T744] ? __dev_set_promiscuity+0x7b/0x1a0 [ 36.103230][ T744] _raw_spin_lock_bh+0x38/0x70 [ 36.104348][ T744] ? dev_set_rx_mode+0x19/0x30 [ 36.105461][ T744] dev_set_rx_mode+0x19/0x30 [ 36.106532][ T744] dev_set_promiscuity+0x36/0x50 [ 36.107692][ T744] __dev_set_promiscuity+0x123/0x1a0 [ 36.108929][ T744] dev_set_promiscuity+0x1e/0x50 [ 36.110093][ T744] br_port_set_promisc+0x1f/0x40 [bridge] [ 36.111415][ T744] br_manage_promisc+0x8b/0xe0 [bridge] [ 36.112728][ T744] __dev_set_promiscuity+0x123/0x1a0 [ 36.113967][ T744] ? __hw_addr_sync_one+0x23/0x50 [ 36.115135][ T744] __dev_set_rx_mode+0x68/0x90 [ 36.116249][ T744] dev_uc_sync+0x70/0x80 [ 36.117244][ T744] dev_uc_add+0x50/0x60 [ 36.118223][ T744] macvlan_open+0x18e/0x1f0 [macvlan] [ 36.119470][ T744] __dev_open+0xd6/0x170 [ 36.120470][ T744] __dev_change_flags+0x181/0x1d0 [ 36.121644][ T744] dev_change_flags+0x23/0x60 [ 36.122741][ T744] do_setlink+0x30a/0x11e0 [ 36.123778][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.124929][ T744] ? __nla_validate_parse.part.6+0x45/0x8e0 [ 36.126309][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.127457][ T744] __rtnl_newlink+0x546/0x8e0 [ 36.128560][ T744] ? lock_acquire+0xb4/0x3b0 [ 36.129623][ T744] ? deactivate_slab.isra.85+0x6a1/0x850 [ 36.130946][ T744] ? __lock_acquire+0x92c/0x1f40 [ 36.132102][ T744] ? lock_acquire+0xb4/0x3b0 [ 36.133176][ T744] ? is_bpf_text_address+0x5/0xe0 [ 36.134364][ T744] ? rtnl_newlink+0x2e/0x70 [ 36.135445][ T744] ? rcu_read_lock_sched_held+0x32/0x60 [ 36.136771][ T744] ? kmem_cache_alloc_trace+0x2d8/0x380 [ 36.138070][ T744] ? rtnl_newlink+0x2e/0x70 [ 36.139164][ T744] rtnl_newlink+0x47/0x70 [ ... ] Fixes: 845e0ebb4408 ("net: change addr_list_lock back to static key") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eff74233 |
|
25-Sep-2020 |
Taehee Yoo <ap420073@gmail.com> |
net: core: introduce struct netdev_nested_priv for nested interface infrastructure Functions related to nested interface infrastructure such as netdev_walk_all_{ upper | lower }_dev() pass both private functions and "data" pointer to handle their own things. At this point, the data pointer type is void *. In order to make it easier to expand common variables and functions, this new netdev_nested_priv structure is added. In the following patch, a new member variable will be added into this struct to fix the lockdep issue. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3753d977 |
|
17-Sep-2020 |
Mahesh Bandewar <maheshb@google.com> |
net: fix build without CONFIG_SYSCTL definition Earlier commit 316cdaa1158a ("net: add option to not create fall-back tunnels in root-ns as well") removed the CONFIG_SYSCTL to enable the kernel-commandline to work. However, this variable gets defined only when CONFIG_SYSCTL option is selected. With this change the behavior would default to creating fall-back tunnels in all namespaces when CONFIG_SYSCTL is not selected and the kernel commandline option will be ignored. Fixes: 316cdaa1158a ("net: add option to not create fall-back tunnels in root-ns as well") Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2492c205 |
|
16-Sep-2020 |
YueHaibing <yuehaibing@huawei.com> |
netdev: Remove unused functions There is no callers in tree, so can remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4d092dd2 |
|
09-Sep-2020 |
Jakub Kicinski <kuba@kernel.org> |
net: manage napi add/del idempotence explicitly To RCUify napi->dev_list we need to replace list_del_init() with list_del_rcu(). There is no _init() version for RCU for obvious reasons. Up until now netif_napi_del() was idempotent so to make sure it remains such add a bit which is set when NAPI is listed, and cleared when it removed. Since we don't expect multiple calls to netif_napi_add() to be correct, add a warning on that side. Now that napi_hash_add / napi_hash_del are only called by napi_add / del we can actually steal its bit. We just need to make sure hash node is initialized correctly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5198d545 |
|
09-Sep-2020 |
Jakub Kicinski <kuba@kernel.org> |
net: remove napi_hash_del() from driver-facing API We allow drivers to call napi_hash_del() before calling netif_napi_del() to batch RCU grace periods. This makes the API asymmetric and leaks internal implementation details. Soon we will want the grace period to protect more than just the NAPI hash table. Restructure the API and have drivers call a new function - __netif_napi_del() if they want to take care of RCU waits. Note that only core was checking the return status from napi_hash_del() so the new helper does not report if the NAPI was actually deleted. Some notes on driver oddness: - veth observed the grace period before calling netif_napi_del() but that should not matter - myri10ge observed normal RCU flavor - bnx2x and enic did not actually observe the grace period (unless they did so implicitly) - virtio_net and enic only unhashed Rx NAPIs The last two points seem to indicate that the calls to napi_hash_del() were a left over rather than an optimization. Regardless, it's easy enough to correct them. This patch may introduce extra synchronize_net() calls for interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on free_netdev() to call netif_napi_del(). This seems inevitable since we want to use RCU for netpoll dev->napi_list traversal, and almost no drivers set IFF_DISABLE_NETPOLL. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ffa59b0b |
|
06-Sep-2020 |
Randy Dunlap <rdunlap@infradead.org> |
netdevice.h: fix xdp_state kernel-doc warning Fix kernel-doc warning in <linux/netdevice.h>: ../include/linux/netdevice.h:2158: warning: Function parameter or member 'xdp_state' not described in 'net_device' Fixes: 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Andrii Nakryiko <andriin@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
eb02d39a |
|
06-Sep-2020 |
Randy Dunlap <rdunlap@infradead.org> |
netdevice.h: fix proto_down_reason kernel-doc warning Fix kernel-doc warning in <linux/netdevice.h>: ../include/linux/netdevice.h:2158: warning: Function parameter or member 'proto_down_reason' not described in 'net_device' Fixes: 829eb208e80d ("rtnetlink: add support for protodown reason") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
#
1742b3d5 |
|
28-Aug-2020 |
Magnus Karlsson <magnus.karlsson@intel.com> |
xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umem Replace the explicit umem reference passed to the driver in AF_XDP zero-copy mode with the buffer pool instead. This in preparation for extending the functionality of the zero-copy mode so that umems can be shared between queues on the same netdev and also between netdevs. In this commit, only an umem reference has been added to the buffer pool struct. But later commits will add other entities to it. These are going to be entities that are different between different queue ids and netdevs even though the umem is shared between them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
|
#
316cdaa1 |
|
26-Aug-2020 |
Mahesh Bandewar <maheshb@google.com> |
net: add option to not create fall-back tunnels in root-ns as well The sysctl that was added earlier by commit 79134e6ce2c ("net: do not create fallback tunnels for non-default namespaces") to create fall-back only in root-ns. This patch enhances that behavior to provide option not to create fallback tunnels in root-ns as well. Since modules that create fallback tunnels could be built-in and setting the sysctl value after booting is pointless, so added a kernel cmdline options to change this default. The default setting is preseved for backward compatibility. The kernel command line option of fb_tunnels=initns will set the sysctl value to 1 and will create fallback tunnels only in initns while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and fallback tunnels are skipped in every netns. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Maciej Zenczykowski <maze@google.com> Cc: Jian Yang <jianyang@google.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f468f21b |
|
26-Aug-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net: Take common prefetch code structure into a function Many device drivers use the same prefetch code structure to deal with small L1 cacheline size. Take this code into a function and call it from the drivers. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df23bb18 |
|
03-Aug-2020 |
Stefano Brivio <sbrivio@redhat.com> |
ipv4: route: Ignore output interface in FIB lookup for PMTU route Currently, processes sending traffic to a local bridge with an encapsulation device as a port don't get ICMP errors if they exceed the PMTU of the encapsulated link. David Ahern suggested this as a hack, but it actually looks like the correct solution: when we update the PMTU for a given destination by means of updating or creating a route exception, the encapsulation might trigger this because of PMTU discovery happening either on the encapsulation device itself, or its lower layer. This happens on bridged encapsulations only. The output interface shouldn't matter, because we already have a valid destination. Drop the output interface restriction from the associated route lookup. For UDP tunnels, we will now have a route exception created for the encapsulation itself, with a MTU value reflecting its headroom, which allows a bridge forwarding IP packets originated locally to deliver errors back to the sending socket. The behaviour is now consistent with IPv6 and verified with selftests pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in this series. v2: - reset output interface only for bridge ports (David Ahern) - add and use netif_is_any_bridge_port() helper (David Ahern) Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
829eb208 |
|
31-Jul-2020 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
rtnetlink: add support for protodown reason netdev protodown is a mechanism that allows protocols to hold an interface down. It was initially introduced in the kernel to hold links down by a multihoming protocol. There was also an attempt to introduce protodown reason at the time but was rejected. protodown and protodown reason is supported by almost every switching and routing platform. It was ok for a while to live without a protodown reason. But, its become more critical now given more than one protocol may need to keep a link down on a system at the same time. eg: vrrp peer node, port security, multihoming protocol. Its common for Network operators and protocol developers to look for such a reason on a networking box (Its also known as errDisable by most networking operators) This patch adds support for link protodown reason attribute. There are two ways to maintain protodown reasons. (a) enumerate every possible reason code in kernel - A protocol developer has to make a request and have that appear in a certain kernel version (b) provide the bits in the kernel, and allow user-space (sysadmin or NOS distributions) to manage the bit-to-reasonname map. - This makes extending reason codes easier (kind of like the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/) This patch takes approach (b). a few things about the patch: - It treats the protodown reason bits as counter to indicate active protodown users - Since protodown attribute is already an exposed UAPI, the reason is not enforced on a protodown set. Its a no-op if not used. the patch follows the below algorithm: - presence of reason bits set indicates protodown is in use - user can set protodown and protodown reason in a single or multiple setlink operations - setlink operation to clear protodown, will return -EBUSY if there are active protodown reason bits - reason is not included in link dumps if not used example with patched iproute2: $cat /etc/iproute2/protodown_reasons.d/r.conf 0 mlag 1 evpn 2 vrrp 3 psecurity $ip link set dev vxlan0 protodown on protodown_reason vrrp on $ip link set dev vxlan0 protodown_reason mlag on $ip link show 14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp> $ip link set dev vxlan0 protodown_reason mlag off $ip link set dev vxlan0 protodown off protodown_reason vrrp off Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e8407fde |
|
22-Jul-2020 |
Andrii Nakryiko <andriin@fb.com> |
bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commands Now that BPF program/link management is centralized in generic net_device code, kernel code never queries program id from drivers, so XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary. This patch removes all the implementations of those commands in kernel, along the xdp_attachment_query(). This patch was compile-tested on allyesconfig. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com
|
#
aa8d3a71 |
|
22-Jul-2020 |
Andrii Nakryiko <andriin@fb.com> |
bpf, xdp: Add bpf_link-based XDP attachment API Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through BPF_LINK_CREATE command. bpf_xdp_link is mutually exclusive with direct BPF program attachment, previous BPF program should be detached prior to attempting to create a new bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it can't be replaced by other BPF program attachment or link attachment. It will be detached only when the last BPF link FD is closed. bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to how other BPF links behave (cgroup, flow_dissector). At that point bpf_link will become defunct, but won't be destroyed until last FD is closed. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-5-andriin@fb.com
|
#
7f0a8382 |
|
22-Jul-2020 |
Andrii Nakryiko <andriin@fb.com> |
bpf, xdp: Maintain info on attached XDP BPF programs in net_device Instead of delegating to drivers, maintain information about which BPF programs are attached in which XDP modes (generic/skb, driver, or hardware) locally in net_device. This effectively obsoletes XDP_QUERY_PROG command. Such re-organization simplifies existing code already. But it also allows to further add bpf_link-based XDP attachments without drivers having to know about any of this at all, which seems like a good setup. XDP_SETUP_PROG/XDP_SETUP_PROG_HW are just low-level commands to driver to install/uninstall active BPF program. All the higher-level concerns about prog/link interaction will be contained within generic driver-agnostic logic. All the XDP_QUERY_PROG calls to driver in dev_xdp_uninstall() were removed. It's not clear for me why dev_xdp_uninstall() were passing previous prog_flags when resetting installed programs. That seems unnecessary, plus most drivers don't populate prog_flags anyways. Having XDP_SETUP_PROG vs XDP_SETUP_PROG_HW should be enough of an indicator of what is required of driver to correctly reset active BPF program. dev_xdp_uninstall() is also generalized as an iteration over all three supported mode. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200722064603.3350758-3-andriin@fb.com
|
#
cc4e3835 |
|
09-Jul-2020 |
Jakub Kicinski <kuba@kernel.org> |
udp_tunnel: add central NIC RX port offload infrastructure Cater to devices which: (a) may want to sleep in the callbacks; (b) only have IPv4 support; (c) need all the programming to happen while the netdev is up. Drivers attach UDP tunnel offload info struct to their netdevs, where they declare how many UDP ports of various tunnel types they support. Core takes care of tracking which ports to offload. Use a fixed-size array since this matches what almost all drivers do, and avoids a complexity and uncertainty around memory allocations in an atomic context. Make sure that tunnel drivers don't try to replay the ports when new NIC netdev is registered. Automatic replays would mess up reference counting, and will be removed completely once all drivers are converted. v4: - use a #define NULL to avoid build issues with CONFIG_INET=n. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c45a918 |
|
23-Jun-2020 |
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> |
net: netdevice.h: add a description for napi_defer_hard_irqs Changeset 6f8b12d661d0 ("net: napi: add hard irqs deferral feature") added a new element at struct net_device. Add a description for it, based on what's described at the changeset which added such feature. Fixes: 6f8b12d661d0 ("net: napi: add hard irqs deferral feature") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/807a3840e7bc1562adefadb0535c9f47e6ab52e0.1592895969.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
#
fb7861d1 |
|
16-Jun-2020 |
Taehee Yoo <ap420073@gmail.com> |
net: core: reduce recursion limit value In the current code, ->ndo_start_xmit() can be executed recursively only 10 times because of stack memory. But, in the case of the vxlan, 10 recursion limit value results in a stack overflow. In the current code, the nested interface is limited by 8 depth. There is no critical reason that the recursion limitation value should be 10. So, it would be good to be the same value with the limitation value of nesting interface depth. Test commands: ip link add vxlan10 type vxlan vni 10 dstport 4789 srcport 4789 4789 ip link set vxlan10 up ip a a 192.168.10.1/24 dev vxlan10 ip n a 192.168.10.2 dev vxlan10 lladdr fc:22:33:44:55:66 nud permanent for i in {9..0} do let A=$i+1 ip link add vxlan$i type vxlan vni $i dstport 4789 srcport 4789 4789 ip link set vxlan$i up ip a a 192.168.$i.1/24 dev vxlan$i ip n a 192.168.$i.2 dev vxlan$i lladdr fc:22:33:44:55:66 nud permanent bridge fdb add fc:22:33:44:55:66 dev vxlan$A dst 192.168.$i.2 self done hping3 192.168.10.2 -2 -d 60000 Splat looks like: [ 103.814237][ T1127] ============================================================================= [ 103.871955][ T1127] BUG kmalloc-2k (Tainted: G B ): Padding overwritten. 0x00000000897a2e4f-0x000 [ 103.873187][ T1127] ----------------------------------------------------------------------------- [ 103.873187][ T1127] [ 103.874252][ T1127] INFO: Slab 0x000000005cccc724 objects=5 used=5 fp=0x0000000000000000 flags=0x10000000001020 [ 103.881323][ T1127] CPU: 3 PID: 1127 Comm: hping3 Tainted: G B 5.7.0+ #575 [ 103.882131][ T1127] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 103.883006][ T1127] Call Trace: [ 103.883324][ T1127] dump_stack+0x96/0xdb [ 103.883716][ T1127] slab_err+0xad/0xd0 [ 103.884106][ T1127] ? _raw_spin_unlock+0x1f/0x30 [ 103.884620][ T1127] ? get_partial_node.isra.78+0x140/0x360 [ 103.885214][ T1127] slab_pad_check.part.53+0xf7/0x160 [ 103.885769][ T1127] ? pskb_expand_head+0x110/0xe10 [ 103.886316][ T1127] check_slab+0x97/0xb0 [ 103.886763][ T1127] alloc_debug_processing+0x84/0x1a0 [ 103.887308][ T1127] ___slab_alloc+0x5a5/0x630 [ 103.887765][ T1127] ? pskb_expand_head+0x110/0xe10 [ 103.888265][ T1127] ? lock_downgrade+0x730/0x730 [ 103.888762][ T1127] ? pskb_expand_head+0x110/0xe10 [ 103.889244][ T1127] ? __slab_alloc+0x3e/0x80 [ 103.889675][ T1127] __slab_alloc+0x3e/0x80 [ 103.890108][ T1127] __kmalloc_node_track_caller+0xc7/0x420 [ ... ] Fixes: 11a766ce915f ("net: Increase xmit RECURSION_LIMIT to 10.") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
845e0ebb |
|
08-Jun-2020 |
Cong Wang <xiyou.wangcong@gmail.com> |
net: change addr_list_lock back to static key The dynamic key update for addr_list_lock still causes troubles, for example the following race condition still exists: CPU 0: CPU 1: (RCU read lock) (RTNL lock) dev_mc_seq_show() netdev_update_lockdep_key() -> lockdep_unregister_key() -> netif_addr_lock_bh() because lockdep doesn't provide an API to update it atomically. Therefore, we have to move it back to static keys and use subclass for nest locking like before. In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key changes"), I already reverted most parts of commit ab92d68fc22f ("net: core: add generic lockdep keys"). This patch reverts the rest and also part of commit f3b0a18bb6cb ("net: remove unnecessary variables and callback"). After this patch, addr_list_lock changes back to using static keys and subclasses to satisfy lockdep. Thanks to dev->lower_level, we do not have to change back to ->ndo_get_lock_subclass(). And hopefully this reduces some syzbot lockdep noises too. Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com Cc: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ceabef7d |
|
07-Jun-2020 |
Orson Zhai <orson.zhai@unisoc.com> |
dynamic_debug: add an option to enable dynamic debug for modules only Instead of enabling dynamic debug globally with CONFIG_DYNAMIC_DEBUG, CONFIG_DYNAMIC_DEBUG_CORE will only enable core function of dynamic debug. With the DYNAMIC_DEBUG_MODULE defined for any modules, dynamic debug will be tied to them. This is useful for people who only want to enable dynamic debug for kernel modules without worrying about kernel image size and memory consumption is increasing too much. [orson.zhai@unisoc.com: v2] Link: http://lkml.kernel.org/r/1587408228-10861-1-git-send-email-orson.unisoc@gmail.com Signed-off-by: Orson Zhai <orson.zhai@unisoc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Petr Mladek <pmladek@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Randy Dunlap <rdunlap@infradead.org> Link: http://lkml.kernel.org/r/1586521984-5890-1-git-send-email-orson.unisoc@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cd16627f |
|
23-May-2020 |
Bartosz Golaszewski <bgolaszewski@baylibre.com> |
net: devres: provide devm_register_netdev() Provide devm_register_netdev() - a device resource managed variant of register_netdev(). This new helper will only work for net_device structs that are also already managed by devres. Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
607259a6 |
|
19-May-2020 |
Christoph Hellwig <hch@lst.de> |
net: add a new ndo_tunnel_ioctl method This method is used to properly allow kernel callers of the IPv4 route management ioctls. The exsting ip_tunnel_ioctl helper is renamed to ip_tunnel_ctl to better reflect that it doesn't directly implement ioctls touching user memory, and is used for the guts of ndo_tunnel_ctl implementations. A new ip_tunnel_ioctl helper is added that can be wired up directly to the ndo_do_ioctl method and takes care of the copy to and from userspace. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1a33e10e |
|
02-May-2020 |
Cong Wang <xiyou.wangcong@gmail.com> |
net: partially revert dynamic lockdep key changes This patch reverts the folowing commits: commit 064ff66e2bef84f1153087612032b5b9eab005bd "bonding: add missing netdev_update_lockdep_key()" commit 53d374979ef147ab51f5d632dfe20b14aebeccd0 "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()" commit 1f26c0d3d24125992ab0026b0dab16c08df947c7 "net: fix kernel-doc warning in <linux/netdevice.h>" commit ab92d68fc22f9afab480153bd82a20f6e2533769 "net: core: add generic lockdep keys" but keeps the addr_list_lock_key because we still lock addr_list_lock nestedly on stack devices, unlikely xmit_lock this is safe because we don't take addr_list_lock on any fast path. Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cff9f12b |
|
30-Apr-2020 |
Maor Gottlieb <maorg@mellanox.com> |
net/core: Introduce netdev_get_xmit_slave Add new ndo to get the xmit slave of master device. The reference counters are not incremented so the caller must be careful with locks. User can ask to get the xmit slave assume all the slaves can transmit by set all_slaves arg to true. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
6f8b12d6 |
|
22-Apr-2020 |
Eric Dumazet <edumazet@google.com> |
net: napi: add hard irqs deferral feature Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer") we added the ability to arm one high resolution timer, that we used to keep not-complete packets in GRO engine a bit longer, hoping that further frames might be added to them. Since then, we added the napi_complete_done() interface, and commit 364b6055738b ("net: busy-poll: return busypolling status to drivers") allowed drivers to avoid re-arming NIC interrupts if we made a promise that their NAPI poll() handler would be called in the near future. This infrastructure can be leveraged, thanks to a new device parameter, which allows to arm the napi hrtimer, instead of re-arming the device hard IRQ. We have noticed that on some servers with 32 RX queues or more, the chit-chat between the NIC and the host caused by IRQ delivery and re-arming could hurt throughput by ~20% on 100Gbit NIC. In contrast, hrtimers are using local (percpu) resources and might have lower cost. The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy than gro_flush_timeout (/sys/class/net/ethX/) By default, both gro_flush_timeout and napi_defer_hard_irqs are zero. This patch does not change the prior behavior of gro_flush_timeout if used alone : NIC hard irqs should be rearmed as before. One concrete usage can be : echo 20000 >/sys/class/net/eth1/gro_flush_timeout echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs If at least one packet is retired, then we will reset napi counter to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans of the queue. On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ avoidance was only possible if napi->poll() was exhausting its budget and not call napi_complete_done(). This feature also can be used to work around some non-optimal NIC irq coalescing strategies. Having the ability to insert XX usec delays between each napi->poll() can increase cache efficiency, since we increase batch sizes. It also keeps serving cpus not idle too long, reducing tail latencies. Co-developed-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eec517cd |
|
19-Apr-2020 |
Andrew Lunn <andrew@lunn.ch> |
net: Add IF_OPER_TESTING RFC 2863 defines the operational state testing. Add support for this state, both as a IF_LINK_MODE_ and __LINK_STATE_. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
92234c8f |
|
25-Mar-2020 |
Toke Høiland-Jørgensen <toke@redhat.com> |
xdp: Support specifying expected existing program when attaching XDP While it is currently possible for userspace to specify that an existing XDP program should not be replaced when attaching to an interface, there is no mechanism to safely replace a specific XDP program with another. This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be set along with IFLA_XDP_FD. If set, the kernel will check that the program currently loaded on the interface matches the expected one, and fail the operation if it does not. This corresponds to a 'cmpxchg' memory operation. Setting the new attribute with a negative value means that no program is expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST flag. A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly request checking of the EXPECTED_FD attribute. This is needed for userspace to discover whether the kernel supports the new attribute. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk
|
#
30e9bb84 |
|
25-Mar-2020 |
Antoine Tenart <atenart@kernel.org> |
net: add a reference to MACsec ops in net_device This patch adds a reference to MACsec ops to the net_device structure, allowing net device drivers to implement offloading operations for MACsec. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Mark Starovoytov <mstarovoitov@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
357b6cc5 |
|
18-Mar-2020 |
Daniel Borkmann <daniel@iogearbox.net> |
netfilter: revert introduction of egress hook This reverts the following commits: 8537f78647c0 ("netfilter: Introduce egress hook") 5418d3881e1f ("netfilter: Generalize ingress hook") b030f194aed2 ("netfilter: Rename ingress hook include file") >From the discussion in [0], the author's main motivation to add a hook in fast path is for an out of tree kernel module, which is a red flag to begin with. Other mentioned potential use cases like NAT{64,46} is on future extensions w/o concrete code in the tree yet. Revert as suggested [1] given the weak justification to add more hooks to critical fast-path. [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/ [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Miller <davem@davemloft.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Alexei Starovoitov <ast@kernel.org> Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8537f786 |
|
10-Mar-2020 |
Lukas Wunner <lukas@wunner.de> |
netfilter: Introduce egress hook Commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key") introduced the ability to classify packets on ingress. Allow the same on egress. Position the hook immediately before a packet is handed to tc and then sent out on an interface, thereby mirroring the ingress order. This order allows marking packets in the netfilter egress hook and subsequently using the mark in tc. Another benefit of this order is consistency with a lot of existing documentation which says that egress tc is performed after netfilter hooks. Egress hooks already exist for the most common protocols, such as NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because they are executed earlier during packet processing. However for more exotic protocols, there is currently no provision to apply netfilter on egress. A common workaround is to enslave the interface to a bridge and use ebtables, or to resort to tc. But when the ingress hook was introduced, consensus was that users should be given the choice to use netfilter or tc, whichever tool suits their needs best: https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/ This hook is also useful for NAT46/NAT64, tunneling and filtering of locally generated af_packet traffic such as dhclient. There have also been occasional user requests for a netfilter egress hook in the past, e.g.: https://www.spinics.net/lists/netfilter/msg50038.html Performance measurements with pktgen surprisingly show a speedup rather than a slowdown with this commit: * Without this commit: Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags) 2920481pps 1401Mb/sec (1401830880bps) errors: 0 * With this commit: Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags) 2941410pps 1411Mb/sec (1411876800bps) errors: 0 * Without this commit + tc egress: Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags) 2562631pps 1230Mb/sec (1230062880bps) errors: 0 * With this commit + tc egress: Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags) 2659259pps 1276Mb/sec (1276444320bps) errors: 0 * With this commit + nft egress: Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags) 2413320pps 1158Mb/sec (1158393600bps) errors: 0 Tested on a bare-metal Core i7-3615QM, each measurement was performed three times to verify that the numbers are stable. Commands to perform a measurement: modprobe pktgen echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000 Commands for testing tc egress: tc qdisc add dev lo clsact tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32 Commands for testing nft egress: nft add table netdev t nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \} nft add rule netdev t co ip daddr 4.3.2.1/32 drop All testing was performed on the loopback interface to avoid distorting measurements by the packet handling in the low-level Ethernet driver. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
aaca9408 |
|
05-Mar-2020 |
Petr Machata <petrm@mellanox.com> |
net: sched: Make FIFO Qdisc offloadable Invoke ndo_setup_tc() as appropriate to signal init / replacement, destroying and dumping of pFIFO / bFIFO Qdisc. A lot of the FIFO logic is used for pFIFO_head_drop as well, but that's a semantically very different Qdisc that isn't really in the same boat as pFIFO / bFIFO. Split some of the functions to keep the Qdisc intact. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb4cf02d |
|
02-Mar-2020 |
Gustavo A. R. Silva <gustavo@embeddedor.com> |
netdevice: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7151affe |
|
15-Feb-2020 |
Taehee Yoo <ap420073@gmail.com> |
net: export netdev_next_lower_dev_rcu() netdev_next_lower_dev_rcu() will be used to implement a function, which is to walk all lower interfaces. There are already functions that they walk their lower interface. (netdev_walk_all_lower_dev_rcu, netdev_walk_all_lower_dev()). But, there would be cases that couldn't be covered by given netdev_walk_all_lower_dev_{rcu}() function. So, some modules would want to implement own function, which is to walk all lower interfaces. In the next patch, netdev_next_lower_dev_rcu() will be used. In addition, this patch removes two unused prototypes in netdevice.h. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a1fa83bd |
|
12-Feb-2020 |
Randy Dunlap <rdunlap@infradead.org> |
netdevice.h: fix all kernel-doc and Sphinx warnings Eliminate all kernel-doc and Sphinx warnings in <linux/netdevice.h>. Fixes these warnings: ../include/linux/netdevice.h:2100: warning: Function parameter or member 'gso_partial_features' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'l3mdev_ops' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xfrmdev_ops' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'tlsdev_ops' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'name_assign_type' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'ieee802154_ptr' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'mpls_ptr' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xdp_prog' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'gro_flush_timeout' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xdp_bulkq' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xps_cpus_map' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'xps_rxqs_map' not described in 'net_device' ../include/linux/netdevice.h:2100: warning: Function parameter or member 'qdisc_hash' not described in 'net_device' ../include/linux/netdevice.h:3552: WARNING: Inline emphasis start-string without end-string. ../include/linux/netdevice.h:3552: WARNING: Inline emphasis start-string without end-string. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6a94b8cc |
|
26-Jan-2020 |
Michal Kubecek <mkubecek@suse.cz> |
ethtool: provide message mask with DEBUG_GET request Implement DEBUG_GET request to get debugging settings for a device. At the moment, only message mask corresponding to message level as reported by ETHTOOL_GMSGLVL ioctl request is provided. (It is called message level in ioctl interface but almost all drivers interpret it as a bit mask.) As part of the implementation, provide symbolic names for message mask bits as ETH_SS_MSG_CLASSES string set. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93642e14 |
|
24-Jan-2020 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce dev_net notifier register/unregister variants Introduce dev_net variants of netdev notifier register/unregister functions and allow per-net notifier to follow the netdevice into the namespace it is moved to. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3a1296a3 |
|
25-Jan-2020 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Support GRO/GSO fraglist chaining. This patch adds the core functions to chain/unchain GSO skbs at the frag_list pointer. This also adds a new GSO type SKB_GSO_FRAGLIST and a is_flist flag to napi_gro_cb which indicates that this flow will be GROed by fraglist chaining. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b335832 |
|
25-Jan-2020 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Add fraglist GRO/GSO feature flags This adds new Fraglist GRO/GSO feature flags. They will be used to configure fraglist GRO/GSO what will be implemented with some followup paches. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ef6aadcc |
|
24-Jan-2020 |
Petr Machata <petrm@mellanox.com> |
net: sched: Make TBF Qdisc offloadable Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying and dumping of TBF Qdisc. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d836f5c6 |
|
21-Jan-2020 |
Eric Dumazet <edumazet@google.com> |
net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() rtnl_create_link() needs to apply dev->min_mtu and dev->max_mtu checks that we apply in do_setlink() Otherwise malicious users can crash the kernel, for example after an integer overflow : BUG: KASAN: use-after-free in memset include/linux/string.h:365 [inline] BUG: KASAN: use-after-free in __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238 Write of size 32 at addr ffff88819f20b9c0 by task swapper/0/0 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506 kasan_report+0x12/0x20 mm/kasan/common.c:639 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192 memset+0x24/0x40 mm/kasan/common.c:108 memset include/linux/string.h:365 [inline] __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238 alloc_skb include/linux/skbuff.h:1049 [inline] alloc_skb_with_frags+0x93/0x590 net/core/skbuff.c:5664 sock_alloc_send_pskb+0x7ad/0x920 net/core/sock.c:2242 sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2259 mld_newpack+0x1d7/0x7f0 net/ipv6/mcast.c:1609 add_grhead.isra.0+0x299/0x370 net/ipv6/mcast.c:1713 add_grec+0x7db/0x10b0 net/ipv6/mcast.c:1844 mld_send_cr net/ipv6/mcast.c:1970 [inline] mld_ifc_timer_expire+0x3d3/0x950 net/ipv6/mcast.c:2477 call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404 expire_timers kernel/time/timer.c:1449 [inline] __run_timers kernel/time/timer.c:1773 [inline] __run_timers kernel/time/timer.c:1740 [inline] run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786 __do_softirq+0x262/0x98c kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0x19b/0x1e0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1137 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829 </IRQ> RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61 Code: 98 6b ea f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 44 1c 60 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 34 1c 60 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 4e 5d 9a f9 e8 79 RSP: 0018:ffffffff89807ce8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13 RAX: 1ffffffff13266ae RBX: ffffffff8987a1c0 RCX: 0000000000000000 RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffffffff8987aa54 RBP: ffffffff89807d18 R08: ffffffff8987a1c0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000 R13: ffffffff8a799980 R14: 0000000000000000 R15: 0000000000000000 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:690 default_idle_call+0x84/0xb0 kernel/sched/idle.c:94 cpuidle_idle_call kernel/sched/idle.c:154 [inline] do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361 rest_init+0x23b/0x371 init/main.c:451 arch_call_rest_init+0xe/0x1b start_kernel+0x904/0x943 init/main.c:784 x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490 x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:471 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242 The buggy address belongs to the page: page:ffffea00067c82c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 raw: 057ffe0000000000 ffffea00067c82c8 ffffea00067c82c8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88819f20b880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88819f20b900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88819f20b980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88819f20ba00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88819f20ba80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
75ccae62 |
|
16-Jan-2020 |
Toke Høiland-Jørgensen <toke@redhat.com> |
xdp: Move devmap bulk queue into struct net_device Commit 96360004b862 ("xdp: Make devmap flush_list common for all map instances"), changed devmap flushing to be a global operation instead of a per-map operation. However, the queue structure used for bulking was still allocated as part of the containing map. This patch moves the devmap bulk queue into struct net_device. The motivation for this is reusing it for the non-map variant of XDP_REDIRECT, which will be changed in a subsequent commit. To avoid other fields of struct net_device moving to different cache lines, we also move a couple of other members around. We defer the actual allocation of the bulk queue structure until the NETDEV_REGISTER notification devmap.c. This makes it possible to check for ndo_xdp_xmit support before allocating the structure, which is not possible at the time struct net_device is allocated. However, we keep the freeing in free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER. Because of this change, we lose the reference back to the map that originated the redirect, so change the tracepoint to always return 0 as the map ID and index. Otherwise no functional change is intended with this patch. After this patch, the relevant part of struct net_device looks like this, according to pahole: /* --- cacheline 14 boundary (896 bytes) --- */ struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */ unsigned int num_tx_queues; /* 904 4 */ unsigned int real_num_tx_queues; /* 908 4 */ struct Qdisc * qdisc; /* 912 8 */ unsigned int tx_queue_len; /* 920 4 */ spinlock_t tx_global_lock; /* 924 4 */ struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */ struct xps_dev_maps * xps_cpus_map; /* 936 8 */ struct xps_dev_maps * xps_rxqs_map; /* 944 8 */ struct mini_Qdisc * miniq_egress; /* 952 8 */ /* --- cacheline 15 boundary (960 bytes) --- */ struct hlist_head qdisc_hash[16]; /* 960 128 */ /* --- cacheline 17 boundary (1088 bytes) --- */ struct timer_list watchdog_timer; /* 1088 40 */ /* XXX last struct has 4 bytes of padding */ int watchdog_timeo; /* 1128 4 */ /* XXX 4 bytes hole, try to pack */ struct list_head todo_list; /* 1136 16 */ /* --- cacheline 18 boundary (1152 bytes) --- */ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
|
#
b39c78b2 |
|
02-Jan-2020 |
Li RongQing <lirongqing@baidu.com> |
net: remove the check argument from __skb_gro_checksum_convert The argument is always ignored, so remove it. Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b08d6c1 |
|
27-Dec-2019 |
Michal Kubecek <mkubecek@suse.cz> |
ethtool: support for netlink notifications Add infrastructure for ethtool netlink notifications. There is only one multicast group "monitor" which is used to notify userspace about changes and actions performed. Notification messages (types using suffix _NTF) share the format with replies to GET requests. Notifications are supposed to be broadcasted on every configuration change, whether it is done using the netlink interface or ioctl one. Netlink SET requests only trigger a notification if some data is actually changed. To trigger an ethtool notification, both ethtool netlink and external code use ethtool_notify() helper. This helper requires RTNL to be held and may sleep. Handlers sending messages for specific notification message types are registered in ethnl_notify_handlers array. As notifications can be triggered from other code, ethnl_ok flag is used to prevent an attempt to send notification before genetlink family is registered. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d35eb52b |
|
18-Dec-2019 |
Petr Machata <petrm@mellanox.com> |
net: sch_ets: Make the ETS qdisc offloadable Add hooks at appropriate points to make it possible to offload the ETS Qdisc. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1f26c0d3 |
|
16-Dec-2019 |
Randy Dunlap <rdunlap@infradead.org> |
net: fix kernel-doc warning in <linux/netdevice.h> Fix missing '*' kernel-doc notation that causes this warning: ../include/linux/netdevice.h:1779: warning: bad line: spinlock Fixes: ab92d68fc22f ("net: core: add generic lockdep keys") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f5e70c8 |
|
19-Nov-2019 |
Lukas Wunner <lukas@wunner.de> |
netfilter: Document ingress hook Amend kerneldoc of struct net_device to fix a "make htmldocs" warning: include/linux/netdevice.h:2045: warning: Function parameter or member 'nf_hooks_ingress' not described in 'net_device' Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
0290bd29 |
|
10-Dec-2019 |
Michael S. Tsirkin <mst@redhat.com> |
netdev: pass the stuck queue to the timeout handler This allows incrementing the correct timeout statistic without any mess. Down the road, devices can learn to reset just the specific queue. The patch was generated with the following script: use strict; use warnings; our $^I = '.bak'; my @work = ( ["arch/m68k/emu/nfeth.c", "nfeth_tx_timeout"], ["arch/um/drivers/net_kern.c", "uml_net_tx_timeout"], ["arch/um/drivers/vector_kern.c", "vector_net_tx_timeout"], ["arch/xtensa/platforms/iss/network.c", "iss_net_tx_timeout"], ["drivers/char/pcmcia/synclink_cs.c", "hdlcdev_tx_timeout"], ["drivers/infiniband/ulp/ipoib/ipoib_main.c", "ipoib_timeout"], ["drivers/infiniband/ulp/ipoib/ipoib_main.c", "ipoib_timeout"], ["drivers/message/fusion/mptlan.c", "mpt_lan_tx_timeout"], ["drivers/misc/sgi-xp/xpnet.c", "xpnet_dev_tx_timeout"], ["drivers/net/appletalk/cops.c", "cops_timeout"], ["drivers/net/arcnet/arcdevice.h", "arcnet_timeout"], ["drivers/net/arcnet/arcnet.c", "arcnet_timeout"], ["drivers/net/arcnet/com20020.c", "arcnet_timeout"], ["drivers/net/ethernet/3com/3c509.c", "el3_tx_timeout"], ["drivers/net/ethernet/3com/3c515.c", "corkscrew_timeout"], ["drivers/net/ethernet/3com/3c574_cs.c", "el3_tx_timeout"], ["drivers/net/ethernet/3com/3c589_cs.c", "el3_tx_timeout"], ["drivers/net/ethernet/3com/3c59x.c", "vortex_tx_timeout"], ["drivers/net/ethernet/3com/3c59x.c", "vortex_tx_timeout"], ["drivers/net/ethernet/3com/typhoon.c", "typhoon_tx_timeout"], ["drivers/net/ethernet/8390/8390.h", "ei_tx_timeout"], ["drivers/net/ethernet/8390/8390.h", "eip_tx_timeout"], ["drivers/net/ethernet/8390/8390.c", "ei_tx_timeout"], ["drivers/net/ethernet/8390/8390p.c", "eip_tx_timeout"], ["drivers/net/ethernet/8390/ax88796.c", "ax_ei_tx_timeout"], ["drivers/net/ethernet/8390/axnet_cs.c", "axnet_tx_timeout"], ["drivers/net/ethernet/8390/etherh.c", "__ei_tx_timeout"], ["drivers/net/ethernet/8390/hydra.c", "__ei_tx_timeout"], ["drivers/net/ethernet/8390/mac8390.c", "__ei_tx_timeout"], ["drivers/net/ethernet/8390/mcf8390.c", "__ei_tx_timeout"], ["drivers/net/ethernet/8390/lib8390.c", "__ei_tx_timeout"], ["drivers/net/ethernet/8390/ne2k-pci.c", "ei_tx_timeout"], ["drivers/net/ethernet/8390/pcnet_cs.c", "ei_tx_timeout"], ["drivers/net/ethernet/8390/smc-ultra.c", "ei_tx_timeout"], ["drivers/net/ethernet/8390/wd.c", "ei_tx_timeout"], ["drivers/net/ethernet/8390/zorro8390.c", "__ei_tx_timeout"], ["drivers/net/ethernet/adaptec/starfire.c", "tx_timeout"], ["drivers/net/ethernet/agere/et131x.c", "et131x_tx_timeout"], ["drivers/net/ethernet/allwinner/sun4i-emac.c", "emac_timeout"], ["drivers/net/ethernet/alteon/acenic.c", "ace_watchdog"], ["drivers/net/ethernet/amazon/ena/ena_netdev.c", "ena_tx_timeout"], ["drivers/net/ethernet/amd/7990.h", "lance_tx_timeout"], ["drivers/net/ethernet/amd/7990.c", "lance_tx_timeout"], ["drivers/net/ethernet/amd/a2065.c", "lance_tx_timeout"], ["drivers/net/ethernet/amd/am79c961a.c", "am79c961_timeout"], ["drivers/net/ethernet/amd/amd8111e.c", "amd8111e_tx_timeout"], ["drivers/net/ethernet/amd/ariadne.c", "ariadne_tx_timeout"], ["drivers/net/ethernet/amd/atarilance.c", "lance_tx_timeout"], ["drivers/net/ethernet/amd/au1000_eth.c", "au1000_tx_timeout"], ["drivers/net/ethernet/amd/declance.c", "lance_tx_timeout"], ["drivers/net/ethernet/amd/lance.c", "lance_tx_timeout"], ["drivers/net/ethernet/amd/mvme147.c", "lance_tx_timeout"], ["drivers/net/ethernet/amd/ni65.c", "ni65_timeout"], ["drivers/net/ethernet/amd/nmclan_cs.c", "mace_tx_timeout"], ["drivers/net/ethernet/amd/pcnet32.c", "pcnet32_tx_timeout"], ["drivers/net/ethernet/amd/sunlance.c", "lance_tx_timeout"], ["drivers/net/ethernet/amd/xgbe/xgbe-drv.c", "xgbe_tx_timeout"], ["drivers/net/ethernet/apm/xgene-v2/main.c", "xge_timeout"], ["drivers/net/ethernet/apm/xgene/xgene_enet_main.c", "xgene_enet_timeout"], ["drivers/net/ethernet/apple/macmace.c", "mace_tx_timeout"], ["drivers/net/ethernet/atheros/ag71xx.c", "ag71xx_tx_timeout"], ["drivers/net/ethernet/atheros/alx/main.c", "alx_tx_timeout"], ["drivers/net/ethernet/atheros/atl1c/atl1c_main.c", "atl1c_tx_timeout"], ["drivers/net/ethernet/atheros/atl1e/atl1e_main.c", "atl1e_tx_timeout"], ["drivers/net/ethernet/atheros/atlx/atl.c", "atlx_tx_timeout"], ["drivers/net/ethernet/atheros/atlx/atl1.c", "atlx_tx_timeout"], ["drivers/net/ethernet/atheros/atlx/atl2.c", "atl2_tx_timeout"], ["drivers/net/ethernet/broadcom/b44.c", "b44_tx_timeout"], ["drivers/net/ethernet/broadcom/bcmsysport.c", "bcm_sysport_tx_timeout"], ["drivers/net/ethernet/broadcom/bnx2.c", "bnx2_tx_timeout"], ["drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h", "bnx2x_tx_timeout"], ["drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c", "bnx2x_tx_timeout"], ["drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c", "bnx2x_tx_timeout"], ["drivers/net/ethernet/broadcom/bnxt/bnxt.c", "bnxt_tx_timeout"], ["drivers/net/ethernet/broadcom/genet/bcmgenet.c", "bcmgenet_timeout"], ["drivers/net/ethernet/broadcom/sb1250-mac.c", "sbmac_tx_timeout"], ["drivers/net/ethernet/broadcom/tg3.c", "tg3_tx_timeout"], ["drivers/net/ethernet/calxeda/xgmac.c", "xgmac_tx_timeout"], ["drivers/net/ethernet/cavium/liquidio/lio_main.c", "liquidio_tx_timeout"], ["drivers/net/ethernet/cavium/liquidio/lio_vf_main.c", "liquidio_tx_timeout"], ["drivers/net/ethernet/cavium/liquidio/lio_vf_rep.c", "lio_vf_rep_tx_timeout"], ["drivers/net/ethernet/cavium/thunder/nicvf_main.c", "nicvf_tx_timeout"], ["drivers/net/ethernet/cirrus/cs89x0.c", "net_timeout"], ["drivers/net/ethernet/cisco/enic/enic_main.c", "enic_tx_timeout"], ["drivers/net/ethernet/cisco/enic/enic_main.c", "enic_tx_timeout"], ["drivers/net/ethernet/cortina/gemini.c", "gmac_tx_timeout"], ["drivers/net/ethernet/davicom/dm9000.c", "dm9000_timeout"], ["drivers/net/ethernet/dec/tulip/de2104x.c", "de_tx_timeout"], ["drivers/net/ethernet/dec/tulip/tulip_core.c", "tulip_tx_timeout"], ["drivers/net/ethernet/dec/tulip/winbond-840.c", "tx_timeout"], ["drivers/net/ethernet/dlink/dl2k.c", "rio_tx_timeout"], ["drivers/net/ethernet/dlink/sundance.c", "tx_timeout"], ["drivers/net/ethernet/emulex/benet/be_main.c", "be_tx_timeout"], ["drivers/net/ethernet/ethoc.c", "ethoc_tx_timeout"], ["drivers/net/ethernet/faraday/ftgmac100.c", "ftgmac100_tx_timeout"], ["drivers/net/ethernet/fealnx.c", "fealnx_tx_timeout"], ["drivers/net/ethernet/freescale/dpaa/dpaa_eth.c", "dpaa_tx_timeout"], ["drivers/net/ethernet/freescale/fec_main.c", "fec_timeout"], ["drivers/net/ethernet/freescale/fec_mpc52xx.c", "mpc52xx_fec_tx_timeout"], ["drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c", "fs_timeout"], ["drivers/net/ethernet/freescale/gianfar.c", "gfar_timeout"], ["drivers/net/ethernet/freescale/ucc_geth.c", "ucc_geth_timeout"], ["drivers/net/ethernet/fujitsu/fmvj18x_cs.c", "fjn_tx_timeout"], ["drivers/net/ethernet/google/gve/gve_main.c", "gve_tx_timeout"], ["drivers/net/ethernet/hisilicon/hip04_eth.c", "hip04_timeout"], ["drivers/net/ethernet/hisilicon/hix5hd2_gmac.c", "hix5hd2_net_timeout"], ["drivers/net/ethernet/hisilicon/hns/hns_enet.c", "hns_nic_net_timeout"], ["drivers/net/ethernet/hisilicon/hns3/hns3_enet.c", "hns3_nic_net_timeout"], ["drivers/net/ethernet/huawei/hinic/hinic_main.c", "hinic_tx_timeout"], ["drivers/net/ethernet/i825xx/82596.c", "i596_tx_timeout"], ["drivers/net/ethernet/i825xx/ether1.c", "ether1_timeout"], ["drivers/net/ethernet/i825xx/lib82596.c", "i596_tx_timeout"], ["drivers/net/ethernet/i825xx/sun3_82586.c", "sun3_82586_timeout"], ["drivers/net/ethernet/ibm/ehea/ehea_main.c", "ehea_tx_watchdog"], ["drivers/net/ethernet/ibm/emac/core.c", "emac_tx_timeout"], ["drivers/net/ethernet/ibm/emac/core.c", "emac_tx_timeout"], ["drivers/net/ethernet/ibm/ibmvnic.c", "ibmvnic_tx_timeout"], ["drivers/net/ethernet/intel/e100.c", "e100_tx_timeout"], ["drivers/net/ethernet/intel/e1000/e1000_main.c", "e1000_tx_timeout"], ["drivers/net/ethernet/intel/e1000e/netdev.c", "e1000_tx_timeout"], ["drivers/net/ethernet/intel/fm10k/fm10k_netdev.c", "fm10k_tx_timeout"], ["drivers/net/ethernet/intel/i40e/i40e_main.c", "i40e_tx_timeout"], ["drivers/net/ethernet/intel/iavf/iavf_main.c", "iavf_tx_timeout"], ["drivers/net/ethernet/intel/ice/ice_main.c", "ice_tx_timeout"], ["drivers/net/ethernet/intel/ice/ice_main.c", "ice_tx_timeout"], ["drivers/net/ethernet/intel/igb/igb_main.c", "igb_tx_timeout"], ["drivers/net/ethernet/intel/igbvf/netdev.c", "igbvf_tx_timeout"], ["drivers/net/ethernet/intel/ixgb/ixgb_main.c", "ixgb_tx_timeout"], ["drivers/net/ethernet/intel/ixgbe/ixgbe_debugfs.c", "adapter->netdev->netdev_ops->ndo_tx_timeout(adapter->netdev);"], ["drivers/net/ethernet/intel/ixgbe/ixgbe_main.c", "ixgbe_tx_timeout"], ["drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c", "ixgbevf_tx_timeout"], ["drivers/net/ethernet/jme.c", "jme_tx_timeout"], ["drivers/net/ethernet/korina.c", "korina_tx_timeout"], ["drivers/net/ethernet/lantiq_etop.c", "ltq_etop_tx_timeout"], ["drivers/net/ethernet/marvell/mv643xx_eth.c", "mv643xx_eth_tx_timeout"], ["drivers/net/ethernet/marvell/pxa168_eth.c", "pxa168_eth_tx_timeout"], ["drivers/net/ethernet/marvell/skge.c", "skge_tx_timeout"], ["drivers/net/ethernet/marvell/sky2.c", "sky2_tx_timeout"], ["drivers/net/ethernet/marvell/sky2.c", "sky2_tx_timeout"], ["drivers/net/ethernet/mediatek/mtk_eth_soc.c", "mtk_tx_timeout"], ["drivers/net/ethernet/mellanox/mlx4/en_netdev.c", "mlx4_en_tx_timeout"], ["drivers/net/ethernet/mellanox/mlx4/en_netdev.c", "mlx4_en_tx_timeout"], ["drivers/net/ethernet/mellanox/mlx5/core/en_main.c", "mlx5e_tx_timeout"], ["drivers/net/ethernet/micrel/ks8842.c", "ks8842_tx_timeout"], ["drivers/net/ethernet/micrel/ksz884x.c", "netdev_tx_timeout"], ["drivers/net/ethernet/microchip/enc28j60.c", "enc28j60_tx_timeout"], ["drivers/net/ethernet/microchip/encx24j600.c", "encx24j600_tx_timeout"], ["drivers/net/ethernet/natsemi/sonic.h", "sonic_tx_timeout"], ["drivers/net/ethernet/natsemi/sonic.c", "sonic_tx_timeout"], ["drivers/net/ethernet/natsemi/jazzsonic.c", "sonic_tx_timeout"], ["drivers/net/ethernet/natsemi/macsonic.c", "sonic_tx_timeout"], ["drivers/net/ethernet/natsemi/natsemi.c", "ns_tx_timeout"], ["drivers/net/ethernet/natsemi/ns83820.c", "ns83820_tx_timeout"], ["drivers/net/ethernet/natsemi/xtsonic.c", "sonic_tx_timeout"], ["drivers/net/ethernet/neterion/s2io.h", "s2io_tx_watchdog"], ["drivers/net/ethernet/neterion/s2io.c", "s2io_tx_watchdog"], ["drivers/net/ethernet/neterion/vxge/vxge-main.c", "vxge_tx_watchdog"], ["drivers/net/ethernet/netronome/nfp/nfp_net_common.c", "nfp_net_tx_timeout"], ["drivers/net/ethernet/nvidia/forcedeth.c", "nv_tx_timeout"], ["drivers/net/ethernet/nvidia/forcedeth.c", "nv_tx_timeout"], ["drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c", "pch_gbe_tx_timeout"], ["drivers/net/ethernet/packetengines/hamachi.c", "hamachi_tx_timeout"], ["drivers/net/ethernet/packetengines/yellowfin.c", "yellowfin_tx_timeout"], ["drivers/net/ethernet/pensando/ionic/ionic_lif.c", "ionic_tx_timeout"], ["drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c", "netxen_tx_timeout"], ["drivers/net/ethernet/qlogic/qla3xxx.c", "ql3xxx_tx_timeout"], ["drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c", "qlcnic_tx_timeout"], ["drivers/net/ethernet/qualcomm/emac/emac.c", "emac_tx_timeout"], ["drivers/net/ethernet/qualcomm/qca_spi.c", "qcaspi_netdev_tx_timeout"], ["drivers/net/ethernet/qualcomm/qca_uart.c", "qcauart_netdev_tx_timeout"], ["drivers/net/ethernet/rdc/r6040.c", "r6040_tx_timeout"], ["drivers/net/ethernet/realtek/8139cp.c", "cp_tx_timeout"], ["drivers/net/ethernet/realtek/8139too.c", "rtl8139_tx_timeout"], ["drivers/net/ethernet/realtek/atp.c", "tx_timeout"], ["drivers/net/ethernet/realtek/r8169_main.c", "rtl8169_tx_timeout"], ["drivers/net/ethernet/renesas/ravb_main.c", "ravb_tx_timeout"], ["drivers/net/ethernet/renesas/sh_eth.c", "sh_eth_tx_timeout"], ["drivers/net/ethernet/renesas/sh_eth.c", "sh_eth_tx_timeout"], ["drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c", "sxgbe_tx_timeout"], ["drivers/net/ethernet/seeq/ether3.c", "ether3_timeout"], ["drivers/net/ethernet/seeq/sgiseeq.c", "timeout"], ["drivers/net/ethernet/sfc/efx.c", "efx_watchdog"], ["drivers/net/ethernet/sfc/falcon/efx.c", "ef4_watchdog"], ["drivers/net/ethernet/sgi/ioc3-eth.c", "ioc3_timeout"], ["drivers/net/ethernet/sgi/meth.c", "meth_tx_timeout"], ["drivers/net/ethernet/silan/sc92031.c", "sc92031_tx_timeout"], ["drivers/net/ethernet/sis/sis190.c", "sis190_tx_timeout"], ["drivers/net/ethernet/sis/sis900.c", "sis900_tx_timeout"], ["drivers/net/ethernet/smsc/epic100.c", "epic_tx_timeout"], ["drivers/net/ethernet/smsc/smc911x.c", "smc911x_timeout"], ["drivers/net/ethernet/smsc/smc9194.c", "smc_timeout"], ["drivers/net/ethernet/smsc/smc91c92_cs.c", "smc_tx_timeout"], ["drivers/net/ethernet/smsc/smc91x.c", "smc_timeout"], ["drivers/net/ethernet/stmicro/stmmac/stmmac_main.c", "stmmac_tx_timeout"], ["drivers/net/ethernet/sun/cassini.c", "cas_tx_timeout"], ["drivers/net/ethernet/sun/ldmvsw.c", "sunvnet_tx_timeout_common"], ["drivers/net/ethernet/sun/niu.c", "niu_tx_timeout"], ["drivers/net/ethernet/sun/sunbmac.c", "bigmac_tx_timeout"], ["drivers/net/ethernet/sun/sungem.c", "gem_tx_timeout"], ["drivers/net/ethernet/sun/sunhme.c", "happy_meal_tx_timeout"], ["drivers/net/ethernet/sun/sunqe.c", "qe_tx_timeout"], ["drivers/net/ethernet/sun/sunvnet.c", "sunvnet_tx_timeout_common"], ["drivers/net/ethernet/sun/sunvnet_common.c", "sunvnet_tx_timeout_common"], ["drivers/net/ethernet/sun/sunvnet_common.h", "sunvnet_tx_timeout_common"], ["drivers/net/ethernet/synopsys/dwc-xlgmac-net.c", "xlgmac_tx_timeout"], ["drivers/net/ethernet/ti/cpmac.c", "cpmac_tx_timeout"], ["drivers/net/ethernet/ti/cpsw.c", "cpsw_ndo_tx_timeout"], ["drivers/net/ethernet/ti/cpsw_priv.c", "cpsw_ndo_tx_timeout"], ["drivers/net/ethernet/ti/cpsw_priv.h", "cpsw_ndo_tx_timeout"], ["drivers/net/ethernet/ti/davinci_emac.c", "emac_dev_tx_timeout"], ["drivers/net/ethernet/ti/netcp_core.c", "netcp_ndo_tx_timeout"], ["drivers/net/ethernet/ti/tlan.c", "tlan_tx_timeout"], ["drivers/net/ethernet/toshiba/ps3_gelic_net.h", "gelic_net_tx_timeout"], ["drivers/net/ethernet/toshiba/ps3_gelic_net.c", "gelic_net_tx_timeout"], ["drivers/net/ethernet/toshiba/ps3_gelic_wireless.c", "gelic_net_tx_timeout"], ["drivers/net/ethernet/toshiba/spider_net.c", "spider_net_tx_timeout"], ["drivers/net/ethernet/toshiba/tc35815.c", "tc35815_tx_timeout"], ["drivers/net/ethernet/via/via-rhine.c", "rhine_tx_timeout"], ["drivers/net/ethernet/wiznet/w5100.c", "w5100_tx_timeout"], ["drivers/net/ethernet/wiznet/w5300.c", "w5300_tx_timeout"], ["drivers/net/ethernet/xilinx/xilinx_emaclite.c", "xemaclite_tx_timeout"], ["drivers/net/ethernet/xircom/xirc2ps_cs.c", "xirc_tx_timeout"], ["drivers/net/fjes/fjes_main.c", "fjes_tx_retry"], ["drivers/net/slip/slip.c", "sl_tx_timeout"], ["include/linux/usb/usbnet.h", "usbnet_tx_timeout"], ["drivers/net/usb/aqc111.c", "usbnet_tx_timeout"], ["drivers/net/usb/asix_devices.c", "usbnet_tx_timeout"], ["drivers/net/usb/asix_devices.c", "usbnet_tx_timeout"], ["drivers/net/usb/asix_devices.c", "usbnet_tx_timeout"], ["drivers/net/usb/ax88172a.c", "usbnet_tx_timeout"], ["drivers/net/usb/ax88179_178a.c", "usbnet_tx_timeout"], ["drivers/net/usb/catc.c", "catc_tx_timeout"], ["drivers/net/usb/cdc_mbim.c", "usbnet_tx_timeout"], ["drivers/net/usb/cdc_ncm.c", "usbnet_tx_timeout"], ["drivers/net/usb/dm9601.c", "usbnet_tx_timeout"], ["drivers/net/usb/hso.c", "hso_net_tx_timeout"], ["drivers/net/usb/int51x1.c", "usbnet_tx_timeout"], ["drivers/net/usb/ipheth.c", "ipheth_tx_timeout"], ["drivers/net/usb/kaweth.c", "kaweth_tx_timeout"], ["drivers/net/usb/lan78xx.c", "lan78xx_tx_timeout"], ["drivers/net/usb/mcs7830.c", "usbnet_tx_timeout"], ["drivers/net/usb/pegasus.c", "pegasus_tx_timeout"], ["drivers/net/usb/qmi_wwan.c", "usbnet_tx_timeout"], ["drivers/net/usb/r8152.c", "rtl8152_tx_timeout"], ["drivers/net/usb/rndis_host.c", "usbnet_tx_timeout"], ["drivers/net/usb/rtl8150.c", "rtl8150_tx_timeout"], ["drivers/net/usb/sierra_net.c", "usbnet_tx_timeout"], ["drivers/net/usb/smsc75xx.c", "usbnet_tx_timeout"], ["drivers/net/usb/smsc95xx.c", "usbnet_tx_timeout"], ["drivers/net/usb/sr9700.c", "usbnet_tx_timeout"], ["drivers/net/usb/sr9800.c", "usbnet_tx_timeout"], ["drivers/net/usb/usbnet.c", "usbnet_tx_timeout"], ["drivers/net/vmxnet3/vmxnet3_drv.c", "vmxnet3_tx_timeout"], ["drivers/net/wan/cosa.c", "cosa_net_timeout"], ["drivers/net/wan/farsync.c", "fst_tx_timeout"], ["drivers/net/wan/fsl_ucc_hdlc.c", "uhdlc_tx_timeout"], ["drivers/net/wan/lmc/lmc_main.c", "lmc_driver_timeout"], ["drivers/net/wan/x25_asy.c", "x25_asy_timeout"], ["drivers/net/wimax/i2400m/netdev.c", "i2400m_tx_timeout"], ["drivers/net/wireless/intel/ipw2x00/ipw2100.c", "ipw2100_tx_timeout"], ["drivers/net/wireless/intersil/hostap/hostap_main.c", "prism2_tx_timeout"], ["drivers/net/wireless/intersil/hostap/hostap_main.c", "prism2_tx_timeout"], ["drivers/net/wireless/intersil/hostap/hostap_main.c", "prism2_tx_timeout"], ["drivers/net/wireless/intersil/orinoco/main.c", "orinoco_tx_timeout"], ["drivers/net/wireless/intersil/orinoco/orinoco_usb.c", "orinoco_tx_timeout"], ["drivers/net/wireless/intersil/orinoco/orinoco.h", "orinoco_tx_timeout"], ["drivers/net/wireless/intersil/prism54/islpci_dev.c", "islpci_eth_tx_timeout"], ["drivers/net/wireless/intersil/prism54/islpci_eth.c", "islpci_eth_tx_timeout"], ["drivers/net/wireless/intersil/prism54/islpci_eth.h", "islpci_eth_tx_timeout"], ["drivers/net/wireless/marvell/mwifiex/main.c", "mwifiex_tx_timeout"], ["drivers/net/wireless/quantenna/qtnfmac/core.c", "qtnf_netdev_tx_timeout"], ["drivers/net/wireless/quantenna/qtnfmac/core.h", "qtnf_netdev_tx_timeout"], ["drivers/net/wireless/rndis_wlan.c", "usbnet_tx_timeout"], ["drivers/net/wireless/wl3501_cs.c", "wl3501_tx_timeout"], ["drivers/net/wireless/zydas/zd1201.c", "zd1201_tx_timeout"], ["drivers/s390/net/qeth_core.h", "qeth_tx_timeout"], ["drivers/s390/net/qeth_core_main.c", "qeth_tx_timeout"], ["drivers/s390/net/qeth_l2_main.c", "qeth_tx_timeout"], ["drivers/s390/net/qeth_l2_main.c", "qeth_tx_timeout"], ["drivers/s390/net/qeth_l3_main.c", "qeth_tx_timeout"], ["drivers/s390/net/qeth_l3_main.c", "qeth_tx_timeout"], ["drivers/staging/ks7010/ks_wlan_net.c", "ks_wlan_tx_timeout"], ["drivers/staging/qlge/qlge_main.c", "qlge_tx_timeout"], ["drivers/staging/rtl8192e/rtl8192e/rtl_core.c", "_rtl92e_tx_timeout"], ["drivers/staging/rtl8192u/r8192U_core.c", "tx_timeout"], ["drivers/staging/unisys/visornic/visornic_main.c", "visornic_xmit_timeout"], ["drivers/staging/wlan-ng/p80211netdev.c", "p80211knetdev_tx_timeout"], ["drivers/tty/n_gsm.c", "gsm_mux_net_tx_timeout"], ["drivers/tty/synclink.c", "hdlcdev_tx_timeout"], ["drivers/tty/synclink_gt.c", "hdlcdev_tx_timeout"], ["drivers/tty/synclinkmp.c", "hdlcdev_tx_timeout"], ["net/atm/lec.c", "lec_tx_timeout"], ["net/bluetooth/bnep/netdev.c", "bnep_net_timeout"] ); for my $p (@work) { my @pair = @$p; my $file = $pair[0]; my $func = $pair[1]; print STDERR $file , ": ", $func,"\n"; our @ARGV = ($file); while (<ARGV>) { if (m/($func\s*\(struct\s+net_device\s+\*[A-Za-z_]?[A-Za-z-0-9_]*)(\))/) { print STDERR "found $1+$2 in $file\n"; } if (s/($func\s*\(struct\s+net_device\s+\*[A-Za-z_]?[A-Za-z-0-9_]*)(\))/$1, unsigned int txqueue$2/) { print STDERR "$func found in $file\n"; } print; } } where the list of files and functions is simply from: git grep ndo_tx_timeout, with manual addition of headers in the rare cases where the function is from a header, then manually changing the few places which actually call ndo_tx_timeout. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Heiner Kallweit <hkallweit1@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Shannon Nelson <snelson@pensando.io> Reviewed-by: Martin Habets <mhabets@solarflare.com> changes from v9: fixup a forward declaration changes from v9: more leftovers from v3 change changes from v8: fix up a missing direct call to timeout rebased on net-next changes from v7: fixup leftovers from v3 change changes from v6: fix typo in rtl driver changes from v5: add missing files (allow any net device argument name) changes from v4: add a missing driver header changes from v3: change queue # to unsigned Changes from v2: added headers Changes from v1: Fix errors found by kbuild: generalize the pattern a bit, to pick up a couple of instances missed by the previous version. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
501a90c9 |
|
05-Dec-2019 |
Eric Dumazet <edumazet@google.com> |
inet: protect against too small mtu values. syzbot was once again able to crash a host by setting a very small mtu on loopback device. Let's make inetdev_valid_mtu() available in include/net/ip.h, and use it in ip_setup_cork(), so that we protect both ip_append_page() and __ip_append_data() Also add a READ_ONCE() when the device mtu is read. Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(), even if other code paths might write over this field. Add a big comment in include/linux/netdevice.h about dev->mtu needing READ_ONCE()/WRITE_ONCE() annotations. Hopefully we will add the missing ones in followup patches. [1] refcount_t: saturated; leaking memory. WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 panic+0x2e3/0x75c kernel/panic.c:221 __warn.cold+0x2f/0x3e kernel/panic.c:582 report_bug+0x289/0x300 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:174 [inline] fixup_bug arch/x86/kernel/traps.c:169 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027 RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22 Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89 RSP: 0018:ffff88809689f550 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1 R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001 R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40 refcount_add include/linux/refcount.h:193 [inline] skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999 sock_wmalloc+0xf1/0x120 net/core/sock.c:2096 ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383 udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276 inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821 kernel_sendpage+0x92/0xf0 net/socket.c:3794 sock_sendpage+0x8b/0xc0 net/socket.c:936 pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458 splice_from_pipe_feed fs/splice.c:512 [inline] __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636 splice_from_pipe+0x108/0x170 fs/splice.c:671 generic_splice_sendpage+0x3c/0x50 fs/splice.c:842 do_splice_from fs/splice.c:861 [inline] direct_splice_actor+0x123/0x190 fs/splice.c:1035 splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990 do_splice_direct+0x1da/0x2a0 fs/splice.c:1078 do_sendfile+0x597/0xd00 fs/read_write.c:1464 __do_sys_sendfile64 fs/read_write.c:1525 [inline] __se_sys_sendfile64 fs/read_write.c:1511 [inline] __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441409 Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409 RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005 RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180 R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000 Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: 1470ddf7f8ce ("inet: Remove explicit write references to sk/inet in ip_append_data") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
30aad417 |
|
06-Nov-2019 |
Danit Goldberg <danitg@mellanox.com> |
net/core: Add support for getting VF GUIDs Introduce a new ndo: ndo_get_vf_guid, to get from the net device the port and node GUID. New applications can choose to use this interface to show GUIDs with iproute2 with commands such as: - ip link show ib4 ib4: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 256 link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff vf 0 link/infiniband 00:00:0a:2d:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:44:36:8d brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof checking off, NODE_GUID 22:44:33:00:33:11:00:33, PORT_GUID 10:21:33:12:00:11:22:10, link-state disable, trust off, query_rss off Signed-off-by: Danit Goldberg <danitg@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
#
c29f74e0 |
|
11-Nov-2019 |
Pablo Neira Ayuso <pablo@netfilter.org> |
netfilter: nf_flow_table: hardware offload support This patch adds the dataplane hardware offload to the flowtable infrastructure. Three new flags represent the hardware state of this flow: * FLOW_OFFLOAD_HW: This flow entry resides in the hardware. * FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove from hardware. This might be triggered by either packet path (via TCP RST/FIN packet) or via aging. * FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from the hardware, the software garbage collector can remove it from the software flowtable. This patch supports for: * IPv4 only. * Aging via FLOW_CLS_STATS, no packet and byte counter synchronization at this stage. This patch also adds the action callback that specifies how to convert the flow entry into the flow_rule object that is passed to the driver. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd2f4737 |
|
07-Nov-2019 |
Eric Dumazet <edumazet@google.com> |
net: use u64_stats_t in struct pcpu_lstats In order to fix the data-race found by KCSAN, we can use the new u64_stats_t type and its accessors instead of plain u64 fields. This will still generate optimal code for both 32 and 64 bit platforms. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dd5382a0 |
|
07-Nov-2019 |
Eric Dumazet <edumazet@google.com> |
net: provide dev_lstats_add() helper Many network drivers need it and hand-coded the same function. In order to ease u64_stats_t adoption, it is time to factorize. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
de7d5084 |
|
07-Nov-2019 |
Eric Dumazet <edumazet@google.com> |
net: provide dev_lstats_read() helper Many network drivers use hand-coded implementation of the same thing, let's factorize things so that u64_stats_t adoption is done once. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3b0a18b |
|
21-Oct-2019 |
Taehee Yoo <ap420073@gmail.com> |
net: remove unnecessary variables and callback This patch removes variables and callback these are related to the nested device structure. devices that can be nested have their own nest_level variable that represents the depth of nested devices. In the previous patch, new {lower/upper}_level variables are added and they replace old private nest_level variable. So, this patch removes all 'nest_level' variables. In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added to get lockdep subclass value, which is actually lower nested depth value. But now, they use the dynamic lockdep key to avoid lockdep warning instead of the subclass. So, this patch removes ->ndo_get_lock_subclass() callback. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32b6d34f |
|
21-Oct-2019 |
Taehee Yoo <ap420073@gmail.com> |
net: core: add ignore flag to netdev_adjacent structure In order to link an adjacent node, netdev_upper_dev_link() is used and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used. unlink operation does not fail, but link operation can fail. In order to exchange adjacent nodes, we should unlink an old adjacent node first. then, link a new adjacent node. If link operation is failed, we should link an old adjacent node again. But this link operation can fail too. It eventually breaks the adjacent link relationship. This patch adds an ignore flag into the netdev_adjacent structure. If this flag is set, netdev_upper_dev_link() ignores an old adjacent node for a moment. This patch also adds new functions for other modules. netdev_adjacent_change_prepare() netdev_adjacent_change_commit() netdev_adjacent_change_abort() netdev_adjacent_change_prepare() inserts new device into adjacent list but new device is not allowed to use immediately. If netdev_adjacent_change_prepare() fails, it internally rollbacks adjacent list so that we don't need any other action. netdev_adjacent_change_commit() deletes old device in the adjacent list and allows new device to use. netdev_adjacent_change_abort() rollbacks adjacent list. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ab92d68f |
|
21-Oct-2019 |
Taehee Yoo <ap420073@gmail.com> |
net: core: add generic lockdep keys Some interface types could be nested. (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..) These interface types should set lockdep class because, without lockdep class key, lockdep always warn about unexisting circular locking. In the current code, these interfaces have their own lockdep class keys and these manage itself. So that there are so many duplicate code around the /driver/net and /net/. This patch adds new generic lockdep keys and some helper functions for it. This patch does below changes. a) Add lockdep class keys in struct net_device - qdisc_running, xmit, addr_list, qdisc_busylock - these keys are used as dynamic lockdep key. b) When net_device is being allocated, lockdep keys are registered. - alloc_netdev_mqs() c) When net_device is being free'd llockdep keys are unregistered. - free_netdev() d) Add generic lockdep key helper function - netdev_register_lockdep_key() - netdev_unregister_lockdep_key() - netdev_update_lockdep_key() e) Remove unnecessary generic lockdep macro and functions f) Remove unnecessary lockdep code of each interfaces. After this patch, each interface modules don't need to maintain their lockdep keys. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5343da4c |
|
21-Oct-2019 |
Taehee Yoo <ap420073@gmail.com> |
net: core: limit nested device depth Current code doesn't limit the number of nested devices. Nested devices would be handled recursively and this needs huge stack memory. So, unlimited nested devices could make stack overflow. This patch adds upper_level and lower_level, they are common variables and represent maximum lower/upper depth. When upper/lower device is attached or dettached, {lower/upper}_level are updated. and if maximum depth is bigger than 8, attach routine fails and returns -EMLINK. In addition, this patch converts recursive routine of netdev_walk_all_{lower/upper} to iterator routine. Test commands: ip link add dummy0 type dummy ip link add link dummy0 name vlan1 type vlan id 1 ip link set vlan1 up for i in {2..55} do let A=$i-1 ip link add vlan$i link vlan$A type vlan id $i done ip link del dummy0 Splat looks like: [ 155.513226][ T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850 [ 155.514162][ T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908 [ 155.515048][ T908] [ 155.515333][ T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96 [ 155.516147][ T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 155.517233][ T908] Call Trace: [ 155.517627][ T908] [ 155.517918][ T908] Allocated by task 0: [ 155.518412][ T908] (stack is not available) [ 155.518955][ T908] [ 155.519228][ T908] Freed by task 0: [ 155.519885][ T908] (stack is not available) [ 155.520452][ T908] [ 155.520729][ T908] The buggy address belongs to the object at ffff8880608a6ac0 [ 155.520729][ T908] which belongs to the cache names_cache of size 4096 [ 155.522387][ T908] The buggy address is located 512 bytes inside of [ 155.522387][ T908] 4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0) [ 155.523920][ T908] The buggy address belongs to the page: [ 155.524552][ T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0 [ 155.525836][ T908] flags: 0x100000000010200(slab|head) [ 155.526445][ T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0 [ 155.527424][ T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 [ 155.528429][ T908] page dumped because: kasan: bad access detected [ 155.529158][ T908] [ 155.529410][ T908] Memory state around the buggy address: [ 155.530060][ T908] ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 155.530971][ T908] ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3 [ 155.531889][ T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 155.532806][ T908] ^ [ 155.533509][ T908] ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00 [ 155.534436][ T908] ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00 [ ... ] Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bacb7e18 |
|
08-Oct-2019 |
Eric Dumazet <edumazet@google.com> |
Revert "tun: call dev_get_valid_name() before register_netdevice()" This reverts commit 0ad646c81b2182f7fa67ec0c8c825e0ee165696d. As noticed by Jakub, this is no longer needed after commit 11fc7d5a0a2d ("tun: fix memory leak in error path") This no longer exports dev_get_valid_name() for the exclusive use of tun driver. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
|
#
a30c7b42 |
|
30-Sep-2019 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce per-netns netdevice notifiers Often the code for example in drivers is interested in getting notifier call only from certain network namespace. In addition to the existing global netdevice notifier chain introduce per-netns chains and allow users to register to that. Eventually this would eliminate unnecessary overhead in case there are many netdevices in many network namespaces. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
afa0df59 |
|
30-Sep-2019 |
Jiri Pirko <jiri@mellanox.com> |
net: push loops and nb calls into helper functions Push iterations over net namespaces and netdevices from register_netdevice_notifier() and unregister_netdevice_notifier() into helper functions. Along with that introduce continue_reverse macros to make the code a bit nicer allowing to get rid of "last" marks. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
36fbf1e5 |
|
30-Sep-2019 |
Jiri Pirko <jiri@mellanox.com> |
net: rtnetlink: add linkprop commands to add and delete alternative ifnames Add two commands to add and delete list of link properties. Implement the first property type along - alternative ifnames. Each net device can have multiple alternative names. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ff927412 |
|
30-Sep-2019 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce name_node struct to be used in hashlist Introduce name_node structure to hold name of device and put it into hashlist instead of putting there struct net_device directly. Add a necessary infrastructure to manipulate the hashlist. This prepares the code to use the same hashlist for alternative names introduced later in this set. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9c66d156 |
|
14-Sep-2019 |
Vinicius Costa Gomes <vinicius.gomes@intel.com> |
taprio: Add support for hardware offloading This allows taprio to offload the schedule enforcement to capable network cards, resulting in more precise windows and less CPU usage. The gate mask acts on traffic classes (groups of queues of same priority), as specified in IEEE 802.1Q-2018, and following the existing taprio and mqprio semantics. It is up to the driver to perform conversion between tc and individual netdev queues if for some reason it needs to make that distinction. Full offload is requested from the network interface by specifying "flags 2" in the tc qdisc creation command, which in turn corresponds to the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit. The important detail here is the clockid which is implicitly /dev/ptpN for full offload, and hence not configurable. A reference counting API is added to support the use case where Ethernet drivers need to keep the taprio offload structure locally (i.e. they are a multi-port switch driver, and configuring a port depends on the settings of other ports as well). The refcount_t variable is kept in a private structure (__tc_taprio_qopt_offload) and not exposed to drivers. In the future, the private structure might also be expanded with a backpointer to taprio_sched *q, to implement the notification system described in the patch (of when admin became oper, or an error occurred, etc, so the offload can be monitored with 'tc qdisc show'). Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3acd33d |
|
22-Aug-2019 |
xiaolinkui <xiaolinkui@kylinos.cn> |
net: use unlikely for dql_avail case This is an unlikely case, use unlikely() on it seems logical. Signed-off-by: xiaolinkui <xiaolinkui@kylinos.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9116e5e2 |
|
14-Aug-2019 |
Magnus Karlsson <magnus.karlsson@intel.com> |
xsk: replace ndo_xsk_async_xmit with ndo_xsk_wakeup This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new ndo provides the same functionality as before but with the addition of a new flags field that is used to specifiy if Rx, Tx or both should be woken up. The previous ndo only woke up Tx, as implied by the name. The i40e and ixgbe drivers (which are all the supported ones) are updated with this new interface. This new ndo will be used by the new need_wakeup functionality of XDP sockets that need to be able to wake up both Rx and Tx driver processing. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
323ebb61 |
|
06-Aug-2019 |
Edward Cree <ecree@solarflare.com> |
net: use listified RX for handling GRO_NORMAL skbs When GRO decides not to coalesce a packet, in napi_frags_finish(), instead of passing it to the stack immediately, place it on a list in the napi struct. Then, at flush time (napi_complete_done(), napi_poll(), or napi_busy_loop()), call netif_receive_skb_list_internal() on the list. We'd like to do that in napi_gro_flush(), but it's not called if !napi->gro_bitmask, so we have to do it in the callers instead. (There are a handful of drivers that call napi_gro_flush() themselves, but it's not clear why, or whether this will affect them.) Because a full 64 packets is an inefficiently large batch, also consume the list whenever it exceeds gro_normal_batch, a new net/core sysctl that defaults to 8. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4de83b88 |
|
01-Jul-2019 |
Mahesh Bandewar <maheshb@google.com> |
loopback: create blackhole net device similar to loopack. Create a blackhole net device that can be used for "dead" dst entries instead of loopback device. This blackhole device differs from loopback in few aspects: (a) It's not per-ns. (b) MTU on this device is ETH_MIN_MTU (c) The xmit function is essentially kfree_skb(). and (d) since it's not registered it won't have ifindex. Lower MTU effectively make the device not pass the MTU check during the route check when a dst associated with the skb is dead. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2874c5fd |
|
27-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
da68b4ad |
|
25-Apr-2019 |
Jakub Kicinski <kuba@kernel.org> |
net/tls: move definition of tls ops into net/tls.h There seems to be no reason for tls_ops to be defined in netdevice.h which is included in a lot of places. Don't wrap the struct/enum declaration in ifdefs, it trickles down unnecessary ifdefs into driver code. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8065a779 |
|
08-Apr-2019 |
Si-Wei Liu <si-wei.liu@oracle.com> |
failover: allow name change on IFF_UP slave interfaces When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to listen for the rename events on failover slaves. Userspace component interacting with slaves is expected to be changed to operate on failover master interface instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and userspace components should only deal with failover master in the long run. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28b05b92 |
|
03-Apr-2019 |
Florian Westphal <fw@strlen.de> |
net: use correct this_cpu primitive in dev_recursion_level syzbot reports: BUG: using __this_cpu_read() in preemptible code: caller is dev_recursion_level include/linux/netdevice.h:3052 [inline] __this_cpu_preempt_check+0x246/0x270 lib/smp_processor_id.c:47 dev_recursion_level include/linux/netdevice.h:3052 [inline] ip6_skb_dst_mtu include/net/ip6_route.h:245 [inline] I erronously downgraded a this_cpu_read to __this_cpu_read when moving dev_recursion_level() around. Reported-by: syzbot+51471b4aae195285a4a3@syzkaller.appspotmail.com Fixes: 97cdcf37b57e ("net: place xmit recursion in softnet data") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b16f9ee |
|
01-Apr-2019 |
Florian Westphal <fw@strlen.de> |
net: move skb->xmit_more hint to softnet data There are two reasons for this. First, the xmit_more flag conceptually doesn't fit into the skb, as xmit_more is not a property related to the skb. Its only a hint to the driver that the stack is about to transmit another packet immediately. Second, it was only done this way to not have to pass another argument to ndo_start_xmit(). We can place xmit_more in the softnet data, next to the device recursion. The recursion counter is already written to on each transmit. The "more" indicator is placed right next to it. Drivers can use the netdev_xmit_more() helper instead of skb->xmit_more to check the "more packets coming" hint. skb->xmit_more is retained (but always 0) to not cause build breakage. This change takes care of the simple s/skb->xmit_more/netdev_xmit_more()/ conversions. Remaining drivers are converted in the next patches. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
97cdcf37 |
|
01-Apr-2019 |
Florian Westphal <fw@strlen.de> |
net: place xmit recursion in softnet data This fills a hole in softnet data, so no change in structure size. Also prepares for xmit_more placement in the same spot; skb->xmit_more will be removed in followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5dc37bb9 |
|
28-Mar-2019 |
Jiri Pirko <jiri@mellanox.com> |
net: replace ndo_get_devlink with ndo_get_devlink_port Follow-up patch is going to need a devlink port instance according to a netdev. Devlink port instance should be always available when devlink is used. So change the recently introduced ndo_get_devlink to ndo_get_devlink_port. With that, adjust the wrapper for the only user to get devlink pointer. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dc05360f |
|
22-Mar-2019 |
Eric Dumazet <edumazet@google.com> |
net: convert rps_needed and rfs_needed to new static branch api We prefer static_branch_unlikely() over static_key_false() these days. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a350ecce |
|
20-Mar-2019 |
Paolo Abeni <pabeni@redhat.com> |
net: remove 'fallback' argument from dev->ndo_select_queue() After the previous patch, all the callers of ndo_select_queue() provide as a 'fallback' argument netdev_pick_tx. The only exceptions are nested calls to ndo_select_queue(), which pass down the 'fallback' available in the current scope - still netdev_pick_tx. We can drop such argument and replace fallback() invocation with netdev_pick_tx(). This avoids an indirect call per xmit packet in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen) with device drivers implementing such ndo. It also clean the code a bit. Tested with ixgbe and CONFIG_FCOE=m With pktgen using queue xmit: threads vanilla patched (kpps) (kpps) 1 2334 2428 2 4166 4278 4 7895 8100 v1 -> v2: - rebased after helper's name change Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b71b5837 |
|
20-Mar-2019 |
Paolo Abeni <pabeni@redhat.com> |
packet: rework packet_pick_tx_queue() to use common code selection Currently packet_pick_tx_queue() is the only caller of ndo_select_queue() using a fallback argument other than netdev_pick_tx. Leveraging rx queue, we can obtain a similar queue selection behavior using core helpers. After this change, ndo_select_queue() is always invoked with netdev_pick_tx() as fallback. We can change ndo_select_queue() signature in a followup patch, dropping an indirect call per transmitted packet in some scenarios (e.g. TCP syn and XDP generic xmit) This changes slightly how af packet queue selection happens when PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit() tacking in account both XPS and TC mapping. v1 -> v2: - rebased after helper name change RFC -> v1: - initialize sender_cpu to the expected value Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4bd97d51 |
|
20-Mar-2019 |
Paolo Abeni <pabeni@redhat.com> |
net: dev: rename queue selection helpers. With the following patches, we are going to use __netdev_pick_tx() in many modules. Rename it to netdev_pick_tx(), to make it clear is a public API. Also rename the existing netdev_pick_tx() to netdev_core_pick_tx(), to avoid name clashes. Suggested-by: Eric Dumazet <edumazet@google.com> Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3d705f07 |
|
27-Feb-2019 |
Florian Fainelli <f.fainelli@gmail.com> |
net: Remove switchdev_ops Now that we have converted all possible callers to using a switchdev notifier for attributes we do not have a need for implementing switchdev_ops anymore, and this can be removed from all drivers the net_device structure. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f4d7b3e2 |
|
27-Feb-2019 |
Andy Shevchenko <andriy.shevchenko@linux.intel.com> |
net: dev: Use unsigned integer as an argument to left-shift 1 << 31 is Undefined Behaviour according to the C standard. Use U type modifier to avoid theoretical overflow. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b473b0d2 |
|
25-Feb-2019 |
Jakub Kicinski <kuba@kernel.org> |
devlink: create a special NDO for getting the devlink instance Instead of iterating over all devlink ports add a NDO which will return the devlink instance from the driver. v2: add the netdev_to_devlink() helper (Michal) v3: check that devlink has ops (Florian) v4: hold devlink_mutex (Jiri) Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b5899679 |
|
22-Feb-2019 |
Andy Roulin <aroulin@cumulusnetworks.com> |
net: dev: add generic protodown handler Introduce dev_change_proto_down_generic, a generic ndo_change_proto_down implementation, which sets the netdev carrier state according to proto_down. This adds the ability to set protodown on vxlan and macvlan devices in a generic way for use by control protocols like VRRPD. Signed-off-by: Andy Roulin <aroulin@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e78b2915 |
|
20-Feb-2019 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net: Introduce parse_protocol header_ops callback Introduce a new optional header_ops callback called parse_protocol and a wrapper function dev_parse_header_protocol, similar to dev_parse_header. The new callback's purpose is to extract the protocol number from the L2 header, the format of which is known to the driver, but not to the upper layers of the stack. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f8ebfaf6 |
|
13-Feb-2019 |
Jan Sokolowski <jan.sokolowski@intel.com> |
net: bpf: remove XDP_QUERY_XSK_UMEM enumerator Commit c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy on one queue id") moved the umem query code to the AF_XDP core, and therefore removed the need to query the netdevice for a umem. This patch removes XDP_QUERY_XSK_UMEM and all code that implement that behavior, which is just dead code. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
d6abc596 |
|
06-Feb-2019 |
Florian Fainelli <f.fainelli@gmail.com> |
net: Introduce ndo_get_port_parent_id() In preparation for getting rid of switchdev_ops, create a dedicated NDO operation for getting the port's parent identifier. There are essentially two classes of drivers that need to implement getting the port's parent ID which are VF/PF drivers with a built-in switch, and pure switchdev drivers such as mlxsw, ocelot, dsa etc. We introduce a helper function: dev_get_port_parent_id() which supports recursion into the lower devices to obtain the first port's parent ID. Convert the bridge, core and ipv4 multicast routing code to check for such ndo_get_port_parent_id() and call the helper function when valid before falling back to switchdev_port_attr_get(). This will allow us to convert all relevant drivers in one go instead of having to implement both switchdev_port_attr_get() and ndo_get_port_parent_id() operations, then get rid of switchdev_port_attr_get(). Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce3fdb69 |
|
02-Feb-2019 |
Joe Perches <joe@perches.com> |
netdevice.h: Add __cold to netdev_<level> logging functions Add __cold to the netdev_<level> logging functions similar to the use of __cold in the generic printk function. Using __cold moves all the netdev_<level> logging functions out-of-line possibly improving code locality and runtime performance. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d5256083 |
|
29-Jan-2019 |
Daniel Borkmann <daniel@iogearbox.net> |
ipvlan, l3mdev: fix broken l3s mode wrt local routes While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, I ran into the issue that while l3 mode is working fine, l3s mode does not have any connectivity to kube-apiserver and hence all pods end up in Error state as well. The ipvlan master device sits on top of a bond device and hostns traffic to kube-apiserver (also running in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 where the latter is the address of the bond0. While in l3 mode, a curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 works fine from hostns, neither of them do in case of l3s. In the latter only a curl to https://127.0.0.1:37573 appeared to work where for local addresses of bond0 I saw kernel suddenly starting to emit ARP requests to query HW address of bond0 which remained unanswered and neighbor entries in INCOMPLETE state. These ARP requests only happen while in l3s. Debugging this further, I found the issue is that l3s mode is piggy- backing on l3 master device, and in this case local routes are using l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback"). I found that reverting them back into using the net->loopback_dev fixed ipvlan l3s connectivity and got everything working for the CNI. Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the l3mdev paper in [0] the only sole reason why ipvlan l3s is relying on l3 master device is to get the l3mdev_ip_rcv() receive hook for setting the dst entry of the input route without adding its own ipvlan specific hacks into the receive path, however, any l3 domain semantics beyond just that are breaking l3s operation. Note that ipvlan also has the ability to dynamically switch its internal operation from l3 to l3s for all ports via ipvlan_set_port_mode() at runtime. In any case, l3 vs l3s soley distinguishes itself by 'de-confusing' netfilter through switching skb->dev to ipvlan slave device late in NF_INET_LOCAL_IN before handing the skb to L4. Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook without any additional l3mdev semantics on top. This should also have minimal impact since dev->priv_flags is already hot in cache. With this set, l3s mode is working fine and I also get things like masquerading pod traffic on the ipvlan master properly working. [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback") Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
856c395c |
|
18-Jan-2019 |
Cong Wang <xiyou.wangcong@gmail.com> |
net: introduce a knob to control whether to inherit devconf config There have been many people complaining about the inconsistent behaviors of IPv4 and IPv6 devconf when creating new network namespaces. Currently, for IPv4, we inherit all current settings from init_net, but for IPv6 we reset all setting to default. This patch introduces a new /proc file /proc/sys/net/core/devconf_inherit_init_net to control the behavior of whether to inhert sysctl current settings from init_net. This file itself is only available in init_net. As demonstrated below: Initial setup in init_net: # cat /proc/sys/net/ipv4/conf/all/rp_filter 2 # cat /proc/sys/net/ipv6/conf/all/accept_dad 1 Default value 0 (current behavior): # ip netns del test # ip netns add test # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter 2 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad 0 Set to 1 (inherit from init_net): # echo 1 > /proc/sys/net/core/devconf_inherit_init_net # ip netns del test # ip netns add test # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter 2 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad 1 Set to 2 (reset to default): # echo 2 > /proc/sys/net/core/devconf_inherit_init_net # ip netns del test # ip netns add test # ip netns exec test cat /proc/sys/net/ipv4/conf/all/rp_filter 0 # ip netns exec test cat /proc/sys/net/ipv6/conf/all/accept_dad 0 Set to a value out of range (invalid): # echo 3 > /proc/sys/net/core/devconf_inherit_init_net -bash: echo: write error: Invalid argument # echo -1 > /proc/sys/net/core/devconf_inherit_init_net -bash: echo: write error: Invalid argument Reported-by: Zhu Yanjun <Yanjun.Zhu@windriver.com> Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
87b0984e |
|
16-Jan-2019 |
Petr Machata <petrm@mellanox.com> |
net: Add extack argument to ndo_fdb_add() Drivers may not be able to support certain FDB entries, and an error code is insufficient to give clear hints as to the reasons of rejection. In order to make it possible to communicate the rejection reason, extend ndo_fdb_add() with an extack argument. Adapt the existing implementations of ndo_fdb_add() to take the parameter (and ignore it). Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add(). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b2f94b2 |
|
15-Dec-2018 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
net: rtnetlink: support for fdb get This patch adds support for fdb get similar to route get. arguments can be any of the following (similar to fdb add/del/dump): [bridge, mac, vlan] or [bridge_port, mac, vlan, flags=[NTF_MASTER]] or [dev, mac, [vni|vlan], flags=[NTF_SELF]] Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d59cdf94 |
|
13-Dec-2018 |
Petr Machata <petrm@mellanox.com> |
net: dev: Issue NETDEV_PRE_CHANGEADDR When a device address is about to be changed, or an address added to the list of device HW addresses, it is necessary to ensure that all interested parties can support the address. Therefore, send the NETDEV_PRE_CHANGEADDR notification, and if anyone bails on it, do not change the address. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1570415f |
|
13-Dec-2018 |
Petr Machata <petrm@mellanox.com> |
net: dev: Add NETDEV_PRE_CHANGEADDR The NETDEV_CHANGEADDR notification is emitted after a device address changes. Extending this message to allow vetoing is certainly possible, but several other notification types have instead adopted a simple two-stage approach: first a "pre" notification is sent to make sure all interested parties are OK with a change that's about to be done. Then the change is done, and afterwards a "post" notification is sent. This dual approach is easier to use: when the change is vetoed, nothing has changed yet, and it's therefore unnecessary to roll anything back. Therefore adopt it for NETDEV_CHANGEADDR as well. To that end, add NETDEV_PRE_CHANGEADDR and an info structure to go along with it. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3a37a963 |
|
13-Dec-2018 |
Petr Machata <petrm@mellanox.com> |
net: dev: Add extack argument to dev_set_mac_address() A follow-up patch will add a notifier type NETDEV_PRE_CHANGEADDR, which allows vetoing of MAC address changes. One prominent path to that notification is through dev_set_mac_address(). Therefore give this function an extack argument, so that it can be packed together with the notification. Thus a textual reason for rejection (or a warning) can be communicated back to the user. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2fd527b7 |
|
12-Dec-2018 |
Petr Machata <petrm@mellanox.com> |
net: ndo_bridge_setlink: Add extack Drivers may not be able to implement a VLAN addition or reconfiguration. In those cases it's desirable to explain to the user that it was rejected (and why). To that end, add extack argument to ndo_bridge_setlink. Adapt all users to that change. Following patches will use the new argument in the bridge driver. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6d040321 |
|
06-Dec-2018 |
Petr Machata <petrm@mellanox.com> |
net: core: dev: Add extack argument to __dev_change_flags() In order to pass extack together with NETDEV_PRE_UP notifications, it's necessary to route the extack to __dev_open() from diverse (possibly indirect) callers. The last missing API is __dev_change_flags(). Therefore extend __dev_change_flags() with and extra extack argument and update the two existing users. Since the function declaration line is changed anyway, name the struct net_device argument to placate checkpatch. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
567c5e13 |
|
06-Dec-2018 |
Petr Machata <petrm@mellanox.com> |
net: core: dev: Add extack argument to dev_change_flags() In order to pass extack together with NETDEV_PRE_UP notifications, it's necessary to route the extack to __dev_open() from diverse (possibly indirect) callers. One prominent API through which the notification is invoked is dev_change_flags(). Therefore extend dev_change_flags() with and extra extack argument and update all users. Most of the calls end up just encoding NULL, but several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available. Since the function declaration line is changed anyway, name the other function arguments to placate checkpatch. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
00f54e68 |
|
06-Dec-2018 |
Petr Machata <petrm@mellanox.com> |
net: core: dev: Add extack argument to dev_open() In order to pass extack together with NETDEV_PRE_UP notifications, it's necessary to route the extack to __dev_open() from diverse (possibly indirect) callers. One prominent API through which the notification is invoked is dev_open(). Therefore extend dev_open() with and extra extack argument and update all users. Most of the calls end up just encoding NULL, but bond and team drivers have the extack readily available. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
620344c4 |
|
25-Nov-2018 |
Heiner Kallweit <hkallweit1@gmail.com> |
net: core: add __netdev_sent_queue as variant of __netdev_tx_sent_queue Similar to netdev_sent_queue add helper __netdev_sent_queue as variant of __netdev_tx_sent_queue. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2183435c |
|
23-Nov-2018 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: fixup type in netdev_start_xmit() Return code should be formally "netdev_tx_t". Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
890d8d23 |
|
19-Nov-2018 |
Jakub Kicinski <kuba@kernel.org> |
net: sched: gred: add basic Qdisc offload Add basic offload for the GRED Qdisc. Inform the drivers any time Qdisc or virtual queue configuration changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9a5ee462 |
|
16-Nov-2018 |
Eric Dumazet <edumazet@google.com> |
net: align pcpu_sw_netstats and pcpu_lstats structs Do not risk spanning these small structures on two cache lines, it is absolutely not worth it. For 32bit arches, the hint might not be enough, but we do not really care anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7fe50ac8 |
|
12-Nov-2018 |
Cong Wang <xiyou.wangcong@gmail.com> |
net: dump more useful information in netdev_rx_csum_fault() Currently netdev_rx_csum_fault() only shows a device name, we need more information about the skb for debugging csum failures. Sample output: ens3: hw csum failure dev features: 0x0000000000014b89 skb len=84 data_len=0 pkt_type=0 gso_size=0 gso_type=0 nr_frags=0 ip_summed=0 csum=0 csum_complete_sw=0 csum_valid=0 csum_level=0 Note, I use pr_err() just to be consistent with the existing one. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
98b0e5f6 |
|
12-Nov-2018 |
Jakub Kicinski <kuba@kernel.org> |
net: sched: provide notification for graft on root Drivers are currently not notified when a Qdisc is grafted as root. This requires special casing Qdiscs added with parent = TC_H_ROOT in the driver. Also there is no notification sent to the driver when an existing Qdisc is grafted as root. Add this very simple notifications, drivers should now be able to track their Qdisc tree fully. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eb911947 |
|
09-Nov-2018 |
Quentin Monnet <quentin@isovalent.com> |
bpf: pass destroy() as a callback and remove its ndo_bpf subcommand As part of the transition from ndo_bpf() to callbacks attached to struct bpf_offload_dev for some of the eBPF offload operations, move the functions related to program destruction to the struct and remove the subcommand that was used to call them through the NDO. Remove function __bpf_offload_ndo(), which is no longer used. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
b07ade27 |
|
09-Nov-2018 |
Quentin Monnet <quentin@isovalent.com> |
bpf: pass translate() as a callback and remove its ndo_bpf subcommand As part of the transition from ndo_bpf() to callbacks attached to struct bpf_offload_dev for some of the eBPF offload operations, move the functions related to code translation to the struct and remove the subcommand that was used to call them through the NDO. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
00db12c3 |
|
09-Nov-2018 |
Quentin Monnet <quentin@isovalent.com> |
bpf: call verifier_prep from its callback in struct bpf_offload_dev In a way similar to the change previously brought to the verify_insn hook and to the finalize callback, switch to the newly added ops in struct bpf_prog_offload for calling the functions used to prepare driver verifiers. Since the dev_ops pointer in struct bpf_prog_offload is no longer used by any callback, we can now remove it from struct bpf_prog_offload. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
e7946760 |
|
08-Nov-2018 |
Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> |
net: core: dev_addr_lists: add auxiliary func to handle reference address updates In order to avoid all table update, and only remove or add new address, the auxiliary function exists, named __hw_addr_sync_dev(). It allows end driver do nothing when nothing changed and add/rm when concrete address is firstly added or lastly removed. But it doesn't include cases when an address of real device or vlan was reused by other vlans or vlan/macval devices. For handaling events when address was reused/unreused the patch adds new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do nothing when nothing was changed and do updates only for an address being added/reused/deleted/unreused. Thus, clone address changes for vlans can be mirrored in the table. The function is exclusive with __hw_addr_sync_dev(). It's responsibility of the end driver to identify address vlan device, if it needs so. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3e59020a |
|
31-Oct-2018 |
Eric Dumazet <edumazet@google.com> |
net: bql: add __netdev_tx_sent_queue() When qdisc_run() tries to use BQL budget to bulk-dequeue a batch of packets, GSO can later transform this list in another list of skbs, and each skb is sent to device ndo_start_xmit(), one at a time, with skb->xmit_more being set to one but for last skb. Problem is that very often, BQL limit is hit in the middle of the packet train, forcing dev_hard_start_xmit() to stop the bulk send and requeue the end of the list. BQL role is to avoid head of line blocking, making sure a qdisc can deliver high priority packets before low priority ones. But there is no way requeued packets can be bypassed by fresh packets in the qdisc. Aborting the bulk send increases TX softirqs, and hot cache lines (after skb_segment()) are wasted. Note that for TSO packets, we never split a packet in the middle because of BQL limit being hit. Drivers should be able to update BQL counters without flipping/caring about BQL status, if the current skb has xmit_more set. Upper layers are ultimately responsible to stop sending another packet train when BQL limit is hit. Code template in a driver might look like the following : send_doorbell = __netdev_tx_sent_queue(tx_queue, nr_bytes, skb->xmit_more); Note that __netdev_tx_sent_queue() use is not mandatory, since following patch will change dev_hard_start_xmit() to not care about BQL status. But it is highly recommended so that xmit_more full benefits can be reached (less doorbells sent, and less atomic operations as well) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9f9a742d |
|
09-Oct-2018 |
Maciej W. Rozycki <macro@linux-mips.org> |
FDDI: defza: Support capturing outgoing SMT traffic DEC FDDIcontroller 700 (DEFZA) uses a Tx/Rx queue pair to communicate SMT frames with adapter's firmware. Any SMT frame received from the RMC via the Rx queue is queued back by the driver to the SMT Rx queue for the firmware to process. Similarly the firmware uses the SMT Tx queue to supply the driver with SMT frames which are queued back to the Tx queue for the RMC to send to the ring. When a network tap is attached to an FDDI interface handled by `defza' any incoming SMT frames captured are queued to our usual processing of network data received, which in turn delivers them to any listening taps. However the outgoing SMT frames produced by the firmware bypass our network protocol stack and are therefore not delivered to taps. This in turn means that taps are missing a part of network traffic sent by the adapter, which may make it more difficult to track down network problems or do general traffic analysis. Call `dev_queue_xmit_nit' then in the SMT Tx path, having checked that a network tap is attached, with a newly-created `dev_nit_active' helper wrapping the usual condition used in the transmit path. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
af7d6cce |
|
09-Oct-2018 |
Sabrina Dubroca <sd@queasysnail.net> |
net: ipv4: update fnhe_pmtu when first hop's MTU changes Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions"), exceptions get deprecated separately from cached routes. In particular, administrative changes don't clear PMTU anymore. As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes"), the PMTU discovered before the local MTU change can become stale: - if the local MTU is now lower than the PMTU, that PMTU is now incorrect - if the local MTU was the lowest value in the path, and is increased, we might discover a higher PMTU Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those cases. If the exception was locked, the discovered PMTU was smaller than the minimal accepted PMTU. In that case, if the new local MTU is smaller than the current PMTU, let PMTU discovery figure out if locking of the exception is still needed. To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU notifier. By the time the notifier is called, dev->mtu has been changed. This patch adds the old MTU as additional information in the notifier structure, and a new call_netdevice_notifiers_u32() function. Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
661b8d1b |
|
01-Oct-2018 |
Magnus Karlsson <magnus.karlsson@intel.com> |
net: add umem reference in netdev{_rx}_queue These references to the umem will be used to store information on what kind of AF_XDP umem that is bound to a queue id, if any. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
61941143 |
|
24-Sep-2018 |
Heiner Kallweit <hkallweit1@gmail.com> |
net: core: add member wol_enabled to struct net_device Add flag wol_enabled to struct net_device indicating whether Wake-on-LAN is enabled. As first user phy_suspend() will use it to decide whether PHY can be suspended or not. Fixes: f1e911d5d0df ("r8169: add basic phylib support") Fixes: e8cfd9d6c772 ("net: phy: call state machine synchronously in phy_stop") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14d73416 |
|
17-Sep-2018 |
Li RongQing <lirongqing@baidu.com> |
veth: rename pcpu_vstats as pcpu_lstats struct pcpu_vstats and pcpu_lstats have same members and usage, and pcpu_lstats is used in many files, so rename pcpu_vstats as pcpu_lstats to reduce duplicate definition Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
52bb6677 |
|
14-Sep-2018 |
Li RongQing <lirongqing@baidu.com> |
net: move definition of pcpu_lstats to header file pcpu_lstats is defined in several files, so unify them as one and move to header file Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fa788d98 |
|
03-Sep-2018 |
Vincent Whitchurch <vincent.whitchurch@axis.com> |
packet: add sockopt to ignore outgoing packets Currently, the only way to ignore outgoing packets on a packet socket is via the BPF filter. With MSG_ZEROCOPY, packets that are looped into AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even if the filter run from packet_rcv() would reject them. So the presence of a packet socket on the interface takes away the benefits of MSG_ZEROCOPY, even if the packet socket is not interested in outgoing packets. (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily cloned, but the cost for that is much lower.) Add a socket option to allow AF_PACKET sockets to ignore outgoing packets to solve this. Note that the *BSDs already have something similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT. The first intended user is lldpd. Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c5c95810 |
|
28-Aug-2018 |
Magnus Karlsson <magnus.karlsson@intel.com> |
net: add napi_if_scheduled_mark_missed The function napi_if_scheduled_mark_missed is used to check if the NAPI context is scheduled, if so set NAPIF_STATE_MISSED and return true. Used by the AF_XDP zero-copy i40e Tx code implementation in order to make sure that irq affinity is honored by the napi context. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
c9fbb2d2 |
|
10-Aug-2018 |
Krzysztof Kozlowski <krzk@kernel.org> |
net: Provide stub for __netif_set_xps_queue if there is no CONFIG_XPS Building virtio_net driver without CONFIG_XPS fails with: drivers/net/virtio_net.c: In function ‘virtnet_set_affinity’: drivers/net/virtio_net.c:1910:3: error: implicit declaration of function ‘__netif_set_xps_queue’ [-Werror=implicit-function-declaration] __netif_set_xps_queue(vi->dev, mask, i, false); ^ Fixes: 4d99f6602cb5 ("net: allow to call netif_reset_xps_queues() under cpus_read_lock") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
84c6b868 |
|
30-Jul-2018 |
Jakub Kicinski <kuba@kernel.org> |
xsk: don't allow umem replace at stack level Currently drivers have to check if they already have a umem installed for a given queue and return an error if so. Make better use of XDP_QUERY_XSK_UMEM and move this functionality to the core. We need to keep rtnl across the calls now. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c29c2ebd |
|
30-Jul-2018 |
Jakub Kicinski <kuba@kernel.org> |
net: update real_num_rx_queues even when !CONFIG_SYSFS We used to depend on real_num_rx_queues as a upper bound for sanity checks. For AF_XDP socket validation it's useful if the check behaves the same regardless of CONFIG_SYSFS setting. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a4c53be |
|
27-Jul-2018 |
Stephen Hemminger <stephen@networkplumber.org> |
net: report invalid mtu value via netlink extack If an invalid MTU value is set through rtnetlink return extra error information instead of putting message in kernel log. For other cases where there is no visible API, keep the error report in the log. Example: # ip li set dev enp12s0 mtu 10000 Error: mtu greater than device maximum. # ifconfig enp12s0 mtu 10000 SIOCSIFMTU: Invalid argument # dmesg | tail -1 [ 2047.795467] enp12s0: mtu greater than device maximum Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d9f37d01 |
|
13-Jul-2018 |
Li RongQing <lirongqing@baidu.com> |
net: convert gro_count to bitmask gro_hash size is 192 bytes, and uses 3 cache lines, if there is few flows, gro_hash may be not fully used, so it is unnecessary to iterate all gro_hash in napi_gro_flush(), to occupy unnecessary cacheline. convert gro_count to a bitmask, and rename it as gro_bitmask, each bit represents a element of gro_hash, only flush a gro_hash element if the related bit is set, to speed up napi_gro_flush(). and update gro_bitmask only if it will be changed, to reduce cache update Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
16e4edc2 |
|
13-Jul-2018 |
Boris Pismenny <borisp@mellanox.com> |
net: Add TLS rx resync NDO Add new netdev tls op for resynchronizing HW tls context Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a25717d2 |
|
11-Jul-2018 |
Jakub Kicinski <kuba@kernel.org> |
xdp: support simultaneous driver and hw XDP attachment Split the query of HW-attached program from the software one. Introduce new .ndo_bpf command to query HW-attached program. This will allow drivers to install different programs in HW and SW at the same time. Netlink can now also carry multiple programs on dump (in which case mode will be set to XDP_ATTACHED_MULTI and user has to check per-attachment point attributes, IFLA_XDP_PROG_ID will not be present). We reuse IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size() doesn't need to be updated. Note that the installation side is still not there, since all drivers currently reject installing more than one program at the time. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
6b867589 |
|
11-Jul-2018 |
Jakub Kicinski <kuba@kernel.org> |
xdp: don't make drivers report attachment mode prog_attached of struct netdev_bpf should have been superseded by simply setting prog_id long time ago, but we kept it around to allow offloading drivers to communicate attachment mode (drv vs hw). Subsequently drivers were also allowed to report back attachment flags (prog_flags), and since nowadays only programs attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell the attachment mode from the flags driver reports. Remove prog_attached member. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
8ec56fc3 |
|
08-Jul-2018 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: allow fallback function to pass netdev For most of these calls we can just pass NULL through to the fallback function as the sb_dev. The only cases where we cannot are the cases where we might be dealing with either an upper device or a driver that would have configured things to support an sb_dev itself. The only driver that has any significant change in this patch set should be ixgbe as we can drop the redundant functionality that existed in both the ndo_select_queue function and the fallback function that was passed through to us. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
4f49dec9 |
|
08-Jul-2018 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: allow ndo_select_queue to pass netdev This patch makes it so that instead of passing a void pointer as the accel_priv we instead pass a net_device pointer as sb_dev. Making this change allows us to pass the subordinate device through to the fallback function eventually so that we can keep the actual code in the ndo_select_queue call as focused on possible on the exception cases. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
a4ea8a3d |
|
08-Jul-2018 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add generic ndo_select_queue functions This patch adds a generic version of the ndo_select_queue functions for either returning 0 or selecting a queue based on the processor ID. This is generally meant to just reduce the number of functions we have to change in the future when we have to deal with ndo_select_queue changes. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
eadec877 |
|
08-Jul-2018 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add support for subordinate traffic classes to netdev_pick_tx This change makes it so that we can support the concept of subordinate device traffic classes to the core networking code. In doing this we can start pulling out the driver specific bits needed to support selecting a queue based on an upper device. The solution at is currently stands is only partially implemented. I have the start of some XPS bits in here, but I would still need to allow for configuration of the XPS maps on the queues reserved for the subordinate devices. For now I am using the reference to the sb_dev XPS map as just a way to skip the lookup of the lower device XPS map for now as that would result in the wrong queue being picked. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
ffcfe25b |
|
08-Jul-2018 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add support for subordinate device traffic classes This patch is meant to provide the basic tools needed to allow us to create subordinate device traffic classes. The general idea here is to allow subdividing the queues of a device into queue groups accessible through an upper device such as a macvlan. The idea here is to enforce the idea that an upper device has to be a single queue device, ideally with IFF_NO_QUQUE set. With that being the case we can pretty much guarantee that the tc_to_txq mappings and XPS maps for the upper device are unused. As such we could reuse those in order to support subdividing the lower device and distributing those queues between the subordinate devices. In order to distinguish between a regular set of traffic classes and if a device is carrying subordinate traffic classes I changed num_tc from a u8 to a s16 value and use the negative values to represent the subordinate pool values. So starting at -1 and running to -32768 we can encode those as pool values, and the existing values of 0 to 15 can be maintained. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
6312fe77 |
|
05-Jul-2018 |
Li RongQing <lirongqing@baidu.com> |
net: limit each hash list length to MAX_GRO_SKBS After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash table.")' there is 8 hash buckets, which allows more flows to be held for merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still, limit the hash table performance. keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not the total 8 skb Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
25db26a9 |
|
03-Jul-2018 |
Vinicius Costa Gomes <vinicius.gomes@intel.com> |
net/sched: Introduce the ETF Qdisc The ETF (Earliest TxTime First) qdisc uses the information added earlier in this series (the socket option SO_TXTIME and the new role of sk_buff->tstamp) to schedule packets transmission based on absolute time. For some workloads, just bandwidth enforcement is not enough, and precise control of the transmission of packets is necessary. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \ clockid CLOCK_TAI In this example, the Qdisc will provide SW best-effort for the control of the transmission time to the network adapter, the time stamp in the socket will be in reference to the clockid CLOCK_TAI and packets will leave the qdisc "delta" (100000) nanoseconds before its transmission time. The ETF qdisc will buffer packets sorted by their txtime. It will drop packets on enqueue() if their skbuff clockid does not match the clock reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped if it expires while being enqueued. The qdisc also supports the SO_TXTIME deadline mode. For this mode, it will dequeue a packet as soon as possible and change the skb timestamp to 'now' during etf_dequeue(). Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid to be configured, but it's been decided that usage of CLOCK_TAI should be enforced until we decide to allow for other clockids to be used. The rationale here is that PTP times are usually in the TAI scale, thus no other clocks should be necessary. For now, the qdisc will return EINVAL if any clocks other than CLOCK_TAI are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
17266ee9 |
|
02-Jul-2018 |
Edward Cree <ecree@solarflare.com> |
net: ipv4: listified version of ip_rcv Also involved adding a way to run a netfilter hook over a list of packets. Rather than attempting to make netfilter know about lists (which would be a major project in itself) we just let it call the regular okfn (in this case ip_rcv_finish()) for any packets it steals, and have it give us back a list of packets it's synchronously accepted (which normally NF_HOOK would automatically call okfn() on, but we want to be able to potentially pass the list to a listified version of okfn().) The netfilter hooks themselves are indirect calls that still happen per- packet (see nf_hook_entry_hookfn()), but again, changing that can be left for future work. There is potential for out-of-order receives if the netfilter hook ends up synchronously stealing packets, as they will be processed before any accepts earlier in the list. However, it was already possible for an asynchronous accept to cause out-of-order receives, so presumably this is considered OK. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6ad8c1b |
|
02-Jul-2018 |
Edward Cree <ecree@solarflare.com> |
net: core: trivial netif_receive_skb_list() entry point Just calls netif_receive_skb() in a loop. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
603d4cf8 |
|
30-Jun-2018 |
Sabrina Dubroca <sd@queasysnail.net> |
net: fix use-after-free in GRO with ESP Since the addition of GRO for ESP, gro_receive can consume the skb and return -EINPROGRESS. In that case, the lower layer GRO handler cannot touch the skb anymore. Commit 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") converted some of the gro_receive handlers that can lead to ESP's gro_receive so that they wouldn't access the skb when -EINPROGRESS is returned, but missed other spots, mainly in tunneling protocols. This patch finishes the conversion to using skb_gro_flush_final(), and adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and GUE. Fixes: 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
80d19669 |
|
29-Jun-2018 |
Amritha Nambiar <amritha.nambiar@intel.com> |
net: Refactor XPS for CPUs and Rx queues Refactor XPS code to support Tx queue selection based on CPU(s) map or Rx queue(s) map. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
07d78363 |
|
23-Jun-2018 |
David Miller <davem@davemloft.net> |
net: Convert NAPI gro list into a small hash table. Improve the performance of GRO receive by splitting flows into multiple hash chains. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4546c25 |
|
23-Jun-2018 |
David Miller <davem@davemloft.net> |
net: Convert GRO SKB handling to list_head. Manage pending per-NAPI GRO packets via list_head. Return an SKB pointer from the GRO receive handlers. When GRO receive handlers return non-NULL, it means that this SKB needs to be completed at this time and removed from the NAPI queue. Several operations are greatly simplified by this transformation, especially timing out the oldest SKB in the list when gro_count exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue in reverse order. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3760c7e |
|
04-Jun-2018 |
Magnus Karlsson <magnus.karlsson@intel.com> |
net: added netdevice operation for Tx Added ndo_xsk_async_xmit. This ndo "kicks" the netdev to start to pull userland AF_XDP Tx frames from a NAPI context. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
74515c57 |
|
04-Jun-2018 |
Björn Töpel <bjorn@kernel.org> |
net: xdp: added bpf_netdev_command XDP_{QUERY, SETUP}_XSK_UMEM Extend ndo_bpf with two new commands used for query zero-copy support and register an UMEM to a queue_id of a netdev. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
189454e8 |
|
05-Jun-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
net: remove net_device operation ndo_xdp_flush All drivers are cleaned up and no references to ndo_xdp_flush are left in drivers, it is time to remove the net_device_ops operation ndo_xdp_flush. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
42b33468 |
|
31-May-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
xdp: add flags argument to ndo_xdp_xmit API This patch only change the API and reject any use of flags. This is an intermediate step that allows us to implement the flush flag operation later, for each individual driver in a separate patch. The plan is to implement flush operation via XDP_XMIT_FLUSH flag and then remove XDP_XMIT_FLAGS_NONE when done. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
f971b132 |
|
25-May-2018 |
Jakub Kicinski <kuba@kernel.org> |
net: sched: mq: add simple offload notification mq offload is trivial, we just need to let the device know that the root qdisc is mq. Alternative approach would be to export qdisc_lookup() and make drivers check the root type themselves, but notification via ndo_setup_tc is more in line with other qdiscs. Note that mq doesn't hold any stats on it's own, it just adds up stats of its children. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
30c8bd5a |
|
24-May-2018 |
Sridhar Samudrala <sridhar.samudrala@intel.com> |
net: Introduce generic failover module The failover module provides a generic interface for paravirtual drivers to register a netdev and a set of ops with a failover instance. The ops are used as event handlers that get called to handle netdev register/ unregister/link change/name change events on slave pci ethernet devices with the same mac address as the failover netdev. This enables paravirtual drivers to use a VF as an accelerated low latency datapath. It also allows migration of VMs with direct attached VFs by failing over to the paravirtual datapath when the VF is unplugged. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f44aa9ef |
|
23-May-2018 |
John Hurley <john.hurley@netronome.com> |
net: include hash policy in LAG changeupper info LAG upper event notifiers contain the tx type used by the LAG device. Extend this to also include the hash policy used for tx types that utilize hashing. Signed-off-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
735fc405 |
|
24-May-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
xdp: change ndo_xdp_xmit API to support bulking This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
865b03f2 |
|
02-May-2018 |
Magnus Karlsson <magnus.karlsson@intel.com> |
dev: packet: make packet_direct_xmit a common function The new dev_direct_xmit will be used by AF_XDP in later commits. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
e283de3a |
|
30-Apr-2018 |
Florian Fainelli <f.fainelli@gmail.com> |
net: core: Inline netdev_features_size_check() We do not require this inline function to be used in multiple different locations, just inline it where it gets used in register_netdevice(). Suggested-by: David Miller <davem@davemloft.net> Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a5c37c63 |
|
30-Apr-2018 |
Ilya Lesokhin <ilyal@mellanox.com> |
net: Add TLS offload netdev ops Add new netdev ops to add and delete tls context Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3ac305c3 |
|
27-Apr-2018 |
Florian Fainelli <f.fainelli@gmail.com> |
net: core: Assert the size of netdev_featres_t We have about 53 netdev_features_t bits defined and counting, add a build time check to catch when an u64 type will not be enough and we will have to convert that to a bitmap. This is done in register_netdevice() for convenience. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b837d48 |
|
27-Apr-2018 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash I am dropping the export of __skb_tx_hash as after my patches nobody is using it outside of the net/core/dev.c file. In addition I am renaming and repurposing it to just be a static declaration of skb_tx_hash since that was the only user for it at this point. By doing this the compiler can inline it into __netdev_pick_tx as that will improve performance. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
83aa025f |
|
26-Apr-2018 |
Willem de Bruijn <willemb@google.com> |
udp: add gso support to virtual devices Virtual devices such as tunnels and bonding can handle large packets. Only segment packets when reaching a physical or loopback device. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
44fa2dbd |
|
17-Apr-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
xdp: transition into using xdp_frame for ndo_xdp_xmit Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync. This builds towards changing the API further to become a bulk API, because xdp_buff is not a queue-able object while xdp_frame is. V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP") V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e679c9c1 |
|
28-Mar-2018 |
Russell King <rmk+kernel@armlinux.org.uk> |
sfp/phylink: move module EEPROM ethtool access into netdev core ethtool Provide a pointer to the SFP bus in struct net_device, so that the ethtool module EEPROM methods can access the SFP directly, rather than needing every user to provide a hook for it. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9daae9bd |
|
28-Mar-2018 |
Gal Pressman <galp@mellanox.com> |
net: Call add/kill vid ndo on vlan filter feature toggling NETIF_F_HW_VLAN_[CS]TAG_FILTER features require more than just a bit flip in dev->features in order to keep the driver in a consistent state. These features notify the driver of each added/removed vlan, but toggling of vlan-filter does not notify the driver accordingly for each of the existing vlans. This patch implements a similar solution to NETIF_F_RX_UDP_TUNNEL_PORT behavior (which notifies the driver about UDP ports in the same manner that vids are reported). Each toggling of the features propagates to the 8021q module, which iterates over the vlans and call add/kill ndo accordingly. Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
070f2d7e |
|
23-Mar-2018 |
Kirill Tkhai <ktkhai@virtuozzo.com> |
net: Drop NETDEV_UNREGISTER_FINAL Last user is gone after bdf5bd7f2132 "rds: tcp: remove register_netdevice_notifier infrastructure.", so we can remove this netdevice command. This allows to delete rtnl_lock() in netdev_run_todo(), which is hot path for net namespace unregistration. dev_change_net_namespace() and netdev_wait_allrefs() have rcu_barrier() before NETDEV_UNREGISTER_FINAL call, and the source commits say they were introduced to delemit the call with NETDEV_UNREGISTER, but this patch leaves them on the places, since they require additional analysis, whether we need in them for something else. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ede2762d |
|
23-Mar-2018 |
Kirill Tkhai <ktkhai@virtuozzo.com> |
net: Make NETDEV_XXX commands enum { } This patch is preparation to drop NETDEV_UNREGISTER_FINAL. Since the cmd is used in usnic_ib_netdev_event_to_string() to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL from everywhere, we'd have holes in event2str[] in this function. Instead of that, let's make NETDEV_XXX commands names available for everyone, and to define netdev_cmd_to_name() in the way we won't have to shaffle names after their numbers are changed. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
be9fc097 |
|
12-Mar-2018 |
Arnd Bergmann <arnd@arndb.de> |
net: fix sysctl_fb_tunnels_only_for_init_net link error The new variable is only available when CONFIG_SYSCTL is enabled, otherwise we get a link error: net/ipv4/ip_tunnel.o: In function `ip_tunnel_init_net': ip_tunnel.c:(.text+0x278b): undefined reference to `sysctl_fb_tunnels_only_for_init_net' net/ipv6/sit.o: In function `sit_init_net': sit.c:(.init.text+0x4c): undefined reference to `sysctl_fb_tunnels_only_for_init_net' net/ipv6/ip6_tunnel.o: In function `ip6_tnl_init_net': ip6_tunnel.c:(.init.text+0x39): undefined reference to `sysctl_fb_tunnels_only_for_init_net' This adds an extra condition, keeping the traditional behavior when CONFIG_SYSCTL is disabled. Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5426250 |
|
09-Mar-2018 |
Paolo Abeni <pabeni@redhat.com> |
net: introduce IFF_NO_RX_HANDLER Some network devices - notably ipvlan slave - are not compatible with any kind of rx_handler. Currently the hook can be installed but any configuration (bridge, bond, macsec, ...) is nonfunctional. This change allocates a priv_flag bit to mark such devices and explicitly forbid installing a rx_handler if such bit is set. The new bit is used by ipvlan slave device. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
79134e6c |
|
08-Mar-2018 |
Eric Dumazet <edumazet@google.com> |
net: do not create fallback tunnels for non-default namespaces fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0, ip6tnl0, ip6gre0) are automatically created when the corresponding module is loaded. These tunnels are also automatically created when a new network namespace is created, at a great cost. In many cases, netns are used for isolation purposes, and these extra network devices are a waste of resources. We are using thousands of netns per host, and hit the netns creation/delete bottleneck a lot. (Many thanks to Kirill for recent work on this) Add a new sysctl so that we can opt-out from this automatic creation. Note that these tunnels are still created for the initial namespace, to be the least intrusive for typical setups. Tested: lpk43:~# cat add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net lpk43:~# time ./add_del_unshare.sh real 0m37.521s user 0m0.886s sys 7m7.084s lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net lpk43:~# time ./add_del_unshare.sh real 0m4.761s user 0m0.851s sys 1m8.343s lpk43:~# Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ec54cb4 |
|
06-Mar-2018 |
Paolo Abeni <pabeni@redhat.com> |
net: unpollute priv_flags space the ipvlan device driver defines and uses 2 bits inside the priv_flags net_device field. Such bits and the related helper are used only inside the ipvlan device driver, and the core networking does not need to be aware of them. This change moves netif_is_ipvlan* helper in the ipvlan driver and re-implement them looking for ipvlan specific symbols instead of using priv_flags. Overall this frees two bits inside priv_flags - and move the following ones to avoid gaps - without any intended functional change. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
89e58148 |
|
13-Feb-2018 |
David Ahern <dsa@cumulusnetworks.com> |
net: Make atalk_ptr depend on ATALK or IRDA Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
19ff13f2 |
|
13-Feb-2018 |
David Ahern <dsahern@gmail.com> |
net: Make ax25_ptr depend on CONFIG_AX25 Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
330c7272 |
|
13-Feb-2018 |
David Ahern <dsahern@gmail.com> |
net: Make dn_ptr depend on CONFIG_DECNET Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
65073a67 |
|
30-Jan-2018 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: fix null pointer deref in bpf_prog_test_run_xdp syzkaller was able to generate the following XDP program ... (18) r0 = 0x0 (61) r5 = *(u32 *)(r1 +12) (04) (u32) r0 += (u32) 0 (95) exit ... and trigger a NULL pointer dereference in ___bpf_prog_run() via bpf_prog_test_run_xdp() where this was attempted to run. Reason is that recent xdp_rxq_info addition to XDP programs updated all drivers, but not bpf_prog_test_run_xdp(), where xdp_buff is set up. Thus when context rewriter does the deref on the netdev it's NULL at runtime. Fix it by using xdp_rxq from loopback dev. __netif_get_rx_queue() helper can also be reused in various other locations later on. Fixes: 02dd3291b2f0 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs") Reported-by: syzbot+1eb094057b338eb1fc00@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
6a643ddb |
|
25-Jan-2018 |
Cong Wang <xiyou.wangcong@gmail.com> |
net: introduce helper dev_change_tx_queue_len() This patch promotes the local change_tx_queue_len() to a core helper function, dev_change_tx_queue_len(), so that rtnetlink and net-sysfs could share the code. This also prepares for the following patch. Note, the -EFAULT in the original code doesn't make sense, we should propagate the errno from notifiers. Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
44c02a2c |
|
04-Oct-2017 |
Al Viro <viro@zeniv.linux.org.uk> |
dev_ioctl(): move copyin/copyout to callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
36fd633e |
|
26-Jun-2017 |
Al Viro <viro@zeniv.linux.org.uk> |
net: separate SIOCGIFCONF handling from dev_ioctl() Only two of dev_ioctl() callers may pass SIOCGIFCONF to it. Separating that codepath from the rest of dev_ioctl() allows both to simplify dev_ioctl() itself (all other cases work with struct ifreq *) *and* seriously simplify the compat side of that beast: all it takes is passing to inet_gifconf() an extra argument - the size of individual records (sizeof(struct ifreq) or sizeof(struct compat_ifreq)). With dev_ifconf() called directly from sock_do_ioctl()/compat_dev_ifconf() that's easy to arrange. As the result, compat side of SIOCGIFCONF doesn't need any allocations, copy_in_user() back and forth, etc. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
9e55e5d3 |
|
22-Jan-2018 |
Florian Fainelli <f.fainelli@gmail.com> |
net: core: Fix kernel-doc for carrier_* attributes Fix the documentation warning: include/linux/netdevice.h:1939: warning: Excess struct member 'carrier_changes' description in 'net_device' Reported-by: kbuild test robot <fengguang.wu@intel.com> Fixes: b2d3bcfa26a7 ("net: core: Expose number of link up/down transitions") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b2d3bcfa |
|
18-Jan-2018 |
David Decotigny <decot@googlers.com> |
net: core: Expose number of link up/down transitions Expose the number of times the link has been going UP or DOWN, and update the "carrier_changes" counter to be the sum of these two events. While at it, also update the sysfs-class-net documentation to cover: carrier_changes (3.15), carrier_up_count (4.16) and carrier_down_count (4.16) Signed-off-by: David Decotigny <decot@googlers.com> [Florian: * rebase * add documentation * merge carrier_changes with up/down counters] Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
50bd870a |
|
14-Jan-2018 |
Yossef Efraim <yossefe@mellanox.com> |
xfrm: Add ESN support for IPSec HW offload This patch adds ESN support to IPsec device offload. Adding new xfrm device operation to synchronize device ESN. Signed-off-by: Yossef Efraim <yossefe@mellanox.com> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
a3884572 |
|
11-Jan-2018 |
Jakub Kicinski <kuba@kernel.org> |
bpf: offload: add map offload infrastructure BPF map offload follow similar path to program offload. At creation time users may specify ifindex of the device on which they want to create the map. Map will be validated by the kernel's .map_alloc_check callback and device driver will be called for the actual allocation. Map will have an empty set of operations associated with it (save for alloc and free callbacks). The real device callbacks are kept in map->offload->dev_ops because they have slightly different signatures. Map operations are called in process context so the driver may communicate with HW freely, msleep(), wait() etc. Map alloc and free callbacks are muxed via existing .ndo_bpf, and are always called with rtnl lock held. Maps and programs are guaranteed to be destroyed before .ndo_uninit (i.e. before unregister_netdev() returns). Map callbacks are invoked with bpf_devs_lock *read* locked, drivers must take care of exclusive locking if necessary. All offload-specific branches are marked with unlikely() (through bpf_map_is_dev_bound()), given that branch penalty will be negligible compared to IO anyway, and we don't want to penalize SW path unnecessarily. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
7fdb61b4 |
|
13-Jan-2018 |
Nogah Frankel <nogahf@mellanox.com> |
net: sch: prio: Add offload ability to PRIO qdisc Add the ability to offload PRIO qdisc by using ndo_setup_tc. There are three commands for PRIO offloading: * TC_PRIO_REPLACE: handles set and tune * TC_PRIO_DESTROY: handles qdisc destroy * TC_PRIO_STATS: updates the qdiscs counters (given as reference) Like RED qdisc, the indication of whether PRIO is being offloaded is being set and updated as part of the dump function. It is so because the driver could decide to offload or not based on the qdisc parent, which could change without notifying the qdisc. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Reviewed-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd3ba214 |
|
09-Jan-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
net: fix xdp_rxq_info build issue when CONFIG_SYSFS is not set The commit e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info") removed some ifdef CONFIG_SYSFS in net/core/dev.c, but forgot to remove the corresponding ifdef's in include/linux/netdevice.h. Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e1cfe3d0 |
|
06-Jan-2018 |
Gal Pressman <galp@mellanox.com> |
net: No line break on netdev_WARN* formatting Remove the unnecessary line break between the netdev name and reg state to the actual message that should be printed. For example, this: [86730.307236] ------------[ cut here ]------------ [86730.313496] netdevice: enp27s0f0 Message from the driver [...] Will be replaced with: [86770.259289] ------------[ cut here ]------------ [86770.265191] netdevice: enp27s0f0: Message from the driver [...] Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
72dd831e |
|
06-Jan-2018 |
Gal Pressman <galp@mellanox.com> |
net: Fix netdev_WARN_ONCE macro netdev_WARN_ONCE is broken (whoops..), this fix will remove the unnecessary "condition" parameter, add the missing comma and change "arg" to "args". Fixes: 375ef2b1f0d0 ("net: Introduce netdev_*_once functions") Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e817f856 |
|
03-Jan-2018 |
Jesper Dangaard Brouer <brouer@redhat.com> |
xdp: generic XDP handling of xdp_rxq_info Hook points for xdp_rxq_info: * reg : netif_alloc_rx_queues * unreg: netif_free_rx_queues The net_device have some members (num_rx_queues + real_num_rx_queues) and data-area (dev->_rx with struct netdev_rx_queue's) that were primarily used for exporting information about RPS (CONFIG_RPS) queues to sysfs (CONFIG_SYSFS). For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info, and remove some of the CONFIG_SYSFS ifdefs. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
#
cae1927c |
|
27-Dec-2017 |
Jakub Kicinski <kuba@kernel.org> |
bpf: offload: allow netdev to disappear while verifier is running To allow verifier instruction callbacks without any extra locking NETDEV_UNREGISTER notification would wait on a waitqueue for verifier to finish. This design decision was made when rtnl lock was providing all the locking. Use the read/write lock instead and remove the workqueue. Verifier will now call into the offload code, so dev_ops are moved to offload structure. Since verifier calls are all under bpf_prog_is_dev_bound() we no longer need static inline implementations to please builds with CONFIG_NET=n. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
9cb0d21d |
|
19-Dec-2017 |
Shannon Nelson <shannon.nelson@oracle.com> |
xfrm: wrap xfrmdev_ops with offload config There's no reason to define netdev->xfrmdev_ops if the offload facility is not CONFIG'd in. Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
f53c7239 |
|
20-Dec-2017 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Add asynchronous callbacks for xfrm on layer 2. This patch implements asynchronous crypto callbacks and a backlog handler that can be used when IPsec is done at layer 2 in the TX path. It also extends the skb validate functions so that we can update the driver transmit return codes based on async crypto operation or to indicate that we queued the packet in a backlog queue. Joint work with: Aviv Heller <avivh@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
92f0292b |
|
01-Dec-2017 |
Jakub Kicinski <kuba@kernel.org> |
net: xdp: report flags program was installed with on query Some drivers enforce that flags on program replacement and removal must match the flags passed on install. This leaves the possibility open to enable simultaneous loading of XDP programs both to HW and DRV. Allow such drivers to report the flags back to the stack. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
118b4aa2 |
|
01-Dec-2017 |
Jakub Kicinski <kuba@kernel.org> |
net: xdp: avoid output parameters when querying XDP prog The output parameters will get unwieldy if we want to add more information about the program. Simply pass the entire struct netdev_bpf in. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
#
0c19f846 |
|
21-Nov-2017 |
Willem de Bruijn <willemb@google.com> |
net: accept UFO datagrams from tuntap and packet Tuntap and similar devices can inject GSO packets. Accept type VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively. Processes are expected to use feature negotiation such as TUNSETOFFLOAD to detect supported offload types and refrain from injecting other packets. This process breaks down with live migration: guest kernels do not renegotiate flags, so destination hosts need to expose all features that the source host does. Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677. This patch introduces nearly(*) no new code to simplify verification. It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP insertion and software UFO segmentation. It does not reinstate protocol stack support, hardware offload (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception of VIRTIO_NET_HDR_GSO_UDP packets in tuntap. To support SKB_GSO_UDP reappearing in the stack, also reinstate logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD by squashing in commit 939912216fa8 ("net: skb_needs_check() removes CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1 ("net: avoid skb_warn_bad_offload false positives on UFO"). (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id, ipv6_proxy_select_ident is changed to return a __be32 and this is assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted at the end of the enum to minimize code churn. Tested Booted a v4.13 guest kernel with QEMU. On a host kernel before this patch `ethtool -k eth0` shows UFO disabled. After the patch, it is enabled, same as on a v4.13 host kernel. A UFO packet sent from the guest appears on the tap device: host: nc -l -p -u 8000 & tcpdump -n -i tap0 guest: dd if=/dev/zero of=payload.txt bs=1 count=2000 nc -u 192.16.1.1 8000 < payload.txt Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds, packets arriving fragmented: ./with_tap_pair.sh ./tap_send_ufo tap0 tap1 (from https://github.com/wdebruij/kerneltools/tree/master/tests) Changes v1 -> v2 - simplified set_offload change (review comment) - documented test procedure Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com> Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.") Reported-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
54985120 |
|
07-Nov-2017 |
Girish Moodalbail <girish.moodalbail@oracle.com> |
net: fix incorrect comment with regard to VLAN packet handling The commit bcc6d4790361 ("net: vlan: make non-hw-accel rx path similar to hw-accel") unified accel and non-accel path for VLAN RX. With that fix we do not register any packet_type handler for VLANs anymore, so fix the incorrect comment. Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
375ef2b1 |
|
17-Sep-2017 |
Gal Pressman <galp@mellanox.com> |
net: Introduce netdev_*_once functions Extend the net device error logging with netdev_*_once macros. netdev_*_once are the equivalents of the dev_*_once macros which are useful for messages that should only be logged once. Also add netdev_WARN_ONCE, which is the "once" extension for the already existing netdev_WARN macro. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
#
8521db4c |
|
05-Nov-2017 |
Nogah Frankel <nogahf@mellanox.com> |
net_sch: cbs: Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS to match the new convention.. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
575ed7d3 |
|
05-Nov-2017 |
Nogah Frankel <nogahf@mellanox.com> |
net_sch: mqprio: Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO to match the new convention. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
602f3baf |
|
05-Nov-2017 |
Nogah Frankel <nogahf@mellanox.com> |
net_sch: red: Add offload ability to RED qdisc Add the ability to offload RED qdisc by using ndo_setup_tc. There are four commands for RED offloading: * TC_RED_SET: handles set and change. * TC_RED_DESTROY: handle qdisc destroy. * TC_RED_STATS: update the qdiscs counters (given as reference) * TC_RED_XSTAT: returns red xstats. Whether RED is being offloaded is being determined every time dump action is being called because parent change of this qdisc could change its offload state but doesn't require any RED function to be called. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ab3f0063 |
|
03-Nov-2017 |
Jakub Kicinski <kuba@kernel.org> |
bpf: offload: add infrastructure for loading programs for a specific netdev The fact that we don't know which device the program is going to be used on is quite limiting in current eBPF infrastructure. We have to reverse or limit the changes which kernel makes to the loaded bytecode if we want it to be offloaded to a networking device. We also have to invent new APIs for debugging and troubleshooting support. Make it possible to load programs for a specific netdev. This helps us to bring the debug information closer to the core eBPF infrastructure (e.g. we will be able to reuse the verifer log in device JIT). It allows device JITs to perform translation on the original bytecode. __bpf_prog_get() when called to get a reference for an attachment point will now refuse to give it if program has a device assigned. Following patches will add a version of that function which passes the expected netdev in. @type argument in __bpf_prog_get() is renamed to attach_type to make it clearer that it's only set on attachment. All calls to ndo_bpf are protected by rtnl, only verifier callbacks are not. We need a wait queue to make sure netdev doesn't get destroyed while verifier is still running and calling its driver. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f4e63525 |
|
03-Nov-2017 |
Jakub Kicinski <kuba@kernel.org> |
net: bpf: rename ndo_xdp to ndo_bpf ndo_xdp is a control path callback for setting up XDP in the driver. We can reuse it for other forms of communication between the eBPF stack and the drivers. Rename the callback and associated structures and definitions. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
46209401 |
|
03-Nov-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath In sch_handle_egress and sch_handle_ingress tp->q is used only in order to update stats. So stats and filter list are the only things that are needed in clsact qdisc fastpath processing. Introduce new mini_Qdisc struct to hold those items. Also, introduce a helper to swap the mini_Qdisc structures in case filter list head changes. This removes need for tp->q usage without added overhead. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3d0bd028 |
|
16-Oct-2017 |
Vinicius Costa Gomes <vinicius.gomes@intel.com> |
net/sched: Add support for HW offloading for CBS This adds support for offloading the CBS algorithm to the controller, if supported. Drivers wanting to support CBS offload must implement the .ndo_setup_tc callback and handle the TC_SETUP_CBS (introduced here) type. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Henrik Austad <henrik@austad.us> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
8c4083b3 |
|
19-Oct-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: add block bind/unbind notif. and extended block_get/put Introduce new type of ndo_setup_tc message to propage binding/unbinding of a block to driver. Call this ndo whenever qdisc gets/puts a block. Alongside with this, there's need to propagate binder type from qdisc code down to the notifier. So introduce extended variants of block_get/put in order to pass this info. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1c601d82 |
|
15-Oct-2017 |
Jesper Dangaard Brouer <brouer@redhat.com> |
bpf: cpumap xdp_buff to skb conversion and allocation This patch makes cpumap functional, by adding SKB allocation and invoking the network stack on the dequeuing CPU. For constructing the SKB on the remote CPU, the xdp_buff in converted into a struct xdp_pkt, and it mapped into the top headroom of the packet, to avoid allocating separate mem. For now, struct xdp_pkt is just a cpumap internal data structure, with info carried between enqueue to dequeue. If a driver doesn't have enough headroom it is simply dropped, with return code -EOVERFLOW. This will be picked up the xdp tracepoint infrastructure, to allow users to catch this. V2: take into account xdp->data_meta V4: - Drop busypoll tricks, keeping it more simple. - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei V5: correct RCU read protection around __netif_receive_skb_core. V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0ad646c8 |
|
13-Oct-2017 |
Cong Wang <xiyou.wangcong@gmail.com> |
tun: call dev_get_valid_name() before register_netdevice() register_netdevice() could fail early when we have an invalid dev name, in which case ->ndo_uninit() is not called. For tun device, this is a problem because a timer etc. are already initialized and it expects ->ndo_uninit() to clean them up. We could move these initializations into a ->ndo_init() so that register_netdevice() knows better, however this is still complicated due to the logic in tun_detach(). Therefore, I choose to just call dev_get_valid_name() before register_netdevice(), which is quicker and much easier to audit. And for this specific case, it is already enough. Fixes: 96442e42429e ("tuntap: choose the txq based on rxq") Reported-by: Dmitry Alexeev <avekceeb@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
42ab19ee |
|
04-Oct-2017 |
David Ahern <dsahern@gmail.com> |
net: Add extack to upper device linking Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
33eaf2a6 |
|
04-Oct-2017 |
David Ahern <dsahern@gmail.com> |
net: Add extack to ndo_add_slave Pass extack to do_set_master and down to ndo_add_slave Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
51d0c047 |
|
04-Oct-2017 |
David Ahern <dsahern@gmail.com> |
net: Add extack to netdev_notifier_info Add netlink_ext_ack to netdev_notifier_info to allow notifier handlers to return errors to userspace. Clean up the initialization in dev.c such that extack is easily added in subsequent patches where relevant. Specifically, remove the init call in call_netdevice_notifiers_info and have callers initalize on stack when info is declared. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c557001 |
|
02-Oct-2017 |
Florian Westphal <fw@strlen.de> |
net: core: decouple ifalias get/set from rtnl lock Device alias can be set by either rtnetlink (rtnl is held) or sysfs. rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose. Add an extra mutex for it and use rcu to protect concurrent accesses. This allows the sysfs path to not take rtnl and would later allow to not hold it when dumping ifalias. Based on suggestion from Eric Dumazet. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f657a60 |
|
29-Sep-2017 |
Vivien Didelot <vivien.didelot@gmail.com> |
net: dsa: change dsa_ptr for a dsa_port With DSA, a master net device (CPU facing interface) has a dsa_ptr pointer to which hangs a dsa_switch_tree. This is not correct because a master interface is wired to a dedicated switch port, and because we can theoretically have several master interfaces pointing to several CPU ports of the same switch fabric. Change the master interface's dsa_ptr for the CPU dsa_port pointer. This is a step towards supporting multiple CPU ports. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
25cc72a3 |
|
01-Sep-2017 |
Ido Schimmel <idosch@mellanox.com> |
mlxsw: spectrum: Forbid linking to devices that have uppers The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the device in case a port is enslaved to a master netdev such as bridge or bond. Since the driver ignores events unrelated to its ports and their uppers, it's possible to engineer situations in which the device's data path differs from the kernel's. One example to such a situation is when a port is enslaved to a bond that is already enslaved to a bridge. When the bond was enslaved the driver ignored the event - as the bond wasn't one of its uppers - and therefore a bridge port instance isn't created in the device. Until such configurations are supported forbid them by checking that the upper device doesn't have uppers of its own. Fixes: 0d65fc13042f ("mlxsw: spectrum: Implement LAG port join/leave") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Nogah Frankel <nogahf@mellanox.com> Tested-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f16ded59 |
|
31-Aug-2017 |
Rami Rosen <rami.rosen@intel.com> |
net: fix two typos in net_device_ops documentation. This patch fixes two trivial typos in net_device_ops documentation, related to ndo_xdp_flush callback. Signed-off-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0dd5759d |
|
29-Aug-2017 |
Dave Jiang <dave.jiang@intel.com> |
net: remove dmaengine.h inclusion from netdevice.h Since the removal of NET_DMA, dmaengine.h header file shouldn't be needed by netdevice.h anymore. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
966a9671 |
|
07-Aug-2017 |
Ying Huang <ying.huang@intel.com> |
smp: Avoid using two cache lines for struct call_single_data struct call_single_data is used in IPIs to transfer information between CPUs. Its size is bigger than sizeof(unsigned long) and less than cache line size. Currently it is not allocated with any explicit alignment requirements. This makes it possible for allocated call_single_data to cross two cache lines, which results in double the number of the cache lines that need to be transferred among CPUs. This can be fixed by requiring call_single_data to be aligned with the size of call_single_data. Currently the size of call_single_data is the power of 2. If we add new fields to call_single_data, we may need to add padding to make sure the size of new definition is the power of 2 as well. Fortunately, this is enforced by GCC, which will report bad sizes. To set alignment requirements of call_single_data to the size of call_single_data, a struct definition and a typedef is used. To test the effect of the patch, I used the vm-scalability multiple thread swap test case (swap-w-seq-mt). The test will create multiple threads and each thread will eat memory until all RAM and part of swap is used, so that huge number of IPIs are triggered when unmapping memory. In the test, the throughput of memory writing improves ~5% compared with misaligned call_single_data, because of faster IPIs. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Huang, Ying <ying.huang@intel.com> [ Add call_single_data_t and align with size of call_single_data. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
960632ec |
|
23-Aug-2017 |
Aaron Conole <aconole@bytheb.org> |
netfilter: convert hook list to an array This converts the storage and layout of netfilter hook entries from a linked list to an array. After this commit, hook entries will be stored adjacent in memory. The next pointer is no longer required. The ops pointers are stored at the end of the array as they are only used in the register/unregister path and in the legacy br_netfilter code. nf_unregister_net_hooks() is slower than needed as it just calls nf_unregister_net_hook in a loop (i.e. at least n synchronize_net() calls), this will be addressed in followup patch. Test setup: - ixgbe 10gbit - netperf UDP_STREAM, 64 byte packets - 5 hooks: (raw + mangle prerouting, mangle+filter input, inet filter): empty mangle and raw prerouting, mangle and filter input hooks: 353.9 this patch: 364.2 Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
718ad681 |
|
18-Aug-2017 |
stephen hemminger <stephen@networkplumber.org> |
net: drop unused attribute argument from sysfs queue funcs The show and store functions don't need/use the attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
737aec57 |
|
18-Aug-2017 |
stephen hemminger <stephen@networkplumber.org> |
net: constify net_ns_type_operations This can be const. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b793dc5c |
|
18-Aug-2017 |
stephen hemminger <stephen@networkplumber.org> |
net: constify netdev_class_file These functions are wrapper arount class_create_file which can take a const attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c497478 |
|
11-Aug-2017 |
Jason Wang <jasowang@redhat.com> |
net: export some generic xdp helpers This patch tries to export some generic xdp helpers to drivers. This can let driver to do XDP for a specific skb. This is useful for the case when the packet is hard to be processed at page level directly (e.g jumbo/GSO frame). With this patch, there's no need for driver to forbid the XDP set when configuration is not suitable. Instead, it can defer the XDP for packets that is hard to be processed directly after skb is created. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
de4784ca |
|
07-Aug-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: get rid of struct tc_to_netdev Get rid of struct tc_to_netdev which is now just unnecessary container and rather pass per-type structures down to drivers directly. Along with that, consolidate the naming of per-type structure variables in cls_*. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5fd9fc4e |
|
07-Aug-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: push cls related args into cls_common structure As ndo_setup_tc is generic offload op for whole tc subsystem, does not really make sense to have cls-specific args. So move them under cls_common structurure which is embedded in all cls structs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3e0e8266 |
|
07-Aug-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: make egress_dev flag part of flower offload struct Since this is specific to flower now, make it part of the flower offload struct. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ade9b658 |
|
07-Aug-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: rename TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL In order to be aligned with the rest of the types, rename TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2572ac53 |
|
07-Aug-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: sched: make type an argument for ndo_setup_tc Since the type is always present, push it to be a separate argument to ndo_setup_tc. On the way, name the type enum and use it for arg type. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
296d8ee3 |
|
20-Jul-2017 |
Sabrina Dubroca <sd@queasysnail.net> |
net: add infrastructure to un-offload UDP tunnel port This adds a new NETDEV_UDP_TUNNEL_DROP_INFO event, similar to NETDEV_UDP_TUNNEL_PUSH_INFO, to signal to un-offload ports. This also adds udp_tunnel_drop_rx_port(), which calls ndo_udp_tunnel_del. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7051b88a |
|
18-Jul-2017 |
stephen hemminger <stephen@networkplumber.org> |
net: make dev_close and related functions void There is no useful return value from dev_close. All paths return 0. Change dev_close and helper functions to void. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d9d30adf |
|
03-Jul-2017 |
David S. Miller <davem@davemloft.net> |
net: Kill NETIF_F_UFO and SKB_GSO_UDP. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
11393cc9 |
|
17-Jul-2017 |
John Fastabend <john.fastabend@gmail.com> |
xdp: Add batching support to redirect map For performance reasons we want to avoid updating the tail pointer in the driver tx ring as much as possible. To accomplish this we add batching support to the redirect path in XDP. This adds another ndo op "xdp_flush" that is used to inform the driver that it should bump the tail pointer on the TX ring. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
814abfab |
|
17-Jul-2017 |
John Fastabend <john.fastabend@gmail.com> |
xdp: add bpf_redirect helper function This adds support for a bpf_redirect helper function to the XDP infrastructure. For now this only supports redirecting to the egress path of a port. In order to support drivers handling a xdp_buff natively this patches uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff to the specified device. If the program specifies either (a) an unknown device or (b) a device that does not support the operation a BPF warning is thrown and the XDP_ABORTED error code is returned. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f51048c3 |
|
06-Jul-2017 |
WANG Cong <xiyou.wangcong@gmail.com> |
bonding: avoid NETDEV_CHANGEMTU event when unregistering slave As Hongjun/Nicolas summarized in their original patch: " When a device changes from one netns to another, it's first unregistered, then the netns reference is updated and the dev is registered in the new netns. Thus, when a slave moves to another netns, it is first unregistered. This triggers a NETDEV_UNREGISTER event which is caught by the bonding driver. The driver calls bond_release(), which calls dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in the old netns). " This is a very special case, because the device is being unregistered no one should still care about the NETDEV_CHANGEMTU event triggered at this point, we can avoid broadcasting this event on this path, and avoid touching inetdev_event()/addrconf_notify() path. It requires to export __dev_set_mtu() to bonding driver. Reported-by: Hongjun Li <hongjun.li@6wind.com> Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3ecab38 |
|
25-Jun-2017 |
Mintz, Yuval <Yuval.Mintz@cavium.com> |
net: Remove ndo_dfwd_start_xmit Looks like commit f663dd9aaf9e ("net: core: explicitly select a txq before doing l2 forwarding") has removed the need for this dedicated xmit function [it even explicitly states so in its commit log message] but it hasn't removed the definition of the ndo. Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> CC: Jason Wang <jasowang@redhat.com> CC: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce158e58 |
|
21-Jun-2017 |
Jakub Kicinski <kuba@kernel.org> |
xdp: add reporting of offload mode Extend the XDP_ATTACHED_* values to include offloaded mode. Let drivers report whether program is installed in the driver or the HW by changing the prog_attached field from bool to u8 (type of the netlink attribute). Exploit the fact that the value of XDP_ATTACHED_DRV is 1, therefore since all drivers currently assign the mode with double negation: mode = !!xdp_prog; no drivers have to be modified. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee5d032f |
|
21-Jun-2017 |
Jakub Kicinski <kuba@kernel.org> |
xdp: add HW offload mode flag for installing programs Add an installation-time flag for requesting that the program be installed only if it can be offloaded to HW. Internally new command for ndo_xdp is added, this way we avoid putting checks into drivers since they all return -EINVAL on an unknown command. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32d60277 |
|
21-Jun-2017 |
Jakub Kicinski <kuba@kernel.org> |
xdp: pass XDP flags into install handlers Pass XDP flags to the xdp ndo. This will allow drivers to look at the mode flags and make decisions about offload. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
58038695 |
|
15-Jun-2017 |
Martin KaFai Lau <kafai@fb.com> |
net: Add IFLA_XDP_PROG_ID Expose prog_id through IFLA_XDP_PROG_ID. This patch makes modification to generic_xdp. The later patches will modify other xdp-supported drivers. prog_id is added to struct net_dev_xdp. iproute2 patch will be followed. Here is how the 'ip link' will look like: > ip link show eth0 3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp(prog_id:1) qdisc fq_codel state UP mode DEFAULT group default qlen 1000 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
db46a0e1 |
|
14-Jun-2017 |
Magnus Damm <damm+renesas@opensource.se> |
net: update undefined ->ndo_change_mtu() comment Update ->ndo_change_mtu() callback comment to remove text about returning error in case of undefined callback. This change makes the comment match the existing code behavior. Signed-off-by: Magnus Damm <damm+renesas@opensource.se> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8397ed36 |
|
07-Jun-2017 |
David Ahern <dsahern@gmail.com> |
net: ipv6: Release route when device is unregistering Roopa reported attempts to delete a bond device that is referenced in a multipath route is hanging: $ ifdown bond2 # ifupdown2 command that deletes virtual devices unregister_netdevice: waiting for bond2 to become free. Usage count = 2 Steps to reproduce: echo 1 > /proc/sys/net/ipv6/conf/all/ignore_routes_with_linkdown ip link add dev bond12 type bond ip link add dev bond13 type bond ip addr add 2001:db8:2::0/64 dev bond12 ip addr add 2001:db8:3::0/64 dev bond13 ip route add 2001:db8:33::0/64 nexthop via 2001:db8:2::2 nexthop via 2001:db8:3::2 ip link del dev bond12 ip link del dev bond13 The root cause is the recent change to keep routes on a linkdown. Update the check to detect when the device is unregistering and release the route for that case. Fixes: a1a22c12060e4 ("net: ipv6: Keep nexthop of multipath route on admin down") Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a5fcf8a6 |
|
06-Jun-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: propagate tc filter chain index down the ndo_setup_tc call We need to push the chain index down to the drivers, so they have the information to which chain the rule belongs. For now, no driver supports multichain offload, so only chain 0 is supported. This is needed to prevent chain squashes during offload for now. Later this will be used to implement multichain offload. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf124db5 |
|
07-May-2017 |
David S. Miller <davem@davemloft.net> |
net: Fix inconsistent teardown and release of private netdev state. Network devices can allocate reasources and private memory using netdev_ops->ndo_init(). However, the release of these resources can occur in one of two different places. Either netdev_ops->ndo_uninit() or netdev->destructor(). The decision of which operation frees the resources depends upon whether it is necessary for all netdev refs to be released before it is safe to perform the freeing. netdev_ops->ndo_uninit() presumably can occur right after the NETDEV_UNREGISTER notifier completes and the unicast and multicast address lists are flushed. netdev->destructor(), on the other hand, does not run until the netdev references all go away. Further complicating the situation is that netdev->destructor() almost universally does also a free_netdev(). This creates a problem for the logic in register_netdevice(). Because all callers of register_netdevice() manage the freeing of the netdev, and invoke free_netdev(dev) if register_netdevice() fails. If netdev_ops->ndo_init() succeeds, but something else fails inside of register_netdevice(), it does call ndo_ops->ndo_uninit(). But it is not able to invoke netdev->destructor(). This is because netdev->destructor() will do a free_netdev() and then the caller of register_netdevice() will do the same. However, this means that the resources that would normally be released by netdev->destructor() will not be. Over the years drivers have added local hacks to deal with this, by invoking their destructor parts by hand when register_netdevice() fails. Many drivers do not try to deal with this, and instead we have leaks. Let's close this hole by formalizing the distinction between what private things need to be freed up by netdev->destructor() and whether the driver needs unregister_netdevice() to perform the free_netdev(). netdev->priv_destructor() performs all actions to free up the private resources that used to be freed by netdev->destructor(), except for free_netdev(). netdev->needs_free_netdev is a boolean that indicates whether free_netdev() should be done at the end of unregister_netdevice(). Now, register_netdevice() can sanely release all resources after ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit() and netdev->priv_destructor(). And at the end of unregister_netdevice(), we invoke netdev->priv_destructor() and optionally call free_netdev(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
90b602f8 |
|
19-May-2017 |
Miroslav Lichvar <mlichvar@redhat.com> |
net: add function to retrieve original skb device using NAPI ID Since commit b68581778cd0 ("net: Make skb->skb_iif always track skb->dev") skbs don't have the original index of the interface which received the packet. This information is now needed for a new control message related to hardware timestamping. Instead of adding a new field to skb, we can find the device by the NAPI ID if it is available, i.e. CONFIG_NET_RX_BUSY_POLL is enabled and the driver is using NAPI. Add dev_get_by_napi_id() and also skb_napi_id() to hide the CONFIG_NET_RX_BUSY_POLL ifdef. CC: Richard Cochran <richardcochran@gmail.com> Suggested-by: Willem de Bruijn <willemb@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
43c26a1a |
|
18-May-2017 |
Davide Caratti <dcaratti@redhat.com> |
net: more accurate checksumming in validate_xmit_skb() skb_csum_hwoffload_help() uses netdev features and skb->csum_not_inet to determine if skb needs software computation of Internet Checksum or crc32c (or nothing, if this computation can be done by the hardware). Use it in place of skb_checksum_help() in validate_xmit_skb() to avoid corruption of non-GSO SCTP packets having skb->ip_summed equal to CHECKSUM_PARTIAL. While at it, remove references to skb_csum_off_chk* functions, since they are not present anymore in Linux _ see commit cf53b1da73bd ("Revert "net: Add driver helper functions to determine checksum offloadability""). Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
219f1d79 |
|
18-May-2017 |
Davide Caratti <dcaratti@redhat.com> |
sk_buff: remove support for csum_bad in sk_buff This bit was introduced with commit 5a21232983aa ("net: Support for csum_bad in skbuff") to reduce the stack workload when processing RX packets carrying a wrong Internet Checksum. Up to now, only one driver and GRO core are setting it. Suggested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b72b5bf6 |
|
18-May-2017 |
Davide Caratti <dcaratti@redhat.com> |
net: introduce skb_crc32c_csum_help skb_crc32c_csum_help is like skb_checksum_help, but it is designed for checksumming SCTP packets using crc32c (see RFC3309), provided that libcrc32c.ko has been loaded before. In case libcrc32c is not loaded, invoking skb_crc32c_csum_help on a skb results in one the following printouts: warn_crc32c_csum_update: attempt to compute crc32c without libcrc32c.ko warn_crc32c_csum_combine: attempt to compute crc32c without libcrc32c.ko Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0cd29503 |
|
17-May-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: make struct net_device::tx_queue_len unsigned int 4 billion packet queue is something unthinkable so use 32-bit value for now. Space savings on x86_64: add/remove: 0/0 grow/shrink: 3/70 up/down: 16/-131 (-115) function old new delta change_tx_queue_len 94 108 +14 qdisc_create 1176 1177 +1 alloc_netdev_mqs 1124 1125 +1 xenvif_alloc 533 532 -1 x25_asy_setup 167 166 -1 ... tun_queue_resize 945 940 -5 pfifo_fast_enqueue 167 162 -5 qfq_init_qdisc 168 158 -10 tap_queue_resize 810 799 -11 transmit 719 698 -21 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d651983d |
|
12-May-2017 |
Mauro Carvalho Chehab <mchehab@kernel.org> |
net: fix some identation issues at kernel-doc markups Sphinx is very pedantic with regards to identation and escape sequences: ./include/net/sock.h:1967: ERROR: Unexpected indentation. ./include/net/sock.h:1969: ERROR: Unexpected indentation. ./include/net/sock.h:1970: WARNING: Block quote ends without a blank line; unexpected unindent. ./include/net/sock.h:1971: WARNING: Block quote ends without a blank line; unexpected unindent. ./include/net/sock.h:2268: WARNING: Inline emphasis start-string without end-string. ./net/core/sock.c:2686: ERROR: Unexpected indentation. ./net/core/sock.c:2687: WARNING: Block quote ends without a blank line; unexpected unindent. ./net/core/datagram.c:182: WARNING: Inline emphasis start-string without end-string. ./include/linux/netdevice.h:1444: ERROR: Unexpected indentation. ./drivers/net/phy/phy.c:381: ERROR: Unexpected indentation. ./drivers/net/phy/phy.c:382: WARNING: Block quote ends without a blank line; unexpected unindent. - Fix spacing where needed; - Properly escape constants; - Use a literal block for a race description. No functional changes. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
|
#
d67b9cd2 |
|
11-May-2017 |
Daniel Borkmann <daniel@iogearbox.net> |
xdp: refine xdp api with regards to generic xdp While working on the iproute2 generic XDP frontend, I noticed that as of right now it's possible to have native *and* generic XDP programs loaded both at the same time for the case when a driver supports native XDP. The intended model for generic XDP from b5cdae3291f7 ("net: Generic XDP") is, however, that only one out of the two can be present at once which is also indicated as such in the XDP netlink dump part. The main rationale for generic XDP is to ease accessibility (in case a driver does not yet have XDP support) and to generically provide a semantical model as an example for driver developers wanting to add XDP support. The generic XDP option for an XDP aware driver can still be useful for comparing and testing both implementations. However, it is not intended to have a second XDP processing stage or layer with exactly the same functionality of the first native stage. Only reason could be to have a partial fallback for future XDP features that are not supported yet in the native implementation and we probably also shouldn't strive for such fallback and instead encourage native feature support in the first place. Given there's currently no such fallback issue or use case, lets not go there yet if we don't need to. Therefore, change semantics for loading XDP and bail out if the user tries to load a generic XDP program when a native one is present and vice versa. Another alternative to bailing out would be to handle the transition from one flavor to another gracefully, but that would require to bring the device down, exchange both types of programs, and bring it up again in order to avoid a tiny window where a packet could hit both hooks. Given this complicates the logic for just a debugging feature in the native case, I went with the simpler variant. For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7 and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all or just a subset of flags that were used for loading the XDP prog is suboptimal in the long run since not all flags are useful for dumping and if we start to reuse the same flag definitions for load and dump, then we'll waste bit space. What we really just want is to dump the mode for now. Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0), a program is running at the native driver layer (1). Thus, add a mode that says that a program is running at generic XDP layer (2). Applications will handle this fine in that older binaries will just indicate that something is attached at XDP layer, effectively this is similar to IFLA_XDP_FLAGS attr that we would have had modulo the redundancy. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ddf9f970 |
|
30-Apr-2017 |
Jakub Kicinski <kuba@kernel.org> |
xdp: propagate extended ack to XDP setup Drivers usually have a number of restrictions for running XDP - most common being buffer sizes, LRO and number of rings. Even though some drivers try to be helpful and print error messages experience shows that users don't often consult kernel logs on netlink errors. Try to use the new extended ack mechanism to carry the message back to user space. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8ecbc40a |
|
25-Apr-2017 |
Zhang Shengju <zhangshengju@cmss.chinamobile.com> |
net: update comment for netif_dormant() function This patch updates the comment for netif_dormant() function to reflect the intended usage. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7acedaf5 |
|
25-Apr-2017 |
Eric Dumazet <edumazet@google.com> |
net: move xdp_prog field in RX cache lines (struct net_device, xdp_prog) field should be moved in RX cache lines, reducing latencies when a single packet is received on idle host, since netif_elide_gro() needs it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b5cdae32 |
|
18-Apr-2017 |
David S. Miller <davem@davemloft.net> |
net: Generic XDP This provides a generic SKB based non-optimized XDP path which is used if either the driver lacks a specific XDP implementation, or the user requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE. It is arguable that perhaps I should have required something like this as part of the initial XDP feature merge. I believe this is critical for two reasons: 1) Accessibility. More people can play with XDP with less dependencies. Yes I know we have XDP support in virtio_net, but that just creates another depedency for learning how to use this facility. I wrote this to make life easier for the XDP newbies. 2) As a model for what the expected semantics are. If there is a pure generic core implementation, it serves as a semantic example for driver folks adding XDP support. One thing I have not tried to address here is the issue of XDP_PACKET_HEADROOM, thanks to Daniel for spotting that. It seems incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or whatever even if the XDP program doesn't try to push headers at all. I think we really need the verifier to somehow propagate whether certain XDP helpers are used or not. v5: - Handle both negative and positive offset after running prog - Fix mac length in XDP_TX case (Alexei) - Use rcu_dereference_protected() in free_netdev (kbuild test robot) v4: - Fix MAC header adjustmnet before calling prog (David Ahern) - Disable LRO when generic XDP is installed (Michael Chan) - Bypass qdisc et al. on XDP_TX and record the event (Alexei) - Do not perform generic XDP on reinjected packets (DaveM) v3: - Make sure XDP program sees packet at MAC header, push back MAC header if we do XDP_TX. (Alexei) - Elide GRO when generic XDP is in use. (Alexei) - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic XDP even if the driver has an XDP implementation. (Alexei) - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS attribute. (Daniel) v2: - Add some "fall through" comments in switch statements based upon feedback from Andrew Lunn - Use RCU for generic xdp_prog, thanks to Johannes Berg. Tested-by: Andy Gospodarek <andy@greyhouse.net> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7acf8a1e |
|
18-Apr-2017 |
Matthew Whitehead <tedheadster@gmail.com> |
Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning Constants used for tuning are generally a bad idea, especially as hardware changes over time. Replace the constant 2 jiffies with sysctl variable netdev_budget_usecs to enable sysadmins to tune the softirq processing. Also document the variable. For example, a very fast machine might tune this to 1000 microseconds, while my regression testing 486DX-25 needs it to be 4000 microseconds on a nearly idle network to prevent time_squeeze from being incremented. Version 2: changed jiffies to microseconds for predictable units. Signed-off-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5be66141 |
|
18-Apr-2017 |
Jiri Pirko <jiri@mellanox.com> |
net: add netif_is_ovs_port helper To find out if a netdev is an OVS port. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d77e38e6 |
|
14-Apr-2017 |
Steffen Klassert <steffen.klassert@secunet.com> |
xfrm: Add an IPsec hardware offloading API This patch adds all the bits that are needed to do IPsec hardware offload for IPsec states and ESP packets. We add xfrmdev_ops to the net_device. xfrmdev_ops has function pointers that are needed to manage the xfrm states in the hardware and to do a per packet offloading decision. Joint work with: Ilan Tayari <ilant@mellanox.com> Guy Shapiro <guysh@mellanox.com> Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
c7ef8f0c |
|
14-Apr-2017 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Add ESP offload features This patch adds netdev features to configure IPsec offloads. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
d92be7a4 |
|
10-Apr-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: make struct net_device::min_header_len 8-bit This field is never big enough to warrant 16-bitness. 8-bit accesses enjoy shorted encoding on i386/x86_64 than 16-bit accesses: add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-10 (-10) function old new delta loopback_setup 169 164 -5 ether_setup 148 143 -5 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b3dc2f3 |
|
10-Apr-2017 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: neigh: make ->hh_len 32-bit Using 16-bit ->hh_len doesn't save any memory, save some .text instead: add/remove: 0/0 grow/shrink: 1/6 up/down: 2/-19 (-17) function old new delta neigh_update 2312 2314 +2 fwnet_header_cache 199 197 -2 eth_header_cache 101 99 -2 ip6_finish_output2 2371 2368 -3 vrf_finish_output6 1522 1518 -4 vrf_finish_output 1413 1409 -4 ip_finish_output2 1627 1623 -4 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6e970a0 |
|
28-Mar-2017 |
Andrew Lunn <andrew@lunn.ch> |
net: break include loop netdevice.h, dsa.h, devlink.h There is an include loop between netdevice.h, dsa.h, devlink.h because of NETDEV_ALIGN, making it impossible to use devlink structures in dsa.h. Break this loop by taking dsa.h out of netdevice.h, add a forward declaration of dsa_switch_tree and netdev_set_default_ethtool_ops() function, which is what netdevice.h requires. No longer having dsa.h in netdevice.h means the includes in dsa.h no longer get included. This breaks a few other files which depend on these includes. Add these directly in the affected file. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
56f36acd |
|
15-Mar-2017 |
Amritha Nambiar <amritha.nambiar@intel.com> |
mqprio: Modify mqprio to pass user parameters via ndo_setup_tc. The configurable priority to traffic class mapping and the user specified queue ranges are used to configure the traffic class, overriding the hardware defaults when the 'hw' option is set to 0. However, when the 'hw' option is non-zero, the hardware QOS defaults are used. This patch makes it so that we can pass the data the user provided to ndo_setup_tc. This allows us to pull in the queue configuration if the user requested it as well as any additional hardware offload type requested by using a value other than 1 for the hw value. Finally it also provides a means for the device driver to return the level supported for the offload type via the qopt->hw value. Previously we were just always assuming the value to be 1, in the future values beyond just 1 may be supported. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
39e6c820 |
|
28-Feb-2017 |
Eric Dumazet <edumazet@google.com> |
net: solve a NAPI race While playing with mlx4 hardware timestamping of RX packets, I found that some packets were received by TCP stack with a ~200 ms delay... Since the timestamp was provided by the NIC, and my probe was added in tcp_v4_rcv() while in BH handler, I was confident it was not a sender issue, or a drop in the network. This would happen with a very low probability, but hurting RPC workloads. A NAPI driver normally arms the IRQ after the napi_complete_done(), after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab it. Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit while IRQ are not disabled, we might have later an IRQ firing and finding this bit set, right before napi_complete_done() clears it. This can happen with busy polling users, or if gro_flush_timeout is used. But some other uses of napi_schedule() in drivers can cause this as well. thread 1 thread 2 (could be on same cpu, or not) // busy polling or napi_watchdog() napi_schedule(); ... napi->poll() device polling: read 2 packets from ring buffer Additional 3rd packet is available. device hard irq // does nothing because NAPI_STATE_SCHED bit is owned by thread 1 napi_schedule(); napi_complete_done(napi, 2); rearm_irq(); Note that rearm_irq() will not force the device to send an additional IRQ for the packet it already signaled (3rd packet in my example) This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep() can set if it could not grab NAPI_STATE_SCHED Then napi_complete_done() properly reschedules the napi to make sure we do not miss something. Since we manipulate multiple bits at once, use cmpxchg() like in sk_busy_loop() to provide proper transactions. In v2, I changed napi_watchdog() to use a relaxed variant of napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point. In v3, I added more details in the changelog and clears NAPI_STATE_MISSED in busy_poll_stop() In v4, I added the ideas given by Alexander Duyck in v3 review Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
25393d3f |
|
15-Feb-2017 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Prepare gro for packet consuming gro callbacks The upcomming IPsec ESP gro callbacks will consume the skb, so prepare for that. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
5f114163 |
|
15-Feb-2017 |
Steffen Klassert <steffen.klassert@secunet.com> |
net: Add a skb_gro_flush_final helper. Add a skb_gro_flush_final helper to prepare for consuming skbs in call_gro_receive. We will extend this helper to not touch the skb if the skb is consumed by a gro callback with a followup patch. We need this to handle the upcomming IPsec ESP callbacks as they reinject the skb to the napi_gro_receive asynchronous. The handler is used in all gro_receive functions that can call the ESP gro handlers. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
#
fb585b44 |
|
10-Feb-2017 |
Tobias Klauser <tklauser@distanz.ch> |
net: make net_device members garp_port and mrp_port conditional garp_port is only used in net/802/garp.c which is only compiled with CONFIG_GARP enabled. Same goes for mrp_port which is only used in net/802/mrp.c with CONFIG_MRP enabled. Only include the two members in struct net_device if their respective CONFIG_* is enabled. This saves a few bytes in struct net_device in case CONFIG_GARP or CONFIG_MRP are not enabled. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
217e6fa2 |
|
07-Feb-2017 |
Willem de Bruijn <willemb@google.com> |
net: introduce device min_header_len The stack must not pass packets to device drivers that are shorter than the minimum link layer header length. Previously, packet sockets would drop packets smaller than or equal to dev->hard_header_len, but this has false positives. Zero length payload is used over Ethernet. Other link layer protocols support variable length headers. Support for validation of these protocols removed the min length check for all protocols. Introduce an explicit dev->min_header_len parameter and drop all packets below this value. Initially, set it to non-zero only for Ethernet and loopback. Other protocols can follow in a patch to net-next. Fixes: 9ed988cd5915 ("packet: validate variable length ll headers") Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a8eca326 |
|
06-Feb-2017 |
Ido Schimmel <idosch@mellanox.com> |
net: remove ndo_neigh_{construct, destroy} from stacked devices In commit 18bfb924f000 ("net: introduce default neigh_construct/destroy ndo calls for L2 upper devices") we added these ndos to stacked devices such as team and bond, so that calls will be propagated to mlxsw. However, previous commit removed the reliance on these ndos and no new users of these ndos have appeared since above mentioned commit. We can therefore safely remove this dead code. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02c1602e |
|
04-Feb-2017 |
Eric Dumazet <edumazet@google.com> |
net: remove __napi_complete() All __napi_complete() callers have been converted to use the more standard napi_complete_done(), we can now remove this NAPI method for good. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
79e7fff4 |
|
02-Feb-2017 |
Eric Dumazet <edumazet@google.com> |
net: remove support for per driver ndo_busy_poll() We added generic support for busy polling in NAPI layer in linux-4.5 No network driver uses ndo_busy_poll() anymore, we can get rid of the pointer in struct net_device_ops, and its use in sk_busy_loop() Saves NETIF_F_BUSY_POLL features bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1a2a1444 |
|
31-Jan-2017 |
Dimitris Michailidis <dmichail@google.com> |
net: fix ndo_features_check/ndo_fix_features comment ordering Commit cdba756f5803a2 ("net: move ndo_features_check() close to ndo_start_xmit()") inadvertently moved the doc comment for .ndo_fix_features instead of .ndo_features_check. Fix the comment ordering. Fixes: cdba756f5803a2 ("net: move ndo_features_check() close to ndo_start_xmit()") Signed-off-by: Dimitris Michailidis <dmichail@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f617f276 |
|
27-Jan-2017 |
Edward Cree <ecree@solarflare.com> |
net: implement netif_cond_dbg macro For reporting things that may or may not be serious, depending on some condition, netif_cond_dbg will check the condition and print the report at either dbg (if the condition is true) or the specified level. Suggested-by: Jon Cooper <jcooper@solarflare.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6a4bc2b4 |
|
26-Jan-2017 |
Florian Fainelli <f.fainelli@gmail.com> |
net: Fix ndo_setup_tc comment Commit 16e5cc647173 ("net: rework setup_tc ndo op to consume general tc operand") changed the ndo_setup_tc() signature, but did not update the comments in netdevice.h, so do that now. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4a7c9726 |
|
18-Jan-2017 |
Tobias Klauser <tklauser@distanz.ch> |
net: Remove usage of net_device last_rx member The network stack no longer uses the last_rx member of struct net_device since the bonding driver switched to use its own private last_rx in commit 9f242738376d ("bonding: use last_arp_rx in slave_last_rx()"). However, some drivers still (ab)use the field for their own purposes and some driver just update it without actually using it. Previously, there was an accompanying comment for the last_rx member added in commit 4dc89133f49b ("net: add a comment on netdev->last_rx") which asked drivers not to update is, unless really needed. However, this commend was removed in commit f8ff080dacec ("bonding: remove useless updating of slave->dev->last_rx"), so some drivers added later on still did update last_rx. Remove all usage of last_rx and switch three drivers (sky2, atp and smc91c92_cs) which actually read and write it to use their own private copy in netdev_priv. Compile-tested with allyesconfig and allmodconfig on x86 and arm. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Mirko Lindner <mlindner@marvell.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
738b35cc |
|
11-Jan-2017 |
Florian Fainelli <f.fainelli@gmail.com> |
net: core: Make netif_wake_subqueue a wrapper netif_wake_subqueue() is duplicating the same thing that netif_tx_wake_queue() does, so make it call it directly after looking up the queue from the index. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
57ea52a8 |
|
10-Jan-2017 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Disable frag0 optimization on IPv6 ext headers The GRO fast path caches the frag0 address. This address becomes invalid if frag0 is modified by pskb_may_pull or its variants. So whenever that happens we must disable the frag0 optimization. This is usually done through the combination of gro_header_hard and gro_header_slow, however, the IPv6 extension header path did the pulling directly and would continue to use the GRO fast path incorrectly. This patch fixes it by disabling the fast path when we enter the IPv6 extension header path. Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address") Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bc1f4470 |
|
06-Jan-2017 |
stephen hemminger <stephen@networkplumber.org> |
net: make ndo_get_stats64 a void function The network device operation for reading statistics is only called in one place, and it ignores the return value. Having a structure return value is potentially confusing because some future driver could incorrectly assume that the return value was used. Fix all drivers with ndo_get_stats64 to have a void function. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3d48b53f |
|
29-Dec-2016 |
Matthias Tafelmeier <matthias.tafelmeier@gmx.net> |
net: dev_weight: TX/RX orthogonality Oftenly, introducing side effects on packet processing on the other half of the stack by adjusting one of TX/RX via sysctl is not desirable. There are cases of demand for asymmetric, orthogonal configurability. This holds true especially for nodes where RPS for RFS usage on top is configured and therefore use the 'old dev_weight'. This is quite a common base configuration setup nowadays, even with NICs of superior processing support (e.g. aRFS). A good example use case are nodes acting as noSQL data bases with a large number of tiny requests and rather fewer but large packets as responses. It's affordable to have large budget and rx dev_weights for the requests. But as a side effect having this large a number on TX processed in one run can overwhelm drivers. This patch therefore introduces an independent configurability via sysctl to userland. Signed-off-by: Matthias Tafelmeier <matthias.tafelmeier@gmx.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
13bfff25 |
|
07-Dec-2016 |
Eric Dumazet <edumazet@google.com> |
net: rfs: add a jump label RFS is not commonly used, so add a jump label to avoid some conditionals in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7091d8c7 |
|
01-Dec-2016 |
Hadar Hen Zion <hadarh@mellanox.com> |
net/sched: cls_flower: Add offload support using egress Hardware device In order to support hardware offloading when the device given by the tc rule is different from the Hardware underline device, extract the mirred (egress) device from the tc action when a filter is added, using the new tc_action_ops, get_dev(). Flower caches the information about the mirred device and use it for calling ndo_setup_tc in filter change, update stats and delete. Calling ndo_setup_tc of the mirred (egress) device instead of the ingress device will allow a resolution between the software ingress device and the underline hardware device. The resolution will take place inside the offloading driver using 'egress_device' flag added to tc_to_netdev struct which is provided to the offloading driver. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
85de8576 |
|
28-Nov-2016 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf, xdp: allow to pass flags to dev_change_xdp_fd Add an IFLA_XDP_FLAGS attribute that can be passed for setting up XDP along with IFLA_XDP_FD, which eventually allows user space to implement typical add/replace/delete logic for programs. Right now, calling into dev_change_xdp_fd() will always replace previous programs. When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more graceful when requested by returning -EBUSY in case we try to attach a new program, but we find that another one is already attached. This will be used by upcoming front-end for iproute2 as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a717f4f |
|
23-Nov-2016 |
Michael S. Tsirkin <mst@redhat.com> |
netdevice: fix sparse warning for HARD_TX_LOCK sparse warns about context imbalance in any code that uses HARD_TX_LOCK/UNLOCK - this is because it's unable to determine that flags don't change so lock and unlock are paired. Seems easy enough to fix by adding __acquire/__release calls. With this patch af_packet.c is now sparse-clean, Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3df5b3c6 |
|
22-Nov-2016 |
Or Gerlitz <ogerlitz@mellanox.com> |
net: Add net-device param to the get offloaded stats ndo Some drivers would need to check few internal matters for that. To be used in downstream mlx5 commit. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
920c1cd3 |
|
21-Nov-2016 |
Randy Dunlap <rdunlap@infradead.org> |
netdevice.h: fix kernel-doc warning Fix kernel-doc warning in <linux/netdevice.h> (missing ':'): ..//include/linux/netdevice.h:1904: warning: No description found for parameter 'prio_tc_map[TC_BITMASK + 1]' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
89c4b442 |
|
16-Nov-2016 |
Eric Dumazet <edumazet@google.com> |
netpoll: more efficient locking Callers of netpoll_poll_lock() own NAPI_STATE_SCHED Callers of netpoll_poll_unlock() have BH blocked between the NAPI_STATE_SCHED being cleared and poll_lock is released. We can avoid the spinlock which has no contention, and use cmpxchg() on poll_owner which we need to set anyway. This removes a possible lockdep violation after the cited commit, since sk_busy_loop() re-enables BH before calling busy_poll_stop() Fixes: 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
364b6055 |
|
15-Nov-2016 |
Eric Dumazet <edumazet@google.com> |
net: busy-poll: return busypolling status to drivers NAPI drivers use napi_complete_done() or napi_complete() when they drained RX ring and right before re-enabling device interrupts. In busy polling, we can avoid interrupts being delivered since we are polling RX ring in a controlled loop. Drivers can chose to use napi_complete_done() return value to reduce interrupts overhead while busy polling is active. This is optional, legacy drivers should work fine even if not updated. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
217f6974 |
|
15-Nov-2016 |
Eric Dumazet <edumazet@google.com> |
net: busy-poll: allow preemption in sk_busy_loop() After commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job"), sk_busy_loop() needs a bit of care : softirqs might be delayed since we do not allow preemption yet. This patch adds preemptiom points in sk_busy_loop(), and makes sure no unnecessary cache line dirtying or atomic operations are done while looping. A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED, so that sk_busy_loop() does not have to grab it again. Similarly, netpoll_poll_lock() is done one time. This gives about 10 to 20 % improvement in various busy polling tests, especially when many threads are busy polling in configurations with large number of NIC queues. This should allow experimenting with bigger delays without hurting overall latencies. Tested: On a 40Gb mlx4 NIC, 32 RX/TX queues. echo 70 >/proc/sys/net/core/busy_read for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done Before: After: 1: 90072 92819 2: 157289 184007 3: 235772 213504 4: 344074 357513 5: 394755 458267 6: 461151 487819 7: 549116 625963 8: 544423 716219 9: 720460 738446 10: 794686 837612 11: 915998 923960 12: 937507 925107 13: 1019677 971506 14: 1046831 1113650 15: 1114154 1148902 16: 1105221 1179263 17: 1266552 1299585 18: 1258454 1383817 19: 1341453 1312194 20: 1363557 1488487 21: 1387979 1501004 22: 1417552 1601683 23: 1550049 1642002 24: 1568876 1601915 25: 1560239 1683607 26: 1640207 1745211 27: 1706540 1723574 28: 1638518 1722036 29: 1734309 1757447 30: 1782007 1855436 31: 1724806 1888539 32: 1717716 1944297 33: 1778716 1869118 34: 1805738 1983466 35: 1815694 2020758 36: 1893059 2035632 37: 1843406 2034653 38: 1888830 2086580 39: 1972827 2143567 40: 1877729 2181851 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4e3264d2 |
|
09-Nov-2016 |
Martin KaFai Lau <kafai@fb.com> |
bpf: Fix bpf_redirect to an ipip/ip6tnl dev If the bpf program calls bpf_redirect(dev, 0) and dev is an ipip/ip6tnl, it currently includes the mac header. e.g. If dev is ipip, the end result is IP-EthHdr-IP instead of IP-IP. The fix is to pull the mac header. At ingress, skb_postpull_rcsum() is not needed because the ethhdr should have been pulled once already and then got pushed back just before calling the bpf_prog. At egress, this patch calls skb_postpull_rcsum(). If bpf_redirect(dev, BPF_F_INGRESS) is called, it also fails now because it calls dev_forward_skb() which eventually calls eth_type_trans(skb, dev). The eth_type_trans() will set skb->type = PACKET_OTHERHOST because the mac address does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST will eventually cause the ip_rcv() errors out. To fix this, ____dev_forward_skb() is added. Joint work with Daniel Borkmann. Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
149d6ad8 |
|
08-Nov-2016 |
Eric Dumazet <edumazet@google.com> |
net: napi_hash_add() is no longer exported There are no more users except from net/core/dev.c napi_hash_add() can now be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
184c449f |
|
28-Oct-2016 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add support for XPS with QoS via traffic classes This patch adds support for setting and using XPS when QoS via traffic classes is enabled. With this change we will factor in the priority and traffic class mapping of the packet and use that information to correctly select the queue. This allows us to define a set of queues for a given traffic class via mqprio and then configure the XPS mapping for those queues so that the traffic flows can avoid head-of-line blocking between the individual CPUs if so desired. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8d059b0f |
|
28-Oct-2016 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add sysfs value to determine queue traffic class Add a sysfs attribute for a Tx queue that allows us to determine the traffic class for a given queue. This will allow us to more easily determine this in the future. It is needed as XPS will take the traffic class for a group of queues into account in order to avoid pulling traffic from one traffic class into another. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9cf1f6a8 |
|
28-Oct-2016 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Move functions for configuring traffic classes out of inline headers The functions for configuring the traffic class to queue mappings have other effects that need to be addressed. Instead of trying to export a bunch of new functions just relocate the functions so that we can instrument them directly with the functionality they will need. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fcd91dd4 |
|
20-Oct-2016 |
Sabrina Dubroca <sd@queasysnail.net> |
net: add recursion limit to GRO Currently, GRO can do unlimited recursion through the gro_receive handlers. This was fixed for tunneling protocols by limiting tunnel GRO to one level with encap_mark, but both VLAN and TEB still have this problem. Thus, the kernel is vulnerable to a stack overflow, if we receive a packet composed entirely of VLAN headers. This patch adds a recursion counter to the GRO layer to prevent stack overflow. When a gro_receive function hits the recursion limit, GRO is aborted for this skb and it is processed normally. This recursion counter is put in the GRO CB, but could be turned into a percpu counter if we run out of space in the CB. Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report. Fixes: CVE-2016-7039 Fixes: 9b174d88c257 ("net: Add Transparent Ethernet Bridging GRO support.") Fixes: 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Jiri Benc <jbenc@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e4961b07 |
|
19-Oct-2016 |
Ido Schimmel <idosch@mellanox.com> |
net: core: Correctly iterate over lower adjacency list Tamir reported the following trace when processing ARP requests received via a vlan device on top of a VLAN-aware bridge: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0] [...] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.8.0-rc7 #1 Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 task: ffff88017edfea40 task.stack: ffff88017ee10000 RIP: 0010:[<ffffffff815dcc73>] [<ffffffff815dcc73>] netdev_all_lower_get_next_rcu+0x33/0x60 [...] Call Trace: <IRQ> [<ffffffffa015de0a>] mlxsw_sp_port_lower_dev_hold+0x5a/0xa0 [mlxsw_spectrum] [<ffffffffa016f1b0>] mlxsw_sp_router_netevent_event+0x80/0x150 [mlxsw_spectrum] [<ffffffff810ad07a>] notifier_call_chain+0x4a/0x70 [<ffffffff810ad13a>] atomic_notifier_call_chain+0x1a/0x20 [<ffffffff815ee77b>] call_netevent_notifiers+0x1b/0x20 [<ffffffff815f2eb6>] neigh_update+0x306/0x740 [<ffffffff815f38ce>] neigh_event_ns+0x4e/0xb0 [<ffffffff8165ea3f>] arp_process+0x66f/0x700 [<ffffffff8170214c>] ? common_interrupt+0x8c/0x8c [<ffffffff8165ec29>] arp_rcv+0x139/0x1d0 [<ffffffff816e505a>] ? vlan_do_receive+0xda/0x320 [<ffffffff815e3794>] __netif_receive_skb_core+0x524/0xab0 [<ffffffff815e6830>] ? dev_queue_xmit+0x10/0x20 [<ffffffffa06d612d>] ? br_forward_finish+0x3d/0xc0 [bridge] [<ffffffffa06e5796>] ? br_handle_vlan+0xf6/0x1b0 [bridge] [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60 [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0 [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70 [<ffffffffa06d7856>] br_pass_frame_up+0xc6/0x160 [bridge] [<ffffffffa06d63d7>] ? deliver_clone+0x37/0x50 [bridge] [<ffffffffa06d656c>] ? br_flood+0xcc/0x160 [bridge] [<ffffffffa06d7b14>] br_handle_frame_finish+0x224/0x4f0 [bridge] [<ffffffffa06d7f94>] br_handle_frame+0x174/0x300 [bridge] [<ffffffff815e3599>] __netif_receive_skb_core+0x329/0xab0 [<ffffffff81374815>] ? find_next_bit+0x15/0x20 [<ffffffff8135e802>] ? cpumask_next_and+0x32/0x50 [<ffffffff810c9968>] ? load_balance+0x178/0x9b0 [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60 [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0 [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70 [<ffffffffa01544a1>] mlxsw_sp_rx_listener_func+0x61/0xb0 [mlxsw_spectrum] [<ffffffffa005c9f7>] mlxsw_core_skb_receive+0x187/0x200 [mlxsw_core] [<ffffffffa007332a>] mlxsw_pci_cq_tasklet+0x63a/0x9b0 [mlxsw_pci] [<ffffffff81091986>] tasklet_action+0xf6/0x110 [<ffffffff81704556>] __do_softirq+0xf6/0x280 [<ffffffff8109213f>] irq_exit+0xdf/0xf0 [<ffffffff817042b4>] do_IRQ+0x54/0xd0 [<ffffffff8170214c>] common_interrupt+0x8c/0x8c The problem is that netdev_all_lower_get_next_rcu() never advances the iterator, thereby causing the loop over the lower adjacency list to run forever. Fix this by advancing the iterator and avoid the infinite loop. Fixes: 7ce856aaaf13 ("mlxsw: spectrum: Add couple of lower device helper functions") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Tamir Winetroub <tamirw@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f1170fd4 |
|
17-Oct-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: Remove all_adj_list and its references Only direct adjacencies are maintained. All upper or lower devices can be learned via the new walk API which recursively walks the adj_list for upper devices or lower devices. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1a3f060c |
|
17-Oct-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: Introduce new api for walking upper and lower devices This patch introduces netdev_walk_all_upper_dev_rcu, netdev_walk_all_lower_dev and netdev_walk_all_lower_dev_rcu. These functions recursively walk the adj_list of devices to determine all upper and lower devices. The functions take a callback function that is invoked for each device in the list. If the callback returns non-0, the walk is terminated and the functions return that code back to callers. v3 - simplified netdev_has_upper_dev_all_rcu and __netdev_has_upper_dev and removed typecast as suggested by Stephen v2 - fixed definition of netdev_next_lower_dev_rcu to mirror the upper_dev version. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf53b1da |
|
11-Oct-2016 |
stephen hemminger <stephen@networkplumber.org> |
Revert "net: Add driver helper functions to determine checksum offloadability" This reverts commit 6ae23ad36253a8033c5714c52b691b84456487c5. The code has been in kernel since 4.4 but there are no in tree code that uses. Unused code is broken code, remove it. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61e84623 |
|
07-Oct-2016 |
Jarod Wilson <jarod@redhat.com> |
net: centralize net_device min/max MTU checking While looking into an MTU issue with sfc, I started noticing that almost every NIC driver with an ndo_change_mtu function implemented almost exactly the same range checks, and in many cases, that was the only practical thing their ndo_change_mtu function was doing. Quite a few drivers have either 68, 64, 60 or 46 as their minimum MTU value checked, and then various sizes from 1500 to 65535 for their maximum MTU value. We can remove a whole lot of redundant code here if we simple store min_mtu and max_mtu in net_device, and check against those in net/core/dev.c's dev_set_mtu(). In theory, there should be zero functional change with this patch, it just puts the infrastructure in place. Subsequent patches will attempt to start using said infrastructure, with theoretically zero change in functionality. CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3b37f11 |
|
21-Sep-2016 |
Aaron Conole <aconole@bytheb.org> |
netfilter: replace list_head with single linked list The netfilter hook list never uses the prev pointer, and so can be trimmed to be a simple singly-linked list. In addition to having a more light weight structure for hook traversal, struct net becomes 5568 bytes (down from 6400) and struct net_device becomes 2176 bytes (down from 2240). Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
79aab093 |
|
21-Sep-2016 |
Moshe Shemesh <moshe@mellanox.com> |
net: Update API for VF vlan protocol 802.1ad support Introduce new rtnl UAPI that exposes a list of vlans per VF, giving the ability for user-space application to specify it for the VF, as an option to support 802.1ad. We adjusted IP Link tool to support this option. For future use cases, the new UAPI supports multiple vlans. For now we limit the list size to a single vlan in kernel. Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward compatibility with older versions of IP Link tool. Add a vlan protocol parameter to the ndo_set_vf_vlan callback. We kept 802.1Q as the drivers' default vlan protocol. Suitable ip link tool command examples: Set vf vlan protocol 802.1ad: ip link set eth0 vf 1 vlan 100 proto 802.1ad Set vf to VST (802.1Q) mode: ip link set eth0 vf 1 vlan 100 proto 802.1Q Or by omitting the new parameter ip link set eth0 vf 1 vlan 100 Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
332ae8e2 |
|
21-Sep-2016 |
Jakub Kicinski <kuba@kernel.org> |
net: cls_bpf: add hardware offload This patch adds hardware offload capability to cls_bpf classifier, similar to what have been done with U32 and flower. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2c9d85d4 |
|
16-Sep-2016 |
Nogah Frankel <nogahf@mellanox.com> |
netdevice: Add offload statistics ndo Add a new ndo to return statistics for offloaded operation. Since there can be many different offloaded operation with many stats types, the ndo gets an attribute id by which it knows which stats are wanted. The ndo also gets a void pointer to be cast according to the attribute id. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24b27fc4 |
|
01-Sep-2016 |
Mahesh Bandewar <maheshb@google.com> |
bonding: Fix bonding crash Following few steps will crash kernel - (a) Create bonding master > modprobe bonding miimon=50 (b) Create macvlan bridge on eth2 > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \ type macvlan (c) Now try adding eth2 into the bond > echo +eth2 > /sys/class/net/bond0/bonding/slaves <crash> Bonding does lots of things before checking if the device enslaved is busy or not. In this case when the notifier call-chain sends notifications, the bond_netdev_event() assumes that the rx_handler /rx_handler_data is registered while the bond_enslave() hasn't progressed far enough to register rx_handler for the new slave. This patch adds a rx_handler check that can be performed right at the beginning of the enslave code to avoid getting into this situation. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d297653d |
|
30-Aug-2016 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
rtnetlink: fdb dump: optimize by saving last interface markers fdb dumps spanning multiple skb's currently restart from the first interface again for every skb. This results in unnecessary iterations on the already visited interfaces and their fdb entries. In large scale setups, we have seen this to slow down fdb dumps considerably. On a system with 30k macs we see fdb dumps spanning across more than 300 skbs. To fix the problem, this patch replaces the existing single fdb marker with three markers: netdev hash entries, netdevs and fdb index to continue where we left off instead of restarting from the first netdev. This is consistent with link dumps. In the process of fixing the performance issue, this patch also re-implements fix done by commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump") (with an internal fix from Wilson Kok) in the following ways: - change ndo_fdb_dump handlers to return error code instead of the last fdb index - use cb->args strictly for dump frag markers and not error codes. This is consistent with other dump functions. Below results were taken on a system with 1000 netdevs and 35085 fdb entries: before patch: $time bridge fdb show | wc -l 15065 real 1m11.791s user 0m0.070s sys 1m8.395s (existing code does not return all macs) after patch: $time bridge fdb show | wc -l 35085 real 0m2.017s user 0m0.113s sys 0m1.942s Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6bc506b4 |
|
25-Aug-2016 |
Ido Schimmel <idosch@mellanox.com> |
bridge: switchdev: Add forward mark support for stacked devices switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of port netdevs so that packets being flooded by the device won't be flooded twice. It works by assigning a unique identifier (the ifindex of the first bridge port) to bridge ports sharing the same parent ID. This prevents packets from being flooded twice by the same switch, but will flood packets through bridge ports belonging to a different switch. This method is problematic when stacked devices are taken into account, such as VLANs. In such cases, a physical port netdev can have upper devices being members in two different bridges, thus requiring two different 'offload_fwd_mark's to be configured on the port netdev, which is impossible. The main problem is that packet and netdev marking is performed at the physical netdev level, whereas flooding occurs between bridge ports, which are not necessarily port netdevs. Instead, packet and netdev marking should really be done in the bridge driver with the switch driver only telling it which packets it already forwarded. The bridge driver will mark such packets using the mark assigned to the ingress bridge port and will prevent the packet from being forwarded through any bridge port sharing the same mark (i.e. having the same parent ID). Remove the current switchdev 'offload_fwd_mark' implementation and instead implement the proposed method. In addition, make rocker - the sole user of the mark - use the proposed method. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
952fcfd0 |
|
12-Aug-2016 |
Sabrina Dubroca <sd@queasysnail.net> |
net: remove type_check from dev_get_nest_level() The idea for type_check in dev_get_nest_level() was to count the number of nested devices of the same type (currently, only macvlan or vlan devices). This prevented the false positive lockdep warning on configurations such as: eth0 <--- macvlan0 <--- vlan0 <--- macvlan1 However, this doesn't prevent a warning on a configuration such as: eth0 <--- macvlan0 <--- vlan0 eth1 <--- vlan1 <--- macvlan1 In this case, all the locks end up with a nesting subclass of 1, so lockdep thinks that there is still a deadlock: - in the first case we have (macvlan_netdev_addr_lock_key, 1) and then take (vlan_netdev_xmit_lock_key, 1) - in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then take (macvlan_netdev_addr_lock_key, 1) By removing the linktype check in dev_get_nest_level() and always incrementing the nesting depth, lockdep considers this configuration valid. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
59cc1f61 |
|
10-Aug-2016 |
Jiri Kosina <jkosina@suse.cz> |
net: sched: convert qdisc linked list to hashtable Convert the per-device linked list into a hashtable. The primary motivation for this change is that currently, we're not tracking all the qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup performed over the linked list by qdisc_match_from_root() is rather expensive. The ultimate goal is to get rid of hidden qdiscs completely, which will bring much more determinism in user experience. Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b87f7936 |
|
20-Jul-2016 |
Yotam Gigi <yotamg@mellanox.com> |
net/sched: Add match-all classifier hw offloading. Following the work that have been done on offloading classifiers like u32 and flower, now the match-all classifier hw offloading is possible. if the interface supports tc offloading. To control the offloading, two tc flags have been introduced: skip_sw and skip_hw. Typical usage: tc filter add dev eth25 parent ffff: \ matchall skip_sw \ action mirred egress mirror \ dev eth27 Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a7862b45 |
|
19-Jul-2016 |
Brenden Blanco <bblanco@plumgrid.com> |
net: add ndo to setup/query xdp prog in adapter rx Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP filter. The single op is used for both setup/query of the xdp program, modelled after ndo_setup_tc. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
18d3df3e |
|
14-Jul-2016 |
Paolo Abeni <pabeni@redhat.com> |
vlan: use a valid default mtu value for vlan over macsec macsec can't cope with mtu frames which need vlan tag insertion, and vlan device set the default mtu equal to the underlying dev's one. By default vlan over macsec devices use invalid mtu, dropping all the large packets. This patch adds a netif helper to check if an upper vlan device needs mtu reduction. The helper is used during vlan devices initialization to set a valid default and during mtu updating to forbid invalid, too bit, mtu values. The helper currently only check if the lower dev is a macsec device, if we get more users, we need to update only the helper (possibly reserving an additional IFF bit). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
18bfb924 |
|
05-Jul-2016 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce default neigh_construct/destroy ndo calls for L2 upper devices L2 upper device needs to propagate neigh_construct/destroy calls down to lower devices. Do this by defining default ndo functions and use them in team, bond, bridge and vlan. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
503eebc2 |
|
05-Jul-2016 |
Jiri Pirko <jiri@mellanox.com> |
net: add dev arg to ndo_neigh_construct/destroy As the following patch will allow upper devices to follow the call down lower devices, we need to add dev here and not rely on n->dev. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ce856aa |
|
04-Jul-2016 |
Jiri Pirko <jiri@mellanox.com> |
mlxsw: spectrum: Add couple of lower device helper functions Add functions that iterate over lower devices and find port device. As a dependency add netdev_for_each_all_lower_dev and netdev_for_each_all_lower_dev_rcu macro with netdev_all_lower_get_next and netdev_all_lower_get_next_rcu shelpers. Also, add functions to return mlxsw struct according to lower device found and mlxsw_port struct with a reference to lower device. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
08294a26 |
|
30-Jun-2016 |
Jason Wang <jasowang@redhat.com> |
net: introduce NETDEV_CHANGE_TX_QUEUE_LEN This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this will be triggered when tx_queue_len. It could be used by net device who want to do some processing at that time. An example is tun who may want to resize tx array when tx_queue_len is changed. Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1938ee1f |
|
16-Jun-2016 |
Alexander Duyck <aduyck@mirantis.com> |
net: Remove deprecated tunnel specific UDP offload functions Now that we have all the drivers using udp_tunnel_get_rx_ports, ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the function calls that were specific to VXLAN and GENEVE. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7c46a640 |
|
16-Jun-2016 |
Alexander Duyck <aduyck@mirantis.com> |
net: Merge VXLAN and GENEVE push notifiers into a single notifier This patch merges the notifiers for VXLAN and GENEVE into a single UDP tunnel notifier. The idea is that we will want to only have to make one notifier call to receive the list of ports for VXLAN and GENEVE tunnels that need to be offloaded. In addition we add a new set of ndo functions named ndo_udp_tunnel_add and ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data such as port and address family as tunnels are added and removed. The tunnel meta-data is now transported in a structure named udp_tunnel_info which for now carries the type, address family, and port number. In the future this could be updated so that we can include a tuple of values including things such as the destination IP address and other fields. I also ended up going with a naming scheme that consisted of using the prefix udp_tunnel on function names. I applied this to the notifier and ndo ops as well so that it hopefully points to the fact that these are primarily used in the udp_tunnel functions. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f997c55c |
|
15-Jun-2016 |
Alexander Aring <aar@pengutronix.de> |
ipv6: introduce neighbour discovery ops This patch introduces neighbour discovery ops callback structure. The idea is to separate the handling for 6LoWPAN into the 6lowpan module. These callback offers 6lowpan different handling, such as 802.15.4 short address handling or RFC6775 (Neighbor Discovery Optimization for IPv6 over 6LoWPANs). Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8626a0c8 |
|
15-Jun-2016 |
Alexander Aring <aar@pengutronix.de> |
6lowpan: add private neighbour data This patch will introduce a 6lowpan neighbour private data. Like the interface private data we handle private data for generic 6lowpan and for link-layer specific 6lowpan. The current first use case if to save the short address for a 802.15.4 6lowpan neighbour. Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
99860208 |
|
10-Jun-2016 |
Florian Westphal <fw@strlen.de> |
sched: remove NET_XMIT_POLICED sch_atm returns this when TC_ACT_SHOT classification occurs. But all other schedulers that use tc_classify (htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS in this case so just do that in atm. BATMAN uses it as an intermediate return value to signal forwarding vs. buffering, but it did not return POLICED to callers outside of BATMAN. Reviewed-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a70b506e |
|
10-Jun-2016 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: enforce recursion limit on redirects Respect the stack's xmit_recursion limit for calls into dev_queue_xmit(). Currently, they are not handeled by the limiter when attached to clsact's egress parent, for example, and a buggy program redirecting it to the same device again could run into stack overflow eventually. It would be good if we could notify an admin to give him a chance to react. We reuse xmit_recursion instead of having one private to eBPF, so that the stack's current recursion depth will be taken into account as well. Follow-up to commit 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper") and 27b29f63058d ("bpf: add bpf_redirect() helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3fff6c4 |
|
09-Jun-2016 |
Eric Dumazet <edumazet@google.com> |
net: add netdev_lockdep_set_classes() helper It is time to add netdev_lockdep_set_classes() helper so that lockdep annotations per device type are easier to manage. This removes a lot of copies and missing annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
123b3652 |
|
08-Jun-2016 |
Eric Dumazet <edumazet@google.com> |
net: sched: fix missing doc annotations "make htmldocs" complains otherwise: .//net/core/gen_stats.c:168: warning: No description found for parameter 'running' .//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key' Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f9eb8aea |
|
06-Jun-2016 |
Eric Dumazet <edumazet@google.com> |
net_sched: transform qdisc running bit into a seqcount Instead of using a single bit (__QDISC___STATE_RUNNING) in sch->__state, use a seqcount. This adds lockdep support, but more importantly it will allow us to sample qdisc/class statistics without having to grab qdisc root lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
90017acc |
|
02-Jun-2016 |
Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> |
sctp: Add GSO support SCTP has this pecualiarity that its packets cannot be just segmented to (P)MTU. Its chunks must be contained in IP segments, padding respected. So we can't just generate a big skb, set gso_size to the fragmentation point and deliver it to IP layer. This patch takes a different approach. SCTP will now build a skb as it would be if it was received using GRO. That is, there will be a cover skb with protocol headers and children ones containing the actual segments, already segmented to a way that respects SCTP RFCs. With that, we can tell skb_segment() to just split based on frag_list, trusting its sizes are already in accordance. This way SCTP can benefit from GSO and instead of passing several packets through the stack, it can pass a single large packet. v2: - Added support for receiving GSO frames, as requested by Dave Miller. - Clear skb->cb if packet is GSO (otherwise it's not used by SCTP) - Added heuristics similar to what we have in TCP for not generating single GSO packets that fills cwnd. v3: - consider sctphdr size in skb_gso_transport_seglen() - rebased due to 5c7cdf339af5 ("gso: Remove arbitrary checks for unsupported GSO") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7e13318d |
|
18-May-2016 |
Tom Herbert <tom@herbertland.com> |
net: define gso types for IPx over IPv4 and IPv6 This patch defines two new GSO definitions SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and NETIF_F_GSO_IPXIP6. These are used to described IP in IP tunnel and what the outer protocol is. The inner protocol can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT are removed (these are both instances of SKB_GSO_IPXIP4). SKB_GSO_IPXIP6 will be used when support for GSO with IP encapsulation over IPv6 is added. Signed-off-by: Tom Herbert <tom@herbertland.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c94987e4 |
|
13-May-2016 |
Daniel Borkmann <daniel@iogearbox.net> |
bpf: move bpf_jit_enable declaration Move the bpf_jit_enable declaration to the filter.h file where most other core code is declared, also since we're going to add a second knob there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
74b20582 |
|
10-May-2016 |
David Ahern <dsa@cumulusnetworks.com> |
net: l3mdev: Add hook in ip and ipv6 Currently the VRF driver uses the rx_handler to switch the skb device to the VRF device. Switching the dev prior to the ip / ipv6 layer means the VRF driver has to duplicate IP/IPv6 processing which adds overhead and makes features such as retaining the ingress device index more complicated than necessary. This patch moves the hook to the L3 layer just after the first NF_HOOK for PRE_ROUTING. This location makes exposing the original ingress device trivial (next patch) and allows adding other NF_HOOKs to the VRF driver in the future. dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb with the switched device through the packet taps to maintain current behavior (tcpdump can be used on either the vrf device or the enslaved devices). Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
229740c6 |
|
03-May-2016 |
Jarno Rajahalme <jarno@ovn.org> |
udp_offload: Set encapsulation before inner completes. UDP tunnel segmentation code relies on the inner offsets being set for an UDP tunnel GSO packet, but the inner *_complete() functions will set the inner offsets only if 'encapsulation' is set before calling them. Currently, udp_gro_complete() sets 'encapsulation' only after the inner *_complete() functions are done. This causes the inner offsets having invalid values after udp_gro_complete() returns, which in turn will make it impossible to properly segment the packet in case it needs to be forwarded, which would be visible to the user either as invalid packets being sent or as packet loss. This patch fixes this by setting skb's 'encapsulation' in udp_gro_complete() before calling into the inner complete functions, and by making each possible UDP tunnel gro_complete() callback set the inner_mac_header to the beginning of the tunnel payload. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Reviewed-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9b36627a |
|
03-May-2016 |
Florian Westphal <fw@strlen.de> |
net: remove dev->trans_start previous patches removed all direct accesses to dev->trans_start, so change the netif_trans_update helper to update trans_start of netdev queue 0 instead and then remove trans_start from struct net_device. AFAICS a lot of the netif_trans_update() invocations are now useless because they occur in ndo_start_xmit and driver doesn't set LLTX (i.e. stack already took care of the update). As I can't test any of them it seems better to just leave them alone. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ba162f8e |
|
03-May-2016 |
Florian Westphal <fw@strlen.de> |
netdevice: add helper to update trans_start trans_start exists twice: - as member of net_device (legacy) - as member of netdev_queue In order to get rid of the legacy case, add a helper for the dev->trans_update (this patch), then convert spots that do dev->trans_start = jiffies to use this helper (next patch). This would then allow us to change the helper so that it updates the trans_stamp of netdev queue 0 instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c0ef079c |
|
02-May-2016 |
Florian Westphal <fw@strlen.de> |
netdevice: shrink size of struct netdev_queue - trans_timeout is incremented when tx queue timed out (tx watchdog). - tx_maxrate is set via sysfs Moving tx_maxrate to read-mostly part shrinks the struct by 64 bytes. While at it, also move trans_timeout (it is out-of-place in the 'write-mostly' part). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f4b05d27 |
|
28-Apr-2016 |
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> |
net: constify is_skb_forwardable's arguments is_skb_forwardable is not supposed to change anything so constify its arguments Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7b748340 |
|
25-Apr-2016 |
Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> |
net: fix net_gso_ok for new GSO types. Fix casting in net_gso_ok. Otherwise the shift on gso_type << NETIF_F_GSO_SHIFT may hit the 32th bit and make it look like a INT_MIN, which is then promoted from signed to uint64 which is 0xffffffff80000000, resulting in wrong behavior when it is and'ed with the feature itself, as in: This test app: #include <stdio.h> #include <stdint.h> int main(int argc, char **argv) { uint64_t feature1; uint64_t feature2; int gso_type = 1 << 15; feature1 = gso_type << 16; feature2 = (uint64_t)gso_type << 16; printf("%lx %lx\n", feature1, feature2); return 0; } Gives: ffffffff80000000 80000000 So that this: return (features & feature) == feature; Actually works on more bits than expected and invalid ones. Fix is to promote it earlier. Issue noted while rebasing SCTP GSO patch but posting separetely as someone else may experience this meanwhile. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
501e7ef5 |
|
26-Apr-2016 |
Eric Dumazet <edumazet@google.com> |
net-rfs: fix false sharing accessing sd->input_queue_head sd->input_queue_head is incremented for each processed packet in process_backlog(), and read from other cpus performing Out Of Order avoidance in get_rps_cpu() Moving this field in a separate cache line keeps it mostly hot for the cpu in process_backlog(), as other cpus will only read it. In a stress test, process_backlog() was consuming 6.80 % of cpu cycles, and the patch reduced the cost to 0.65 % Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f0cdf76c |
|
24-Apr-2016 |
Florian Westphal <fw@strlen.de> |
net: remove NETDEV_TX_LOCKED support No more users in the tree, remove NETDEV_TX_LOCKED support. Adds another hole in softnet_stats struct, but better than keeping the unused collision counter around. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
681e683f |
|
18-Apr-2016 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
geneve: break dependency with netdev drivers Equivalent to "vxlan: break dependency with netdev drivers", don't autoload geneve module in case the driver is loaded. Instead make the coupling weaker by using netdevice notifiers as proxy. Cc: Jesse Gross <jesse@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b7aade15 |
|
18-Apr-2016 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
vxlan: break dependency with netdev drivers Currently all drivers depend and autoload the vxlan module because how vxlan_get_rx_port is linked into them. Remove this dependency: By using a new event type in the netdevice notifier call chain we proxy the request from the drivers to flush and resetup the vxlan ports not directly via function call but by the already existing netdevice notifier call chain. I added a separate new event type, NETDEV_OFFLOAD_PUSH_VXLAN, to do so. We don't need to save those ids, as the event type field is an unsigned long and using specialized event types for this purpose seemed to be a more elegant way. This also comes in beneficial if in future we want to add offloading knobs for vxlan. Cc: Jesse Gross <jesse@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
802ab55a |
|
10-Apr-2016 |
Alexander Duyck <aduyck@mirantis.com> |
GSO: Support partial segmentation offload This patch adds support for something I am referring to as GSO partial. The basic idea is that we can support a broader range of devices for segmentation if we use fixed outer headers and have the hardware only really deal with segmenting the inner header. The idea behind the naming is due to the fact that everything before csum_start will be fixed headers, and everything after will be the region that is handled by hardware. With the current implementation it allows us to add support for the following GSO types with an inner TSO_MANGLEID or TSO6 offload: NETIF_F_GSO_GRE NETIF_F_GSO_GRE_CSUM NETIF_F_GSO_IPIP NETIF_F_GSO_SIT NETIF_F_UDP_TUNNEL NETIF_F_UDP_TUNNEL_CSUM In the case of hardware that already supports tunneling we may be able to extend this further to support TSO_TCPV4 without TSO_MANGLEID if the hardware can support updating inner IPv4 headers. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1530545e |
|
10-Apr-2016 |
Alexander Duyck <aduyck@mirantis.com> |
GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values This patch does two things. First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As a result we should now be able to aggregate flows that were converted from IPv6 to IPv4. In addition this allows us more flexibility for future implementations of segmentation as we may be able to use a fixed IP ID when segmenting the flow. The second thing this does is that it places limitations on the outer IPv4 ID header in the case of tunneled frames. Specifically it forces the IP ID to be incrementing by 1 unless the DF bit is set in the outer IPv4 header. This way we can avoid creating overlapping series of IP IDs that could possibly be fragmented if the frame goes through GRO and is then resegmented via GSO. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cbc53e08 |
|
10-Apr-2016 |
Alexander Duyck <aduyck@mirantis.com> |
GSO: Add GSO type for fixed IPv4 ID This patch adds support for TSO using IPv4 headers with a fixed IP ID field. This is meant to allow us to do a lossless GRO in the case of TCP flows that use a fixed IP ID such as those that convert IPv6 header to IPv4 headers. In addition I am adding a feature that for now I am referring to TSO with IP ID mangling. Basically when this flag is enabled the device has the option to either output the flow with incrementing IP IDs or with a fixed IP ID regardless of what the original IP ID ordering was. This is useful in cases where the DF bit is set and we do not care if the original IP ID value is maintained. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
743b03a8 |
|
09-Apr-2016 |
Eric Dumazet <edumazet@google.com> |
net: remove netdevice gso_min_segs After introduction of ndo_features_check(), we believe that very specific checks for rare features should not be done in core networking stack. No driver uses gso_min_segs yet, so we revert this feature and save few instructions per tx packet in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f9a7cbbf |
|
08-Apr-2016 |
Denys Vlasenko <dvlasenk@redhat.com> |
net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put Sometimes gcc mysteriously doesn't inline very small functions we expect to be inlined. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122 Arguably, gcc should do better, but gcc people aren't willing to invest time into it, asking to use __always_inline instead. With this .config: http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os, the following functions get deinlined many times. netif_tx_stop_queue: 207 copies, 590 calls: 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi) 5d pop %rbp c3 retq netif_tx_start_queue: 47 copies, 111 calls 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 80 a7 e0 01 00 00 fe lock andb $0xfe,0x1e0(%rdi) 5d pop %rbp c3 retq sock_hold: 39 copies, 124 calls 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 ff 87 80 00 00 00 lock incl 0x80(%rdi) 5d pop %rbp c3 retq __sock_put: 6 copies, 13 calls 55 push %rbp 48 89 e5 mov %rsp,%rbp f0 ff 8f 80 00 00 00 lock decl 0x80(%rdi) 5d pop %rbp c3 retq This patch fixes this via s/inline/__always_inline/. Code size decrease after the patch is ~2.5k: text data bss dec hex filename 56719876 56364551 36196352 149280779 8e5d80b vmlinux_before 56717440 56364551 36196352 149278343 8e5ce87 vmlinux Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: David S. Miller <davem@davemloft.net> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a0ca153f |
|
05-Apr-2016 |
Alexander Duyck <aduyck@mirantis.com> |
GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
46aa2f30 |
|
05-Apr-2016 |
Tom Herbert <tom@herbertland.com> |
udp: Remove udp_offloads Now that the UDP encapsulation GRO functions have been moved to the UDP socket we not longer need the udp_offload insfrastructure so removing it. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5e82b4b2 |
|
23-Mar-2016 |
Bjorn Helgaas <bhelgaas@google.com> |
net: Fix typos and whitespace. Fix typos. Capitalize CPU, NAPI, RCU consistently. Align structure indentation. No functional change intended; only comment and whitespace changes. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
62d885fe |
|
21-Mar-2016 |
Benjamin Poirier <bpoirier@suse.com> |
net: Add missing kernel-doc for netdev ptype lists .//include/linux/netdevice.h:1826: warning: No description found for parameter 'ptype_all' .//include/linux/netdevice.h:1826: warning: No description found for parameter 'ptype_specific' Introduced by commit 7866a621043f ("dev: add per net_device packet type chains") Cc: Salam Noureddine <noureddine@arista.com> Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6d0e24cd |
|
21-Mar-2016 |
Luis de Bethencourt <luisbg@osg.samsung.com> |
net: add missing descriptions in net_device_priv_flags The flags IFF_XMIT_DST_RELEASE_PERM, IFF_IPVLAN_MASTER and IFF_IPVLAN_SLAVE are missing descriptions for the Documentation. Adding them. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Suggested-by: Benjamin Poirier <benjamin.poirier@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cc8e27cc |
|
11-Mar-2016 |
Eli Cohen <eli@mellanox.com> |
net/core: Add support for configuring VF GUIDs Add two new NLAs to support configuration of Infiniband node or port GUIDs. New applications can choose to use this interface to configure GUIDs with iproute2 with commands such as: ip link set dev ib0 vf 0 node_guid 00:02:c9:03:00:21:6e:70 ip link set dev ib0 vf 0 port_guid 00:02:c9:03:00:21:6e:78 A new ndo, ndo_sef_vf_guid is introduced to notify the net device of the request to change the GUID. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
fac8e0f5 |
|
19-Mar-2016 |
Jesse Gross <jesse@kernel.org> |
tunnels: Don't apply GRO to multiple layers of encapsulation. When drivers express support for TSO of encapsulated packets, they only mean that they can do it for one layer of encapsulation. Supporting additional levels would mean updating, at a minimum, more IP length fields and they are unaware of this. No encapsulation device expresses support for handling offloaded encapsulated packets, so we won't generate these types of frames in the transmit path. However, GRO doesn't have a check for multiple levels of encapsulation and will attempt to build them. UDP tunnel GRO actually does prevent this situation but it only handles multiple UDP tunnels stacked on top of each other. This generalizes that solution to prevent any kind of tunnel stacking that would cause problems. Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack") Signed-off-by: Jesse Gross <jesse@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3c175784 |
|
11-Mar-2016 |
Sabrina Dubroca <sd@queasysnail.net> |
net: add MACsec netdevice priv_flags and helper Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b33f488 |
|
07-Mar-2016 |
Amir Vadai <amir@vadai.me> |
net/flower: Introduce hardware offload support This patch is based on a patch made by John Fastabend. It adds support for offloading cls_flower. when NETIF_F_HW_TC is on: flags = 0 => Rule will be processed twice - by hardware, and if still relevant, by software. flags = SKIP_HW => Rull will be processed by software only If hardware fail/not capabale to apply the rule, operation will NOT fail. Filter will be processed by SW only. Acked-by: Jiri Pirko <jiri@mellanox.com> Suggested-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2793a23a |
|
09-Mar-2016 |
Willem de Bruijn <willemb@google.com> |
net: validate variable length ll headers Netdevice parameter hard_header_len is variously interpreted both as an upper and lower bound on link layer header length. The field is used as upper bound when reserving room at allocation, as lower bound when validating user input in PF_PACKET. Clarify the definition to be maximum header length. For validation of untrusted headers, add an optional validate member to header_ops. Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance for deliberate testing of corrupt input. In this case, pad trailing bytes, as some device drivers expect completely initialized headers. See also http://comments.gmane.org/gmane.linux.network/401064 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
871b642a |
|
26-Feb-2016 |
Paolo Abeni <pabeni@redhat.com> |
netdev: introduce ndo_set_rx_headroom This method allows the controlling device (i.e. the bridge) to specify additional headroom to be allocated for skb head on frame reception. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cfdd28be |
|
17-Feb-2016 |
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> |
net: make netdev_for_each_lower_dev safe for device removal When I used netdev_for_each_lower_dev in commit bad531623253 ("vrf: remove slave queue and private slave struct") I thought that it acts like netdev_for_each_lower_private and can be used to remove the current device from the list while walking, but unfortunately it acts more like netdev_for_each_lower_private_rcu and doesn't allow it. The difference is where the "iter" points to, right now it points to the current element and that makes it impossible to remove it. Change the logic to be similar to netdev_for_each_lower_private and make it point to the "next" element so we can safely delete the current one. VRF is the only such user right now, there's no change for the read-only users. Here's what can happen now: [98423.249858] general protection fault: 0000 [#1] SMP [98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button 9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy virtio_ring virtio scsi_mod [last unloaded: bridge] [98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G O 4.5.0-rc2+ #81 [98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti: ffff88003428c000 [98423.256123] RIP: 0010:[<ffffffff81514f3e>] [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30 [98423.256534] RSP: 0018:ffff88003428f940 EFLAGS: 00010207 [98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX: 0000000000000000 [98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI: ffff880054ff90c0 [98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09: 0000000000000000 [98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88003428f9e0 [98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15: 0000000000000001 [98423.258055] FS: 00007f3d76881700(0000) GS:ffff88005d000000(0000) knlGS:0000000000000000 [98423.258418] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4: 00000000000406e0 [98423.258902] Stack: [98423.259075] ffff88003428f960 ffffffffa0442636 0002000100000004 ffff880054ff9000 [98423.259647] ffff88003428f9b0 ffffffff81518205 ffff880054ff9000 ffff88003428f978 [98423.260208] ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0 ffff880035b35f00 [98423.260739] Call Trace: [98423.260920] [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf] [98423.261156] [<ffffffff81518205>] rollback_registered_many+0x205/0x390 [98423.261401] [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70 [98423.261641] [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50 [98423.271557] [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0 [98423.271800] [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90 [98423.272049] [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200 [98423.272279] [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10 [98423.272513] [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40 [98423.272755] [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40 [98423.272983] [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0 [98423.273209] [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40 [98423.273476] [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0 [98423.273710] [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610 [98423.273947] [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70 [98423.274175] [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0 [98423.274416] [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140 [98423.274658] [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210 [98423.274894] [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210 [98423.275130] [<ffffffff81269611>] ? __fget_light+0x91/0xb0 [98423.275365] [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80 [98423.275595] [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20 [98423.275827] [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a [98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48 89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 [98423.279639] RIP [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30 [98423.279920] RSP <ffff88003428f940> CC: David Ahern <dsa@cumulusnetworks.com> CC: David S. Miller <davem@davemloft.net> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Vlad Yasevich <vyasevic@redhat.com> Fixes: bad531623253 ("vrf: remove slave queue and private slave struct") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a1b7c5fd |
|
16-Feb-2016 |
John Fastabend <john.fastabend@gmail.com> |
net: sched: add cls_u32 offload hooks for netdevs This patch allows netdev drivers to consume cls_u32 offloads via the ndo_setup_tc ndo op. This works aligns with how network drivers have been doing qdisc offloads for mqprio. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
16e5cc64 |
|
16-Feb-2016 |
John Fastabend <john.fastabend@gmail.com> |
net: rework setup_tc ndo op to consume general tc operand This patch updates setup_tc so we can pass additional parameters into the ndo op in a generic way. To do this we provide structured union and type flag. This lets each classifier and qdisc provide its own set of attributes without having to add new ndo ops or grow the signature of the callback. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e4c6734e |
|
16-Feb-2016 |
John Fastabend <john.fastabend@gmail.com> |
net: rework ndo tc op to consume additional qdisc handle parameter The ndo_setup_tc() op was added to support drivers offloading tx qdiscs however only support for mqprio was ever added. So we only ever added support for passing the number of traffic classes to the driver. This patch generalizes the ndo_setup_tc op so that a handle can be provided to indicate if the offload is for ingress or egress or potentially even child qdiscs. CC: Murali Karicheri <m-karicheri2@ti.com> CC: Shradha Shah <sshah@solarflare.com> CC: Or Gerlitz <ogerlitz@mellanox.com> CC: Ariel Elior <ariel.elior@qlogic.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Bruce Allan <bruce.w.allan@intel.com> CC: Jesse Brandeburg <jesse.brandeburg@intel.com> CC: Don Skidmore <donald.c.skidmore@intel.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4ab4286 |
|
08-Feb-2016 |
Keller, Jacob E <jacob.e.keller@intel.com> |
ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH} Ethernet drivers implementing both {GS}RXFH and {GS}CHANNELS ethtool ops incorrectly allow SCHANNELS when it would conflict with the settings from SRXFH. This occurs because it is not possible for drivers to understand whether their Rx flow indirection table has been configured or is in the default state. In addition, drivers currently behave in various ways when increasing the number of Rx channels. Some drivers will always destroy the Rx flow indirection table when this occurs, whether it has been set by the user or not. Other drivers will attempt to preserve the table even if the user has never modified it from the default driver settings. Neither of these situation is desirable because it leads to unexpected behavior or loss of user configuration. The correct behavior is to simply return -EINVAL when SCHANNELS would conflict with the current Rx flow table settings. However, it should only do so if the current settings were modified by the user. If we required that the new settings never conflict with the current (default) Rx flow settings, we would force users to first reduce their Rx flow settings and then reduce the number of Rx channels. This patch proposes a solution implemented in net/core/ethtool.c which ensures that all drivers behave correctly. It checks whether the RXFH table has been configured to non-default settings, and stores this information in a private netdev flag. When the number of channels is requested to change, it first ensures that the current Rx flow table is not going to assign flows to now disabled channels. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ba905f5e |
|
01-Feb-2016 |
Kim Jones <kim-marie.jones@intel.com> |
ethtool: Declare netdev_rss_key as __read_mostly. netdev_rss_key is written to once and thereafter is read by drivers when they are initialising. The fact that it is mostly read and not written to makes it a candidate for a __read_mostly declaration. Signed-off-by: Kim Jones <kim-marie.jones@intel.com> Signed-off-by: Alan Carey <alan.carey@intel.com> Acked-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e7333d3 |
|
01-Feb-2016 |
Jarod Wilson <jarod@redhat.com> |
net: add rx_nohandler stat counter This adds an rx_nohandler stat counter, along with a sysfs statistics node, and copies the counter out via netlink as well. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@mellanox.com> CC: Daniel Borkmann <daniel@iogearbox.net> CC: Tom Herbert <tom@herbertland.com> CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <gospo@cumulusnetworks.com> CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
facc432f |
|
22-Jan-2016 |
Arnd Bergmann <arnd@arndb.de> |
net: simplify napi_synchronize() to avoid warnings The napi_synchronize() function is defined twice: The definition for SMP builds waits for other CPUs to be done, while the uniprocessor variant just contains a barrier and ignores its argument. In the mvneta driver, this leads to a warning about an unused variable when we lookup the NAPI struct of another CPU and then don't use it: ethernet/marvell/mvneta.c: In function 'mvneta_percpu_notifier': ethernet/marvell/mvneta.c:2910:30: error: unused variable 'other_port' [-Werror=unused-variable] There are no other CPUs on a UP build, so that code never runs, but gcc does not know this. The nicest solution seems to be to turn the napi_synchronize() helper into an inline function for the UP case as well, as that leads gcc to not complain about the argument being unused. Once we do that, we can also combine the two cases into a single function definition and use if(IS_ENABLED()) rather than #ifdef to make it look a bit nicer. The warning first came up in linux-4.4, but I failed to catch it earlier. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: f86428854480 ("net: mvneta: Statically assign queues to CPUs") Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b6a0e72a |
|
11-Jan-2016 |
Tom Herbert <tom@herbertland.com> |
net: Fix typo in netdev_intersect_features Obviously need to 'or in NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM. Fixes: c8cd0989bd151f ("net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM") Reported-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1f211a1b |
|
07-Jan-2016 |
Daniel Borkmann <daniel@iogearbox.net> |
net, sched: add clsact qdisc This work adds a generalization of the ingress qdisc as a qdisc holding only classifiers. The clsact qdisc works on ingress, but also on egress. In both cases, it's execution happens without taking the qdisc lock, and the main difference for the egress part compared to prior version of [1] is that this can be applied with _any_ underlying real egress qdisc (also classless ones). Besides solving the use-case of [1], that is, allowing for more programmability on assigning skb->priority for the mqprio case that is supported by most popular 10G+ NICs, it also opens up a lot more flexibility for other tc applications. The main work on classification can already be done at clsact egress time if the use-case allows and state stored for later retrieval f.e. again in skb->priority with major/minors (which is checked by most classful qdiscs before consulting tc_classify()) and/or in other skb fields like skb->tc_index for some light-weight post-processing to get to the eventual classid in case of a classful qdisc. Another use case is that the clsact egress part allows to have a central egress counterpart to the ingress classifiers, so that classifiers can easily share state (e.g. in cls_bpf via eBPF maps) for ingress and egress. Currently, default setups like mq + pfifo_fast would require for this to use, for example, prio qdisc instead (to get a tc_classify() run) and to duplicate the egress classifier for each queue. With clsact, it allows for leaving the setup as is, it can additionally assign skb->priority to put the skb in one of pfifo_fast's bands and it can share state with maps. Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid) w/o the need to perform a skb_dst_force() to hold on to it any longer. In lwt case, we can also use this facility to setup dst metadata via cls_bpf (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for that (case of IFF_NO_QUEUE devices, for example). The realization can be done without any changes to the scheduler core framework. All it takes is that we have two a-priori defined minors/child classes, where we can mux between ingress and egress classifier list (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to dev->_tx to avoid extra cacheline miss for moderate loads). The egress part is a bit similar modelled to handle_ing() and patched to a noop in case the functionality is not used. Both handlers are now called sch_handle_ingress() and sch_handle_egress(), code sharing among the two doesn't seem practical as there are various minor differences in both paths, so that making them conditional in a single handler would rather slow things down. Full compatibility to ingress qdisc is provided as well. Since both piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist per netdevice, and thus ingress qdisc specific behaviour can be retained for user space. This means, either a user does 'tc qdisc add dev foo ingress' and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact' alternative, where both, ingress and egress classifier can be configured as in the below example. ingress qdisc supports attaching classifier to any minor number whereas clsact has two fixed minors for muxing between the lists, therefore to not break user space setups, they are better done as two separate qdiscs. I decided to extend the sch_ingress module with clsact functionality so that commonly used code can be reused, the module is being aliased with sch_clsact so that it can be auto-loaded properly. Alternative would have been to add a flag when initializing ingress to alter its behaviour plus aliasing to a different name (as it's more than just ingress). However, the first would end up, based on the flag, choosing the new/old behaviour by calling different function implementations to handle each anyway, the latter would require to register ingress qdisc once again under different alias. So, this really begs to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops by its own that share callbacks used by both. Example, adding qdisc: # tc qdisc add dev foo clsact # tc qdisc show dev foo qdisc mq 0: root qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 Adding filters (deleting, etc works analogous by specifying ingress/egress): # tc filter add dev foo ingress bpf da obj bar.o sec ingress # tc filter add dev foo egress bpf da obj bar.o sec egress # tc filter show dev foo ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action # tc filter show dev foo egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will show an empty list for clsact. Either using the parent names (ingress/egress) or specifying the full major/minor will then show the related filter lists. Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend. [1] http://patchwork.ozlabs.org/patch/512949/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cdba756f |
|
06-Jan-2016 |
Eric Dumazet <edumazet@google.com> |
net: move ndo_features_check() close to ndo_start_xmit() TX fast path uses ndo_start_xmit(), ndo_features_check() and ndo_select_queue(). Move ndo_features_check() close to ndo_start_xmit() to increase data locality. All "struct net_device_ops" should now be using C99 initializers. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a8170d2b |
|
14-Dec-2015 |
Singhai, Anjali <anjali.singhai@intel.com> |
geneve: Add geneve udp port offload for ethernet devices Add ndo_ops to add/del UDP ports to a device that supports geneve offload. v2: Comment fix. Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6ae23ad3 |
|
14-Dec-2015 |
Tom Herbert <tom@herbertland.com> |
net: Add driver helper functions to determine checksum offloadability Add skb_csum_offload_chk driver helper function to determine if a device with limited checksum offload capabilities is able to offload the checksum for a given packet. This patch includes: - The skb_csum_offload_chk function. Returns true if checksum is offloadable, else false. Optionally, in the case that the checksum is not offloable, the function can call skb_checksum_help to resolve the checksum. skb_csum_offload_chk also returns whether the checksum refers to an encapsulated checksum. - Definition of skb_csum_offl_spec structure that caller uses to indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only, whether encapsulated checksums can be offloaded, whether checksum with IPv6 extension headers can be offloaded). - Ancilary functions called skb_csum_offload_chk_help, skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c8cd0989 |
|
14-Dec-2015 |
Tom Herbert <tom@herbertland.com> |
net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM These netif flags are unnecessary convolutions. It is more straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM directly. This patch also: - Cleans up can_checksum_protocol - Simplifies netdev_intersect_features Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a188222b |
|
14-Dec-2015 |
Tom Herbert <tom@herbertland.com> |
net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the set of features for offloading all checksums. This is a mask of the checksum offload related features bits. It is incorrect to set both NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for features of a device. This patch: - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where NETIF_F_ALL_CSUM is being used as a mask). - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
326fcfa5 |
|
05-Dec-2015 |
Felix Fietkau <nbd@openwrt.org> |
net: remove unnecessary semicolon in netdev_alloc_pcpu_stats() This semicolon causes a build error if the function call is wrapped in parentheses. Fixes: aabc92bbe3cf ("net: add __netdev_alloc_pcpu_stats() to indicate gfp flags") Reported-by: Imre Kaloz <kaloz@openwrt.org> Signed-off-by: Felix Fietkau <nbd@openwrt.org> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b618aaa9 |
|
04-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: constify netif_is_* helpers net_device param As suggested by Eric, these helpers should have const dev param. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fb1b2e3c |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce lower state changed info structure for LAG lowers This is shared info structure for bonding and team. Serves to pass down info about link state and port activity to notification listeners. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
04d48266 |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce change lower state notifier When lower device like bonding slave, team/bridge port, etc changes its state, it is useful for others to notice this change. Currently this is implemented specificly for bonding as NETDEV_BONDING_INFO notifier. This patch aims to replace this specific usage and make this more generic to be used for all upper-lower devices. Introduce NETDEV_CHANGELOWERSTATE netdev notifier type and netdev_lower_state_changed() helper. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
764f5e54 |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add info struct for LAG changeupper This struct will be shared by bonding and team to pass internal information to notifier listeners. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29bf24af |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add possibility to pass information about upper device via notifier Sometimes the drivers and other code would find it handy to know some internal information about upper device being changed. So allow upper-code to pass information down to notifier listeners during linking. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6dffb044 |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: propagate upper priv via netdev_master_upper_dev_link Eliminate netdev_master_upper_dev_link_private and pass priv directly as a parameter of netdev_master_upper_dev_link. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e0ba1414 |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add netif_is_lag_port helper Some code does not mind if a device is bond slave or team port and treats them the same, as generic LAG ports. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7be61833 |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add netif_is_lag_master helper Some code does not mind if the master is bond or team and treats them the same, as generic LAG. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f7f019ee |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add netif_is_team_port helper Similar to other helpers, caller can use this to find out if device is team port. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c981e421 |
|
02-Dec-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add netif_is_team_master helper Similar to other helpers, caller can use this to find out if device is team master. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c0eb4540 |
|
01-Dec-2015 |
KY Srinivasan <kys@microsoft.com> |
hv_netvsc: Don't ask for additional head room in the skb The rndis header is 116 bytes big and can be placed in the default head room that will be available in the skb. Since the netvsc packet is less than 48 bytes, we can use the skb control buffer for the netvsc packet. With these changes we don't need to ask for additional head room. Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
880621c2 |
|
22-Nov-2015 |
Martin Blumenstingl <martin.blumenstingl@googlemail.com> |
packet: Allow packets with only a header (but no payload) Commit 9c7077622dd91 ("packet: make packet_snd fail on len smaller than l2 header") added validation for the packet size in packet_snd. This change enforces that every packet needs a header (with at least hard_header_len bytes) plus a payload with at least one byte. Before this change the payload was optional. This fixes PPPoE connections which do not have a "Service" or "Host-Uniq" configured (which is violating the spec, but is still widely used in real-world setups). Those are currently failing with the following message: "pppd: packet size is too short (24 <= 24)" Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93d05d4a |
|
18-Nov-2015 |
Eric Dumazet <edumazet@google.com> |
net: provide generic busy polling to all NAPI drivers NAPI drivers no longer need to observe a particular protocol to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y) napi_hash_add() and napi_hash_del() are automatically called from core networking stack, respectively from netif_napi_add() and netif_napi_del() This patch depends on free_netdev() and netif_napi_del() being called from process context, which seems to be the norm. Drivers might still prefer to call napi_hash_del() on their own, since they might combine all the rcu grace periods into a single one, knowing their NAPI structures lifetime, while core networking stack has no idea of a possible combining. Once this patch proves to not bring serious regressions, we will cleanup drivers to either remove napi_hash_del() or provide appropriate rcu grace periods combining. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34cbe27e |
|
18-Nov-2015 |
Eric Dumazet <edumazet@google.com> |
net: napi_hash_del() returns a boolean status napi_hash_del() will soon be used from both drivers (if they want) or core networking stack. Callers are responsibles to ensure an RCU grace period is respected before freeing napi structure : napi_hash_del() can signal if this RCU grace period is needed or not. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d64b5e85 |
|
18-Nov-2015 |
Eric Dumazet <edumazet@google.com> |
net: add netif_tx_napi_add() netif_tx_napi_add() is a variant of netif_napi_add() It should be used by drivers that use a napi structure to exclusively poll TX. We do not want to add this kind of napi in napi_hash[] in following patches, adding generic busy polling to all NAPI drivers. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02d62e86 |
|
18-Nov-2015 |
Eric Dumazet <edumazet@google.com> |
net: un-inline sk_busy_loop() There is really little gain from inlining this big function. We'll soon make it even bigger in following patches. This means we no longer need to export napi_by_id() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28f9ee22 |
|
16-Nov-2015 |
Vlad Yasevich <vyasevich@gmail.com> |
vlan: Do not put vlan headers back on bridge and macvlan ports When a vlan is configured with REORDER_HEADER set to 0, the vlan header is put back into the packet and makes it appear that the vlan header is still there even after it's been processed. This posses a problem for bridge and macvlan ports. The packets passed to those device may be forwarded and at the time of the forward, vlan headers end up being unexpectedly present. With the patch, we make sure that we do not put the vlan header back (when REORDER_HEADER is 0) if a bridge or macvlan has been configured on top of the vlan device. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
aabc92bb |
|
10-Nov-2015 |
Pablo Neira Ayuso <pablo@netfilter.org> |
net: add __netdev_alloc_pcpu_stats() to indicate gfp flags nf_tables may create percpu counters from the packet path through its dynamic set instantiation infrastructure, so we need a way to allocate this through GFP_ATOMIC. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: David S. Miller <davem@davemloft.net>
|
#
8f25348b |
|
04-Nov-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add forgotten IFF_L3MDEV_SLAVE define Fixes: fee6d4c77 ("net: Add netif_is_l3_slave") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c59f419b |
|
25-Oct-2015 |
Helge Deller <deller@gmx.de> |
net/xps: Fix calculation of initial number of xps queues The existing code breaks on architectures where the L1 cache size (L1_CACHE_BYTES) is smaller or equal the size of struct xps_map. The new code ensures that we get at minimum one initial xps queue, or even more as long as it fits into the next multiple of L1_CACHE_SIZE. Signed-off-by: Helge Deller <deller@gmx.de> Acked-by: Alexander Duyck <aduyck@mirantis.com>
|
#
dd461d6a |
|
28-Aug-2015 |
Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> |
if_link: Add control trust VF Add netlink directives and ndo entry to trust VF user. This controls the special permission of VF user. The administrator will dedicatedly trust VF user to use some features which impacts security and/or performance. The administrator never turn it on unless VF user is fully trusted. CC: Sy Jong Choi <sy.jong.choi@intel.com> Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Acked-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
fc4099f1 |
|
22-Oct-2015 |
Pravin B Shelar <pshelar@nicira.com> |
openvswitch: Fix egress tunnel info. While transitioning to netdev based vport we broke OVS feature which allows user to retrieve tunnel packet egress information for lwtunnel devices. Following patch fixes it by introducing ndo operation to get the tunnel egress info. Same ndo operation can be used for lwtunnel devices and compat ovs-tnl-vport devices. So after adding such device operation we can remove similar operation from ovs-vport. Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device"). Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
573c7ba0 |
|
16-Oct-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce pre-change upper device notifier This newly introduced netdevice notifier is called before actual change upper happens. That provides a possibility for notifier handlers to know upper change will happen and react to it, including possibility to forbid the change. That is valuable for drivers which can check if the upper device linkage is supported and forbid that in case it is not. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fee6d4c7 |
|
05-Oct-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Add netif_is_l3_slave IPv6 addrconf keys off of IFF_SLAVE so can not use it for L3 slave. Add a new private flag and add netif_is_l3_slave function for checking it. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9478d12d |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Move netif_index_is_l3_master to l3mdev.h Change CONFIG dependency to CONFIG_NET_L3_MASTER_DEV as well. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93a7e7e8 |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Remove the now unused vrf_ptr Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b69c6d0 |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Introduce L3 Master device abstraction L3 master devices allow users of the abstraction to influence FIB lookups for enslaved devices. Current API provides a means for the master device to return a specific FIB table for an enslaved device, to return an rtable/custom dst and influence the OIF used for fib lookups. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
007979ea |
|
29-Sep-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the netif_is_vrf and netif_index_is_vrf macros. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2d8bff126 |
|
23-Sep-2015 |
Neil Horman <nhorman@redhat.com> |
netpoll: Close race condition between poll_one_napi and napi_disable Drivers might call napi_disable while not holding the napi instance poll_lock. In those instances, its possible for a race condition to exist between poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such the following may happen: CPU0 CPU1 ndo_tx_timeout napi_poll_dev napi_disable poll_one_napi test_and_set_bit (ret 0) test_bit (ret 1) reset adapter napi_poll_routine If the adapter gets a tx timeout without a napi instance scheduled, its possible for the adapter to think it has exclusive access to the hardware (as the napi instance is now scheduled via the napi_disable call), while the netpoll code thinks there is simply work to do. The result is parallel hardware access leading to corrupt data structures in the driver, and a crash. Additionaly, there is another, more critical race between netpoll and napi_disable. The disabled napi state is actually identical to the scheduled state for a given napi instance. The implication being that, if a napi instance is disabled, a netconsole instance would see the napi state of the device as having been scheduled, and poll it, likely while the driver was dong something requiring exclusive access. In the case above, its fairly clear that not having the rings in a state ready to be polled will cause any number of crashes. The fix should be pretty easy. netpoll uses its own bit to indicate that that the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC). We can just gate disabling on that bit as well as the sched bit. That should prevent netpoll from conducting a napi poll if we convert its set bit to a test_and_set_bit operation to provide mutual exclusion Change notes: V2) Remove a trailing whtiespace Resubmit with proper subject prefix V3) Clean up spacing nits Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: "David S. Miller" <davem@davemloft.net> CC: jmaxwell@redhat.com Tested-by: jmaxwell@redhat.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c4b51f0 |
|
15-Sep-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
netfilter: Pass net into okfn This is immediately motivated by the bridge code that chains functions that call into netfilter. Without passing net into the okfns the bridge code would need to guess about the best expression for the network namespace to process packets in. As net is frequently one of the first things computed in continuation functions after netfilter has done it's job passing in the desired network namespace is in many cases a code simplification. To support this change the function dst_output_okfn is introduced to simplify passing dst_output as an okfn. For the moment dst_output_okfn just silently drops the struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
04eb4489 |
|
15-Sep-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
bridge: Add br_netif_receive_skb remove netif_receive_skb_sk netif_receive_skb_sk is only called once in the bridge code, replace it with a bridge specific function that calls netif_receive_skb. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2b4aa3ce |
|
15-Sep-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Remove dev_queue_xmit_sk A function with weird arguments that it will never use to accomdate a netfilter callback prototype is absolutely in the core of the networking stack. Frankly it does not make sense and it causes a lot of confusion as to why arguments that are never used are being passed to the function. As I am preparing to make a second change to arguments to the okfn even the names stops making sense. As I have removed the two callers of this function remove this confusion from the networking stack. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
816dd19b |
|
30-Jul-2015 |
Matan Barak <matanb@mellanox.com> |
net: Add info for NETDEV_CHANGEUPPER event Some consumers of NETDEV_CHANGEUPPER event would like to know which upper device was linked/unlinked and what operation was carried. Add information in the notifier info block for that purpose. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
0dc1549b |
|
27-Aug-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: kill long time unused bonding private flags We don't use them for years, just kill them now. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
35d4e172 |
|
27-Aug-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add netif_is_ovs_master helper with IFF_OPENVSWITCH private flag Add this helper so code can easily figure out if netdev is openswitch. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0894ae3f |
|
27-Aug-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: add netif_is_bridge_master helper Add this helper so code can easily figure out if netdev is a bridge. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0e4ead9d |
|
27-Aug-2015 |
Jiri Pirko <jiri@mellanox.com> |
net: introduce change upper device notifier change info Add info that is passed along with NETDEV_CHANGEUPPER event. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b7fe10e5 |
|
19-Aug-2015 |
Tom Herbert <tom@herbertland.com> |
gro: Fix remcsum offload to deal with frags in GRO The remote checksum offload GRO did not consider the case that frag0 might be in use. This patch fixes that by accessing headers using the skb_gro functions and not saving offsets relative to skb->head. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
808d28c4 |
|
16-Aug-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Fix docbook warning for IFF_VRF_MASTER enum kbuild test robot reported: tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master head: d52736e24fe2e927c26817256f8d1a3c8b5d51a0 commit: 4e3c89920cd3a6cfce22c6f537690747c26128dd [751/762] net: Introduce VRF related flags and helpers reproduce: make htmldocs >> Warning(include/linux/netdevice.h:1293): Enum value 'IFF_VRF_MASTER' not described in enum 'netdev_priv_flags' Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f52bdcf |
|
16-Aug-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Updates to netif_index_is_vrf As Eric noted netif_index_is_vrf is not called with rcu_read_lock held, so wrap the dev_get_by_index_rcu in rcu_read_lock and unlock. If VRF is not enabled or oif is 0 skip the device lookup. In both cases index cannot be the VRF master. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fa8187c9 |
|
13-Aug-2015 |
Phil Sutter <phil@nwl.cc> |
net: declare new net_device priv_flag IFF_NO_QUEUE This private net_device flag can be set by drivers to inform that a device runs fine without a qdisc attached. This was formerly done by setting tx_queue_len to zero. Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4e3c8992 |
|
13-Aug-2015 |
David Ahern <dsa@cumulusnetworks.com> |
net: Introduce VRF related flags and helpers Add a VRF_MASTER flag for interfaces and helper functions for determining if a device is a VRF_MASTER. Add link attribute for passing VRF_TABLE id. Add vrf_ptr to netdevice. Add various macros for determining if a device is a VRF device, the index of the master VRF device and table associated with VRF device. Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d754f98b |
|
18-Jul-2015 |
Scott Feldman <sfeldma@gmail.com> |
net: add phys ID compare helper to test if two IDs are the same Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c4f691f |
|
18-Jul-2015 |
Scott Feldman <sfeldma@gmail.com> |
net: don't reforward packets already forwarded by offload device Just before queuing skb for xmit on port, check if skb has been marked by switchdev port driver as already fordwarded by device. If so, drop skb. A non-zero skb->offload_fwd_mark field is set by the switchdev port driver/device on ingress to indicate the skb has already been forwarded by the device to egress ports with matching dev->skb_mark. The switchdev port driver would assign a non-zero dev->offload_skb_mark for each device port netdev during registration, for example. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d746d707 |
|
14-Jul-2015 |
Anuradha Karuppiah <anuradhak@cumulusnetworks.com> |
net core: Add protodown support. This patch introduces the proto_down flag that can be used by user space applications to notify switch drivers that errors have been detected on the device. The switch driver can react to protodown notification by doing a phys down on the associated switch port. Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b766cd8 |
|
15-Jun-2015 |
Eran Ben Elisha <eranbe@mellanox.com> |
net/core: Add reading VF statistics through the PF netdevice Add ndo_get_vf_stats where the PF retrieves and fills the VFs traffic statistics. We encode the VF stats in a nested manner to allow for future extensions. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bdef7de4 |
|
01-Jun-2015 |
David S. Miller <davem@davemloft.net> |
net: Add priority to packet_offload objects. When we scan a packet for GRO processing, we want to see the most common packet types in the front of the offload_base list. So add a priority field so we can handle this properly. IPv4/IPv6 get the highest priority with the implicit zero priority field. Next comes ethernet with a priority of 10, and then we have the MPLS types with a priority of 15. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e687ad60 |
|
13-May-2015 |
Pablo Neira <pablo@netfilter.org> |
netfilter: add netfilter ingress hook after handle_ing() under unique static key This patch adds the Netfilter ingress hook just after the existing tc ingress hook, that seems to be the consensus solution for this. Note that the Netfilter hook resides under the global static key that enables ingress filtering. Nonetheless, Netfilter still also has its own static key for minimal impact on the existing handle_ing(). * Without this patch: Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags) 16086246pps 7721Mb/sec (7721398080bps) errors: 100000000 42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch: Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags) 16090536pps 7723Mb/sec (7723457280bps) errors: 100000000 41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb * Without this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags) 10788648pps 5178Mb/sec (5178551040bps) errors: 100000000 40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.77% kpktgend_0 [cls_u32] [k] u32_classify 5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker 3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify 2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags) 10743194pps 5156Mb/sec (5156733120bps) errors: 100000000 42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.70% kpktgend_0 [cls_u32] [k] u32_classify 5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker 2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify 1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk Note that the results are very similar before and after. I can see gcc gets the code under the ingress static key out of the hot path. Then, on that cold branch, it generates the code to accomodate the netfilter ingress static key. My explanation for this is that this reduces the pressure on the instruction cache for non-users as the new code is out of the hot path, and it comes with minimal impact for tc ingress users. Using gcc version 4.8.4 on: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 [...] L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 8192K Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5605c762 |
|
12-May-2015 |
Jiri Pirko <jiri@resnulli.us> |
net: move __skb_tx_hash to dev.c __skb_tx_hash function has no relation to flow_dissect so just move it to dev.c Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
01d460dd |
|
12-May-2015 |
David Ahern <dsahern@gmail.com> |
net: Remove remaining remnants of pm_qos from netdevice.h Commit e2c6544829f removed pm_qos from struct net_device but left the comment and header file. Remove those. Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a2029240 |
|
11-May-2015 |
Denys Vlasenko <dvlasenk@redhat.com> |
net: deinline netif_tx_stop_all_queues(), remove WARN_ON in netif_tx_stop_queue() These functions compile to 60 bytes of machine code each. With this .config: http://busybox.net/~vda/kernel_config there are 617 calls of netif_tx_stop_queue() and 49 calls of netif_tx_stop_all_queues() in vmlinux. To fix this, remove WARN_ON in netif_tx_stop_queue() as suggested by davem, and deinline netif_tx_stop_all_queues(). Change in code size is about 20k: text data bss dec hex filename 82426986 22255416 20627456 125309858 77813a2 vmlinux.before 82406248 22255416 20627456 125289120 777c2a0 vmlinux gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue sometimes: $ nm --size-sort vmlinux | grep netif_tx_stop_queue | wc -l 190 ffffffff81b558a8 <netif_tx_stop_queue>: ffffffff81b558a8: 55 push %rbp ffffffff81b558a9: 48 89 e5 mov %rsp,%rbp ffffffff81b558ac: f0 80 8f e0 01 00 00 lock orb $0x1,0x1e0(%rdi) ffffffff81b558b3: 01 ffffffff81b558b4: 5d pop %rbp ffffffff81b558b5: c3 retq This needs additional fixing. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Alexei Starovoitov <alexei.starovoitov@gmail.com> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Joe Perches <joe@perches.com> CC: David S. Miller <davem@davemloft.net> CC: Jiri Pirko <jpirko@redhat.com> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9d47c0a2 |
|
10-May-2015 |
Jiri Pirko <jiri@resnulli.us> |
switchdev: s/swdev_/switchdev_/ Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4cda01e8 |
|
11-May-2015 |
Daniel Borkmann <daniel@iogearbox.net> |
net: sched: fix typo in net_device ifdef This should have been #ifdef not #if. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: d2788d34885d ("net: sched: further simplify handle_ing") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d2788d34 |
|
09-May-2015 |
Daniel Borkmann <daniel@iogearbox.net> |
net: sched: further simplify handle_ing Ingress qdisc has no other purpose than calling into tc_classify() that executes attached classifier(s) and action(s). It has a 1:1 relationship to dev->ingress_queue. After having commit 087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed the central ingress lock, one major contention point is gone. The extra indirection layers however, are not necessary for calling into ingress qdisc. pktgen calling locally into netif_receive_skb() with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps. We can redirect the private classifier list to the netdev directly, without changing any classifier API bits (!) and execute on that from handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed, ingress qdisc doesn't have a queue and thus dev_deactivate_queue() is also not applicable, ingress_cl_list provides similar behaviour. In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc. One next possible step is the removal of the dev's ingress (dummy) netdev_queue, and to only have the list member in the netdevice itself. Note, the filter chain is RCU protected and individual filter elements are being kfree'd by sched subsystem after RCU grace period. RCU read lock is being held by __netif_receive_skb_core(). Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
46c264da |
|
28-Apr-2015 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
bridge/nl: remove wrong use of NLM_F_MULTI NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact, it is sent only at the end of a dump. Libraries like libnl will wait forever for NLMSG_DONE. Fixes: e5a55a898720 ("net: create generic bridge ops") Fixes: 815cccbf10b2 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf") CC: John Fastabend <john.r.fastabend@intel.com> CC: Sathya Perla <sathya.perla@emulex.com> CC: Subbu Seetharaman <subbu.seetharaman@emulex.com> CC: Ajit Khaparde <ajit.khaparde@emulex.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: intel-wired-lan@lists.osuosl.org CC: Jiri Pirko <jiri@resnulli.us> CC: Scott Feldman <sfeldma@gmail.com> CC: Stephen Hemminger <stephen@networkplumber.org> CC: bridge@lists.linux-foundation.org Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ec65aafb |
|
22-Apr-2015 |
Johannes Berg <johannes.berg@intel.com> |
netdev_alloc_pcpu_stats: use less common iterator variable With the CPU iteration variable called 'i', it's relatively easy to have variable shadowing which sparse will warn about. Avoid that by renaming the variable to __cpu which is less likely to be used in the surrounding context. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
03c57747 |
|
22-Apr-2015 |
Robert Shearman <rshearma@brocade.com> |
mpls: Per-device MPLS state Add per-device MPLS state to supported interfaces. Use the presence of this state in mpls_route_add to determine that this is a supported interface. Use the presence of mpls_dev to drop packets that arrived on an unsupported interface - previously they were allowed through. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Robert Shearman <rshearma@brocade.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8b86a61d |
|
17-Apr-2015 |
Johannes Berg <johannes.berg@intel.com> |
net: remove unused 'dev' argument from netif_needs_gso() In commit 04ffcb255f22 ("net: Add ndo_gso_check") Tom originally added the 'dev' argument to be able to call ndo_gso_check(). Then later, when generalizing this in commit 5f35227ea34b ("net: Generalize ndo_gso_check to ndo_features_check") Jesse removed the call to ndo_gso_check() in netif_needs_gso() by calling the new ndo_features_check() in a different place. This made the 'dev' argument unused. Remove the unused argument and go back to the code as before. Cc: Tom Herbert <therbert@google.com> Cc: Jesse Gross <jesse@nicira.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
14ffbbb8 |
|
10-Apr-2015 |
Thomas Graf <tgraf@suug.ch> |
net_device: Reorder members to fill holes Some trivial reorders while preserving the RX/TX cache lines split to fill a couple of holes. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e2c65448 |
|
10-Apr-2015 |
Thomas Graf <tgraf@suug.ch> |
e1000e: Move pm_qos_req to e1000e adapter e1000e is the only driver requiring pm_qos_req, instead of causing every device to waste up to 240 bytes. Allocate it for the specific driver. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
01a3d796 |
|
30-Mar-2015 |
Vlad Zolotarov <vladz@cloudius-systems.com> |
if_link: Add an additional parameter to ifla_vf_info for RSS querying Add configuration setting for drivers to allow/block an RSS Redirection Table and a Hash Key querying for discrete VFs. On some devices VF share the mentioned above information with PF and querying it may adduce a theoretical security risk. We want to let a system administrator to decide if he/she wants to take this risk or not. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
388069d3 |
|
06-Apr-2015 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
netdevice.h: remove iflink description Also move 'group' description to match the order of the net_device structure. Fixes: 7a66bbc96ce9 ("net: remove iflink field from struct net_device") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7026b1dd |
|
05-Apr-2015 |
David Miller <davem@davemloft.net> |
netfilter: Pass socket pointer down through okfn(). On the output paths in particular, we have to sometimes deal with two socket contexts. First, and usually skb->sk, is the local socket that generated the frame. And second, is potentially the socket used to control a tunneling socket, such as one the encapsulates using UDP. We do not want to disassociate skb->sk when encapsulating in order to fix this, because that would break socket memory accounting. The most extreme case where this can cause huge problems is an AF_PACKET socket transmitting over a vxlan device. We hit code paths doing checks that assume they are dealing with an ipv4 socket, but are actually operating upon the AF_PACKET one. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f60e5990 |
|
01-Apr-2015 |
hannes@stressinduktion.org <hannes@stressinduktion.org> |
ipv6: protect skb->sk accesses from recursive dereference inside the stack We should not consult skb->sk for output decisions in xmit recursion levels > 0 in the stack. Otherwise local socket settings could influence the result of e.g. tunnel encapsulation process. ipv6 does not conform with this in three places: 1) ip6_fragment: we do consult ipv6_npinfo for frag_size 2) sk_mc_loop in ipv6 uses skb->sk and checks if we should loop the packet back to the local socket 3) ip6_skb_dst_mtu could query the settings from the user socket and force a wrong MTU Furthermore: In sk_mc_loop we could potentially land in WARN_ON(1) if we use a PF_PACKET socket ontop of an IPv6-backed vxlan device. Reuse xmit_recursion as we are currently only interested in protecting tunnel devices. Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e79d8429 |
|
03-Apr-2015 |
Rusty Russell <rusty@rustcorp.com.au> |
netdevice: document NETDEV_TX_BUSY deprecation. This paraphrases DaveM (and steals some of his words) explaining why a device shouldn't return NETDEV_TX_BUSY, even though it looks so inviting to driver authors. See http://www.spinics.net/lists/netdev/msg322350.html Inspired-by: David Miller <davem@davemloft.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a66bbc9 |
|
02-Apr-2015 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
net: remove iflink field from struct net_device Now that all users of iflink have the ndo_get_iflink handler available, it's possible to remove this field. By default, dev_get_iflink() returns the ifindex of the interface. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a54acb3a |
|
02-Apr-2015 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
dev: introduce dev_get_iflink() The goal of this patch is to prepare the removal of the iflink field. It introduces a new ndo function, which will be implemented by virtual interfaces. There is no functional change into this patch. All readers of iflink field now call dev_get_iflink(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e38f3025 |
|
26-Mar-2015 |
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> |
net: Introduce passthru_features_check As there are a number of (especially virtual) devices that don't need the multiple vlan check, introduce passthru_features_check() for convenience. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0117ec19 |
|
23-Mar-2015 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
net: remove never used forwarding_accel_ops pointer from net_device Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
99c4a26a |
|
18-Mar-2015 |
David S. Miller <davem@davemloft.net> |
net: Fix high overhead of vlan sub-device teardown. When a networking device is taken down that has a non-trivial number of VLAN devices configured under it, we eat a full synchronize_net() for every such VLAN device. This is because of the call chain: NETDEV_DOWN notifier --> vlan_device_event() --> dev_change_flags() --> __dev_change_flags() --> __dev_close() --> __dev_close_many() --> dev_deactivate_many() --> synchronize_net() This is kind of rediculous because we already have infrastructure for batching doing operation X to a list of net devices so that we only incur one sync. So make use of that by exporting dev_close_many() and adjusting it's interfaace so that the caller can fully manage the batch list. Use this in vlan_device_event() and all the overhead goes away. Reported-by: Salam Noureddine <noureddine@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
db24a904 |
|
17-Mar-2015 |
David Ahern <dsahern@gmail.com> |
net: add support for phys_port_name Similar to port id allow netdevices to specify port names and export the name via sysfs. Drivers can implement the netdevice operation to assist udev in having sane default names for the devices using the rule: $ cat /etc/udev/rules.d/80-net-setup-link.rules SUBSYSTEM=="net", ACTION=="add", ATTR{phys_port_name}!="", NAME="$attr{phys_port_name}" Use of phys_name versus phys_id was suggested-by Jiri Pirko. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
822b3b2e |
|
18-Mar-2015 |
John Fastabend <john.r.fastabend@intel.com> |
net: Add max rate tx queue attribute This adds a tx_maxrate attribute to the tx queue sysfs entry allowing for max-rate limiting. Along with DCB-ETS and BQL this provides another knob to tune queue performance. The limit units are Mbps. By default it is disabled. To disable the rate limitation after it has been set for a queue, it should be set to zero. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ad41faa8 |
|
17-Mar-2015 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
netdevice.h: fix ndo_bridge_* comments The argument 'flags' was missing in ndo_bridge_setlink(). ndo_bridge_dellink() was missing. Fixes: 407af3299ef1 ("bridge: Add netlink interface to configure vlans on bridge ports") Fixes: add511b38266 ("bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellink") CC: Vlad Yasevich <vyasevic@redhat.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
812a1c3f |
|
15-Mar-2015 |
Scott Feldman <sfeldma@gmail.com> |
netdev: remove ndo ops for switchdev Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4170604f |
|
15-Mar-2015 |
Scott Feldman <sfeldma@gmail.com> |
switchdev: add swdev ops As discussed at netconf, introduce swdev_ops as first step to move switchdev ops from ndo to swdev. This will keep switchdev from cluttering up ndo ops space. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c5c9fb5 |
|
11-Mar-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Introduce possible_net_t Having to say > #ifdef CONFIG_NET_NS > struct net *net; > #endif in structures is a little bit wordy and a little bit error prone. Instead it is possible to say: > typedef struct { > #ifdef CONFIG_NET_NS > struct net *net; > #endif > } possible_net_t; And then in a header say: > possible_net_t net; Which is cleaner and easier to use and easier to test, as the possible_net_t is always there no matter what the compile options. Further this allows read_pnet and write_pnet to be functions in all cases which is better at catching typos. This change adds possible_net_t, updates the definitions of read_pnet and write_pnet, updates optional struct net * variables that write_pnet uses on to have the type possible_net_t, and finally fixes up the b0rked users of read_pnet and write_pnet. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
efd7ef1c |
|
11-Mar-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Kill hold_net release_net hold_net and release_net were an idea that turned out to be useless. The code has been disabled since 2008. Kill the code it is long past due. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f8f21471 |
|
09-Mar-2015 |
Scott Feldman <sfeldma@gmail.com> |
switchdev: add netlink flags to IPv4 FIB add op Pass in the netlink flags (NLM_F_*) into switchdev driver for IPv4 FIB add op to allow driver to 1) optimize hardware updates, 2) handle ip route prepend and append commands correctly. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4586f1bb |
|
05-Mar-2015 |
Scott Feldman <sfeldma@gmail.com> |
netdevice: add IPv4 fib add/del ops Add two new ndo ops for IPv4 fib offload support, add and del. Add uses modifiy semantics if fib entry already offloaded. Drivers implementing the new ndo ops will return err<0 if programming device fails, for example if device's tables are full. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d476059e |
|
01-Mar-2015 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Kill dev_rebuild_header Now that there are no more users kill dev_rebuild_header and all of it's implementations. This is long overdue. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
846cd667 |
|
18-Feb-2015 |
Geert Uytterhoeven <geert@linux-m68k.org> |
net: Initialize all members in skb_gro_remcsum_init() skb_gro_remcsum_init() initializes the gro_remcsum.delta member only, leading to compiler warnings about a possibly uninitialized gro_remcsum.offset member: drivers/net/vxlan.c: In function ‘vxlan_gro_receive’: drivers/net/vxlan.c:602: warning: ‘grc.offset’ may be used uninitialized in this function net/ipv4/fou.c: In function ‘gue_gro_receive’: net/ipv4/fou.c:262: warning: ‘grc.offset’ may be used uninitialized in this function While these are harmless for now: - skb_gro_remcsum_process() sets offset before changing delta, - skb_gro_remcsum_cleanup() checks if delta is non-zero before accessing offset, it's safer to let the initialization function initialize all members. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
15e2396d |
|
10-Feb-2015 |
Tom Herbert <therbert@google.com> |
net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload This patch adds infrastructure so that remote checksum offload can set CHECKSUM_PARTIAL instead of calling csum_partial and writing the modfied checksum field. Add skb_remcsum_adjust_partial function to set an skb for using CHECKSUM_PARTIAL with remote checksum offload. Changed skb_remcsum_process and skb_gro_remcsum_process to take a boolean argument to indicate if checksum partial can be set or the checksum needs to be modified using the normal algorithm. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
baa32ff4 |
|
10-Feb-2015 |
Tom Herbert <therbert@google.com> |
net: Use more bit fields in napi_gro_cb This patch moves the free and same_flow fields to be bit fields (2 and 1 bit sized respectively). This frees up some space for u16's. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6edec0e6 |
|
10-Feb-2015 |
Tom Herbert <therbert@google.com> |
net: Clarify meaning of CHECKSUM_PARTIAL for receive path The current meaning of CHECKSUM_PARTIAL for validating checksums is that _all_ checksums in the packet are considered valid. However, in the manner that CHECKSUM_PARTIAL is set only the checksum at csum_start+csum_offset and any preceding checksums may be considered valid. If there are checksums in the packet after csum_offset it is possible they have not been verfied. This patch changes CHECKSUM_PARTIAL logic in skb_csum_unnecessary and __skb_gro_checksum_validate_needed to only considered checksums referring to csum_start and any preceding checksums (with starting offset before csum_start) to be verified. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
26c4f7da |
|
10-Feb-2015 |
Tom Herbert <therbert@google.com> |
net: Fix remcsum in GRO path to not change packet Remote checksum offload processing is currently the same for both the GRO and non-GRO path. When the remote checksum offload option is encountered, the checksum field referred to is modified in the packet. So in the GRO case, the packet is modified in the GRO path and then the operation is skipped when the packet goes through the normal path based on skb->remcsum_offload. There is a problem in that the packet may be modified in the GRO path, but then forwarded off host still containing the remote checksum option. A remote host will again perform RCO but now the checksum verification will fail since GRO RCO already modified the checksum. To fix this, we ensure that GRO restores a packet to it's original state before returning. In this model, when GRO processes a remote checksum option it still changes the checksum per the algorithm but on return from lower layer processing the checksum is restored to its original value. In this patch we add define gro_remcsum structure which is passed to skb_gro_remcsum_process to save offset and delta for the checksum being changed. After lower layer processing, skb_gro_remcsum_cleanup is called to restore the checksum before returning from GRO. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93c1af6c |
|
08-Feb-2015 |
Eric Dumazet <edumazet@google.com> |
net:rfs: adjust table size checking Make sure root user does not try something stupid. Also make sure mask field in struct rps_sock_flow_table does not share a cache line with the potentially often dirtied flow table. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 567e4b79731c ("net: rfs: add hash collision detection") Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
567e4b79 |
|
06-Feb-2015 |
Eric Dumazet <edumazet@google.com> |
net: rfs: add hash collision detection Receive Flow Steering is a nice solution but suffers from hash collisions when a mix of connected and unconnected traffic is received on the host, when flow hash table is populated. Also, clearing flow in inet_release() makes RFS not very good for short lived flows, as many packets can follow close(). (FIN , ACK packets, ...) This patch extends the information stored into global hash table to not only include cpu number, but upper part of the hash value. I use a 32bit value, and dynamically split it in two parts. For host with less than 64 possible cpus, this gives 6 bits for the cpu number, and 26 (32-6) bits for the upper part of the hash. Since hash bucket selection use low order bits of the hash, we have a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big enough. If the hash found in flow table does not match, we fallback to RPS (if it is enabled for the rxqueue). This means that a packet for an non connected flow can avoid the IPI through a unrelated/victim CPU. This also means we no longer have to clear the table at socket close time, and this helps short lived flows performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61bd3857 |
|
03-Feb-2015 |
Moni Shoua <monis@mellanox.com> |
net/core: Add event for a change in slave state Add event which provides an indication on a change in the state of a bonding slave. The event handler should cast the pointer to the appropriate type (struct netdev_bonding_info) in order to get the full info about the slave. Signed-off-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dcdc8994 |
|
02-Feb-2015 |
Tom Herbert <therbert@google.com> |
net: add skb functions to process remote checksum offload This patch adds skb_remcsum_process and skb_gro_remcsum_process to perform the appropriate adjustments to the skb when receiving remote checksum offload. Updated vxlan and gue to use these functions. Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did not see any change in performance. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
add511b3 |
|
29-Jan-2015 |
Roopa Prabhu <roopa@cumulusnetworks.com> |
bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellink bridge flags are needed inside ndo_bridge_setlink/dellink handlers to avoid another call to parse IFLA_AF_SPEC inside these handlers This is used later in this series Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7866a621 |
|
27-Jan-2015 |
Salam Noureddine <noureddine@arista.com> |
dev: add per net_device packet type chains When many pf_packet listeners are created on a lot of interfaces the current implementation using global packet type lists scales poorly. This patch adds per net_device packet type lists to fix this problem. The patch was originally written by Eric Biederman for linux-2.6.29. Tested on linux-3.16. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Salam Noureddine <noureddine@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4ccce02e |
|
14-Jan-2015 |
Benjamin Poirier <bpoirier@suse.de> |
netdevice: Add missing parentheses in macro For example, one could conceivably call for_each_netdev_in_bond_rcu(condition ? bond1 : bond2, slave) and get an unexpected result. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a2b12f3c |
|
12-Jan-2015 |
Tom Herbert <therbert@google.com> |
udp: pass udp_offload struct to UDP gro callbacks This patch introduces udp_offload_callbacks which has the same GRO functions (but not a GSO function) as offload_callbacks, except there is an argument to a udp_offload struct passed to gro_receive and gro_complete functions. This additional argument can be used to retrieve the per port structure of the encapsulation for use in gro processing (mostly by doing container_of on the structure). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d632cb7 |
|
12-Jan-2015 |
B Viswanath <marichika4@gmail.com> |
net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations Signed-off-by: B Viswanath <marichika4@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5f35227e |
|
23-Dec-2014 |
Jesse Gross <jesse@nicira.com> |
net: Generalize ndo_gso_check to ndo_features_check GSO isn't the only offload feature with restrictions that potentially can't be expressed with the current features mechanism. Checksum is another although it's a general issue that could in theory apply to anything. Even if it may be possible to implement these restrictions in other ways, it can result in duplicate code or inefficient per-packet behavior. This generalizes ndo_gso_check so that drivers can remove any features that don't make sense for a given packet, similar to netif_skb_features(). It also converts existing driver restrictions to the new format, completing the work that was done to support tunnel protocols since the issues apply to checksums as well. By actually removing features from the set that are used to do offloading, it solves another problem with the existing interface. In these cases, GSO would run with the original set of features and not do anything because it appears that segmentation is not required. CC: Tom Herbert <therbert@google.com> CC: Joe Stringer <joestringer@nicira.com> CC: Eric Dumazet <edumazet@google.com> CC: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Tom Herbert <therbert@google.com> Fixes: 04ffcb255f22 ("net: Add ndo_gso_check") Tested-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5933fea7 |
|
06-Dec-2014 |
Mahesh Bandewar <maheshb@google.com> |
ipvlan: move the device check function into netdevice.h Move the port check [ipvlan_dev_master()] and device check [ipvlan_dev_slave()] functions to netdevice.h and rename them netif_is_ipvlan_port() and netif_is_ipvlan() resp. to be consistent with macvlan api naming. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f33e7d5 |
|
06-Dec-2014 |
Mahesh Bandewar <maheshb@google.com> |
netdevice: Add a function to check macvlan port Similar to a check for macvlan device, netif_is_macvlan(), add another function to check if a device is used as macvlan port. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
38dcf357 |
|
28-Nov-2014 |
Scott Feldman <sfeldma@gmail.com> |
bridge: call netdev_sw_port_stp_update when bridge port STP status changes To notify switch driver of change in STP state of bridge port, add new .ndo op and provide switchdev wrapper func to call ndo op. Use it in bridge code then. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
007f790c |
|
28-Nov-2014 |
Jiri Pirko <jiri@resnulli.us> |
net: introduce generic switch devices support The goal of this is to provide a possibility to support various switch chips. Drivers should implement relevant ndos to do so. Now there is only one ndo defined: - for getting physical switch id is in place. Note that user can use random port netdevice to access the switch. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Thomas Graf <tgraf@suug.ch> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02637fce |
|
28-Nov-2014 |
Jiri Pirko <jiri@resnulli.us> |
net: rename netdev_phys_port_id to more generic name So this can be reused for identification of other "items" as well. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Thomas Graf <tgraf@suug.ch> Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6f6424b |
|
28-Nov-2014 |
Jiri Pirko <jiri@resnulli.us> |
net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple u16 vid to drivers from there. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2ad7bf36 |
|
24-Nov-2014 |
Mahesh Bandewar <maheshb@google.com> |
ipvlan: Initial check-in of the IPVLAN driver. This driver is very similar to the macvlan driver except that it uses L3 on the frame to determine the logical interface while functioning as packet dispatcher. It inherits L2 of the master device hence the packets on wire will have the same L2 for all the packets originating from all virtual devices off of the same master device. This driver was developed keeping the namespace use-case in mind. Hence most of the examples given here take that as the base setup where main-device belongs to the default-ns and virtual devices are assigned to the additional namespaces. The device operates in two different modes and the difference in these two modes in primarily in the TX side. (a) L2 mode : In this mode, the device behaves as a L2 device. TX processing upto L2 happens on the stack of the virtual device associated with (namespace). Packets are switched after that into the main device (default-ns) and queued for xmit. RX processing is simple and all multicast, broadcast (if applicable), and unicast belonging to the address(es) are delivered to the virtual devices. (b) L3 mode : In this mode, the device behaves like a L3 device. TX processing upto L3 happens on the stack of the virtual device associated with (namespace). Packets are switched to the main-device (default-ns) for the L2 processing. Hence the routing table of the default-ns will be used in this mode. RX processins is somewhat similar to the L2 mode except that in this mode only Unicast packets are delivered to the virtual device while main-dev will handle all other packets. The devices can be added using the "ip" command from the iproute2 package - ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ] Signed-off-by: Mahesh Bandewar <maheshb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Laurent Chavey <chavey@google.com> Cc: Tim Hockin <thockin@google.com> Cc: Brandon Philips <brandon.philips@coreos.com> Cc: Pavel Emelianov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
960fb622 |
|
16-Nov-2014 |
Eric Dumazet <edumazet@google.com> |
net: provide a per host RSS key generic infrastructure RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes RSS key. Some drivers use a constant (and well known key), some drivers use a random key per port, making bonding setups hard to tune. Well known keys increase attack surface, considering that number of queues is usually a power of two. This patch provides infrastructure to help drivers doing the right thing. netdev_rss_key_fill() should be used by drivers to initialize their RSS key, even if they provide ethtool -X support to let user redefine the key later. A new /proc/sys/net/core/netdev_rss_key file can be used to get the host RSS key even for drivers not providing ethtool -x support, in case some applications want to precisely setup flows to match some RX queues. Tested: myhost:~# cat /proc/sys/net/core/netdev_rss_key 11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72 myhost:~# ethtool -x eth0 RX flow hash indirection table for eth0 with 8 RX ring(s): 0: 0 1 2 3 4 5 6 7 RSS hash key: 11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
09626e9d |
|
11-Nov-2014 |
WANG Cong <xiyou.wangcong@gmail.com> |
net: kill netif_copy_real_num_queues() vlan was the only user of netif_copy_real_num_queues(), but it no longer calls it after commit 4af429d29b341bb1735f04c2fb960178 ("vlan: lockless transmit path"). So we can just remove it. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b47d303 |
|
06-Nov-2014 |
Eric Dumazet <edumazet@google.com> |
net: gro: add a per device gro flush timer Tuning coalescing parameters on NIC can be really hard. Servers can handle both bulk and RPC like traffic, with conflicting goals : bulk flows want as big GRO packets as possible, RPC want minimal latencies. To reach big GRO packets on 10Gbe NIC, one can use : ethtool -C eth0 rx-usecs 4 rx-frames 44 But this penalizes rpc sessions, with an increase of latencies, up to 50% in some cases, as NICs generally do not force an interrupt when a packet with TCP Push flag is received. Some NICs do not have an absolute timer, only a timer rearmed for every incoming packet. This patch uses a different strategy : Let GRO stack decides what do do, based on traffic pattern. Packets with Push flag wont be delayed. Packets without Push flag might be held in GRO engine, if we keep receiving data. This new mechanism is off by default, and shall be enabled by setting /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond. To fully enable this mechanism, drivers should use napi_complete_done() instead of napi_complete(). Tested: Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues) Without this feature, we send back about 305,000 ACK per second. GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet) Setting a timer of 2000 nsec is enough to increase GRO packet sizes and reduce number of ACK packets. (811/19.2 = 42) Receiver performs less calls to upper stacks, less wakes up. This also reduces cpu usage on the sender, as it receives less ACK packets. Note that reducing number of wakes up increases cpu efficiency, but can decrease QPS, as applications wont have the chance to warmup cpu caches doing a partial read of RPC requests/answers if they fit in one skb. B:~# sar -n DEV 1 10 | grep eth0 | tail -1 Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00 0.00 0.50 B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout B:~# sar -n DEV 1 10 | grep eth0 | tail -1 Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00 0.00 0.50 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
59b93b41 |
|
05-Nov-2014 |
Pravin B Shelar <pshelar@nicira.com> |
net: Remove MPLS GSO feature. Device can export MPLS GSO support in dev->mpls_features same way it export vlan features in dev->vlan_features. So it is safe to remove NETIF_F_GSO_MPLS redundant flag. Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
|
#
e585f236 |
|
04-Nov-2014 |
Tom Herbert <therbert@google.com> |
udp: Changes to udp_offload to support remote checksum offload Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote checksum offload being done (in this case inner checksum must not be offloaded to the NIC). Added logic in __skb_udp_tunnel_segment to handle remote checksum offload case. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
98a18b6f |
|
01-Nov-2014 |
Alexander Aring <alex.aring@gmail.com> |
netdevice: add ieee802154_ptr to net_device This patch adds an ieee802154_ptr to the net_device structure. Furthermore the 802.15.4 subsystem will introduce a nl802154 framework which is similar like the nl80211 framework and a wpan_dev structure. The wpan_dev structure will hold additional net_device attributes like address options which are 802.15.4 specific. In the upcoming nl802154 implementation we will introduce a NL802154_FLAG_NEED_WPAN_DEV like NL80211_FLAG_NEED_WDEV. For this flag an ieee802154_ptr in net_device is needed. Additional we can access the wpan_dev attributes in upper layers like IEEE 802.15.4 6LoWPAN easily. Current solution is a complicated callback interface and getting these values over subif data structure in mac802154. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
#
4cdb1e2e |
|
02-Nov-2014 |
Eric Dumazet <edumazet@google.com> |
net: shrink struct softnet_data flow_limit in struct softnet_data is only read from local cpu and can be moved to fill a hole, reducing softnet_data size by 64 bytes on x86_64 While we are at it, move output_queue, output_queue_tailp and completion_queue, so that rx / tx paths touch a single cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bc9ad166 |
|
28-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
net: introduce napi_schedule_irqoff() napi_schedule() can be called from any context and has to mask hard irqs. Add a variant that can only be called from hard interrupts handlers or when irqs are already masked. Many NIC drivers can use it from their hard IRQ handler instead of generic variant. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
04ffcb25 |
|
14-Oct-2014 |
Tom Herbert <therbert@google.com> |
net: Add ndo_gso_check Add ndo_gso_check which a device can define to indicate whether is is capable of doing GSO on a packet. This funciton would be called from the stack to determine whether software GSO is needed to be done. A driver should populate this function if it advertises GSO types for which there are combinations that it wouldn't be able to handle. For instance a device that performs UDP tunneling might only implement support for transparent Ethernet bridging type of inner packets or might have limitations on lengths of inner headers. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
53511453 |
|
08-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers Add two helpers so that drivers do not have to care of BQL being available or not. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jim Davis <jim.epost@gmail.com> Fixes: 29d40c903247 ("net/mlx4_en: Use prefetch in tx path") Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02875878 |
|
05-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
net: better IFF_XMIT_DST_RELEASE support Testing xmit_more support with netperf and connected UDP sockets, I found strange dst refcount false sharing. Current handling of IFF_XMIT_DST_RELEASE is not optimal. Dropping dst in validate_xmit_skb() is certainly too late in case packet was queued by cpu X but dequeued by cpu Y The logical point to take care of drop/force is in __dev_queue_xmit() before even taking qdisc lock. As Julian Anastasov pointed out, need for skb_dst() might come from some packet schedulers or classifiers. This patch adds new helper to cleanly express needs of various drivers or qdiscs/classifiers. Drivers that need skb_dst() in their ndo_start_xmit() should call following helper in their setup instead of the prior : dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; -> netif_keep_dst(dev); Instead of using a single bit, we use two bits, one being eventually rebuilt in bonding/team drivers. The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being rebuilt in bonding/team. Eventually, we could add something smarter later. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fcbeb976 |
|
05-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
net: introduce netdevice gso_min_segs attribute Some TSO engines might have a too heavy setup cost, that impacts performance on hosts sending small bursts (2 MSS per packet). This patch adds a device gso_min_segs, allowing drivers to set a minimum segment size for TSO packets, according to the NIC performance. Tested on a mlx4 NIC, this allows to get a ~110% increase of throughput when sending 2 MSS per packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
efc98d08 |
|
03-Oct-2014 |
Tom Herbert <therbert@google.com> |
fou: eliminate IPv4,v6 specific GRO functions This patch removes fou[46]_gro_receive and fou[46]_gro_complete functions. The v4 or v6 variants were chosen for the UDP offloads based on the address family of the socket this is not necessary or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb. This is set in udp6_gro_receive and unset in udp4_gro_receive. In fou_gro_receive the value is used to select the correct inet_offloads for the protocol of the outer IP header. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
55a93b3e |
|
03-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
qdisc: validate skb without holding lock Validation of skb can be pretty expensive : GSO segmentation and/or checksum computations. We can do this without holding qdisc lock, so that other cpus can queue additional packets. Trick is that requeued packets were already validated, so we carry a boolean so that sch_direct_xmit() can validate a fresh skb list, or directly use an old one. Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads host. Turning TSO on or off had no effect on throughput, only few more cpu cycles. Lock contention on qdisc lock disappeared. Same if disabling TX checksum offload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6ea754eb |
|
22-Sep-2014 |
Joe Perches <joe@perches.com> |
net: Change netdev_<level> logging functions to return void No caller or macro uses the return value so make all the functions return void. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
53e50398 |
|
20-Sep-2014 |
Tom Herbert <therbert@google.com> |
net: Remove gso_send_check as an offload callback The send_check logic was only interesting in cases of TCP offload and UDP UFO where the checksum needed to be initialized to the pseudo header checksum. Now we've moved that logic into the related gso_segment functions so gso_send_check is no longer needed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
afe93325 |
|
17-Sep-2014 |
Tom Herbert <therbert@google.com> |
fou: Add GRO support Implement fou_gro_receive and fou_gro_complete, and populate these in the correponsing udp_offloads for the socket. Added ipproto to udp_offloads and pass this from UDP to the fou GRO routine in proto field of napi_gro_cb structure. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5075314e |
|
15-Sep-2014 |
Alexander Duyck <alexander.h.duyck@intel.com> |
dsa: Split ops up, and avoid assigning tag_protocol and receive separately This change addresses several issues. First, it was possible to set tag_protocol without setting the ops pointer. To correct that I have reordered things so that rcv is now populated before we set tag_protocol. Second, it didn't make much sense to keep setting the device ops each time a new slave was registered. So by moving the receive portion out into root switch initialization that issue should be addressed. Third, I wanted to avoid sending tags if the rcv pointer was not registered so I changed the tag check to verify if the rcv function pointer is set on the root tree. If it is then we start sending DSA tagged frames. Finally I split the device ops pointer in the structures into two spots. I placed the rcv function pointer in the root switch since this makes it easiest to access from there, and I placed the xmit function pointer in the slave for the same reason. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3fc88677 |
|
13-Sep-2014 |
Alexander Duyck <alexander.h.duyck@intel.com> |
netdevice: Support DSA tagging when DSA is built as a module This change corrects an error seen when DSA tagging is built as a module. Without this change it is not possible to get XDSA tagged frames as the test for tagging is stripped by the #ifdef check. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6c555490 |
|
11-Sep-2014 |
WANG Cong <xiyou.wangcong@gmail.com> |
ipv6: drop useless rcu_read_lock() in anycast These code is now protected by rtnl lock, rcu read lock is useless now. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
46e5da40a |
|
12-Sep-2014 |
John Fastabend <john.fastabend@gmail.com> |
net: qdisc: use rcu prefix and silence sparse warnings Add __rcu notation to qdisc handling by doing this we can make smatch output more legible. And anyways some of the cases should be using rcu_dereference() see qdisc_all_tx_empty(), qdisc_tx_chainging(), and so on. Also *wake_queue() API is commonly called from driver timer routines without rcu lock or rtnl lock. So I added rcu_read_lock() blocks around netif_wake_subqueue and netif_tx_wake_queue. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b26b0d1e |
|
12-Sep-2014 |
John Fastabend <john.fastabend@gmail.com> |
net: qdisc: use rcu prefix and silence sparse warnings Add __rcu notation to qdisc handling by doing this we can make smatch output more legible. And anyways some of the cases should be using rcu_dereference() see qdisc_all_tx_empty(), qdisc_tx_chainging(), and so on. Also *wake_queue() API is commonly called from driver timer routines without rcu lock or rtnl lock. So I added rcu_read_lock() blocks around netif_wake_subqueue and netif_tx_wake_queue. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e793c0f7 |
|
04-Sep-2014 |
Masanari Iida <standby24x7@gmail.com> |
net: treewide: Fix typo found in DocBook/networking.xml This patch fix spelling typo found in DocBook/networking.xml. It is because the neworking.xml is generated from comments in the source, I have to fix typo in comments within the source. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d96535a1 |
|
31-Aug-2014 |
Tom Herbert <therbert@google.com> |
net: Infrastructure for checksum unnecessary conversions For normal path, added skb_checksum_try_convert which is called to attempt to convert CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE. The primary condition to allow this is that ip_summed is CHECKSUM_NONE and csum_valid is true, which will be the state after consuming a CHECKSUM_UNNECESSARY. For GRO path, added skb_gro_checksum_try_convert which is the GRO analogue of skb_checksum_try_convert. The primary condition to allow this is that NAPI_GRO_CB(skb)->csum_cnt == 0 and NAPI_GRO_CB(skb)->csum_valid is set. This implies that we have consumed all available CHECKSUM_UNNECESSARY checksums in the GRO path. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a212329 |
|
31-Aug-2014 |
Tom Herbert <therbert@google.com> |
net: Support for csum_bad in skbuff This flag indicates that an invalid checksum was detected in the packet. __skb_mark_checksum_bad helper function was added to set this. Checksums can be marked bad from a driver or the GRO path (the latter is implemented in this patch). csum_bad is checked in __skb_checksum_validate_complete (i.e. calling that when ip_summed == CHECKSUM_NONE). csum_bad works in conjunction with ip_summed value. In the case that ip_summed is CHECKSUM_NONE and csum_bad is set, this implies that the first (or next) checksum encountered in the packet is bad. When ip_summed is CHECKSUM_UNNECESSARY, the first checksum after the last one validated is bad. For example, if ip_summed == CHECKSUM_UNNECESSARY, csum_level == 1, and csum_bad is set-- then the third checksum in the packet is bad. In the normal path, the packet will be dropped when processing the protocol layer of the bad checksum: __skb_decr_checksum_unnecessary called twice for the good checksums changing ip_summed to CHECKSUM_NONE so that __skb_checksum_validate_complete is called to validate the third checksum and that will fail since csum_bad is set. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce93718f |
|
30-Aug-2014 |
David S. Miller <davem@davemloft.net> |
net: Don't keep around original SKB when we software segment GSO frames. Just maintain the list properly by returning the head of the remaining SKB list from dev_hard_start_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
50cbe9ab |
|
30-Aug-2014 |
David S. Miller <davem@davemloft.net> |
net: Validate xmit SKBs right when we pull them out of the qdisc. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fa2dbdc2 |
|
29-Aug-2014 |
David S. Miller <davem@davemloft.net> |
net: Pass a "more" indication down into netdev_start_xmit() code paths. For now it will always be false. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
10b3ad8c |
|
29-Aug-2014 |
David S. Miller <davem@davemloft.net> |
net: Do txq_trans_update() in netdev_start_xmit() That way we don't have to audit every call site to make sure it is doing this properly. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
662880f4 |
|
27-Aug-2014 |
Tom Herbert <therbert@google.com> |
net: Allow GRO to use and set levels of checksum unnecessary Allow GRO path to "consume" checksums provided in CHECKSUM_UNNECESSARY and to report new checksums verfied for use in fallback to normal path. Change GRO checksum path to track csum_level using a csum_cnt field in NAPI_GRO_CB. On GRO initialization, if ip_summed is CHECKSUM_UNNECESSARY set NAPI_GRO_CB(skb)->csum_cnt to skb->csum_level + 1. For each checksum verified, decrement NAPI_GRO_CB(skb)->csum_cnt while its greater than zero. If a checksum is verfied and NAPI_GRO_CB(skb)->csum_cnt == 0, we have verified a deeper checksum than originally indicated in skbuf so increment csum_level (or initialize to CHECKSUM_UNNECESSARY if ip_summed is CHECKSUM_NONE or CHECKSUM_COMPLETE). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
10c51b5623 |
|
27-Aug-2014 |
Daniel Borkmann <daniel@iogearbox.net> |
net: add skb_get_tx_queue() helper Replace occurences of skb_get_queue_mapping() and follow-up netdev_get_tx_queue() with an actual helper function. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5aed85ce |
|
27-Aug-2014 |
Florian Fainelli <f.fainelli@gmail.com> |
net: dsa: allow switches to work without tagging In case switch port tagging is disabled (voluntarily, or the switch just does not support it), allow us to continue using the defined set of dsa_device_ops in net/dsa/slave.c. We introduce dsa_protocol_is_tagged() to check whether we need to override skb->protocol and go through the DSA-specifif packet_type function, or if we just go on and receive the SKB through the normal path. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3e8a72d1 |
|
27-Aug-2014 |
Florian Fainelli <f.fainelli@gmail.com> |
net: dsa: reduce number of protocol hooks DSA is currently registering one packet_type function per EtherType it needs to intercept in the receive path of a DSA-enabled Ethernet device. Right now we have three of them: trailer, DSA and eDSA, and there might be more in the future, this will not scale to the addition of new protocols. This patch proceeds with adding a new layer of abstraction and two new functions: dsa_switch_rcv() which will dispatch into the tag-protocol specific receive function implemented by net/dsa/tag_*.c dsa_slave_xmit() which will dispatch into the tag-protocol specific transmit function implemented by net/dsa/tag_*.c When we do create the per-port slave network devices, we iterate over the switch protocol to assign the DSA-specific receive and transmit operations. A new fake ethertype value is used: ETH_P_XDSA to illustrate the fact that this is no longer going to look like ETH_P_DSA or ETH_P_TRAILER like it used to be. This allows us to greatly simplify the check in eth_type_trans() and always override the skb->protocol with ETH_P_XDSA for Ethernet switches tagged protocol, while also reducing the number repetitive slave netdevice_ops assignments. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0b725a2c |
|
25-Aug-2014 |
David S. Miller <davem@davemloft.net> |
net: Remove ndo_xmit_flush netdev operation, use signalling instead. As reported by Jesper Dangaard Brouer, for high packet rates the overhead of having another indirect call in the TX path is non-trivial. There is the indirect call itself, and then there is all of the reloading of the state to refetch the tail pointer value and then write the device register. Move to a more passive scheme, which requires very light modifications to the device drivers. The signal is a new skb->xmit_more value, if it is non-zero it means that more SKBs are pending to be transmitted on the same queue as the current SKB. And therefore, the driver may elide the tail pointer update. Right now skb->xmit_more is always zero. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4798248e |
|
22-Aug-2014 |
David S. Miller <davem@davemloft.net> |
net: Add ops->ndo_xmit_flush() Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
573e8fca |
|
22-Aug-2014 |
Tom Herbert <therbert@google.com> |
net: skb_gro_checksum_* functions Add skb_gro_checksum_validate, skb_gro_checksum_validate_zero_check, and skb_gro_checksum_simple_validate, and __skb_gro_checksum_complete. These are the cognates of the normal checksum functions but are used in the gro_receive path and operate on GRO related fields in sk_buffs. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d4261e56 |
|
19-Aug-2014 |
Jiri Pirko <jiri@resnulli.us> |
bonding: create netlink event when bonding option is changed Userspace needs to be notified if one changes some option. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Veaceslav Falico <vfalico@gmail.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
536721b1 |
|
30-Jul-2014 |
Karoly Kemeny <karoly.kemeny@gmail.com> |
net: kernel-doc compliant documentation for net_device Net_device is a vast and important structure, but it has no kernel-doc compliant documentation. This patch extracts the comments from the structure to clean it up, and let the scripts extract documentation from it. I know that the patch is big, but it's just reordering of comments into the appropriate form, and adding a few more, for the missing members. Signed-off-by: Karoly Kemeny <karoly.kemeny@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ccc7f496 |
|
17-Jul-2014 |
Veaceslav Falico <vfalico@gmail.com> |
net: print net_device reg_state in netdev_* unless it's registered This way we'll always know in what status the device is, unless it's running normally (i.e. NETDEV_REGISTERED). Also, emit a warning once in case of a bad reg_state. CC: "David S. Miller" <davem@davemloft.net> CC: Jason Baron <jbaron@akamai.com> CC: Eric Dumazet <edumazet@google.com> CC: Vlad Yasevich <vyasevic@redhat.com> CC: stephen hemminger <stephen@networkplumber.org> CC: Jerry Chu <hkchu@google.com> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Joe Perches <joe@perches.com> Signed-off-by: Veaceslav Falico <vfalico@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6f854d5 |
|
17-Jul-2014 |
Veaceslav Falico <vfalico@gmail.com> |
net: use dev->name in netdev_pr* when it's available netdev_name() returns dev->name only when the net_device is in NETREG_REGISTERED state. However, dev->name is always populated on creation, so we can easily use it. There are two cases when there's no real name - when it's an empty string or when the name is in form of "eth%d", then netdev_name() returns "unnamed net_device". CC: "David S. Miller" <davem@davemloft.net> CC: Tom Gundersen <teg@jklm.no> Signed-off-by: Veaceslav Falico <vfalico@gmail.com> Acked-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c835a677 |
|
14-Jul-2014 |
Tom Gundersen <teg@jklm.no> |
net: set name_assign_type in alloc_netdev() Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert all users to pass NET_NAME_UNKNOWN. Coccinelle patch: @@ expression sizeof_priv, name, setup, txqs, rxqs, count; @@ ( -alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs) +alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs) | -alloc_netdev_mq(sizeof_priv, name, setup, count) +alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count) | -alloc_netdev(sizeof_priv, name, setup) +alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup) ) v9: move comments here from the wrong commit Signed-off-by: Tom Gundersen <teg@jklm.no> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
685343fc |
|
14-Jul-2014 |
Tom Gundersen <teg@jklm.no> |
net: add name_assign_type netdev attribute Based on a patch by David Herrmann. The name_assign_type attribute gives hints where the interface name of a given net-device comes from. These values are currently defined: NET_NAME_ENUM: The ifname is provided by the kernel with an enumerated suffix, typically based on order of discovery. Names may be reused and unpredictable. NET_NAME_PREDICTABLE: The ifname has been assigned by the kernel in a predictable way that is guaranteed to avoid reuse and always be the same for a given device. Examples include statically created devices like the loopback device and names deduced from hardware properties (including being given explicitly by the firmware). Names depending on the order of discovery, or in any other way on the existence of other devices, must not be marked as PREDICTABLE. NET_NAME_USER: The ifname was provided by user-space during net-device setup. NET_NAME_RENAMED: The net-device has been renamed from userspace. Once this type is set, it cannot change again. NET_NAME_UNKNOWN: This is an internal placeholder to indicate that we yet haven't yet categorized the name. It will not be exposed to userspace, rather -EINVAL is returned. The aim of these patches is to improve user-space renaming of interfaces. As a general rule, userspace must rename interfaces to guarantee that names stay the same every time a given piece of hardware appears (at boot, or when attaching it). However, there are several situations where userspace should not perform the renaming, and that depends on both the policy of the local admin, but crucially also on the nature of the current interface name. If an interface was created in repsonse to a userspace request, and userspace already provided a name, we most probably want to leave that name alone. The main instance of this is wifi-P2P devices created over nl80211, which currently have a long-standing bug where they are getting renamed by udev. We label such names NET_NAME_USER. If an interface, unbeknown to us, has already been renamed from userspace, we most probably want to leave also that alone. This will typically happen when third-party plugins (for instance to udev, but the interface is generic so could be from anywhere) renames the interface without informing udev about it. A typical situation is when you switch root from an installer or an initrd to the real system and the new instance of udev does not know what happened before the switch. These types of problems have caused repeated issues in the past. To solve this, once an interface has been renamed, its name is labelled NET_NAME_RENAMED. In many cases, the kernel is actually able to name interfaces in such a way that there is no need for userspace to rename them. This is the case when the enumeration order of devices, or in fact any other (non-parent) device on the system, can not influence the name of the interface. Examples include statically created devices, or any naming schemes based on hardware properties of the interface. In this case the admin may prefer to use the kernel-provided names, and to make that possible we label such names NET_NAME_PREDICTABLE. We want the kernel to have tho possibilty of performing predictable interface naming itself (and exposing to userspace that it has), as the information necessary for a proper naming scheme for a certain class of devices may not be exposed to userspace. The case where renaming is almost certainly desired, is when the kernel has given the interface a name using global device enumeration based on order of discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM. Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has not yet been ported. This is mostly useful as a transitionary measure, allowing us to label the various naming schemes bit by bit. v8: minor documentation fixes v9: move comment to the right commit Signed-off-by: Tom Gundersen <teg@jklm.no> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Reviewed-by: Kay Sievers <kay@vrfy.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d5eacb3 |
|
10-Jul-2014 |
Jamal Hadi Salim <jhs@mojatatu.com> |
bridge: fdb dumping takes a filter device Dumping a bridge fdb dumps every fdb entry held. With this change we are going to filter on selected bridge port. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0e001614 |
|
01-Jul-2014 |
Tom Herbert <therbert@google.com> |
net: Call skb_get_hash in get_xps_queue and __skb_tx_hash Call standard function to get a packet hash instead of taking this from skb->sk->sk_hash or only using skb->protocol. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4b28252c |
|
15-Jun-2014 |
Tom Herbert <therbert@google.com> |
net: Fix GSO constants to match NETIF flags Joseph Gasparakis reported that VXLAN GSO offload stopped working with i40e device after recent UDP changes. The problem is that the SKB_GSO_* bits are out of sync with the corresponding NETIF flags. This patch fixes that. Also, we add BUILD_BUG_ONs in net_gso_ok for several GSO constants that were missing to avoid the problem in the future. Reported-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
670e5b8e |
|
28-May-2014 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add support for device specific address syncing This change provides a function to be used in order to break the ndo_set_rx_mode call into a set of address add and remove calls. The code is based on the implementation of dev_uc_sync/dev_mc_sync. Since they essentially do the same thing but with only one dev I simply named my functions __dev_uc_sync/__dev_mc_sync. I also implemented an unsync version of the functions as well to allow for cleanup on close. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ed616689 |
|
22-May-2014 |
Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> |
net-next:v4: Add support to configure SR-IOV VF minimum and maximum Tx rate through ip tool. o min_tx_rate puts lower limit on the VF bandwidth. VF is guaranteed to have a bandwidth of at least this value. max_tx_rate puts cap on the VF bandwidth. VF can have a bandwidth of up to this value. o A new handler set_vf_rate for attr IFLA_VF_RATE has been introduced which takes 4 arguments: netdev, VF number, min_tx_rate, max_tx_rate o ndo_set_vf_rate replaces ndo_set_vf_tx_rate handler. o Drivers that currently implement ndo_set_vf_tx_rate should now call ndo_set_vf_rate instead and reject attempt to set a minimum bandwidth greater than 0 for IFLA_VF_TX_RATE when IFLA_VF_RATE is not yet implemented by driver. o If user enters only one of either min_tx_rate or max_tx_rate, then, userland should read back the other value from driver and set both for IFLA_VF_RATE. Drivers that have not yet implemented IFLA_VF_RATE should always return min_tx_rate as 0 when read from ip tool. o If both IFLA_VF_TX_RATE and IFLA_VF_RATE options are specified, then IFLA_VF_RATE should override. o Idea is to have consistent display of rate values to user. o Usage example: - ./ip link set p4p1 vf 0 rate 900 ./ip link show p4p1 32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 900 (Mbps), max_tx_rate 900Mbps vf 1 MAC f6:c6:7c:3f:3d:6c vf 2 MAC 56:32:43:98:d7:71 vf 3 MAC d6:be:c3:b5:85:ff vf 4 MAC ee:a9:9a:1e:19:14 vf 5 MAC 4a:d0:4c:07:52:18 vf 6 MAC 3a:76:44:93:62:f9 vf 7 MAC 82:e9:e7:e3:15:1a ./ip link set p4p1 vf 0 max_tx_rate 300 min_tx_rate 200 ./ip link show p4p1 32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 300 (Mbps), max_tx_rate 300Mbps, min_tx_rate 200Mbps vf 1 MAC f6:c6:7c:3f:3d:6c vf 2 MAC 56:32:43:98:d7:71 vf 3 MAC d6:be:c3:b5:85:ff vf 4 MAC ee:a9:9a:1e:19:14 vf 5 MAC 4a:d0:4c:07:52:18 vf 6 MAC 3a:76:44:93:62:f9 vf 7 MAC 82:e9:e7:e3:15:1a ./ip link set p4p1 vf 0 max_tx_rate 600 rate 300 ./ip link show p4p1 32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000 link/ether 00:0e:1e:08:b0:f brd ff:ff:ff:ff:ff:ff vf 0 MAC 3e:a0:ca:bd:ae:5, tx rate 600 (Mbps), max_tx_rate 600Mbps, min_tx_rate 200Mbps vf 1 MAC f6:c6:7c:3f:3d:6c vf 2 MAC 56:32:43:98:d7:71 vf 3 MAC d6:be:c3:b5:85:ff vf 4 MAC ee:a9:9a:1e:19:14 vf 5 MAC 4a:d0:4c:07:52:18 vf 6 MAC 3a:76:44:93:62:f9 vf 7 MAC 82:e9:e7:e3:15:1a Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
da08143b |
|
20-May-2014 |
Michal Kubeček <mkubecek@suse.cz> |
vlan: more careful checksum features handling When combining real_dev's features and vlan_features, simple bitwise AND is used. This doesn't work well for checksum offloading features as if one set has NETIF_F_HW_CSUM and the other NETIF_F_IP_CSUM and/or NETIF_F_IPV6_CSUM, we end up with no checksum offloading. However, from the logical point of view (how can_checksum_protocol() works), NETIF_F_HW_CSUM contains the functionality of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM so that the result should be IP/IPV6. Add helper function netdev_intersect_features() implementing this logic and use it in vlan_dev_fix_features(). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
44a40855 |
|
16-May-2014 |
Vlad Yasevich <vyasevic@redhat.com> |
bonding: Fix stacked device detection in arp monitoring Prior to commit fbd929f2dce460456807a51e18d623db3db9f077 bonding: support QinQ for bond arp interval the arp monitoring code allowed for proper detection of devices stacked on top of vlans. Since the above commit, the code can still detect a device stacked on top of single vlan, but not a device stacked on top of Q-in-Q configuration. The search will only set the inner vlan tag if the route device is the vlan device. However, this is not always the case, as it is possible to extend the stacked configuration. With this patch it is possible to provision devices on top Q-in-Q vlan configuration that should be used as a source of ARP monitoring information. For example: ip link add link bond0 vlan10 type vlan proto 802.1q id 10 ip link add link vlan10 vlan100 type vlan proto 802.1q id 100 ip link add link vlan100 type macvlan Note: This patch limites the number of stacked VLANs to 2, just like before. The original, however had another issue in that if we had more then 2 levels of VLANs, we would end up generating incorrectly tagged traffic. This is no longer possible. Fixes: fbd929f2dce460456807a51e18d623db3db9f077 (bonding: support QinQ for bond arp interval) CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@redhat.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: Ding Tianhong <dingtianhong@huawei.com> CC: Patric McHardy <kaber@trash.net> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
25175ba5 |
|
16-May-2014 |
Vlad Yasevich <vyasevic@redhat.com> |
net: Allow for more then a single subclass for netif_addr_lock Currently netif_addr_lock_nested assumes that there can be only a single nesting level between 2 devices. However, if we have multiple devices of the same type stacked, this fails. For example: eth0 <-- vlan0.10 <-- vlan0.10.20 A more complicated configuration may stack more then one type of device in different order. Ex: eth0 <-- vlan0.10 <-- macvlan0 <-- vlan1.10.20 <-- macvlan1 This patch adds an ndo_* function that allows each stackable device to report its nesting level. If the device doesn't provide this function default subclass of 1 is used. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4085ebe8 |
|
16-May-2014 |
Vlad Yasevich <vyasevic@redhat.com> |
net: Find the nesting level of a given device by type. Multiple devices in the kernel can be stacked/nested and they need to know their nesting level for the purposes of lockdep. This patch provides a generic function that determines a nesting level of a particular device by its type (ex: vlan, macvlan, etc). We only care about nesting of the same type of devices. For example: eth0 <- vlan0.10 <- macvlan0 <- vlan1.20 The nesting level of vlan1.20 would be 1, since there is another vlan in the stack under it. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ad24ea4 |
|
10-May-2014 |
Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de> |
net: get rid of SET_ETHTOOL_OPS net: get rid of SET_ETHTOOL_OPS Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone. This does that. Mostly done via coccinelle script: @@ struct ethtool_ops *ops; struct net_device *dev; @@ - SET_ETHTOOL_OPS(dev, ops); + dev->ethtool_ops = ops; Compile tested only, but I'd seriously wonder if this broke anything. Suggested-by: Dave Miller <davem@davemloft.net> Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de> Acked-by: Felipe Balbi <balbi@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c1e756bf |
|
05-May-2014 |
Florian Westphal <fw@strlen.de> |
Revert "net: core: introduce netif_skb_dev_features" This reverts commit d206940319c41df4299db75ed56142177bb2e5f6, there are no more callers. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a0265d28 |
|
16-Apr-2014 |
Herbert Xu <herbert@gondor.apana.org.au> |
net: Add __dev_forward_skb This patch adds the helper __dev_forward_skb which is identical to dev_forward_skb except that it doesn't actually inject the skb into the stack. This is useful where we wish to have finer control over how the packet is injected, e.g., via netif_rx_ni or netif_receive_skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4e857c58 |
|
17-Mar-2014 |
Peter Zijlstra <peterz@infradead.org> |
arch: Mass conversion of smp_mb__*() Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
8e2f1a63 |
|
02-Apr-2014 |
Daniel Borkmann <daniel@iogearbox.net> |
packet: fix packet_direct_xmit for BQL enabled drivers Currently, in packet_direct_xmit() we test the assigned netdevice queue for netif_xmit_frozen_or_stopped() before doing an ndo_start_xmit(). This can have the side-effect that BQL enabled drivers which make use of netdev_tx_sent_queue() internally, set __QUEUE_STATE_STACK_XOFF from within the stack and would not fully fill the device's TX ring from packet sockets with PACKET_QDISC_BYPASS enabled. Instead, use a test without BQL bit so that bursts can be absorbed into the NICs TX ring. Fix and code suggested by Eric Dumazet, thanks! Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a50e233c |
|
29-Mar-2014 |
Eric Dumazet <edumazet@google.com> |
net-gro: restore frag0 optimization Main difference between napi_frags_skb() and napi_gro_receive() is that the later is called while ethernet header was already pulled by the NIC driver (eth_type_trans() was called before napi_gro_receive()) Jerry Chu in commit 299603e8370a ("net-gro: Prepare GRO stack for the upcoming tunneling support") tried to remove this difference by calling eth_type_trans() from napi_frags_skb() instead of doing this later from napi_frags_finish() Goal was that napi_gro_complete() could call ptype->callbacks.gro_complete(skb, 0) (offset of first network header = 0) Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to point to their own header, for the current skb and ones held in gro_list Problem is this cleanup work defeated the frag0 optimization: It turns out the consecutive pskb_may_pull() calls are too expensive. This patch brings back the frag0 stuff in napi_frags_skb(). As all skb have their mac header in skb head, we no longer need skb_gro_mac_header() Reported-by: Michal Schmidt <mschmidt@redhat.com> Fixes: 299603e8370a ("net-gro: Prepare GRO stack for the upcoming tunneling support") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2d3b479d |
|
29-Mar-2014 |
david decotigny <decot@googlers.com> |
net-sysfs: expose number of carrier on/off changes This allows to monitor carrier on/off transitions and detect link flapping issues: - new /sys/class/net/X/carrier_changes - new rtnetlink IFLA_CARRIER_CHANGES (getlink) Tested: - grep . /sys/class/net/*/carrier_changes + ip link set dev X down/up + plug/unplug cable - updated iproute2: prints IFLA_CARRIER_CHANGES - iproute2 20121211-2 (debian): unchanged behavior Signed-off-by: David Decotigny <decot@googlers.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
339e0223 |
|
28-Mar-2014 |
Florian Fainelli <f.fainelli@gmail.com> |
net: export NET_ADDR_* values to user-space API NET_ADDR_* values are exported in the /sys/class/net/<iface>/addr_assign_type sysfs attributes, and as such constitutes an user-space ABI. Move the NET_ADDR_* definitions from include/linux/netdevice.h to include/uapi/linux/netdevice.h Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ee481fb |
|
27-Mar-2014 |
Vlad Yasevich <vyasevic@redhat.com> |
net: Allow modules to use is_skb_forwardable Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5efeac44 |
|
27-Mar-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
netpoll: Respect NETIF_F_LLTX Stop taking the transmit lock when a network device has specified NETIF_F_LLTX. If no locks needed to trasnmit a packet this is the ideal scenario for netpoll as all packets can be trasnmitted immediately. Even if some locks are needed in ndo_start_xmit skipping any unnecessary serialization is desirable for netpoll as it makes it more likely a debugging packet may be trasnmitted immediately instead of being deferred until later. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a8779ec1 |
|
27-Mar-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
netpoll: Remove gfp parameter from __netpoll_setup The gfp parameter was added in: commit 47be03a28cc6c80e3aa2b3e8ed6d960ff0c5c0af Author: Amerigo Wang <amwang@redhat.com> Date: Fri Aug 10 01:24:37 2012 +0000 netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup() slave_enable_netpoll() and __netpoll_setup() may be called with read_lock() held, so should use GFP_ATOMIC to allocate memory. Eric suggested to pass gfp flags to __netpoll_setup(). Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> The reason for the gfp parameter was removed in: commit c4cdef9b7183159c23c7302aaf270d64c549f557 Author: dingtianhong <dingtianhong@huawei.com> Date: Tue Jul 23 15:25:27 2013 +0800 bonding: don't call slave_xxx_netpoll under spinlocks The slave_xxx_netpoll will call synchronize_rcu_bh(), so the function may schedule and sleep, it should't be called under spinlocks. bond_netpoll_setup() and bond_netpoll_cleanup() are always protected by rtnl lock, it is no need to take the read lock, as the slave list couldn't be changed outside rtnl lock. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Cc: Jay Vosburgh <fubar@us.ibm.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net> Nothing else that calls __netpoll_setup or ndo_netpoll_setup requires a gfp paramter, so remove the gfp parameter from both of these functions making the code clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
53d6471c |
|
27-Mar-2014 |
Vlad Yasevich <vyasevic@redhat.com> |
net: Account for all vlan headers in skb_mac_gso_segment skb_network_protocol() already accounts for multiple vlan headers that may be present in the skb. However, skb_mac_gso_segment() doesn't know anything about it and assumes that skb->mac_len is set correctly to skip all mac headers. That may not always be the case. If we are simply forwarding the packet (via bridge or macvtap), all vlan headers may not be accounted for. A simple solution is to allow skb_network_protocol to return the vlan depth it has calculated. This way skb_mac_gso_segment will correctly skip all mac headers. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
015f0688 |
|
27-Mar-2014 |
Eric Dumazet <edumazet@google.com> |
net: net: add a core netdev->tx_dropped counter Dropping packets in __dev_queue_xmit() when transmit queue is stopped (NIC TX ring buffer full or BQL limit reached) currently outputs a syslog message. It would be better to get a precise count of such events available in netdevice stats so that monitoring tools can have a clue. This extends the work done in caf586e5f23ce ("net: add a core netdev->rx_dropped counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9c62a68d |
|
14-Mar-2014 |
Eric W. Biederman <ebiederm@xmission.com> |
netpoll: Remove dead packet receive code (CONFIG_NETPOLL_TRAP) The netpoll packet receive code only becomes active if the netpoll rx_skb_hook is implemented, and there is not a single implementation of the netpoll rx_skb_hook in the kernel. All of the out of tree implementations I have found all call netpoll_poll which was removed from the kernel in 2011, so this change should not add any additional breakage. There are problems with the netpoll packet receive code. __netpoll_rx does not call dev_kfree_skb_irq or dev_kfree_skb_any in hard irq context. netpoll_neigh_reply leaks every skb it receives. Reception of packets does not work successfully on stacked devices (aka bonding, team, bridge, and vlans). Given that the netpoll packet receive code is buggy, there are no out of tree users that will be merged soon, and the code has not been used for in tree for a decade let's just remove it. Reverting this commit can server as a starting point for anyone who wants to resurrect netpoll packet reception support. Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
693350c2 |
|
10-Mar-2014 |
stephen hemminger <stephen@networkplumber.org> |
netdev: set __percpu attribute on netdev_alloc_pcpu_stats This patch fixes sparse warnings in vlan driver. It propagates the sparse __percpu attribute from alloc_percpu into netdev_alloc_pcpu_stats. I expect it may trigger additional sparse warnings from other drivers that are missing the __percpu attribute. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7aa98047 |
|
25-Feb-2014 |
Luis R. Rodriguez <mcgrof@suse.com> |
net: move net_device priv_flags out from UAPI These are private to userspace, and they're unstable anyway and can be shuffled at will (see 080e4130b1fb) so any userspace application relying on them is on crack. Test compiled with allyesconfig. mcgrof@drvbp1 /pub/mem/mcgrof/net-next (git::master)$ make allyesconfig mcgrof@drvbp1 /pub/mem/mcgrof/net-next (git::master)$ time make -j 20 ... BUILD arch/x86/boot/bzImage Setup is 16992 bytes (padded to 17408 bytes). System is 56153 kB CRC 721d2751 Kernel: arch/x86/boot/bzImage is ready (#1) real 19m35.744s user 280m37.984s sys 27m54.104s Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3f85944f |
|
25-Feb-2014 |
Amir Vadai <amirv@mellanox.com> |
net: Add sysfs file for port number Add a sysfs file to enable user space to query the device port number used by a netdevice instance. This is needed for devices that have multiple ports on the same PCI function. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f8ff080d |
|
17-Feb-2014 |
Veaceslav Falico <vfalico@redhat.com> |
bonding: remove useless updating of slave->dev->last_rx Now that all the logic is handled via last_arp_rx, we don't need to use last_rx. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b9507bda |
|
16-Feb-2014 |
Daniel Borkmann <daniel@iogearbox.net> |
netdevice: move netdev_cap_txqueue for shared usage to header In order to allow users to invoke netdev_cap_txqueue, it needs to be moved into netdevice.h header file. While at it, also add kernel doc header to document the API. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
99932d4f |
|
16-Feb-2014 |
Daniel Borkmann <daniel@iogearbox.net> |
netdevice: add queue selection fallback handler for ndo_select_queue Add a new argument for ndo_select_queue() callback that passes a fallback handler. This gets invoked through netdev_pick_tx(); fallback handler is currently __netdev_pick_tx() as most drivers invoke this function within their customized implementation in case for skbs that don't need any special handling. This fallback handler can then be replaced on other call-sites with different queue selection methods (e.g. in packet sockets, pktgen etc). This also has the nice side-effect that __netdev_pick_tx() is then only invoked from netdev_pick_tx() and export of that function to modules can be undone. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1c213bd2 |
|
13-Feb-2014 |
WANG Cong <xiyou.wangcong@gmail.com> |
net: introduce netdev_alloc_pcpu_stats() for drivers There are many drivers calling alloc_percpu() to allocate pcpu stats and then initializing ->syncp. So just introduce a helper function for them. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d2069403 |
|
13-Feb-2014 |
Florian Westphal <fw@strlen.de> |
net: core: introduce netif_skb_dev_features Will be used by upcoming ipv4 forward path change that needs to determine feature mask using skb->dst->dev instead of skb->dev. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a9517d0f |
|
22-Jan-2014 |
Jiri Pirko <jiri@resnulli.us> |
rtnetlink: remove ndo_get_slave No longer used API bond-specific can be removed now. This is now handled in a generic way in rtnl_link_ops. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b582ef09 |
|
20-Jan-2014 |
Or Gerlitz <ogerlitz@mellanox.com> |
net: Add GRO support for UDP encapsulating protocols Add GRO handlers for protocols that do UDP encapsulation, with the intent of being able to coalesce packets which encapsulate packets belonging to the same TCP session. For GRO purposes, the destination UDP port takes the role of the ether type field in the ethernet header or the next protocol in the IP header. The UDP GRO handler will only attempt to coalesce packets whose destination port is registered to have gro handler. Use a mark on the skb GRO CB data to disallow (flush) running the udp gro receive code twice on a packet. This solves the problem of udp encapsulated packets whose inner VM packet is udp and happen to carry a port which has registered offloads. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d3ee88a |
|
16-Jan-2014 |
sfeldma@cumulusnetworks.com <sfeldma@cumulusnetworks.com> |
bonding: add netlink attributes to slave link dev If link is IFF_SLAVE, extend link dev netlink attributes to include slave attributes with new IFLA_SLAVE nest. Add netlink notification (RTM_NEWLINK) when slave status changes from backup to active, or visa-versa. Adds new ndo_get_slave op to net_device_ops to fill skb with IFLA_SLAVE attributes. Currently only used by bonding driver, but could be used by other aggregating devices with slaves. Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a953be53 |
|
16-Jan-2014 |
Michael Dalton <mwdalton@google.com> |
net-sysfs: add support for device-specific rx queue sysfs attributes Extend existing support for netdevice receive queue sysfs attributes to permit a device-specific attribute group. Initial use case for this support will be to allow the virtio-net device to export per-receive queue mergeable receive buffer size. Signed-off-by: Michael Dalton <mwdalton@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d486bfb |
|
15-Jan-2014 |
Veaceslav Falico <vfalico@redhat.com> |
net: add NETDEV_PRECHANGEMTU to notify before mtu change happens Currently, if a device changes its mtu, first the change happens (invloving all the side effects), and after that the NETDEV_CHANGEMTU is sent so that other devices can catch up with the new mtu. However, if they return NOTIFY_BAD, then the change is reverted and error returned. This is a really long and costy operation (sometimes). To fix this, add NETDEV_PRECHANGEMTU notification which is called prior to any change actually happening, and if any callee returns NOTIFY_BAD - the change is aborted. This way we're skipping all the playing with apply/revert the mtu. CC: "David S. Miller" <davem@davemloft.net> CC: Jiri Pirko <jiri@resnulli.us> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5bb025fa |
|
14-Jan-2014 |
Veaceslav Falico <vfalico@redhat.com> |
net: rename sysfs symlinks on device name change Currently, we don't rename the upper/lower_ifc symlinks in /sys/class/net/*/ , which might result stale/duplicate links/names. Fix this by adding netdev_adjacent_rename_links(dev, oldname) which renames all the upper/lower interface's links to dev from the upper/lower_oldname to the new name. We don't need a rollback because only we control these symlinks and if we fail to rename them - sysfs will anyway complain. Reported-by: Ding Tianhong <dingtianhong@huawei.com> CC: Ding Tianhong <dingtianhong@huawei.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f663dd9a |
|
10-Jan-2014 |
Jason Wang <jasowang@redhat.com> |
net: core: explicitly select a txq before doing l2 forwarding Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The will cause several issues: - NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan instead of lower device which misses the necessary txq synchronization for lower device such as txq stopping or frozen required by dev watchdog or control path. - dev_hard_start_xmit() was called with NULL txq which bypasses the net device watchdog. - dev_hard_start_xmit() does not check txq everywhere which will lead a crash when tso is disabled for lower device. Fix this by explicitly introducing a new param for .ndo_select_queue() for just selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also extended to accept this parameter and dev_queue_xmit_accel() was used to do l2 forwarding transmission. With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of dev_queue_xmit() to do the transmission. In the future, it was also required for macvtap l2 forwarding support since it provides a necessary synchronization method. Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: e1000-devel@lists.sourceforge.net Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bf5a755f |
|
07-Jan-2014 |
Jerry Chu <hkchu@google.com> |
net-gre-gro: Add GRE support to the GRO stack This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO stack. It also serves as an example for supporting other encapsulation protocols in the GRO stack in the future. The patch supports version 0 and all the flags (key, csum, seq#) but will flush any pkt with the S (seq#) flag. This is because the S flag is not support by GSO, and a GRO pkt may end up in the forwarding path, thus requiring GSO support to break it up correctly. Currently the "packet_offload" structure only contains L3 (ETH_P_IP/ ETH_P_IPV6) GRO offload support so the encapped pkts are limited to IP pkts (i.e., w/o L2 hdr). But support for other protocol type can be easily added, so is the support for GRE variations like NVGRE. The patch also support csum offload. Specifically if the csum flag is on and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE), the code will take advantage of the csum computed by the h/w when validating the GRE csum. Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre: add GRO capability" already introduces GRO capability to IPv4 GRE tunnels, using the gro_cells infrastructure. But GRO is done after GRE hdr has been removed (i.e., decapped). The following patch applies GRO when pkts first come in (before hitting the GRE tunnel code). There is some performance advantage for applying GRO as early as possible. Also this approach is transparent to other subsystem like Open vSwitch where GRE decap is handled outside of the IP stack hence making it harder for the gro_cells stuff to apply. On the other hand, some NICs are still not capable of hashing on the inner hdr of a GRE pkt (RSS). In that case the GRO processing of pkts from the same remote host will all happen on the same CPU and the performance may be suboptimal. I'm including some rough preliminary performance numbers below. Note that the performance will be highly dependent on traffic load, mix as usual. Moreover it also depends on NIC offload features hence the following is by no means a comprehesive study. Local testing and tuning will be needed to decide the best setting. All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs. (super_netperf 50 -H 192.168.1.18 -l 30) An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1 mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123) is configured. The GRO support for pkts AFTER decap are controlled through the device feature of the GRE device (e.g., ethtool -K gre1 gro on/off). 1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off thruput: 9.16Gbps CPU utilization: 19% 1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off thruput: 5.9Gbps CPU utilization: 15% 1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 12-13% 1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 10% The following tests were performed on a different NIC that is capable of csum offload. I.e., the h/w is capable of computing IP payload csum (CHECKSUM_COMPLETE). 2.1 ethtool -K gre1 gro on (hence will use gro_cells) 2.1.1 ethtool -K eth0 gro off; csum offload disabled thruput: 8.53Gbps CPU utilization: 9% 2.1.2 ethtool -K eth0 gro off; csum offload enabled thruput: 8.97Gbps CPU utilization: 7-8% 2.1.3 ethtool -K eth0 gro on; csum offload disabled thruput: 8.83Gbps CPU utilization: 5-6% 2.1.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.98Gbps CPU utilization: 5% 2.2 ethtool -K gre1 gro off 2.2.1 ethtool -K eth0 gro off; csum offload disabled thruput: 5.93Gbps CPU utilization: 9% 2.2.2 ethtool -K eth0 gro off; csum offload enabled thruput: 5.62Gbps CPU utilization: 8% 2.2.3 ethtool -K eth0 gro on; csum offload disabled thruput: 7.69Gbps CPU utilization: 8% 2.2.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.96Gbps CPU utilization: 5-6% Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8f84985f |
|
03-Jan-2014 |
Li RongQing <roy.qing.li@gmail.com> |
net: unify the pcpu_tstats and br_cpu_netstats as one They are same, so unify them as one, pcpu_sw_netstats. Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats from if_tunnel and remove br_cpu_netstats from br_private.h Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
86f8515f |
|
29-Dec-2013 |
Daniel Borkmann <daniel@iogearbox.net> |
net: netprio: rename config to be more consistent with cgroup configs While we're at it and introduced CGROUP_NET_CLASSID, lets also make NETPRIO_CGROUP more consistent with the rest of cgroups and rename it into CONFIG_CGROUP_NET_PRIO so that for networking, we now have CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG option consistent among networking cgroups, but also among cgroups CONFIG conventions in general as the vast majority has a prefix of CONFIG_CGROUP_<SUBSYS>. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Zefan Li <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
#
7a7ffbab |
|
26-Dec-2013 |
Wei-Chun Chao <weichunc@plumgrid.com> |
ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC VM to VM GSO traffic is broken if it goes through VXLAN or GRE tunnel and the physical NIC on the host supports hardware VXLAN/GRE GSO offload (e.g. bnx2x and next-gen mlx4). Two issues - (VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header integrity check fails in udp4_ufo_fragment if inner protocol is TCP. Also gso_segs is calculated incorrectly using skb->len that includes tunnel header. Fix: robust check should only be applied to the inner packet. (VXLAN & GRE) Once GSO header integrity check passes, NULL segs is returned and the original skb is sent to hardware. However the tunnel header is already pulled. Fix: tunnel header needs to be restored so that hardware can perform GSO properly on the original packet. Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d143d9f |
|
29-Dec-2013 |
stephen hemminger <stephen@networkplumber.org> |
net: core functions cleanup The following functions are not used outside of net/core/dev.c and should be declared static. call_netdevice_notifiers_info __dev_remove_offload netdev_has_any_upper_dev __netdev_adjacent_dev_remove __netdev_adjacent_dev_link_lists __netdev_adjacent_dev_unlink_lists __netdev_adjacent_dev_unlink __netdev_adjacent_dev_link_neighbour __netdev_adjacent_dev_unlink_neighbour And the following are never used and should be deleted netdev_lower_dev_get_private_rcu __netdev_find_adj_rcu Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2205369a |
|
31-Dec-2013 |
David S. Miller <davem@davemloft.net> |
vlan: Fix header ops passthru when doing TX VLAN offload. When the vlan code detects that the real device can do TX VLAN offloads in hardware, it tries to arrange for the real device's header_ops to be invoked directly. But it does so illegally, by simply hooking the real device's header_ops up to the VLAN device. This doesn't work because we will end up invoking a set of header_ops routines which expect a device type which matches the real device, but will see a VLAN device instead. Fix this by providing a pass-thru set of header_ops which will arrange to pass the proper real device instead. To facilitate this add a dev_rebuild_header(). There are implementations which provide a ->cache and ->create but not a ->rebuild (f.e. PLIP). So we need a helper function just like dev_hard_header() to avoid crashes. Use this helper in the one existing place where the header_ops->rebuild was being invoked, the neighbour code. With lots of help from Florian Westphal. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
477bb933 |
|
13-Dec-2013 |
stephen hemminger <shemming@brocade.com> |
net: remove dead code for add/del multiple These function to manipulate multiple addresses are not used anywhere in current net-next tree. Some out of tree code maybe using these but too bad; they should submit their code upstream.. Also, make __hw_addr_flush local since only used by dev_addr_lists.c Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e001bfad |
|
12-Dec-2013 |
dingtianhong <dingtianhong@huawei.com> |
bonding: create bond_first_slave_rcu() The bond_first_slave_rcu() will be used to instead of bond_first_slave() in rcu_read_lock(). According to the Jay Vosburgh's suggestion, the struct netdev_adjacent should hide from users who wanted to use it directly. so I package a new function to get the first slave of the bond. Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com> Suggested-by: Jay Vosburgh <fubar@us.ibm.com> Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a0a9663d |
|
12-Dec-2013 |
Sebastian Siewior <bigeasy@linutronix.de> |
net: make neigh_priv_len in struct net_device 16bit instead of 8bit neigh_priv_len is defined as u8. With all debug enabled struct ipoib_neigh has 200 bytes. The largest part is sk_buff_head with 96 bytes and here the spinlock with 72 bytes. The size value still fits in this u8 leaving some room for more. On -RT struct ipoib_neigh put on weight and has 392 bytes. The main reason is sk_buff_head with 288 and the fatty here is spinlock with 192 bytes. This does no longer fit into into neigh_priv_len and gcc complains. This patch changes neigh_priv_len from being 8bit to 16bit. Since the following element (dev_id) is 16bit followed by a spinlock which is aligned, the struct remains with a total size of 3200 (allmodconfig) / 2048 (with as much debug off as possible) bytes on x86-64. On x86-32 the struct is 1856 (allmodconfig) / 1216 (with as much debug off as possible) bytes long. The numbers were gained with and without the patch to prove that this change does not increase the size of the struct. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
299603e8 |
|
11-Dec-2013 |
Jerry Chu <hkchu@google.com> |
net-gro: Prepare GRO stack for the upcoming tunneling support This patch modifies the GRO stack to avoid the use of "network_header" and associated macros like ip_hdr() and ipv6_hdr() in order to allow an arbitary number of IP hdrs (v4 or v6) to be used in the encapsulation chain. This lays the foundation for various IP tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later. With this patch, the GRO stack traversing now is mostly based on skb_gro_offset rather than special hdr offsets saved in skb (e.g., skb->network_header). As a result all but the top layer (i.e., the the transport layer) must have hdrs of the same length in order for a pkt to be considered for aggregation. Therefore when adding a new encap layer (e.g., for tunneling), one must check and skip flows (e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a different hdr length. Note that unlike the network header, the transport header can and will continue to be set by the GRO code since there will be at most one "transport layer" in the encap chain. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37cb0620 |
|
10-Dec-2013 |
Ying Xue <ying.xue@windriver.com> |
tipc: remove TIPC usage of field af_packet_priv in struct net_device TIPC is currently using the field 'af_packet_priv' in struct net_device as a handle to find the bearer instance associated to the given network device. But, by doing so it is blocking other networking cleanups, such as the one discussed here: http://patchwork.ozlabs.org/patch/178044/ This commit removes this usage from TIPC. Instead, we introduce a new field, 'tipc_ptr', to the net_device structure, to serve this purpose. When TIPC bearer is enabled, the bearer object is associated to 'tipc_ptr'. When a TIPC packet arrives in the recv_msg() upcall from a networking device, the bearer object can now be obtained from 'tipc_ptr'. When a bearer is disabled, the bearer object is detached from its underlying network device by setting 'tipc_ptr' to NULL. Additionally, an RCU lock is used to protect the new pointer. Henceforth, the existing tipc_net_lock is used in write mode to serialize write accesses to this pointer, while the new RCU lock is applied on the read side to ensure that the pointer is 100% valid within its wrapped area for all readers. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e6247027 |
|
05-Dec-2013 |
Eric Dumazet <edumazet@google.com> |
net: introduce dev_consume_skb_any() Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq() helpers to free skbs, both for dropped packets and TX completed ones. We need to separate the two causes to get better diagnostics given by dropwatch or "perf record -e skb:kfree_skb" This patch provides two new helpers, dev_consume_skb_any() and dev_consume_skb_irq() to be used for consumed skbs. __dev_kfree_skb_irq() is slightly optimized to remove one atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a6cc0cfa |
|
06-Nov-2013 |
John Fastabend <john.r.fastabend@intel.com> |
net: Add layer 2 hardware acceleration operations for macvlan devices Add a operations structure that allows a network interface to export the fact that it supports package forwarding in hardware between physical interfaces and other mac layer devices assigned to it (such as macvlans). This operaions structure can be used by virtual mac devices to bypass software switching so that forwarding can be done in hardware more efficiently. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df42153c |
|
07-Nov-2013 |
Dan Carpenter <dan.carpenter@oracle.com> |
net: make ndev->irq signed for error handling There is a bug in cpsw_probe() where we do: ndev->irq = platform_get_irq(pdev, 0); if (ndev->irq < 0) { The problem is that "ndev->irq" is unsigned so the error handling doesn't work. I have changed it to a regular int. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
74d332c1 |
|
30-Oct-2013 |
Eric Dumazet <edumazet@google.com> |
net: extend net_device allocation to vmalloc() Joby Poriyath provided a xen-netback patch to reduce the size of xenvif structure as some netdev allocation could fail under memory pressure/fragmentation. This patch is handling the problem at the core level, allowing any netdev structures to use vmalloc() if kmalloc() failed. As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Joby Poriyath <joby.poriyath@citrix.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
80c33ddd |
|
20-Sep-2013 |
Jacob Keller <jacob.e.keller@intel.com> |
net: add might_sleep() call to napi_disable napi_disable uses an msleep() call to wait for outstanding napi work to be finished after setting the disable bit. It does not always sleep incase there was no outstanding work. This resulted in a rare bug in ixgbe_down operation where a napi_disable call took place inside of a local_bh_disable()d context. In order to enable easier detection of future sleep while atomic BUGs, this patch adds a might_sleep() call, so that every use of napi_disable during atomic context will be visible. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: Alexander Duyck <alexander.duyck@intel.com> Cc: Hyong-Youb Kim <hykim@myri.com> Cc: Amir Vadai <amirv@mellanox.com> Cc: Dmitry Kravkov <dmitry@broadcom.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
7cc7c5e5 |
|
14-Oct-2013 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Delete trailing semi-colon from definition of netdev_WARN() Macro definitions should not normally end with a semi-colon, as this makes it dangerous to use them an if...else statement. Happily this has not happened yet. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3f79410c |
|
11-Oct-2013 |
Maxime Jayat <maxime@artisandeveloppeur.fr> |
treewide: Fix common typo in "identify" Correct common misspelling of "identify" as "indentify" throughout the kernel Signed-off-by: Maxime Jayat <maxime@artisandeveloppeur.fr> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
5cde2829 |
|
05-Oct-2013 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Separate the close_list and the unreg_list v2 Separate the unreg_list and the close_list in dev_close_many preventing dev_close_many from permuting the unreg_list. The permutations of the unreg_list have resulted in cases where the loopback device is accessed it has been freed in code such as dst_ifdown. Resulting in subtle memory corruption. This is the second bug from sharing the storage between the close_list and the unreg_list. The issues that crop up with sharing are apparently too subtle to show up in normal testing or usage, so let's forget about being clever and use two separate lists. v2: Make all callers pass in a close_list to dev_close_many Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3573540c |
|
02-Oct-2013 |
Michael S. Tsirkin <mst@redhat.com> |
netif_set_xps_queue: make cpu mask const virtio wants to pass in cpumask_of(cpu), make parameter const to avoid build warnings. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a528c219 |
|
24-Sep-2013 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
dev: update __dev_notify_flags() to send rtnl msg This patch only prepares the next one, there is no functional change. Now, __dev_notify_flags() can also be used to notify flags changes via rtnetlink. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f629d208 |
|
26-Sep-2013 |
Joe Perches <joe@perches.com> |
[networking]device.h: Remove extern from function prototypes There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com>
|
#
58292cbe |
|
11-Sep-2013 |
Tejun Heo <tj@kernel.org> |
sysfs: make attr namespace interface less convoluted sysfs ns (namespace) implementation became more convoluted than necessary while trying to hide ns information from visible interface. The relatively recent attr ns support is a good example. * attr ns tag is determined by sysfs_ops->namespace() callback while dir tag is determined by kobj_type->namespace(). The placement is arbitrary. * Instead of performing operations with explicit ns tag, the namespace callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(), class_attr_namespace(), class_attr->namespace(). It's not simpler in any sense. The only thing this convolution does is traversing the whole stack backwards. The namespace callbacks are unncessary because the operations involved are inherently synchronous. The information can be provided in in straight-forward top-down direction and reversing that direction is unnecessary and against basic design principles. This backward interface is unnecessarily convoluted and hinders properly separating out sysfs from driver model / kobject for proper layering. This patch updates attr ns support such that * sysfs_ops->namespace() and class_attr->namespace() are dropped. * sysfs_{create|remove}_file_ns(), which take explicit @ns param, are added and sysfs_{create|remove}_file() are now simple wrappers around the ns aware functions. * ns handling is dropped from sysfs_chmod_file(). Nobody uses it at this point. sysfs_chmod_file_ns() can be added later if necessary. * Explicit @ns is propagated through class_{create|remove}_file_ns() and netdev_class_{create|remove}_file_ns(). * driver/net/bonding which is currently the only user of attr namespace is updated to use netdev_class_{create|remove}_file_ns() with @bh->net as the ns tag instead of using the namespace callback. This patch should be an equivalent conversion without any functional difference. It makes the code easier to follow, reduces lines of code a bit and helps proper separation and layering. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
b6ccba4c |
|
25-Sep-2013 |
Veaceslav Falico <vfalico@redhat.com> |
net: add a possibility to get private from netdev_adjacent->list It will be useful to get first/last element. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
31088a11 |
|
25-Sep-2013 |
Veaceslav Falico <vfalico@redhat.com> |
net: add for_each iterators through neighbour lower link's private Add a possibility to iterate through netdev_adjacent's private, currently only for lower neighbours. Add both RCU and RTNL/other locking variants of iterators, and make the non-rcu variant to be safe from removal. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
402dae96 |
|
25-Sep-2013 |
Veaceslav Falico <vfalico@redhat.com> |
net: add netdev_adjacent->private and allow to use it Currently, even though we can access any linked device, we can't attach anything to it, which is vital to properly manage them. To fix this, add a new void *private to netdev_adjacent and functions setting/getting it (per link), so that we can save, per example, bonding's slave structures there, per slave device. netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to upper dev and populates the neighbour link only with private. netdev_lower_dev_get_private{,_rcu}() returns the private, if found. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f268f12 |
|
25-Sep-2013 |
Veaceslav Falico <vfalico@redhat.com> |
net: add adj_list to save only neighbours Currently, we distinguish neighbours (first-level linked devices) from non-neighbours by the neighbour bool in the netdev_adjacent. This could be quite time-consuming in case we would like to traverse *only* through neighbours - cause we'd have to traverse through all devices and check for this flag, and in a (quite common) scenario where we have lots of vlans on top of bridge, which is on top of a bond - the bonding would have to go through all those vlans to get its upper neighbour linked devices. This situation is really unpleasant, cause there are already a lot of cases when a device with slaves needs to go through them in hot path. To fix this, introduce a new upper/lower device lists structure - adj_list, which contains only the neighbours. It works always in pair with the all_adj_list structure (renamed from upper/lower_dev_list), i.e. both of them contain the same links, only that all_adj_list contains also non-neighbour device links. It's really a small change visible, currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't change the main linked logic at all. Also, add some comments a fix a name collision in netdev_for_each_upper_dev_rcu() and rework the naming by the following rules: netdev_(all_)(upper|lower)_* If "all_" is present, then we work with the whole list of upper/lower devices, otherwise - only with direct neighbours. Uninline functions - to get better stack traces. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
35e42379 |
|
13-Sep-2013 |
Joseph Gasparakis <joseph.gasparakis@intel.com> |
vxlan: Fix sparse warnings This patch fixes sparse warnings when incorrectly handling the port number and using int instead of unsigned int iterating through &vn->sock_list[]. Keeping the port as __be16 also makes things clearer wrt endianess. Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks which got removed. Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0042d0c8 |
|
06-Sep-2013 |
Florian Fainelli <f.fainelli@gmail.com> |
net: add documentation for BQL helpers Provide a kernel-doc comment documentation for the BQL helpers: - netdev_sent_queue - netdev_completed_queue - netdev_reset_queue Similarly to how it is done for the other functions, the documentation only covers the function operating on struct net_device and not struct netdev_queue. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
53cf5275 |
|
04-Sep-2013 |
Joseph Gasparakis <joseph.gasparakis@intel.com> |
vxlan: Notify drivers for listening UDP port changes This patch adds two more ndo ops: ndo_add_rx_vxlan_port() and ndo_del_rx_vxlan_port(). Drivers can get notifications through the above functions about changes of the UDP listening port of VXLAN. Also, when physical ports come up, now they can call vxlan_get_rx_port() in order to obtain the port number(s) of the existing VXLAN interface in case they already up before them. This information about the listening UDP port would be used for VXLAN related offloads. A big thank you to John Fastabend (john.r.fastabend@intel.com) for his input and his suggestions on this patch set. CC: John Fastabend <john.r.fastabend@intel.com> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8b5be856 |
|
28-Aug-2013 |
Veaceslav Falico <vfalico@redhat.com> |
net: add netdev_for_each_upper_dev_rcu() The new macro netdev_for_each_upper_dev_rcu(dev, upper, iter) iterates through the dev->upper_dev_list starting from the first element, using the netdev_upper_get_next_dev_rcu(dev, &iter). Must be called under RCU read lock. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d261913 |
|
28-Aug-2013 |
Veaceslav Falico <vfalico@redhat.com> |
net: add lower_dev_list to net_device and make a full mesh This patch adds lower_dev_list list_head to net_device, which is the same as upper_dev_list, only for lower devices, and begins to use it in the same way as the upper list. It also changes the way the whole adjacent device lists work - now they contain *all* of upper/lower devices, not only the first level. The first level devices are distinguished by the bool neighbour field in netdev_adjacent, also added by this patch. There are cases when a device can be added several times to the adjacent list, the simplest would be: /---- eth0.10 ---\ eth0- --- bond0 \---- eth0.20 ---/ where both bond0 and eth0 'see' each other in the adjacent lists two times. To avoid duplication of netdev_adjacent structures ref_nr is being kept as the number of times the device was added to the list. The 'full view' is achieved by adding, on link creation, all of the upper_dev's upper_dev_list devices as upper devices to all of the lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice versa. On unlink they are removed using the same logic. I've tested it with thousands vlans/bonds/bridges, everything works ok and no observable lags even on a huge number of interfaces. Memory footprint for 128 devices interconnected with each other via both upper and lower (which is impossible, but for the comparison) lists would be: 128*128*2*sizeof(netdev_adjacent) = 1.5MB but in the real world we usualy have at most several devices with slaves and a lot of vlans, so the footprint will be much lower. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e0d1095a |
|
31-Jul-2013 |
Cong Wang <amwang@redhat.com> |
net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL Eliezer renames several *ll_poll to *busy_poll, but forgets CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too. Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
66b52b0d |
|
29-Jul-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: add ndo to get id of physical port of the device This patch adds a ndo for getting physical port of the device. Driver which is aware of being virtual function of some physical port should implement this ndo. This is applicable not only for IOV, but for other solutions (NPAR, multichannel) as well. Basically if there is possible to have multiple netdevs on the single hw port. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
18afa4b0 |
|
23-Jul-2013 |
Thomas Gleixner <tglx@linutronix.de> |
net: Make devnet_rename_seq static No users outside net/core/dev.c. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4aa5dee4 |
|
19-Jul-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: convert resend IGMP to notifier event Until now, bond_resend_igmp_join_requests() looks for vlans attached to bonding device, bridge where bonding act as port manually. It does not care of other scenarios, like stacked bonds or team device above. Make this more generic and use netdev notifier to propagate the event to upper devices and to actually call ip_mc_rejoin_groups(). Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8b80cda5 |
|
10-Jul-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: rename ll methods to busy-poll Rename ndo_ll_poll to ndo_busy_poll. Rename sk_mark_ll to sk_mark_napi_id. Rename skb_mark_ll to skb_mark_napi_id. Correct all useres of these functions. Update comments and defines in include/net/busy_poll.h Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5dbe7c17 |
|
26-Jun-2013 |
Nicolas Schichan <nschichan@freebox.fr> |
net: fix kernel deadlock with interface rename and netdev name retrieval. When the kernel (compiled with CONFIG_PREEMPT=n) is performing the rename of a network interface, it can end up waiting for a workqueue to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to the fact that read_secklock_begin() will spin forever waiting for the writer process (the one doing the interface rename) to update the devnet_rename_seq sequence. This patch fixes the problem by adding a helper (netdev_get_name()) and using it in the code handling the SIOCGIFNAME ioctl and SO_BINDTODEVICE setsockopt. The netdev_get_name() helper uses raw_seqcount_begin() to avoid spinning forever, waiting for devnet_rename_seq->sequence to become even. cond_resched() is used in the contended case, before retrying the access to give the writer process a chance to finish. The use of raw_seqcount_begin() will incur some unneeded work in the reader process in the contended case, but this is better than deadlocking the system. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d8faf48 |
|
13-Jun-2013 |
Rony Efraim <ronye@mellanox.com> |
net/core: Add VF link state control Add netlink directives and ndo entry to allow for controling VF link, which can be in one of three states: Auto - VF link state reflects the PF link state (default) Up - VF link state is up, traffic from VF to VF works even if the actual PF link is down Down - VF link state is down, no traffic from/to this VF, can be of use while configuring the VF Signed-off-by: Rony Efraim <ronye@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5f121b9a |
|
13-Jun-2013 |
Willem de Bruijn <willemb@google.com> |
net-rps: fixes for rps flow limit Caught by sparse: - __rcu: missing annotation to sd->flow_limit - __user: direct access in cpumask_scnprintf Also - add endline character when printing bitmap if room in buffer - avoid bucket overflow by reducing FLOW_LIMIT_HISTORY The last item warrants some explanation. The hashtable buckets are subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal to bucket size, since all packets may end up in a single bucket. The current (rather arbitrary) history value of 256 happens to match the buffer size (u8). As a result, with a single flow, the first 128 packets are accepted (correct), the second 128 packets dropped (correct) and then the history[] array has filled, so that each subsequent new packet causes an increment in the bucket for new_flow plus a decrement for old_flow: a steady state. This is fine if packets are dropped, as the steady state goes away as soon as a mix of traffic reappears. But, because the 256th packet overflowed the bucket to 0: no packets are dropped. Instead of explicitly adding an overflow check, this patch changes FLOW_LIMIT_HISTORY to never be able to overflow a single bucket. Reported-by: Fengguang Wu <fengguang.wu@intel.com> (first item) Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dffebd2c |
|
10-Jun-2013 |
Narendra K <narendra_k@dell.com> |
doc:networking: Update comment for dev_id field in netdevice.h This patch updates the comment for 'dev_id' field in 'include/linux/netdevice.h' to reflect the intended usage of 'dev_id'. References: http://marc.info/?l=linux-netdev&m=136992115300526&w=2 References: http://marc.info/?l=linux-netdev&m=137062569014612&w=2 Signed-off-by: Narendra K <narendra_k@dell.com> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
06021292 |
|
10-Jun-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: add low latency socket poll Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
af12fa6e |
|
10-Jun-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: add napi_id and hash Adds a napi_id and a hashing mechanism to lookup a napi by id. This will be used by subsequent patches to implement low latency Ethernet device polling. Based on a code sample by Eric Dumazet. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
75538c2b |
|
28-May-2013 |
Cong Wang <amwang@redhat.com> |
net: always pass struct netdev_notifier_info to netdevice notifiers commit 351638e7deeed2ec8ce451b53d3 (net: pass info struct via netdevice notifier) breaks booting of my KVM guest, this is due to we still forget to pass struct netdev_notifier_info in several places. This patch completes it. Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
be9efd36 |
|
27-May-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: pass changed flags along with NETDEV_CHANGE event Use new netdevice notifier infrastructure to pass along changed flags. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: shortened notifier_info struct name Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
351638e7 |
|
27-May-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: pass info struct via netdevice notifier So far, only net_device * could be passed along with netdevice notifier event. This patch provides a possibility to pass custom structure able to provide info that event listener needs to know. Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: fix typo on simeth shortened dev_getter shortened notifier_info struct name v1->v2: fix notifier_call parameter in call_netdevice_notifier() Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0d89d203 |
|
23-May-2013 |
Simon Horman <horms@verge.net.au> |
MPLS: Add limited GSO support In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets. The aim of this code is to provide GSO in software for MPLS packets whose skbs are GSO. SKB Usage: When an implementation adds an MPLS stack to a non-MPLS packet it should do the following to skb metadata: * Set skb->inner_protocol to the old non-MPLS ethertype of the packet. skb->inner_protocol is added by this patch. * Set skb->protocol to the new MPLS ethertype of the packet. * Set skb->network_header to correspond to the end of the L3 header, including the MPLS label stack. I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to kernel" which adds MPLS support to the kernel datapath of Open vSwtich. That patch sets the above requirements in datapath/actions.c:push_mpls() and was used to exercise this code. The datapath patch is against the Open vSwtich tree but it is intended that it be added to the Open vSwtich code present in the mainline Linux kernel at some point. Features: I believe that the approach that I have taken is at least partially consistent with the handling of other protocols. Jesse, I understand that you have some ideas here. I am more than happy to change my implementation. This patch adds dev->mpls_features which may be used by devices to advertise features supported for MPLS packets. A new NETIF_F_MPLS_GSO feature is added for devices which support hardware MPLS GSO offload. Currently no devices support this and MPLS GSO always falls back to software. Alternate Implementation: One possible alternate implementation is to teach netif_skb_features() and skb_network_protocol() about MPLS, in a similar way to their understanding of VLANs. I believe this would avoid the need for net/mpls/mpls_gso.c and in particular the calls to __skb_push() and __skb_push() in mpls_gso_segment(). I have decided on the implementation in this patch as it should not introduce any overhead in the case where mpls_gso is not compiled into the kernel or inserted as a module. MPLS GSO suggested by Jesse Gross. Based in part on "v4 GRE: Add TCP segmentation offload for GRE" by Pravin B Shelar. Cc: Jesse Gross <jesse@nicira.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
42e52bf9 |
|
24-May-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: add netnotifier event for upper device change Now when upper device is changed, event is not propagated via RT Netlink to userspace. Userspace might never now about the change. Fix this by adding upper-device-change notifier event. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
99bbc707 |
|
19-May-2013 |
Willem de Bruijn <willemb@google.com> |
rps: selective flow shedding during softnet overflow A cpu executing the network receive path sheds packets when its input queue grows to netdev_max_backlog. A single high rate flow (such as a spoofed source DoS) can exceed a single cpu processing rate and will degrade throughput of other flows hashed onto the same cpu. This patch adds a more fine grained hashtable. If the netdev backlog is above a threshold, IRQ cpus track the ratio of total traffic of each flow (using 4096 buckets, configurable). The ratio is measured by counting the number of packets per flow over the last 256 packets from the source cpu. Any flow that occupies a large fraction of this (set at 50%) will see packet drop while above the threshold. Tested: Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0, kernel receive (RPS) on cpu0 and application threads on cpus 2--7 each handling 20k req/s. Throughput halves when hit with a 400 kpps antagonist storm. With this patch applied, antagonist overload is dropped and the server processes its complete load. The patch is effective when kernel receive processing is the bottleneck. The above RPS scenario is a extreme, but the same is reached with RFS and sufficient kernel processing (iptables, packet socket tap, ..). Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b0ce3508 |
|
16-May-2013 |
Eric Dumazet <edumazet@google.com> |
bonding: allow TSO being set on bonding master In some situations, we need to disable TSO on bonding slaves. bonding device automatically unset TSO in bond_fix_features(), and performance is not good because : 1) We consume more cpu cycles. 2) GSO segmentation has some bugs leading to out of order TCP packets if this segmentation is done before virtual device. This particular problem will be addressed in a separate patch. This patch allows TSO being set/unset on the bonding master, so that GSO segmentation is done after bonding layer. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Michał Mirosław <mirqus@gmail.com> Cc: Jay Vosburgh <fubar@us.ibm.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
243198d0 |
|
05-May-2013 |
Al Viro <viro@ZenIV.linux.org.uk> |
rps_dev_flow_table_release(): no need to delay vfree() The same story as with fib_trie patch - vfree() from RCU callbacks is legitimate now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
80d5c368 |
|
18-Apr-2013 |
Patrick McHardy <kaber@trash.net> |
net: vlan: prepare for 802.1ad VLAN filtering offload Change the rx_{add,kill}_vid callbacks to take a protocol argument in preparation of 802.1ad support. The protocol argument used so far is always htons(ETH_P_8021Q). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f646968f |
|
18-Apr-2013 |
Patrick McHardy <kaber@trash.net> |
net: vlan: rename NETIF_F_HW_VLAN_* feature flags to NETIF_F_HW_VLAN_CTAG_* Rename the hardware VLAN acceleration features to include "CTAG" to indicate that they only support CTAGs. Follow up patches will introduce 802.1ad server provider tagging (STAGs) and require the distinction for hardware not supporting acclerating both. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4cd729b0 |
|
15-Apr-2013 |
Vlad Yasevich <vyasevic@redhat.com> |
net: add dev_uc_sync_multiple() and dev_mc_sync_multiple() api The current implementation of dev_uc_sync/unsync() assumes that there is a strict 1-to-1 relationship between the source and destination of the sync. In other words, once an address has been synced to a destination device, it will not be synced to any other device through the sync API. However, there are some virtual devices that aggreate a number of lower devices and need to sync addresses to all of them. The current API falls short there. This patch introduces a new dev_uc_sync_multiple() api that can be called in the above circumstances and allows sync to work for every invocation. CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4543fbef |
|
02-Apr-2013 |
Vlad Yasevich <vyasevic@redhat.com> |
net: count hw_addr syncs so that unsync works properly. A few drivers use dev_uc_sync/unsync to synchronize the address lists from master down to slave/lower devices. In some cases (bond/team) a single address list is synched down to multiple devices. At the time of unsync, we have a leak in these lower devices, because "synced" is treated as a boolean and the address will not be unsynced for anything after the first device/call. Treat "synced" as a count (same as refcount) and allow all unsync calls to work. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4c3d5e7b |
|
30-Mar-2013 |
Eric Dumazet <edumazet@google.com> |
net: reorder some fields of net_device As time passed, some fields were added in net_device, and not at sensible offsets. Lets reorder some fields to reduce number of cache lines in RX path. Fields not used in data path should be moved out of this critical cache line. In particular, move broadcast[] to the end of the rx section, as it is less used, and ethernet uses only the beginning of the 32bytes field. Before patch : offsetof(struct net_device,dev_addr)=0x258 offsetof(struct net_device,rx_handler)=0x2b8 offsetof(struct net_device,ingress_queue)=0x2c8 offsetof(struct net_device,broadcast)=0x278 After : offsetof(struct net_device,dev_addr)=0x280 offsetof(struct net_device,rx_handler)=0x298 offsetof(struct net_device,ingress_queue)=0x2a8 offsetof(struct net_device,broadcast)=0x2b0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3d40392 |
|
27-Mar-2013 |
Paul Bolle <pebolle@tiscali.nl> |
tokenring: delete last holdout of CONFIG_TR Tokenring support was deleted in v3.5. One last holdout of the macro CONFIG_TR escaped that fate. Until now. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24f11a5c |
|
27-Mar-2013 |
Dmitry Kravkov <dmitry@broadcom.com> |
line up comment for ndo_bridge_getlink Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
764444f5 |
|
13-Mar-2013 |
Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp> |
net: clean leftover of COMPAT_NET_DEV_OPS removal COMPAT_NET_DEV_OPS was removed a while back and with it the definition of netdev_resync_ops() went away. Let's finish the clean-up. Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8a7fbfab |
|
11-Mar-2013 |
nikolay@redhat.com <nikolay@redhat.com> |
netxen: write IP address to firmware when using bonding This patch allows LRO aggregation on bonded devices that contain an NX3031 device. It also adds a for_each_netdev_in_bond_rcu(bond, slave) macro which executes for each slave that has bond as master. V3: After testing and discussing this with Rajesh, I decided to keep the vlan ip cache and just rename it to ip_cache since it will store bond ip addresses too. A new master flag has been added to the ip cache to denote that the address has been added because of a master device. I've taken care of the enslave/release cases by checking for various combinations of events and flags (e.g. netxen has a master, it's a bond master and it's not marked as a slave means it is being enslaved and is dev_open()ed in bond_enslave). I've changed netxen_free_ip_list() to have a "master" parameter which causes all IP addresses marked as master to be deleted (used when a netxen is being released). I've made the patch use the new upper device API as well. The following cases were tested: - bond -> netxen - vlan -> netxen - vlan -> bond -> netxen V2: Remove local ip caching, retrieve addresses dynamically and restore them if necessary. Note: Tested with NX3031 adapter. Tested-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: Andy Gospodarek <agospoda@redhat.com> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ec5f0615 |
|
07-Mar-2013 |
Pravin B Shelar <pshelar@nicira.com> |
net: Kill link between CSUM and SG features. Earlier SG was unset if CSUM was not available for given device to force skb copy to avoid sending inconsistent csum. Commit c9af6db4c11c (net: Fix possible wrong checksum generation) added explicit flag to force copy to fix this issue. Therefore there is no need to link SG and CSUM, following patch kills this link between there two features. This patch is also required following patch in series. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
82dc3c63c |
|
05-Mar-2013 |
Eric Dumazet <edumazet@google.com> |
net: introduce NAPI_POLL_WEIGHT Some drivers use a too big NAPI poll weight. This patch adds a NAPI_POLL_WEIGHT default value and issues an error message if a driver attempts to use a bigger weight. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd061574 |
|
18-Feb-2013 |
Cong Wang <xiyou.wangcong@gmail.com> |
net: fix a build failure when !CONFIG_PROC_FS When !CONFIG_PROC_FS dev_mcast_init() is not defined, actually we can just merge dev_mcast_init() into dev_proc_init(). Reported-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
900ff8c6 |
|
18-Feb-2013 |
Cong Wang <xiyou.wangcong@gmail.com> |
net: move procfs code to net/core/net-procfs.c Similar to net/core/net-sysfs.c, group procfs code to a single unit. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
05e8ef4a |
|
14-Feb-2013 |
Pravin B Shelar <pshelar@nicira.com> |
net: factor out skb_mac_gso_segment() from skb_gso_segment() This function will be used in next GRE_GSO patch. This patch does not change any functionality. It only exports skb_mac_gso_segment() function. [ Use skb_reset_mac_len() -DaveM ] Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1690be63 |
|
12-Feb-2013 |
Vlad Yasevich <vyasevic@redhat.com> |
bridge: Add vlan support to static neighbors When a user adds bridge neighbors, allow him to specify VLAN id. If the VLAN id is not specified, the neighbor will be added for VLANs currently in the ports filter list. If no VLANs are configured on the port, we use vlan 0 and only add 1 entry. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6cbdceeb |
|
12-Feb-2013 |
Vlad Yasevich <vyasevic@redhat.com> |
bridge: Dump vlan information from a bridge port Using the RTM_GETLINK dump the vlan filter list of a given bridge port. The information depends on setting the filter flag similar to how nic VF info is dumped. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
407af329 |
|
12-Feb-2013 |
Vlad Yasevich <vyasevic@redhat.com> |
bridge: Add netlink interface to configure vlans on bridge ports Add a netlink interface to add and remove vlan configuration on bridge port. The interface uses the RTM_SETLINK message and encodes the vlan configuration inside the IFLA_AF_SPEC. It is possble to include multiple vlans to either add or remove in a single message. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
febf018d |
|
08-Feb-2013 |
David Ward <david.ward@ll.mit.edu> |
net/802: Implement Multiple Registration Protocol (MRP) Initial implementation of the Multiple Registration Protocol (MRP) from IEEE 802.1Q-2011, based on the existing implementation of the Generic Attribute Registration Protocol (GARP). Signed-off-by: David Ward <david.ward@ll.mit.edu> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
12b0004d |
|
05-Feb-2013 |
Cong Wang <amwang@redhat.com> |
net: adjust skb_gso_segment() for calling in rx path skb_gso_segment() is almost always called in tx path, except for openvswitch. It calls this function when it receives the packet and tries to queue it to user-space. In this special case, the ->ip_summed check inside skb_gso_segment() is no longer true, as ->ip_summed value has different meanings on rx path. This patch adjusts skb_gso_segment() so that we can at least avoid such warnings on checksum. Cc: Jesse Gross <jesse@nicira.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5fbee843 |
|
22-Jan-2013 |
Cong Wang <amwang@redhat.com> |
netpoll: add RCU annotation to npinfo field dev->npinfo is protected by RCU. This fixes the following sparse warnings: net/core/netpoll.c:177:48: error: incompatible types in comparison expression (different address spaces) net/core/netpoll.c:200:35: error: incompatible types in comparison expression (different address spaces) net/core/netpoll.c:221:35: error: incompatible types in comparison expression (different address spaces) net/core/netpoll.c:327:18: error: incompatible types in comparison expression (different address spaces) Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d07d7507 |
|
10-Jan-2013 |
Stanislaw Gruszka <sgruszka@redhat.com> |
net, wireless: overwrite default_ethtool_ops Since: commit 2c60db037034d27f8c636403355d52872da92f81 Author: Eric Dumazet <edumazet@google.com> Date: Sun Sep 16 09:17:26 2012 +0000 net: provide a default dev->ethtool_ops wireless core does not correctly assign ethtool_ops. After alloc_netdev*() call, some cfg80211 drivers provide they own ethtool_ops, but some do not. For them, wireless core provide generic cfg80211_ethtool_ops, which is assigned in NETDEV_REGISTER notify call: if (!dev->ethtool_ops) dev->ethtool_ops = &cfg80211_ethtool_ops; But after Eric's commit, dev->ethtool_ops is no longer NULL (on cfg80211 drivers without custom ethtool_ops), but points to &default_ethtool_ops. In order to fix the problem, provide function which will overwrite default_ethtool_ops and use it by wireless core. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
024e9679 |
|
10-Jan-2013 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add support for XPS without sysfs being defined This patch makes it so that we can support transmit packet steering without sysfs needing to be enabled. The reason for making this change is to make it so that a driver can make use of the XPS even while the sysfs portion of the interface is not present. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
537c00de |
|
10-Jan-2013 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add functions netif_reset_xps_queue and netif_set_xps_queue This patch adds two functions, netif_reset_xps_queue and netif_set_xps_queue. The main idea behind these two functions is to provide a mechanism through which drivers can update their defaults in regards to XPS. Currently no such mechanism exists and as a result we cannot use XPS for things such as ATR which would require a basic configuration to start in which the Tx queues are mapped to CPUs via a 1:1 mapping. With this change I am making it possible for drivers such as ixgbe to be able to use the XPS feature by controlling the default configuration. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
416186fb |
|
10-Jan-2013 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Split core bits of netdev_pick_tx into __netdev_pick_tx This change splits the core bits of netdev_pick_tx into a separate function. The main idea behind this is to make this code accessible to select queue functions when they decide to process the standard path instead of their own custom path in their select queue routine. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
85464ef2 |
|
03-Jan-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: kill dev->master Nobody uses this now. Remove it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8b98a70c |
|
03-Jan-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: remove no longer used netdev_set_bond_master() and netdev_set_master() Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9ff162a8 |
|
03-Jan-2013 |
Jiri Pirko <jiri@resnulli.us> |
net: introduce upper device lists This lists are supposed to serve for storing pointers to all upper devices. Eventually it will replace dev->master pointer which is used for bonding, bridge, team but it cannot be used for vlan, macvlan where there might be multiple upper present. In case the upper link is replacement for dev->master, it is marked with "master" flag. New upper device list resolves this limitation. Also, the information stored in lists is used for preventing looping setups like "bond->somethingelse->samebond" Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fbdeca2d |
|
31-Dec-2012 |
Jiri Pirko <jiri@resnulli.us> |
net: add address assign type "SET" This is the way to indicate that mac address of a device has been set by dev_set_mac_address() Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2681128f |
|
29-Dec-2012 |
Eric Dumazet <edumazet@google.com> |
veth: reduce stat overhead veth stats are a bit bloated. There is no need to account transmit and receive stats, since they are absolutely symmetric. Also use a per device atomic64_t for the dropped counter, as it should never be used in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4bf84c35 |
|
27-Dec-2012 |
Jiri Pirko <jiri@resnulli.us> |
net: add change_carrier netdev op This allows a driver to register change_carrier callback which will be called whenever user will like to change carrier state. This is useful for devices like dummy, gre, team and so on. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
30e6c9fa |
|
20-Dec-2012 |
Eric Dumazet <edumazet@google.com> |
net: devnet_rename_seq should be a seqcount Using a seqlock for devnet_rename_seq is not a good idea, as device_rename() can sleep. As we hold RTNL, we dont need a protection for writers, and only need a seqcount so that readers can catch a change done by a writer. Bug added in commit c91f6df2db4972d3 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6a674e9c |
|
07-Dec-2012 |
Joseph Gasparakis <joseph.gasparakis@intel.com> |
net: Add support for hardware-offloaded encapsulation This patch adds support in the kernel for offloading in the NIC Tx and Rx checksumming for encapsulated packets (such as VXLAN and IP GRE). For Tx encapsulation offload, the driver will need to set the right bits in netdev->hw_enc_features. The protocol driver will have to set the skb->encapsulation bit and populate the inner headers, so the NIC driver will use those inner headers to calculate the csum in hardware. For Rx encapsulation offload, the driver will need to set again the skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY. In that case the protocol driver should push the decapsulated packet up to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol driver should set the skb->encapsulation flag back to zero. Finally the protocol driver should have NETIF_F_RXCSUM flag set in its features. Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c3c7c254 |
|
06-Dec-2012 |
Eric Dumazet <edumazet@google.com> |
net: gro: fix possible panic in skb_gro_receive() commit 2e71a6f8084e (net: gro: selective flush of packets) added a bug for skbs using frag_list. This part of the GRO stack is rarely used, as it needs skb not using a page fragment for their skb->head. Most drivers do use a page fragment, but some of them use GFP_KERNEL allocations for the initial fill of their RX ring buffer. napi_gro_flush() overwrite skb->prev that was used for these skb to point to the last skb in frag_list. Fix this using a separate field in struct napi_gro_cb to point to the last fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb728820 |
|
28-Nov-2012 |
Rami Rosen <ramirose@gmail.com> |
core: make GRO methods static. This patch changes three methods to be static and removes their EXPORT_SYMBOLs in core/dev.c and their external declaration in netdevice.h. The methods, dev_gro_receive(), napi_frags_finish() and napi_skb_finish(), which are in the GRO rx path, are not used outside core/dev.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c91f6df2 |
|
25-Nov-2012 |
Brian Haley <brian.haley@hp.com> |
sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name Instead of having the getsockopt() of SO_BINDTODEVICE return an index, which will then require another call like if_indextoname() to get the actual interface name, have it return the name directly. This also matches the existing man page description on socket(7) which mentions the argument being an interface name. If the value has not been set, zero is returned and optlen will be set to zero to indicate there is no interface name present. Added a seqlock to protect this code path, and dev_ifname(), from someone changing the device name via dev_change_name(). v2: Added seqlock protection while copying device name. v3: Fixed word wrap in patch. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d93cf068 |
|
19-Sep-2012 |
Adam Buchbinder <adam.buchbinder@gmail.com> |
various: Fix spelling of "registered" in comments. Some comments misspell "registered"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
f191a1d1 |
|
15-Nov-2012 |
Vlad Yasevich <vyasevic@redhat.com> |
net: Remove code duplication between offload structures Move the offload callbacks into its own structure. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
22061d80 |
|
15-Nov-2012 |
Vlad Yasevich <vyasevic@redhat.com> |
net: Switch to using the new packet offload infrustructure Convert to using the new GSO/GRO registration mechanism and new packet offload structure. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
62532da9 |
|
15-Nov-2012 |
Vlad Yasevich <vyasevic@redhat.com> |
net: Add generic packet offload infrastructure. Create a new data structure to contain the GRO/GSO callbacks and add a new registration mechanism. Singed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e5a55a89 |
|
24-Oct-2012 |
John Fastabend <john.r.fastabend@intel.com> |
net: create generic bridge ops The PF_BRIDGE:RTM_{GET|SET}LINK nlmsg family and type are currently embedded in the ./net/bridge module. This prohibits them from being used by other bridging devices. One example of this being hardware that has embedded bridging components. In order to use these nlmsg types more generically this patch adds two net_device_ops hooks. One to set link bridge attributes and another to dump the current bride attributes. ndo_bridge_setlink() ndo_bridge_getlink() CC: Lennert Buytenhek <buytenh@wantstofly.org> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
607ca46e |
|
13-Oct-2012 |
David Howells <dhowells@redhat.com> |
UAPI: (Scripted) Disintegrate include/linux Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Dave Jones <davej@redhat.com>
|
#
86347245 |
|
08-Oct-2012 |
Eric Dumazet <edumazet@google.com> |
ipv6: gro: fix PV6_GRO_CB(skb)->proto problem It seems IPV6_GRO_CB(skb)->proto can be destroyed in skb_gro_receive() if a new skb is allocated (to serve as an anchor for frag_list) We copy NAPI_GRO_CB() only (not the IPV6 specific part) in : *NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p); So we leave IPV6_GRO_CB(nskb)->proto to 0 (fresh skb allocation) instead of IPPROTO_TCP (6) ipv6_gro_complete() isnt able to call ops->gro_complete() [ tcp6_gro_complete() ] Fix this by moving proto in NAPI_GRO_CB() and getting rid of IPV6_GRO_CB Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2e71a6f8 |
|
06-Oct-2012 |
Eric Dumazet <edumazet@google.com> |
net: gro: selective flush of packets Current GRO can hold packets in gro_list for almost unlimited time, in case napi->poll() handler consumes its budget over and over. In this case, napi_complete()/napi_gro_flush() are not called. Another problem is that gro_list is flushed in non friendly way : We scan the list and complete packets in the reverse order. (youngest packets first, oldest packets last) This defeats priorities that sender could have cooked. Since GRO currently only store TCP packets, we dont really notice the bug because of retransmits, but this behavior can add unexpected latencies, particularly on mice flows clamped by elephant flows. This patch makes sure no packet can stay more than 1 ms in queue, and only in stress situations. It also complete packets in the right order to minimize latencies. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jesse Gross <jesse@nicira.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ca07e43e |
|
06-Oct-2012 |
Eric Dumazet <edumazet@google.com> |
net: gro: fix a potential crash in skb_gro_reset_offset Before accessing skb first fragment, better make sure there is one. This is probably not needed for old kernels, since an ethernet frame cannot contain only an ethernet header, but the recent GRO addition to tunnels makes this patch needed. Also skb_gro_reset_offset() can be static, it actually allows compiler to inline it. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
edc7d573 |
|
30-Sep-2012 |
stephen hemminger <shemminger@vyatta.com> |
netlink: add attributes to fdb interface Later changes need to be able to refer to neighbour attributes when doing fdb_add. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
404f7c9e |
|
26-Sep-2012 |
Eric Dumazet <edumazet@google.com> |
net: struct napi_struct fields reordering Remove two holes on 64bit arches, and put dev_list at the end of napi_struct since its not used in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8c4c49df |
|
17-Sep-2012 |
Amerigo Wang <amwang@redhat.com> |
netpoll: call ->ndo_select_queue() in tx path In netpoll tx path, we miss the chance of calling ->ndo_select_queue(), thus could cause problems when bonding is involved. This patch makes dev_pick_tx() extern (and rename it to netdev_pick_tx()) to let netpoll call it in netpoll_send_skb_on_dev(). Reported-by: Sylvain Munaut <s.munaut@whatever-company.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <amwang@redhat.com> Tested-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b6e2725 |
|
17-Sep-2012 |
stephen hemminger <shemminger@vyatta.com> |
netdev: make address const in device address management The internal functions for add/deleting addresses don't change their argument. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b004ff49 |
|
12-Sep-2012 |
Joe Perches <joe@perches.com> |
netdev_printk/dynamic_netdev_dbg: Directly call printk_emit A lot of stack is used in recursive printks with %pV. Using multiple levels of %pV (a logging function with %pV that calls another logging function with %pV) can consume more stack than necessary. Avoid excessive stack use by not calling dev_printk from netdev_printk and dynamic_netdev_dbg. Duplicate the logic and form of dev_printk instead. Make __netdev_printk static. Remove EXPORT_SYMBOL(__netdev_printk) Whitespace and brace style neatening. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: David S. Miller <davem@davemloft.net> Tested-by: Jim Cromie <jim.cromie@gmail.com> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
23d3b8bf |
|
04-Sep-2012 |
Eric Dumazet <edumazet@google.com> |
net: qdisc busylock needs lockdep annotations It seems we need to provide ability for stacked devices to use specific lock_class_key for sch->busylock We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but a user might use a qdisc anyway. (So same fixes are probably needed on non LLTX stacked drivers) Noticed while stressing L2TPV3 setup : ====================================================== [ INFO: possible circular locking dependency detected ] 3.6.0-rc3+ #788 Not tainted ------------------------------------------------------- netperf/4660 is trying to acquire lock: (l2tpsock){+.-...}, at: [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core] but task is already holding lock: (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&sch->busylock)->rlock){+.-...}: [<ffffffff810a5df0>] lock_acquire+0x90/0x200 [<ffffffff817499fc>] _raw_spin_lock_irqsave+0x4c/0x60 [<ffffffff81074872>] __wake_up+0x32/0x70 [<ffffffff8136d39e>] tty_wakeup+0x3e/0x80 [<ffffffff81378fb3>] pty_write+0x73/0x80 [<ffffffff8136cb4c>] tty_put_char+0x3c/0x40 [<ffffffff813722b2>] process_echoes+0x142/0x330 [<ffffffff813742ab>] n_tty_receive_buf+0x8fb/0x1230 [<ffffffff813777b2>] flush_to_ldisc+0x142/0x1c0 [<ffffffff81062818>] process_one_work+0x198/0x760 [<ffffffff81063236>] worker_thread+0x186/0x4b0 [<ffffffff810694d3>] kthread+0x93/0xa0 [<ffffffff81753e24>] kernel_thread_helper+0x4/0x10 -> #0 (l2tpsock){+.-...}: [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10 [<ffffffff810a5df0>] lock_acquire+0x90/0x200 [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50 [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth] [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70 [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290 [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00 [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890 [<ffffffff815db019>] ip_output+0x59/0xf0 [<ffffffff815da36d>] ip_local_out+0x2d/0xa0 [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680 [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60 [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30 [<ffffffff815f5300>] tcp_push_one+0x30/0x40 [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040 [<ffffffff81614495>] inet_sendmsg+0x125/0x230 [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0 [<ffffffff81579ece>] sys_sendto+0xfe/0x130 [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&sch->busylock)->rlock); lock(l2tpsock); lock(&(&sch->busylock)->rlock); lock(l2tpsock); *** DEADLOCK *** 5 locks held by netperf/4660: #0: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e581c>] tcp_sendmsg+0x2c/0x1040 #1: (rcu_read_lock){.+.+..}, at: [<ffffffff815da3e0>] ip_queue_xmit+0x0/0x680 #2: (rcu_read_lock_bh){.+....}, at: [<ffffffff815d9ac5>] ip_finish_output+0x135/0x890 #3: (rcu_read_lock_bh){.+....}, at: [<ffffffff81595820>] dev_queue_xmit+0x0/0xe00 #4: (&(&sch->busylock)->rlock){+.-...}, at: [<ffffffff81596595>] dev_queue_xmit+0xd75/0xe00 stack backtrace: Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788 Call Trace: [<ffffffff8173dbf8>] print_circular_bug+0x1fb/0x20c [<ffffffff810a5288>] __lock_acquire+0x1628/0x1b10 [<ffffffff810a334b>] ? check_usage+0x9b/0x4d0 [<ffffffff810a3f44>] ? __lock_acquire+0x2e4/0x1b10 [<ffffffff810a5df0>] lock_acquire+0x90/0x200 [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffff817498c1>] _raw_spin_lock+0x41/0x50 [<ffffffffa0208db2>] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffffa0208db2>] l2tp_xmit_skb+0x172/0xa50 [l2tp_core] [<ffffffffa021a802>] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth] [<ffffffff815952b2>] dev_hard_start_xmit+0x502/0xa70 [<ffffffff81594e0e>] ? dev_hard_start_xmit+0x5e/0xa70 [<ffffffff81595961>] ? dev_queue_xmit+0x141/0xe00 [<ffffffff815b63ce>] sch_direct_xmit+0xfe/0x290 [<ffffffff81595a05>] dev_queue_xmit+0x1e5/0xe00 [<ffffffff81595820>] ? dev_hard_start_xmit+0xa70/0xa70 [<ffffffff815d9d60>] ip_finish_output+0x3d0/0x890 [<ffffffff815d9ac5>] ? ip_finish_output+0x135/0x890 [<ffffffff815db019>] ip_output+0x59/0xf0 [<ffffffff815da36d>] ip_local_out+0x2d/0xa0 [<ffffffff815da5a3>] ip_queue_xmit+0x1c3/0x680 [<ffffffff815da3e0>] ? ip_local_out+0xa0/0xa0 [<ffffffff815f4192>] tcp_transmit_skb+0x402/0xa60 [<ffffffff815fa25e>] ? tcp_md5_do_lookup+0x18e/0x1a0 [<ffffffff815f4a94>] tcp_write_xmit+0x1f4/0xa30 [<ffffffff815f5300>] tcp_push_one+0x30/0x40 [<ffffffff815e6672>] tcp_sendmsg+0xe82/0x1040 [<ffffffff81614495>] inet_sendmsg+0x125/0x230 [<ffffffff81614370>] ? inet_create+0x6b0/0x6b0 [<ffffffff8157e6e2>] ? sock_update_classid+0xc2/0x3b0 [<ffffffff8157e750>] ? sock_update_classid+0x130/0x3b0 [<ffffffff81576cdc>] sock_sendmsg+0xdc/0xf0 [<ffffffff81162579>] ? fget_light+0x3f9/0x4f0 [<ffffffff81579ece>] sys_sendto+0xfe/0x130 [<ffffffff810a69ad>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8174a0b0>] ? _raw_spin_unlock_irq+0x30/0x50 [<ffffffff810757e3>] ? finish_task_switch+0x83/0xf0 [<ffffffff810757a6>] ? finish_task_switch+0x46/0xf0 [<ffffffff81752cb7>] ? sysret_check+0x1b/0x56 [<ffffffff81752c92>] system_call_fastpath+0x16/0x1b Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8f4cccbb |
|
20-Aug-2012 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Set device operstate at registration time The operstate of a device is initially IF_OPER_UNKNOWN and is updated asynchronously by linkwatch after each change of carrier state reported by the driver. The default carrier state of a net device is on, and this will never be changed on drivers that do not support carrier detection, thus the operstate remains IF_OPER_UNKNOWN. For devices that do support carrier detection, the driver must set the carrier state to off initially, then poll the hardware state when the device is opened. However, we must not activate linkwatch for a unregistered device, and commit b473001 ('net: Do not fire linkwatch events until the device is registered.') ensured that we don't. But this means that the operstate for many devices that support carrier detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN. The same issue exists with the dormant state. The proper initialisation sequence, avoiding a race with opening of the device, is: rtnl_lock(); rc = register_netdevice(dev); if (rc) goto out_unlock; netif_carrier_off(dev); /* or netif_dormant_on(dev) */ rtnl_unlock(); but it seems silly that this should have to be repeated in so many drivers. Further, the operstate seen immediately after opening the device may still be IF_OPER_UNKNOWN due to the asynchronous nature of linkwatch. Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted to fix this by setting the operstate synchronously, but it was reverted as it could lead to deadlock. This initialises the operstate synchronously at registration time only. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0115e8e3 |
|
22-Aug-2012 |
Eric Dumazet <edumazet@google.com> |
net: remove delay at device dismantle I noticed extra one second delay in device dismantle, tracked down to a call to dst_dev_event() while some call_rcu() are still in RCU queues. These call_rcu() were posted by rt_free(struct rtable *rt) calls. We then wait a little (but one second) in netdev_wait_allrefs() before kicking again NETDEV_UNREGISTER. As the call_rcu() are now completed, dst_dev_event() can do the needed device swap on busy dst. To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called after a rcu_barrier(), but outside of RTNL lock. Use NETDEV_UNREGISTER_FINAL with care ! Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after IP cache removal. With help from Gao feng Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c0de08d0 |
|
16-Aug-2012 |
Eric Leblond <eric@regit.org> |
af_packet: don't emit packet on orig fanout group If a packet is emitted on one socket in one group of fanout sockets, it is transmitted again. It is thus read again on one of the sockets of the fanout group. This result in a loop for software which generate packets when receiving one. This retransmission is not the intended behavior: a fanout group must behave like a single socket. The packet should not be transmitted on a socket if it originates from a socket belonging to the same fanout group. This patch fixes the issue by changing the transmission check to take fanout group info account. Reported-by: Aleksandr Kotov <a1k@mail.ru> Signed-off-by: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
47be03a2 |
|
09-Aug-2012 |
Amerigo Wang <amwang@redhat.com> |
netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup() slave_enable_netpoll() and __netpoll_setup() may be called with read_lock() held, so should use GFP_ATOMIC to allocate memory. Eric suggested to pass gfp flags to __netpoll_setup(). Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b7bc2a5b |
|
09-Aug-2012 |
Amerigo Wang <amwang@redhat.com> |
net: remove netdev_bonding_change() I don't see any benifits to use netdev_bonding_change() than using call_netdevice_notifiers() directly. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee89bab1 |
|
09-Aug-2012 |
Amerigo Wang <amwang@redhat.com> |
net: move and rename netif_notify_peers() I believe net/core/dev.c is a better place for netif_notify_peers(), because other net event notify functions also stay in this file. And rename it to netdev_notify_peers(). Cc: David S. Miller <davem@davemloft.net> Cc: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
30b678d8 |
|
30-Jul-2012 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Allow driver to limit number of GSO segments per skb A peer (or local user) may cause TCP to use a nominal MSS of as little as 88 (actual MSS of 76 with timestamps). Given that we have a sufficiently prodigious local sender and the peer ACKs quickly enough, it is nevertheless possible to grow the window for such a connection to the point that we will try to send just under 64K at once. This results in a single skb that expands to 861 segments. In some drivers with TSO support, such an skb will require hundreds of DMA descriptors; a substantial fraction of a TX ring or even more than a full ring. The TX queue selected for the skb may stall and trigger the TX watchdog repeatedly (since the problem skb will be retried after the TX reset). This particularly affects sfc, for which the issue is designated as CVE-2012-3412. Therefore: 1. Add the field net_device::gso_max_segs holding the device-specific limit. 2. In netif_skb_features(), if the number of segments is too high then mask out GSO features to force fall back to software GSO. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee6ae1a1 |
|
19-Jul-2012 |
Jiri Pirko <jiri@resnulli.us> |
net: honour netif_set_real_num_tx_queues() retval In netif_copy_real_num_queues() the return value of netif_set_real_num_tx_queues() should be checked. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
16917b87 |
|
30-Jun-2012 |
Yuval Mintz <yuvalmin@broadcom.com> |
net-next: Add netif_get_num_default_rss_queues Most multi-queue networking driver consider the number of online cpus when configuring RSS queues. This patch adds a wrapper to the number of cpus, setting an upper limit on the number of cpus a driver should consider (by default) when allocating resources for his queues. Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95603e22 |
|
12-Jun-2012 |
Michel Machado <michel@digirati.com.br> |
net-next: add dev_loopback_xmit() to avoid duplicate code Add dev_loopback_xmit() in order to deduplicate functions ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c). I was about to reinvent the wheel when I noticed that ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I need and are not IP-only functions, but they were not available to reuse elsewhere. ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I understand that this is harmless, and should be in dev_loopback_xmit(). Signed-off-by: Michel Machado <michel@digirati.com.br> CC: "David S. Miller" <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jpirko@redhat.com> CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9136461a |
|
11-Jun-2012 |
Eric Dumazet <edumazet@google.com> |
net: keep name_hlist close to name __dev_get_by_name() is slow because pm_qos_req has been inserted between name[] and name_hlist, adding cache misses. pm_qos_req has nothing to do at the beginning of struct net_device Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0053ea9c |
|
30-May-2012 |
Joe Perches <joe@perches.com> |
netdevice: Update netif_dbg for CONFIG_DYNAMIC_DEBUG Make netif_dbg use dynamic debugging whenever CONFIG_DYNAMIC_DEBUG is enabled. commit b558c96ffa53 ("dynamic_debug: make dynamic-debug supersede DEBUG ccflag") missed updating the netif_dbg variant. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4adb9c4a |
|
18-May-2012 |
Eric Dumazet <edumazet@google.com> |
net: napi_frags_skb() is static No need to export napi_frags_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28e85100 |
|
17-May-2012 |
David S. Miller <davem@davemloft.net> |
net: Remove netdevice ec_ptr, no longer used. ECONET is gone, thus this can be deleted as well. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
59b9997b |
|
10-May-2012 |
David S. Miller <davem@davemloft.net> |
Revert "net: maintain namespace isolation between vlan and real device" This reverts commit 8a83a00b0735190384a348156837918271034144. It causes regressions for S390 devices, because it does an unconditional DST drop on SKBs for vlans and the QETH device needs the neighbour entry hung off the DST for certain things on transmit. Arnd can't remember exactly why he even needed this change. Conflicts: drivers/net/macvlan.c net/8021q/vlan_dev.c net/core/dev.c Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d7e8883c |
|
30-Apr-2012 |
Eric Dumazet <edumazet@google.com> |
net: make GRO aware of skb->head_frag GRO can check if skb to be merged has its skb->head mapped to a page fragment, instead of a kmalloc() area. We 'upgrade' skb->head as a fragment in itself This avoids the frag_list fallback, and permits to build true GRO skb (one sk_buff and up to 16 fragments), using less memory. This reduces number of cache misses when user makes its copy, since a single sk_buff is fetched. This is a followup of patch "net: allow skb->head to be a page fragment" Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
12a94634 |
|
15-Apr-2012 |
John Fastabend <john.r.fastabend@intel.com> |
net: addr_list: add exclusive dev_uc_add and dev_mc_add This adds a dev_uc_add_excl() and dev_mc_add_excl() calls similar to the original dev_{uc|mc}_add() except it sets the global bit and returns -EEXIST for duplicat entires. This is useful for drivers that support SR-IOV, macvlan devices and any other devices that need to manage the unicast and multicast lists. v2: fix typo UNICAST should be MULTICAST in dev_mc_add_excl() CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77162022 |
|
15-Apr-2012 |
John Fastabend <john.r.fastabend@intel.com> |
net: add generic PF_BRIDGE:RTM_ FDB hooks This adds two new flags NTF_MASTER and NTF_SELF that can now be used to specify where PF_BRIDGE netlink commands should be sent. NTF_MASTER sends the commands to the 'dev->master' device for parsing. Typically this will be the linux net/bridge, or open-vswitch devices. Also without any flags set the command will be handled by the master device as well so that current user space tools continue to work as expected. The NTF_SELF flag will push the PF_BRIDGE commands to the device. In the basic example below the commands are then parsed and programmed in the embedded bridge. Note if both NTF_SELF and NTF_MASTER bits are set then the command will be sent to both 'dev->master' and 'dev' this allows user space to easily keep the embedded bridge and software bridge in sync. There is a slight complication in the case with both flags set when an error occurs. To resolve this the rtnl handler clears the NTF_ flag in the netlink ack to indicate which sets completed successfully. The add/del handlers will abort as soon as any error occurs. To support this new net device ops were added to call into the device and the existing bridging code was refactored to use these. There should be no required changes in user space to support the current bridge behavior. A basic setup with a SR-IOV enabled NIC looks like this, veth0 veth2 | | ------------ | bridge0 | <---- software bridging ------------ / / ethx.y ethx VF PF \ \ <---- propagate FDB entries to HW \ \ -------------------- | Embedded Bridge | <---- hardware offloaded switching -------------------- In this case the embedded bridge must be managed to allow 'veth0' to communicate with 'ethx.y' correctly. At present drivers managing the embedded bridge either send frames onto the network which then get dropped by the switch OR the embedded bridge will flood these frames. With this patch we have a mechanism to manage the embedded bridge correctly from user space. This example is specific to SR-IOV but replacing the VF with another PF or dropping this into the DSA framework generates similar management issues. Examples session using the 'br'[1] tool to add, dump and then delete a mac address with a new "embedded" option and enabled ixgbe driver: # br fdb add 22:35:19:ac:60:59 dev eth3 # br fdb port mac addr flags veth0 22:35:19:ac:60:58 static veth0 9a:5f:81:f7:f6:ec local eth3 00:1b:21:55:23:59 local eth3 22:35:19:ac:60:59 static veth0 22:35:19:ac:60:57 static #br fdb add 22:35:19:ac:60:59 embedded dev eth3 #br fdb port mac addr flags veth0 22:35:19:ac:60:58 static veth0 9a:5f:81:f7:f6:ec local eth3 00:1b:21:55:23:59 local eth3 22:35:19:ac:60:59 static veth0 22:35:19:ac:60:57 static eth3 22:35:19:ac:60:59 local embedded #br fdb del 22:35:19:ac:60:59 embedded dev eth3 I added a couple lines to 'br' to set the flags correctly is all. It is my opinion that the merit of this patch is now embedded and SW bridges can both be modeled correctly in user space using very nearly the same message passing. [1] 'br' tool was published as an RFC here and will be renamed 'bridge' http://patchwork.ozlabs.org/patch/117664/ Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for valuable feedback, suggestions, and review. v2: fixed api descriptions and error case with both NTF_SELF and NTF_MASTER set plus updated patch description. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95c96174 |
|
14-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2def16ae |
|
02-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: fix /proc/net/dev regression Commit f04565ddf52 (dev: use name hash for dev_seq_ops) added a second regression, as some devices are missing from /proc/net/dev if many devices are defined. When seq_file buffer is filled, the last ->next/show() method is canceled (pos value is reverted to value prior ->next() call) Problem is after above commit, we dont restart the lookup at right position in ->start() method. Fix this by removing the internal 'pos' pointer added in commit, since we need to use the 'loff_t *pos' provided by seq_file layer. This also reverts commit 5cac98dd0 (net: Fix corruption in /proc/*/net/dev_mcast), since its not needed anymore. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Mihai Maruseac <mmaruseac@ixiacom.com> Tested-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cdbee74c |
|
16-Mar-2012 |
Yi Zou <yi.zou@intel.com> |
net: do not do gso for CHECKSUM_UNNECESSARY in netif_needs_gso This is related to fixing the bug of dropping FCoE frames when disabling tx ip checksum by 'ethtool -K ethx tx off'. The FCoE protocol stack driver would use CHECKSUM_UNNECESSARY on tx path instead of CHECKSUM_PARTIAL (as indicated in the 2/2 of this series). To do so, netif_needs_gso() has to be changed here to not do gso for both CHECKSUM_PARTIAL and CHECKSUM_UNNECESSARY. Ref. to original discussion thread: http://patchwork.ozlabs.org/patch/146567/ Signed-off-by: Yi Zou <yi.zou@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0646819c |
|
15-Mar-2012 |
Yi Zou <yi.zou@intel.com> |
net: do not do gso for CHECKSUM_UNNECESSARY in netif_needs_gso This is related to fixing the bug of dropping FCoE frames when disabling tx ip checksum by 'ethtool -K ethx tx off'. The FCoE protocol stack driver would use CHECKSUM_UNNECESSARY on tx path instead of CHECKSUM_PARTIAL (as indicated in the 2/2 of this series). To do so, netif_needs_gso() has to be changed here to not do gso for both CHECKSUM_PARTIAL and CHECKSUM_UNNECESSARY. Ref. to original discussion thread: http://patchwork.ozlabs.org/patch/146567/ Signed-off-by: Yi Zou <yi.zou@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
313162d0 |
|
30-Jan-2012 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
device.h: audit and cleanup users in main include dir The <linux/device.h> header includes a lot of stuff, and it in turn gets a lot of use just for the basic "struct device" which appears so often. Clean up the users as follows: 1) For those headers only needing "struct device" as a pointer in fcn args, replace the include with exactly that. 2) For headers not really using anything from device.h, simply delete the include altogether. 3) For headers relying on getting device.h implicitly before being included themselves, now explicitly include device.h 4) For files in which doing #1 or #2 uncovers an implicit dependency on some other header, fix by explicitly adding the required header(s). Any C files that were implicitly relying on device.h to be present have already been dealt with in advance. Total removals from #1 and #2: 51. Total additions coming from #3: 9. Total other implicit dependencies from #4: 7. As of 3.3-rc1, there were 110, so a net removal of 42 gives about a 38% reduction in device.h presence in include/* Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
b37c0fbe |
|
06-Feb-2012 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Add memory barriers to prevent possible race in byte queue limits This change adds a memory barrier to the byte queue limit code to address a possible race as has been seen in the past with the netif_stop_queue/netif_wake_queue logic. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
5c490354 |
|
06-Feb-2012 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: Fix issue with netdev_tx_reset_queue not resetting queue from XOFF state We are seeing dev_watchdog hangs on several drivers. I suspect this is due to the __QUEUE_STATE_STACK_XOFF bit being set prior to a reset for link change, and then not being cleared by netdev_tx_reset_queue. This change corrects that. In addition we were seeing dev_watchdog hangs on igb after running the ethtool tests. We found this to be due to the fact that the ethtool test runs the same logic as ndo_start_xmit, but we were never clearing the XOFF flag since the loopback test in ethtool does not do byte queue accounting. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
4d29515f |
|
07-Mar-2012 |
David S. Miller <davem@davemloft.net> |
net: Use bool in netdevice.h helpers. Specifically use it in napi_disable_pending(), napi_schedule_prep(), napi_reschedule(), netif_tx_queue_stopped(), netif_queue_stopped(), netif_xmit_stopped(), netif_xmit_frozen_or_stopped(), netif_running(), __netif_subqueue_stopped(), netif_subqueue_stopped(), netif_is_multiquue(), netif_carrier_ok(), netif_dormant(), netif_oper_up(), netif_device_present(), __netif_tx_trylock(), net_gso_ok(), skb_gso_ok(), netif_needs_gso(), and netif_is_bond_slave(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95f050bf |
|
06-Mar-2012 |
David S. Miller <davem@davemloft.net> |
net: Use bool for return value of dev_valid_name(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
77a1abf5 |
|
04-Mar-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: export netdev_stats_to_stats64 Some drivers use internal netdev stats member to store part of their stats, yet advertize ndo_get_stats64() to implement some 64bit fields. Allow them to use netdev_stats_to_stats64() helper to make the copy of netdev stats before they compute their 64bit counters. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
187f1882 |
|
23-Nov-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
BUG: headers with BUG/BUG_ON etc. need linux/bug.h If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any other BUG variant in a static inline (i.e. not in a #define) then that header really should be including <linux/bug.h> and not just expecting it to be implicitly present. We can make this change risk-free, since if the files using these headers didn't have exposure to linux/bug.h already, they would have been causing compile failures/warnings. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
3bdc0eba |
|
11-Feb-2012 |
Ben Greear <greearb@candelatech.com> |
net: Add framework to allow sending packets with customized CRC. This is useful for testing RX handling of frames with bad CRCs. Requires driver support to actually put the packet on the wire properly. Signed-off-by: Ben Greear <greearb@candelatech.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
c5905afb |
|
24-Feb-2012 |
Ingo Molnar <mingo@elte.hu> |
static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
#
b558c96f |
|
19-Dec-2011 |
Jim Cromie <jim.cromie@gmail.com> |
dynamic_debug: make dynamic-debug supersede DEBUG ccflag If CONFIG_DYNAMIC_DEBUG is defined, honor it over DEBUG, so that pr_debug()s are controllable, instead of always-on. When DEBUG is also defined, change _DPRINTK_FLAGS_DEFAULT to enable printing by default. Also adding _DPRINTK_FLAGS_INCL_MODNAME would be nice, but there are numerous cases of pr_debug(NAME ": ...), which would result in double printing of module-name. So defer this until things settle. Cc: David Miller <davem@davemloft.net> Cc: Joe Perches <joe@perches.com> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
2429f7ac |
|
08-Jan-2012 |
Jiri Pirko <jpirko@redhat.com> |
net: introduce netif_addr_lock_nested() and call if when appropriate dev_uc_sync() and dev_mc_sync() are acquiring netif_addr_lock for destination device of synchronization. Since netif_addr_lock is already held at the time for source device, this triggers lockdep deadlock warning. There's no way this deadlock can happen so use spin_lock_nested() to silence the warning. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
68bad94e |
|
04-Jan-2012 |
Neerav Parikh <Neerav.Parikh@intel.com> |
netdev: FCoE: Add new ndo_get_fcoe_hbainfo() call This adds a new ndo_get_fcoe_hbainfo() call in net_device_ops for FCoE protocol stack. If supported by the underlying device, the FCoE protocol stack will call this to get device specific information from the underlying device. This information will then be utilized by the FCoE protocol stack to register Fiber Channel HBA attributes with the Fiber Channel Management Service via Fabric Device Management Interface (FDMI) as per the T11 FC-GS specification. Changes in v2: - As per comments from David Miller aligning the parameters of the ndo_get_fcoe_hbainfo() Signed-off-by: Neerav Parikh <Neerav.Parikh@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
60b778ce |
|
23-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
rfs: better sizing of dev_flow_table Aim of this patch is to provide full range of rps_flow_cnt on 64bit arches. Theorical limit on number of flows is 2^32 Fix some buggy RPS/RFS macros as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> CC: Xi Wang <xi.wang@gmail.com> CC: Laurent Chavey <chavey@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
933393f5 |
|
22-Dec-2011 |
Christoph Lameter <cl@linux.com> |
percpu: Remove irqsafe_cpu_xxx variants We simply say that regular this_cpu use must be safe regardless of preemption and interrupt state. That has no material change for x86 and s390 implementations of this_cpu operations. However, arches that do not provide their own implementation for this_cpu operations will now get code generated that disables interrupts instead of preemption. -tj: This is part of on-going percpu API cleanup. For detailed discussion of the subject, please refer to the following thread. http://thread.gmane.org/gmane.linux.kernel/1222078 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <alpine.DEB.2.00.1112221154380.11787@router.home>
|
#
447f2191 |
|
19-Dec-2011 |
David S. Miller <davem@davemloft.net> |
Revert "net: Remove unused neighbour layer ops." This reverts commit 5c3ddec73d01a1fae9409c197078cb02c42238c3. S390 qeth driver actually still uses the setup ops. Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c3ddec7 |
|
13-Dec-2011 |
David S. Miller <davem@davemloft.net> |
net: Remove unused neighbour layer ops. It's simpler to just keep these things out until there is a real user of them, so we can see what the needs actually are, rather than keep these things around as useless overhead. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b9ea6e0 |
|
07-Dec-2011 |
Jiri Pirko <jpirko@redhat.com> |
vlan: introduce vid list with reference counting This allows to keep track of vids needed to be in rx vlan filters of devices even if they are used in bond/team etc. vlan_info as well as vlan_group previously was, is allocated when first vid is added and dealocated whan last vid is deleted. vlan_group definition is moved to private header. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8e586137 |
|
08-Dec-2011 |
Jiri Pirko <jpirko@redhat.com> |
net: make vlan ndo_vlan_rx_[add/kill]_vid return error value Let caller know the result of adding/removing vlan id to/from vlan filter. In some drivers I make those functions to just return 0. But in those where there is able to see if hw setup went correctly, return value is set appropriately. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
65698610 |
|
01-Dec-2011 |
David S. Miller <davem@davemloft.net> |
net: Make ndo_neigh_destroy return void. The return value isn't used. Suggested by Ben Hucthings. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
da6a8fa0 |
|
24-Jul-2011 |
David Miller <davem@davemloft.net> |
neigh: Add device constructor/destructor capability. If the neigh entry has device private state, it will need constructor/destructor ops. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
596b9b68 |
|
24-Jul-2011 |
David Miller <davem@davemloft.net> |
neigh: Add infrastructure for allocating device neigh privates. netdev->neigh_priv_len records the private area length. This will trigger for neigh_table objects which set tbl->entry_size to zero, and the first instances of this will be forthcoming. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
114cf580 |
|
28-Nov-2011 |
Tom Herbert <therbert@google.com> |
bql: Byte queue limits Networking stack support for byte queue limits, uses dynamic queue limits library. Byte queue limits are maintained per transmit queue, and a dql structure has been added to netdev_queue structure for this purpose. Configuration of bql is in the tx-<n> sysfs directory for the queue under the byte_queue_limits directory. Configuration includes: limit_min, bql minimum limit limit_max, bql maximum limit hold_time, bql slack hold time Also under the directory are: limit, current byte limit inflight, current number of bytes on the queue Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c5d67bd7 |
|
28-Nov-2011 |
Tom Herbert <therbert@google.com> |
net: Add netdev interfaces for recording sends/comp Add interfaces for drivers to call for recording number of packets and bytes at send time and transmit completion. Also, added a function to "reset" a queue. These will be used by Byte Queue Limits. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
73466498 |
|
28-Nov-2011 |
Tom Herbert <therbert@google.com> |
net: Add queue state xoff flag for stack Create separate queue state flags so that either the stack or drivers can turn on XOFF. Added a set of functions used in the stack to determine if a queue is really stopped (either by stack or driver) Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5cac98dd |
|
27-Nov-2011 |
Anton Blanchard <anton@samba.org> |
net: Fix corruption in /proc/*/net/dev_mcast I just hit this during my testing. Isn't there another bug lurking? BUG kmalloc-8: Redzone overwritten INFO: 0xc0000000de9dec48-0xc0000000de9dec4b. First byte 0x0 instead of 0xcc INFO: Allocated in .__seq_open_private+0x30/0xa0 age=0 cpu=5 pid=3896 .__kmalloc+0x1e0/0x2d0 .__seq_open_private+0x30/0xa0 .seq_open_net+0x60/0xe0 .dev_mc_seq_open+0x4c/0x70 .proc_reg_open+0xd8/0x260 .__dentry_open.clone.11+0x2b8/0x400 .do_last+0xf4/0x950 .path_openat+0xf8/0x480 .do_filp_open+0x48/0xc0 .do_sys_open+0x140/0x250 syscall_exit+0x0/0x40 dev_mc_seq_ops uses dev_seq_start/next/stop but only allocates sizeof(struct seq_net_private) of private data, whereas it expects sizeof(struct dev_iter_state): struct dev_iter_state { struct seq_net_private p; unsigned int pos; /* bucket << BUCKET_SPACE + offset */ }; Create dev_seq_open_ops and use it so we don't have to expose struct dev_iter_state. [ Problem added by commit f04565ddf52e4 (dev: use name hash for dev_seq_ops) -Eric ] Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d11ead75 |
|
25-Nov-2011 |
Ben Hutchings <ben@decadent.org.uk> |
net: Use IS_ENABLED() in netdevice.h as appropriate Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34a430d7 |
|
25-Nov-2011 |
Ben Hutchings <ben@decadent.org.uk> |
dsa: Allow core and drivers to be built as modules Change the kconfig types to tristate and adjust the condition for declaring net_device::dsa_ptr to allow for this. Adjust the makefile so that if NET_DSA_MV88E6123_61_65=y and NET_DSA_MV88E6131=m or vice versa then both drivers are built-in. We could leave these options as bool and make NET_DSA_MV88E6XXX a user-selected option, but that would break existing configurations. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf50dcc2 |
|
25-Nov-2011 |
Ben Hutchings <ben@decadent.org.uk> |
dsa: Change dsa_uses_{dsa, trailer}_tags() into inline functions eth_type_trans() will use these functions if DSA is enabled, which blocks building DSA as a module. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5bc1421e |
|
21-Nov-2011 |
Neil Horman <nhorman@tuxdriver.com> |
net: add network priority cgroup infrastructure (v4) This patch adds in the infrastructure code to create the network priority cgroup. The cgroup, in addition to the standard processes file creates two control files: 1) prioidx - This is a read-only file that exports the index of this cgroup. This is a value that is both arbitrary and unique to a cgroup in this subsystem, and is used to index the per-device priority map 2) priomap - This is a writeable file. On read it reports a table of 2-tuples <name:priority> where name is the name of a network interface and priority is indicates the priority assigned to frames egresessing on the named interface and originating from a pid in this cgroup This cgroup allows for skb priority to be set prior to a root qdisc getting selected. This is benenficial for DCB enabled systems, in that it allows for any application to use dcb configured priorities so without application modification Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> CC: Robert Love <robert.w.love@intel.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
56c978f1 |
|
17-Nov-2011 |
Herbert Xu <herbert@gondor.apana.org.au> |
net: Remove LL_ALLOCATED_SPACE net: Remove LL_ALLOCATED_SPACE The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the alignment to the sum of needed_headroom and needed_tailroom. As the amount that is then reserved for head room is needed_headroom with alignment, this means that the tail room left may be too small. Now that all uses of LL_ALLOCATED_SPACE have been removed, this patch finally removes the macro itself. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
adc9300e |
|
16-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: use jump_label to shortcut RPS if not setup Most machines dont use RPS/RFS, and pay a fair amount of instructions in netif_receive_skb() / netif_rx() / get_rps_cpu() just to discover RPS/RFS is not setup. Add a jump_label named rps_needed. If no device rps_map or global rps_sock_flow_table is setup, netif_receive_skb() / netif_rx() do a single instruction instead of many ones, including conditional jumps. jmp +0 (if CONFIG_JUMP_LABEL=y) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ccf5ff69 |
|
15-Nov-2011 |
david decotigny <david.decotigny@google.com> |
net: new counter for tx_timeout errors in sysfs This adds the /sys/class/net/DEV/queues/Q/tx_timeout attribute containing the total number of timeout events on the given queue. It is always available with CONFIG_SYSFS, independently of CONFIG_RPS/XPS. Credits to Stephen Hemminger for a preliminary version of this patch. Tested: without CONFIG_SYSFS (compilation only) with sysfs and without CONFIG_RPS & CONFIG_XPS with sysfs and without CONFIG_RPS with sysfs and without CONFIG_XPS with defaults Signed-off-by: David Decotigny <david.decotigny@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0345e186 |
|
16-Nov-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: verify GSO flag bits against netdev features Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c8f44aff |
|
15-Nov-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: introduce and use netdev_features_t for device features sets v2: add couple missing conversions in drivers split unexporting netdev_fix_features() implemented %pNF convert sock::sk_route_(no?)caps Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a59e2ecb |
|
15-Nov-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: split netdev features to separate header Move features definitions to separate header so that linux/skbuff.h won't need to include linux/netdevice.h after netdev_features_t is introduced. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bc5787c6 |
|
15-Nov-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: remove legacy ethtool ops As all drivers are converted, we may now remove discrete offload setting callback handling. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b9075fa9 |
|
31-Oct-2011 |
Joe Perches <joe@perches.com> |
treewide: use __printf not __attribute__((format(printf,...))) Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=*.[ch] -w "__attribute__" * | \ grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
5f8444a3 |
|
07-Oct-2011 |
Greg Rose <gregory.v.rose@intel.com> |
if_link: Add additional parameter to IFLA_VF_INFO for spoof checking Add configuration setting for drivers to turn spoof checking on or off for discrete VFs. v2 - Fix indentation problem, wrap the ifla_vf_info structure in #ifdef __KERNEL__ to prevent user space from accessing and change function paramater for the spoof check setting netdev op from u8 to bool. v3 - Preset spoof check setting to -1 so that user space tools such as ip can detect that the driver didn't report a spoofcheck setting. Prevents incorrect display of spoof check settings for drivers that don't report it. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
4bc71cb9 |
|
02-Sep-2011 |
Jiri Pirko <jpirko@redhat.com> |
net: consolidate and fix ethtool_ops->get_settings calling This patch does several things: - introduces __ethtool_get_settings which is called from ethtool code and from drivers as well. Put ASSERT_RTNL there. - dev_ethtool_get_settings() is replaced by __ethtool_get_settings() - changes calling in drivers so rtnl locking is respected. In iboe_get_rate was previously ->get_settings() called unlocked. This fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same problem. Also fixed by calling __dev_get_by_index() instead of dev_get_by_index() and holding rtnl_lock for both calls. - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create() so bnx2fc_if_create() and fcoe_if_create() are called locked as they are from other places. - use __ethtool_get_settings() in bonding code Signed-off-by: Jiri Pirko <jpirko@redhat.com> v2->v3: -removed dev_ethtool_get_settings() -added ASSERT_RTNL into __ethtool_get_settings() -prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock around it and __ethtool_get_settings() call v1->v2: add missing export_symbol Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits] Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3c9c36bc |
|
26-Aug-2011 |
Bhanu Prakash Gollapudi <bprakash@broadcom.com> |
net: Define NETDEV_FCOE_WWNN, NETDEV_FCOE_WWPN only when CONFIG_LIBFCOE is enabled bnx2fc driver calls netdev->netdev_ops->ndo_fcoe_get_wwn() and it may not be defined with the current Kconfig dependencies. ndo_fcoe_get_wwn is dependent on CONFIG_FCOE, but bnx2fc does not select CONFIG_FCOE, as it does not depend on fcoe driver. Since both fcoe and bnx2fc drivers select CONFIG_LIBFCOE, define NETDEV_FCOE_WWNN and NETDEV_FCOE_WWPN when CONFIG_LIBFCOE is defined. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Yi Zou <yi.zou@intel.com> Signed-off-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Reviewed-by: Yi Zou <yi.zou@intel.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cc749986 |
|
25-Aug-2011 |
Jean Pihet <j-pihet@ti.com> |
PM QoS: Minor clean-ups - Misc fixes to improve code readability: * rename struct pm_qos_request_list to struct pm_qos_request, * rename pm_qos_req parameter to req in internal code, consistenly use req in the API parameters, * update the in-kernel API callers to the new parameters names, * rename of fields names (requests, list, node, constraints) Signed-off-by: Jean Pihet <j-pihet@ti.com> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|
#
e8db0be1 |
|
25-Aug-2011 |
Jean Pihet <j-pihet@ti.com> |
PM QoS: Move and rename the implementation files The PM QoS implementation files are better named kernel/power/qos.c and include/linux/pm_qos.h. The PM QoS support is compiled under the CONFIG_PM option. Signed-off-by: Jean Pihet <j-pihet@ti.com> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|
#
b5fb0a03 |
|
11-Aug-2011 |
Jason Baron <jbaron@redhat.com> |
dynamic_debug: make netif_dbg() call __netdev_printk() Previously, netif_dbg() was using dynamic_dev_dbg() to perform the underlying printk. Fix it to use __netdev_printk(), instead. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
ffa10cb4 |
|
11-Aug-2011 |
Jason Baron <jbaron@redhat.com> |
dynamic_debug: make netdev_dbg() call __netdev_printk() Previously, if dynamic debug was enabled netdev_dbg() was using dynamic_dev_dbg() to print out the underlying msg. Fix this by making sure netdev_dbg() uses __netdev_printk(). Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
b81693d9 |
|
16-Aug-2011 |
Jiri Pirko <jpirko@redhat.com> |
net: remove ndo_set_multicast_list callback Remove no longer used operation. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
01789349 |
|
16-Aug-2011 |
Jiri Pirko <jpirko@redhat.com> |
net: introduce IFF_UNICAST_FLT private flag Use IFF_UNICAST_FTL to find out if driver handles unicast address filtering. In case it does not, promisc mode is entered. Patch also fixes following drivers: stmmac, niu: support uc filtering and yet it propagated ndo_set_multicast_list bna, benet, pxa168_eth, ks8851, ks8851_mll, ksz884x : has set ndo_set_rx_mode but do not support uc filtering Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
17dd759c |
|
27-Jul-2011 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Only reset frag0 when skb can be pulled Currently skb_gro_header_slow unconditionally resets frag0 and frag0_len. However, when we can't pull on the skb this leaves the GRO fields in an inconsistent state. This patch fixes this by only resetting those fields after the pskb_may_pull test. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
60063497 |
|
26-Jul-2011 |
Arun Sharma <asharma@fb.com> |
atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
dcfe1421 |
|
25-Jul-2011 |
Amerigo Wang <amwang@redhat.com> |
notifiers: net: move netdevice notifiers into netdevice.h It is not necessary to share the same notifier.h. Signed-off-by: WANG Cong <amwang@redhat.com> Acked-by: David Miller <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2d348d1f |
|
25-Jul-2011 |
Joe Perches <joe@perches.com> |
net: Convert struct net_device uc_promisc to bool No need to use int, its uses are boolean. May save a few bytes one day. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3dbc46c |
|
20-Jul-2011 |
Choi, Jong-Hwan <jhbird.choi@samsung.com> |
net: Kobj and queues_kset should be used when CONFIG_XPS is enabled Kobj and queues_kset are used with CONFIG_XPS=y. Signed-off-by: Choi, Jong-Hwan <jhbird.choi@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7890a5b9 |
|
19-Jul-2011 |
Jiri Pirko <jpirko@redhat.com> |
vlan: kill ndo_vlan_rx_register has no users so remove it Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b23b5455 |
|
16-Jul-2011 |
David S. Miller <davem@davemloft.net> |
neigh: Kill hh_cache->hh_output It's just taking on one of two possible values, either neigh_ops->output or dev_queue_xmit(). And this is purely depending upon whether nud_state has NUD_CONNECTED set or not. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
62f2a3a4 |
|
13-Jul-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: remove NETIF_F_ALL_TX_OFFLOADS There is no software fallback implemented for SCTP or FCoE checksumming, and so it should not be passed on by software devices like bridge or bonding. For VLAN devices, this is different. First, the driver for underlying device should be prepared to get offloaded packets even when the feature is disabled (especially if it advertises it in vlan_features). Second, devices under VLANs do not get replaced without tearing down the VLAN first. This fixes a mess I accidentally introduced while converting bonding to ndo_fix_features. NETIF_F_SOFT_FEATURES are removed from BOND_VLAN_FEATURES because they are unused as of commit 712ae51afd. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fec30c33 |
|
13-Jul-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: unexport netdev_fix_features() It is not used anywhere except net/core/dev.c now. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6b72b62 |
|
14-Jul-2011 |
David S. Miller <davem@davemloft.net> |
net: Embed hh_cache inside of struct neighbour. Now that there is a one-to-one correspondance between neighbour and hh_cache entries, we no longer need: 1) dynamic allocation 2) attachment to dst->hh 3) refcounting Initialization of the hh_cache entry is indicated by hh_len being non-zero, and such initialization is always done with the neighbour's lock held as a writer. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5c25f686 |
|
13-Jul-2011 |
David S. Miller <davem@davemloft.net> |
net: Kill support for multiple hh_cache entries per neighbour This never, ever, happens. Neighbour entries are always tied to one address family, and therefore one set of dst_ops, and therefore one dst_ops->protocol "hh_type" value. This capability was blindly imported by Alexey Kuznetsov when he wrote the neighbour layer. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e69dd336 |
|
13-Jul-2011 |
David S. Miller <davem@davemloft.net> |
net: Push protocol type directly down to header_ops->cache() Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e2270ea6 |
|
12-Jul-2011 |
David S. Miller <davem@davemloft.net> |
netdevice: Kill 'feature' test macros. Almost all of these have long outstayed their welcome. And for every one of these macros, there are 10 features for which we didn't add macros. Let's just delete them all, and get out of habit of doing things this way. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
|
#
b7f080cf |
|
16-Jun-2011 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: remove mm.h inclusion from netdevice.h Remove linux/mm.h inclusion from netdevice.h -- it's unused (I've checked manually). To prevent mm.h inclusion via other channels also extract "enum dma_data_direction" definition into separate header. This tiny piece is what gluing netdevice.h with mm.h via "netdevice.h => dmaengine.h => dma-mapping.h => scatterlist.h => mm.h". Removal of mm.h from scatterlist.h was tried and was found not feasible on most archs, so the link was cutoff earlier. Hope people are OK with tiny include file. Note, that mm_types.h is still dragged in, but it is a separate story. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bff55273 |
|
07-Jun-2011 |
Alexander Duyck <alexander.h.duyck@intel.com> |
v2 ethtool: remove support for ETHTOOL_GRXNTUPLE This change is meant to remove all support for displaying an ntuple as strings via ETHTOOL_GRXNTUPLE. The reason for this change is due to the fact that multiple issues have been found including: - Multiple buffer overruns for strings being displayed. - Incorrect filters displayed, cleared filters with ring of -2 are displayed - Setting get_rx_ntuple displays no rules if defined. - Endianess wrong on displayed values. - Hard limit of 1024 filters makes display functionality extremely limited The only driver that had supported this interface was ixgbe. Since it no longer uses the interface and due to the issues mentioned above I am submitting this patch to remove it. v2: Updated based on comments from Ben Hutchings - Left ETH_SS_NTUPLE_FILTERS in code but commented on it being deprecated - Removed ethtool_rx_ntuple_list and ethtool_rx_ntuple_flow_spec_container - Left ETHTOOL_GRXNTUPLE but commented it as deprecated Also cleaned up set_rx_ntuple since there is no flow spec container to maintain we can drop all the code for the alloc and free of it and just return ops->set_rx_ntuple(). Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a6b7a407 |
|
06-Jun-2011 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: remove interrupt.h inclusion from netdevice.h * remove interrupt.g inclusion from netdevice.h -- not needed * fixup fallout, add interrupt.h and hardirq.h back where needed. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3019de12 |
|
06-Jun-2011 |
David S. Miller <davem@davemloft.net> |
net: Rework netdev_drivername() to avoid warning. This interface uses a temporary buffer, but for no real reason. And now can generate warnings like: net/sched/sch_generic.c: In function dev_watchdog net/sched/sch_generic.c:254:10: warning: unused variable drivername Just return driver->name directly or "". Reported-by: Connor Hansen <cmdkhh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
449f4544 |
|
18-May-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
macvlan: remove one synchronize_rcu() call When one macvlan device is dismantled, we can avoid one synchronize_rcu() call done after deletion from hash list, since caller will perform a synchronize_net() call after its ndo_stop() call. Add a new netdev->dismantle field to signal this dismantle intent. Reduces RTNL hold time. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> CC: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bdc220da |
|
09-May-2011 |
Joe Perches <joe@perches.com> |
netdevice.h: Align struct net_device members Save a bit of space. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
afe12cc8 |
|
06-May-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: introduce netdev_change_features() It will be needed by bonding and other drivers changing vlan_features after ndo_init callback. As a bonus, this includes kernel-doc for netdev_update_features(). Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f267051 |
|
09-May-2011 |
Yi Zou <yi.zou@intel.com> |
net: group FCoE related feature flags Michał Mirosław's patch (http://patchwork.ozlabs.org/patch/94421/) fixes the issue (http://patchwork.ozlabs.org/patch/94188/) about not populating FCoE related flags correctly on vlan devices. However, only NETIF_F_FCOE_CRC is part of the NETIF_F_ALL_TX_OFFLOADS right now, where weed NETIF_F_FCOE_MTU and NETIF_F_FSO as well. Therefore, add NETIF_F_ALL_FCOE to indicate feature flags used by FCoE TX offloads. These include NETIF_F_FCOE_CRC, NETIF_F_FCOE_MTU, and NETIF_F_FSO and add them to be part of NETIF_F_ALL_TX_OFFLOADS. This would eventually make sure all FCoE needed flags are populated properly to vlan devices. Signed-off-by: Yi Zou <yi.zou@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eed2a12f |
|
04-May-2011 |
Mahesh Bandewar <maheshb@google.com> |
net: Allow ethtool to set interface in loopback mode. This patch enables ethtool to set the loopback mode on a given interface. By configuring the interface in loopback mode in conjunction with a policy route / rule, a userland application can stress the egress / ingress path exposing the flows of the change in progress and potentially help developer(s) understand the impact of those changes without even sending a packet out on the network. Following set of commands illustrates one such example - a) ip -4 addr add 192.168.1.1/24 dev eth1 b) ip -4 rule add from all iif eth1 lookup 250 c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250 d) arp -Ds 192.168.1.100 eth1 e) arp -Ds 192.168.1.200 eth1 f) sysctl -w net.ipv4.ip_nonlocal_bind=1 g) sysctl -w net.ipv4.conf.all.accept_local=1 # Assuming that the machine has 8 cores h) taskset 000f netserver -L 192.168.1.200 i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30 Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8ae6daca |
|
27-Apr-2011 |
David Decotigny <decot@google.com> |
ethtool: Call ethtool's get/set_settings callbacks with cleaned data This makes sure that when a driver calls the ethtool's get/set_settings() callback of another driver, the data passed to it is clean. This guarantees that speed_hi will be zeroed correctly if the called callback doesn't explicitely set it: we are sure we don't get a corrupted speed from the underlying driver. We also take care of setting the cmd field appropriately (ETHTOOL_GSET/SSET). This applies to dev_ethtool_get_settings(), which now makes sure it sets up that ethtool command parameter correctly before passing it to drivers. This also means that whoever calls dev_ethtool_get_settings() does not have to clean the ethtool command parameter. This function also becomes an exported symbol instead of an inline. All drivers visible to make allyesconfig under x86_64 have been updated. Signed-off-by: David Decotigny <decot@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fa2bd7ff |
|
22-Apr-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: allow user to change NETIF_F_HIGHDMA NETIF_F_HIGHDMA is like any other TX offloads, so allow user to toggle it. This is needed later for bridge and bonding convertsion to hw_features. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1742f183 |
|
22-Apr-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: fix netdev_increment_features() Simplify and fix netdev_increment_features() to conform to what is stated in netdevice.h comments about NETIF_F_ONE_FOR_ALL. Include FCoE segmentation and VLAN-challedged flags in computation. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a14842f |
|
20-Apr-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: filter: Just In Time compiler for x86-64 In order to speedup packet filtering, here is an implementation of a JIT compiler for x86_64 It is disabled by default, and must be enabled by the admin. echo 1 >/proc/sys/net/core/bpf_jit_enable It uses module_alloc() and module_free() to get memory in the 2GB text kernel range since we call helpers functions from the generated code. EAX : BPF A accumulator EBX : BPF X accumulator RDI : pointer to skb (first argument given to JIT function) RBP : frame pointer (even if CONFIG_FRAME_POINTER=n) r9d : skb->len - skb->data_len (headlen) r8 : skb->data To get a trace of generated code, use : echo 2 >/proc/sys/net/core/bpf_jit_enable Example of generated code : # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24 flen=18 proglen=147 pass=3 image=ffffffffa00b5000 JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60 JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00 JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00 JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0 JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00 JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00 JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24 JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31 JIT code: ffffffffa00b5090: c0 c9 c3 BPF program is 144 bytes long, so native program is almost same size ;) (000) ldh [12] (001) jeq #0x800 jt 2 jf 8 (002) ld [26] (003) and #0xffffff00 (004) jeq #0xc0a81400 jt 16 jf 5 (005) ld [30] (006) and #0xffffff00 (007) jeq #0xc0a81400 jt 16 jf 17 (008) jeq #0x806 jt 10 jf 9 (009) jeq #0x8035 jt 10 jf 17 (010) ld [28] (011) and #0xffffff00 (012) jeq #0xc0a81400 jt 16 jf 13 (013) ld [38] (014) and #0xffffff00 (015) jeq #0xc0a81400 jt 16 jf 17 (016) ret #65535 (017) ret #0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1aac6267 |
|
11-Apr-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: vlan_features comment clarification Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6e1a0d1 |
|
04-Apr-2011 |
Tom Herbert <therbert@google.com> |
net: Allow no-cache copy from user on transmit This patch uses __copy_from_user_nocache on transmit to bypass data cache for a performance improvement. skb_add_data_nocache and skb_copy_to_page_nocache can be called by sendmsg functions to use this feature, initial support is in tcp_sendmsg. This functionality is configurable per device using ethtool. Presumably, this feature would only be useful when the driver does not touch the data. The feature is turned on by default if a device indicates that it does some form of checksum offload; it is off by default for devices that do no checksum offload or indicate no checksum is necessary. For the former case copy-checksum is probably done anyway, in the latter case the device is likely loopback in which case the no cache copy is probably not beneficial. This patch was tested using 200 instances of netperf TCP_RR with 1400 byte request and one byte reply. Platform is 16 core AMD x86. No-cache copy disabled: 672703 tps, 97.13% utilization 50/90/99% latency:244.31 484.205 1028.41 No-cache copy enabled: 702113 tps, 96.16% utilization, 50/90/99% latency 238.56 467.56 956.955 Using 14000 byte request and response sizes demonstrate the effects more dramatically: No-cache copy disabled: 79571 tps, 34.34 %utlization 50/90/95% latency 1584.46 2319.59 5001.76 No-cache copy enabled: 83856 tps, 34.81% utilization 50/90/95% latency 2508.42 2622.62 2735.88 Note especially the effect on latency tail (95th percentile). This seems to provide a nice performance improvement and is consistent in the tests I ran. Presumably, this would provide the greatest benfits in the presence of an application workload stressing the cache and a lot of transmit data happening. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6cb6a27c |
|
02-Apr-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: Call netdev_features_change() from netdev_update_features() Issue FEAT_CHANGE notification when features are changed by netdev_update_features(). This will allow changes made by extra constraints on e.g. MTU change to be properly propagated like changes via ethtool. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4dd5ffe4 |
|
30-Mar-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: Fix dev dev_ethtool_get_rx_csum() for forced NETIF_F_RXCSUM dev_ethtool_get_rx_csum() won't report rx checksumming when it's not changeable and driver is converted to hw_features and friends. Fix this. (dev->hw_features & NETIF_F_RXCSUM) check is dropped - if the ethtool_ops->get_rx_csum is set, then driver is not coverted, yet. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8a4eb573 |
|
11-Mar-2011 |
Jiri Pirko <jpirko@redhat.com> |
net: introduce rx_handler results and logic around that This patch allows rx_handlers to better signalize what to do next to it's caller. That makes skb->deliver_no_wcard no longer needed. kernel-doc for rx_handler_result is taken from Nicolas' patch. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e9bce845 |
|
09-Mar-2011 |
Yi Zou <yi.zou@intel.com> |
net: add proper documentation for previously added net_device_ops for FCoE Add proper documentation for previously added net_device_ops ops for FCoE. Signed-off-by: Yi Zou <yi.zou@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
8909c9ad |
|
01-Mar-2011 |
Vasiliy Kulikov <segoon@openwall.com> |
net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules Since a8f80e8ff94ecba629542d9b4b5f5a8ee3eb565c any process with CAP_NET_ADMIN may load any module from /lib/modules/. This doesn't mean that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are limited to /lib/modules/**. However, CAP_NET_ADMIN capability shouldn't allow anybody load any module not related to networking. This patch restricts an ability of autoloading modules to netdev modules with explicit aliases. This fixes CVE-2011-1019. Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior of loading netdev modules by name (without any prefix) for processes with CAP_SYS_MODULE to maintain the compatibility with network scripts that use autoloading netdev modules by aliases like "eth0", "wlan0". Currently there are only three users of the feature in the upstream kernel: ipip, ip_gre and sit. root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) -- root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: fffffff800001000 CapEff: fffffff800001000 CapBnd: fffffff800001000 root@albatros:~# modprobe xfs FATAL: Error inserting xfs (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted root@albatros:~# lsmod | grep xfs root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit sit: error fetching interface information: Device not found root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit0 sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 root@albatros:~# lsmod | grep sit sit 10457 0 tunnel4 2957 1 sit For CAP_SYS_MODULE module loading is still relaxed: root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs xfs 745319 0 Reference: https://lkml.org/lkml/2011/2/24/203 Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>
|
#
6247e086 |
|
01-Feb-2011 |
Yi Zou <yi.zou@intel.com> |
net: add ndo_fcoe_ddp_target() to support FCoE DDP in target mode The Fiber Channel over Ethernet (FCoE) Direct Data Placement (DDP) can also be used for FCoE target, where the DDP used for read I/O on an initiator can be used on an FCoE target to speed up the write I/O to the target from the initiator. The added ndo_fcoe_ddp_target() works in the similar way as the existing ndo_fcoe_ddp_setup() to allow the underlying hardware set up the DDP context accordingly when it gets called from the FCoE target implementation on top the existing Open-FCoE fcoe/libfc protocol stack so without losing the ability to provide DDP for read I/O as an initiator, it can also provide DDP offload to the write I/O coming from the initiator as a target. Signed-off-by: Yi Zou <yi.zou@intel.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
#
256ee435 |
|
01-Mar-2011 |
Joe Perches <joe@perches.com> |
netdevice: Convert printk to pr_info in netif_tx_stop_queue This allows any caller to be prefaced by any specific pr_fmt to better identify which device driver is using this function inappropriately. Add terminating newline. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
23b41168 |
|
26-Feb-2011 |
Vlad Dogaru <ddvlad@rosedu.org> |
netdevice: make initial group visible to userspace INIT_NETDEV_GROUP is needed by userspace, move it outside __KERNEL__ guards. Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e83d360d |
|
15-Feb-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: introduce NETIF_F_RXCSUM Introduce NETIF_F_RXCSUM to replace device-private flags for RX checksum offload. Integrate it with ndo_fix_features. ethtool_op_get_rx_csum() is removed altogether as nothing in-tree uses it. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5455c699 |
|
15-Feb-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: Introduce new feature setting ops This introduces a new framework to handle device features setting. It consists of: - new fields in struct net_device: + hw_features - features that hw/driver supports toggling + wanted_features - features that user wants enabled, when possible - new netdev_ops: + feat = ndo_fix_features(dev, feat) - API checking constraints for enabling features or their combinations + ndo_set_features(dev) - API updating hardware state to match changed dev->features - new ethtool commands: + ETHTOOL_GFEATURES/ETHTOOL_SFEATURES: get/set dev->wanted_features and trigger device reconfiguration if resulting dev->features changed + ETHTOOL_GSTRINGS(ETH_SS_FEATURES): get feature bits names (meaning) Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a417704 |
|
15-Feb-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
ethtool: factorize get/set_one_feature This allows to enable GRO even if RX csum is disabled. GRO will not be used for packets without hardware checksum anyway. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
212b573f |
|
15-Feb-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
ethtool: enable GSO and GRO by default Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fbaec0ea |
|
13-Feb-2011 |
Jiri Pirko <jpirko@redhat.com> |
rtnetlink: implement setting of master device This patch allows userspace to enslave/release slave devices via netlink interface using IFLA_MASTER. This introduces generic way to add/remove underling devices. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1765a575 |
|
11-Feb-2011 |
Jiri Pirko <jpirko@redhat.com> |
net: make dev->master general dev->master is now tightly connected to bonding driver. This patch makes this pointer more general and ready to be used by others. - netdev_set_master() - bond specifics moved to new function netdev_set_bond_master() - introduced netif_is_bond_slave() to check if device is a bonding slave Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d59cfde2 |
|
11-Feb-2011 |
Jiri Pirko <jpirko@redhat.com> |
net: remove the unnecessary dance around skb_bond_should_drop No need to check (master) twice and to drive in and out the header file. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ccf43438 |
|
26-Jan-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: fix dev_seq_next() Commit c6d14c84566d (net: Introduce for_each_netdev_rcu() iterator) added a race in dev_seq_next(). The rcu_dereference() call should be done _before_ testing the end of list, or we might return a wrong net_device if a concurrent thread changes net_device list under us. Note : discovered thanks to a sparse warning : net/core/dev.c:3919:9: error: incompatible types in comparison expression (different address spaces) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
acd1130e |
|
24-Jan-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: reduce and unify printk level in netdev_fix_features() Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will be used by ethtool and might spam dmesg unnecessarily. This converts the function to use netdev_info() instead of plain printk(). As a side effect, bonding and bridge devices will now log dropped features on every slave device change. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
04ed3e74 |
|
24-Jan-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: change netdev->features to u32 Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c445477d |
|
19-Jan-2011 |
Ben Hutchings <bhutchings@solarflare.com> |
net: RPS: Enable hardware acceleration of RFS Allow drivers for multiqueue hardware with flow filter tables to accelerate RFS. The driver must: 1. Set net_device::rx_cpu_rmap to a cpu_rmap of the RX completion IRQs (in queue order). This will provide a mapping from CPUs to the queues for which completions are handled nearest to them. 2. Implement net_device_ops::ndo_rx_flow_steer. This operation adds or replaces a filter steering the given flow to the given RX queue, if possible. 3. Periodically remove filters for which rps_may_expire_flow() returns true. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4f57c087 |
|
17-Jan-2011 |
John Fastabend <john.r.fastabend@intel.com> |
net: implement mechanism for HW based QOS This patch provides a mechanism for lower layer devices to steer traffic using skb->priority to tx queues. This allows for hardware based QOS schemes to use the default qdisc without incurring the penalties related to global state and the qdisc lock. While reliably receiving skbs on the correct tx ring to avoid head of line blocking resulting from shuffling in the LLD. Finally, all the goodness from txq caching and xps/rps can still be leveraged. Many drivers and hardware exist with the ability to implement QOS schemes in the hardware but currently these drivers tend to rely on firmware to reroute specific traffic, a driver specific select_queue or the queue_mapping action in the qdisc. By using select_queue for this drivers need to be updated for each and every traffic type and we lose the goodness of much of the upstream work. Firmware solutions are inherently inflexible. And finally if admins are expected to build a qdisc and filter rules to steer traffic this requires knowledge of how the hardware is currently configured. The number of tx queues and the queue offsets may change depending on resources. Also this approach incurs all the overhead of a qdisc with filters. With the mechanism in this patch users can set skb priority using expected methods ie setsockopt() or the stack can set the priority directly. Then the skb will be steered to the correct tx queues aligned with hardware QOS traffic classes. In the normal case with single traffic class and all queues in this class everything works as is until the LLD enables multiple tcs. To steer the skb we mask out the lower 4 bits of the priority and allow the hardware to configure upto 15 distinct classes of traffic. This is expected to be sufficient for most applications at any rate it is more then the 8021Q spec designates and is equal to the number of prio bands currently implemented in the default qdisc. This in conjunction with a userspace application such as lldpad can be used to implement 8021Q transmission selection algorithms one of these algorithms being the extended transmission selection algorithm currently being used for DCB. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cbda10fa |
|
13-Jan-2011 |
Vlad Dogaru <ddvlad@rosedu.org> |
net_device: add support for network device groups Net devices can now be grouped, enabling simpler manipulation from userspace. This patch adds a group field to the net_device structure, as well as rtnetlink support to query and modify it. Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ac9ad13 |
|
11-Jan-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: remove dev_txq_stats_fold() After recent changes, (percpu stats on vlan/tunnels...), we dont need anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters. Only remaining users are ixgbe, sch_teql, gianfar & macvlan : 1) ixgbe can be converted to use existing tx_ring counters. 2) macvlan incremented txq->tx_dropped, it can use the dev->stats.tx_dropped counter. 3) sch_teql : almost revert ab35cd4b8f42 (Use net_device internal stats) Now we have ndo_get_stats64(), use it, even for "unsigned long" fields (No need to bring back a struct net_device_stats) 4) gianfar adds a stats structure per tx queue to hold tx_bytes/tx_packets This removes a lockdep warning (and possible lockup) in rndis gadget, calling dev_get_stats() from hard IRQ context. Ref: http://www.spinics.net/lists/netdev/msg149202.html Reported-by: Neil Jones <neiljay@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Sandeep Gopalpet <sandeep.kumar@freescale.com> CC: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
36909ea4 |
|
09-Jan-2011 |
Tom Herbert <therbert@google.com> |
net: Add alloc_netdev_mqs function Added alloc_netdev_mqs function which allows the number of transmit and receive queues to be specified independenty. alloc_netdev_mq was changed to a macro to call the new function. Also added alloc_etherdev_mqs with same purpose. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc741216 |
|
08-Jan-2011 |
Jesse Gross <jesse@nicira.com> |
net offloading: Pass features into netif_needs_gso(). Now that there is a single function that can compute the device features relevant to a packet, we don't want to run it for each offload. This converts netif_needs_gso() to take the features of the device, rather than computing them itself. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f01a5236 |
|
08-Jan-2011 |
Jesse Gross <jesse@nicira.com> |
net offloading: Generalize netif_get_vlan_features(). netif_get_vlan_features() is currently only used by netif_needs_gso(), so it only concerns itself with GSO features. However, several other places also should take into account the contents of the packet when deciding whether to offload to hardware. This generalizes the function to return features about all of the various forms of offloading. Since offloads tend to be linked together, this avoids duplicating the logic in each location (i.e. the scatter/gather code also needs the checksum logic). Suggested-by: Michał Mirosław <mirqus@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
68763c89 |
|
02-Jan-2011 |
Michal Simek <monstr@monstr.eu> |
trivial: Fix typo fault in netdevice.h Signed-off-by: Michal Simek <monstr@monstr.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b236da69 |
|
13-Dec-2010 |
Changli Gao <xiaosuo@gmail.com> |
net: use NUMA_NO_NODE instead of the magic number -1 Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a3d22a68 |
|
12-Dec-2010 |
Vladislav Zolotarov <vladz@broadcom.com> |
bnx2x: Take the distribution range definition out of skb_tx_hash() Move the calcualation of the Tx hash for a given hash range into a separate function and define the skb_tx_hash(), which calculates a Tx hash for a [0; dev->real_num_tx_queues - 1] hash values range, using this function (__skb_tx_hash()). Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
941666c2 |
|
04-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: RCU conversion of dev_getbyhwaddr() and arp_ioctl() Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit : > Hmm.. > > If somebody can explain why RTNL is held in arp_ioctl() (and therefore > in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so > that your patch can be applied. > > Right now it is not good, because RTNL wont be necessarly held when you > are going to call arp_invalidate() ? While doing this analysis, I found a refcount bug in llc, I'll send a patch for net-2.6 Meanwhile, here is the patch for net-next-2.6 Your patch then can be applied after mine. Thanks [PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl() dev_getbyhwaddr() was called under RTNL. Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use RCU locking instead of RTNL. Change arp_ioctl() to use RCU instead of RTNL locking. Note: this fix a dev refcount bug in llc Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f2cd2d3e |
|
29-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net sched: use xps information for qdisc NUMA affinity Allocate qdisc memory according to NUMA properties of cpus included in xps map. To be effective, qdisc should be (re)setup after changes of /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus I added a numa_node field in struct netdev_queue, containing NUMA node if all cpus included in xps_cpus share same node, else -1. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4177869 |
|
28-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
xps: add __rcu annotations Avoid sparse warnings : add __rcu annotations and use rcu_dereference_protected() where necessary. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bf264145 |
|
26-Nov-2010 |
Tom Herbert <therbert@google.com> |
xps: Add CONFIG_XPS This patch adds XPS_CONFIG option to enable and disable XPS. This is done in the same manner as RPS_CONFIG. This is also fixes build failure in XPS code when SMP is not enabled. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a0d2268 |
|
23-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add netif_tx_queue_frozen_or_stopped When testing struct netdev_queue state against FROZEN bit, we also test XOFF bit. We can test both bits at once and save some cycles. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1d24eb48 |
|
21-Nov-2010 |
Tom Herbert <therbert@google.com> |
xps: Transmit Packet Steering This patch implements transmit packet steering (XPS) for multiqueue devices. XPS selects a transmit queue during packet transmission based on configuration. This is done by mapping the CPU transmitting the packet to a queue. This is the transmit side analogue to RPS-- where RPS is selecting a CPU based on receive queue, XPS selects a queue based on the CPU (previously there was an XPS patch from Eric Dumazet, but that might more appropriately be called transmit completion steering). Each transmit queue can be associated with a number of CPUs which will use the queue to send packets. This is configured as a CPU mask on a per queue basis in: /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus The mappings are stored per device in an inverted data structure that maps CPUs to queues. In the netdevice structure this is an array of num_possible_cpu structures where each structure holds and array of queue_indexes for queues which that CPU can use. The benefits of XPS are improved locality in the per queue data structures. Also, transmit completions are more likely to be done nearer to the sending thread, so this should promote locality back to the socket on free (e.g. UDP). The benefits of XPS are dependent on cache hierarchy, application load, and other factors. XPS would nominally be configured so that a queue would only be shared by CPUs which are sharing a cache, the degenerative configuration woud be that each CPU has it's own queue. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. bnx2x on 16 core AMD XPS (16 queues, 1 TX queue per CPU) 1234K at 100% CPU No XPS (16 queues) 996K at 100% CPU Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61391cde |
|
14-Nov-2010 |
stephen hemminger <shemminger@vyatta.com> |
netdev: add rcu annotations to receive handler hook Suggested by Eric's bridge RCU changes. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe822240 |
|
09-Nov-2010 |
Tom Herbert <therbert@google.com> |
net: Simplify RX queue allocation This patch move RX queue allocation to alloc_netdev_mq and freeing of the queues to free_netdev (symmetric to TX queue allocation). Each kobject RX queue takes a reference to the queue's device so that the device can't be freed before all the kobjects have been released-- this obviates the need for reference counts specific to RX queues. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
58e998c6 |
|
28-Oct-2010 |
Jesse Gross <jesse@nicira.com> |
offloading: Force software GSO for multiple vlan tags. We currently use vlan_features to check for TSO support if there is a vlan tag. However, it's quite likely that the NIC is not able to do TSO when there is an arbitrary number of tags. Therefore if there is more than one tag (in-band or out-of-band), fall back to software emulation. Signed-off-by: Jesse Gross <jesse@nicira.com> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc766e4c |
|
28-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
decnet: RCU conversion and get rid of dev_base_lock While tracking dev_base_lock users, I found decnet used it in dnet_select_source(), but for a wrong purpose: Writers only hold RTNL, not dev_base_lock, so readers must use RCU if they cannot use RTNL. Adds an rcu_head in struct dn_ifaddr and handle proper RCU management. Adds __rcu annotation in dn_route as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
18543a64 |
|
06-Nov-2010 |
Guillaume Chazarain <guichaz@gmail.com> |
net: Detect and ignore netif_stop_queue() calls before register_netdev() After e6484930d7c73d324bccda7d43d131088da697b9: net: allocate tx queues in register_netdevice These calls make net drivers oops at load time, so let's avoid people git-bisect'ing known problems. Signed-off-by: Guillaume Chazarain <guichaz@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e3f7faf |
|
24-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
rps: add __rcu annotations Add __rcu annotations to : (struct netdev_rx_queue)->rps_map (struct netdev_rx_queue)->rps_flow_table struct rps_sock_flow_table *rps_sock_flow_table; And use appropriate rcu primitives. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3cc77ec7 |
|
24-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net/802: add __rcu annotations (struct net_device)->garp_port is rcu protected : (struct garp_port)->applicants is rcu protected : add __rcu annotation and proper rcu primitives. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
198caeca |
|
24-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv6: ip6_ptr rcu annotations (struct net_device)->ip6_ptr is rcu protected : add __rcu annotation and proper rcu primitives. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b616b09a |
|
24-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
vlan: rcu annotations (struct net_device)->vlgrp is rcu protected : add __rcu annotation and proper rcu primitives. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d0c2b0d2 |
|
19-Oct-2010 |
stephen hemminger <shemminger@vyatta.com> |
napi: unexport napi_reuse_skb The function napi_reuse_skb is only used inside core. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3701e513 |
|
20-Oct-2010 |
Jesse Gross <jesse@nicira.com> |
vlan: Centralize handling of hardware acceleration. Currently each driver that is capable of vlan hardware acceleration must be aware of the vlan groups that are configured and then pass the stripped tag to a specialized receive function. This is different from other types of hardware offload in that it places a significant amount of knowledge in the driver itself rather keeping it in the networking core. This makes vlan offloading function more similarly to other forms of offloading (such as checksum offloading or TSO) by doing the following: * On receive, stripped vlans are passed directly to the network core, without attempting to check for vlan groups or reconstructing the header if no group * vlans are made less special by folding the logic into the main receive routines * On transmit, the device layer will add the vlan header in software if the hardware doesn't support it, instead of spreading that logic out in upper layers, such as bonding. There are a number of advantages to this: * Fixes all bugs with drivers incorrectly dropping vlan headers at once. * Avoids having to disable VLAN acceleration when in promiscuous mode (good for bridging since it always puts devices in promiscuous mode). * Keeps VLAN tag separate until given to ultimate consumer, which avoids needing to do header reconstruction as in tg3 unless absolutely necessary. * Consolidates common code in core networking. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
65ac6a5f |
|
20-Oct-2010 |
Jesse Gross <jesse@nicira.com> |
vlan: Avoid hash table lookup to find group. A struct net_device always maps to zero or one vlan groups and we always know the device when we are looking up a group. We currently do a hash table lookup on the device to find the group but it is much simpler to just store a pointer. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7b9c6090 |
|
20-Oct-2010 |
Jesse Gross <jesse@nicira.com> |
vlan: Enable software emulation for vlan accleration. Currently users of hardware vlan accleration need to know whether the device supports it before generating packets. However, vlan acceleration will soon be available in a more flexible manner so knowing ahead of time becomes much more difficult. This adds a software fallback path for vlan packets on devices without the necessary offloading support, similar to other types of hardware accleration. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e6484930 |
|
18-Oct-2010 |
Tom Herbert <therbert@google.com> |
net: allocate tx queues in register_netdevice This patch introduces netif_alloc_netdev_queues which is called from register_device instead of alloc_netdev_mq. This makes TX queue allocation symmetric with RX allocation. Also, queue locks allocation is done in netdev_init_one_queue. Change set_real_num_tx_queues to fail if requested number < 1 or greater than number of allocated queues. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29b4433d |
|
11-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: percpu net_device refcount We tried very hard to remove all possible dev_hold()/dev_put() pairs in network stack, using RCU conversions. There is still an unavoidable device refcount change for every dst we create/destroy, and this can slow down some workloads (routers or some app servers, mmap af_packet) We can switch to a percpu refcount implementation, now dynamic per_cpu infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes per device. On x86, dev_hold(dev) code : before lock incl 0x280(%ebx) after: movl 0x260(%ebx),%eax incl fs:(%eax) Stress bench : (Sending 160.000.000 UDP frames, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_TRIE) Before: real 1m1.662s user 0m14.373s sys 12m55.960s After: real 0m51.179s user 0m15.329s sys 10m15.942s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34d101dd |
|
11-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
neigh: speedup neigh_hh_init() When a new dst is used to send a frame, neigh_resolve_output() tries to associate an struct hh_cache to this dst, calling neigh_hh_init() with the neigh rwlock write locked. Most of the time, hh_cache is already known and linked into neighbour, so we find it and increment its refcount. This patch changes the logic so that we call neigh_hh_init() with neighbour lock read locked only, so that fast path can be run in parallel by concurrent cpus. This brings part of the speedup we got with commit c7d4426a98a5f (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones, removing one of the contention point that routers hit on multiqueue enabled machines. Further improvements would need to use a seqlock instead of an rwlock to protect neigh->ha[], to not dirty neigh too often and remove two atomic ops. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
caf586e5 |
|
30-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add a core netdev->rx_dropped counter In various situations, a device provides a packet to our stack and we drop it before it enters protocol stack : - softnet backlog full (accounted in /proc/net/softnet_stat) - bad vlan tag (not accounted) - unknown/unregistered protocol (not accounted) We can handle a per-device counter of such dropped frames at core level, and automatically adds it to the device provided stats (rx_dropped), so that standard tools can be used (ifconfig, ip link, cat /proc/net/dev) This is a generalization of commit 8990f468a (net: rx_dropped accounting), thus reverting it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24824a09 |
|
02-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: dynamic ingress_queue allocation ingress being not used very much, and net_device->ingress_queue being quite a big object (128 or 256 bytes), use a dynamic allocation if needed (tc qdisc add dev eth0 ingress ...) dev_ingress_queue(dev) helper should be used only with RTNL taken. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bfa5ae63 |
|
27-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: rename netdev rx_queue to ingress_queue There is some confusion with rx_queue name after RPS, and net drivers private rx_queue fields. I suggest to rename "struct net_device"->rx_queue to ingress_queue. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6d81f41c |
|
27-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
dummy: percpu stats and lockless xmit Converts dummy network device driver to : - percpu stats - 64bit stats - lockless xmit (NETIF_F_LLTX) - performance features added (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3171d026 |
|
27-Sep-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Add netif_copy_real_num_queues() for use by virtual net drivers This sets the active numbers of queues on a net device to match another. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
62fe0b40 |
|
27-Sep-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Allow changing number of RX queues after device allocation For RPS, we create a kobject for each RX queue based on the number of queues passed to alloc_netdev_mq(). However, drivers generally do not determine the numbers of hardware queues to use until much later, so this usually represents the maximum number the driver may use and not the actual number in use. For TX queues, drivers can update the actual number using netif_set_real_num_tx_queues(). Add a corresponding function for RX queues, netif_set_real_num_rx_queues(). Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
290b895e |
|
26-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tunnels: prepare percpu accounting Tunnels are going to use percpu for their accounting. They are going to use a new tstats field in net_device. skb_tunnel_rx() is changed to be a wrapper around __skb_tunnel_rx() IPTUNNEL_XMIT() is changed to be a wrapper around __IPTUNNEL_XMIT() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a7855c78 |
|
23-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: loopback driver cleanup loopback driver uses dev->ml_priv to store its percpu stats pointer. It uses ugly casts "(void __percpu __force *)" to shut up sparse complains. Define an union to better document we use ml_priv in loopback driver and define a lstats field with appropriate types. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a02cec21 |
|
22-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: return operator cleanup Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8f8f103d |
|
19-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: reorder struct netdev_hw_addr Move 'synced' and 'global_use' fields before 'refcount', to shrinks struct netdev_hw_addr by 8 bytes (on 64bit arches). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd13539b |
|
15-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: shrinks struct net_device commit ab95bfe01 (net: replace hooks in __netif_receive_skb) added rx_handler at wrong place, between two cache line aligned objects, creating a big hole (a full cache line) Move rx_handler and rx_handler_data before rx_queue, filling existing hole. Move master field in the cache line(s) used in receive path. This saves 64 bytes (or L1_CACHE_BYTES), and avoids two possible cache misses in receive path. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
95ae6b22 |
|
14-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv4: ip_ptr cleanups dev->ip_ptr is protected by rtnl and rcu. Yet some places dont use appropriate primitives and/or locking rules. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
86cac58b |
|
31-Aug-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
skge: add GRO support - napi_gro_flush() is exported from net/core/dev.c, to avoid an irq_save/irq_restore in the packet receive path. - use napi_gro_receive() instead of netif_receive_skb() - use napi_gro_flush() before calling __napi_complete() - turn on NETIF_F_GRO by default - Tested on a Marvell 88E8001 Gigabit NIC Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4dc89133 |
|
31-Aug-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add a comment on netdev->last_rx As some driver authors seem to reintroduce dev->last_rx use, add a comment to strongly discourage this. Since commit 6cf3f41e6c0 (bonding, net: Move last_rx update into bonding recv logic), network drivers dont need to update last_rx themselves, unless they use this field to implement a timeout. Not updating last_rx helps not dirtying a cache line, improving performance in SMP. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1726442e |
|
23-Aug-2010 |
Simon Horman <horms@verge.net.au> |
net: increase the size of priv_flags and add IFF_OVS_DATAPATH IFF_OVS_DATAPATH is a place-holder for the Open vSwitch datapath which I am preparing to submit for merging. As all 16 bits of priv_flags are already assigned flags, also increase the size of priv_flags to 32 bits. Unfortunately, by my calculations this increases the size of struct net_device by 4 bytes on 32bit architectures and 8 bytes on 64 bit architectures. I couldn't see an obvious way to avoid that. Cc: Jesse Gross <jesse@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
21dc3301 |
|
23-Aug-2010 |
David S. Miller <davem@davemloft.net> |
net: Rename skb_has_frags to skb_has_frag_list SKBs can be "fragmented" in two ways, via a page array (called skb_shinfo(skb)->frags[]) and via a list of SKBs (called skb_shinfo(skb)->frag_list). Since skb_has_frags() tests the latter, it's name is confusing since it sounds more like it's testing the former. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
04600794 |
|
05-Aug-2010 |
Johannes Berg <johannes.berg@intel.com> |
cfg80211: support sysfs namespaces Enable using network namespaces with wireless devices even when sysfs is enabled using the same infrastructure that was built for netdevs. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
|
#
c1f79426 |
|
21-Jul-2010 |
Stefan Assmann <sassmann@redhat.com> |
sysfs: add attribute to indicate hw address assignment type Add addr_assign_type to struct net_device and expose it via sysfs. This new attribute has the purpose of giving user-space the ability to distinguish between different assignment types of MAC addresses. For example user-space can treat NICs with randomly generated MAC addresses differently than NICs that have permanent (locally assigned) MAC addresses. For the former udev could write a persistent net rule by matching the device path instead of the MAC address. There's also the case of devices that 'steal' MAC addresses from slave devices. In which it is also be beneficial for user-space to be aware of the fact. This patch also introduces a helper function to assist adoption of drivers that generate MAC addresses randomly. Signed-off-by: Stefan Assmann <sassmann@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bd27290a |
|
19-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: 64bit stats for netdev_queue Since struct netdev_queue tx_bytes/tx_packets/tx_dropped are already protected by _xmit_lock, its easy to convert these fields to u64 instead of unsigned long. This completes 64bit stats for devices using them (vlan, macvlan, ...) Strictly, we could avoid the locking in dev_txq_stats_fold() on 64bit arches, but its slow path and we prefer keep it simple. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c1f19b51 |
|
17-Jul-2010 |
Richard Cochran <richardcochran@gmail.com> |
net: support time stamping in phy devices. This patch adds a new networking option to allow hardware time stamps from PHY devices. When enabled, likely candidates among incoming and outgoing network packets are offered to the PHY driver for possible time stamping. When accepted by the PHY driver, incoming packets are deferred for later delivery by the driver. The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl and callbacks for transmit and receive time stamping. Drivers may optionally implement these functions. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
82f68251 |
|
05-Jul-2010 |
James Bottomley <James.Bottomley@suse.de> |
pm_qos: Get rid of the allocation in pm_qos_add_request() All current users of pm_qos_add_request() have the ability to supply the memory required by the pm_qos routines, so make them do this and eliminate the kmalloc() with pm_qos_add_request(). This has the double benefit of making the call never fail and allowing it to be called from atomic context. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|
#
d7753516 |
|
09-Jul-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Document that dev_get_stats() returns the given pointer Document that dev_get_stats() returns the same stats pointer it was given. Remove const qualification from the returned pointer since the caller may do what it likes with that structure. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3cfde79c |
|
09-Jul-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Get rid of rtnl_link_stats64 / net_device_stats union In commit be1f3c2c027cc5ad735df6a45a542ed1db7ec48b "net: Enable 64-bit net device statistics on 32-bit architectures" I redefined struct net_device_stats so that it could be used in a union with struct rtnl_link_stats64, avoiding the need for explicit copying or conversion between the two. However, this is unsafe because there is no locking required and no lock consistently held around calls to dev_get_stats() and use of the statistics structure it returns. In commit 28172739f0a276eb8d6ca917b3974c2edb036da3 "net: fix 64 bit counters on 32 bit arches" Eric Dumazet dealt with that problem by requiring callers of dev_get_stats() to provide storage for the result. This means that the net_device::stats64 field and the padding in struct net_device_stats are now redundant, so remove them. Update the comment on net_device_ops::ndo_get_stats64 to reflect its new usage. Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since that is what all its callers are really using and it is no longer going to be compatible with struct net_device_stats. Eric Dumazet suggested the separate function for the structure conversion. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
28172739 |
|
07-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: fix 64 bit counters on 32 bit arches There is a small possibility that a reader gets incorrect values on 32 bit arches. SNMP applications could catch incorrect counters when a 32bit high part is changed by another stats consumer/provider. One way to solve this is to add a rtnl_link_stats64 param to all ndo_get_stats64() methods, and also add such a parameter to dev_get_stats(). Rule is that we are not allowed to use dev->stats64 as a temporary storage for 64bit stats, but a caller provided area (usually on stack) Old drivers (only providing get_stats() method) need no changes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bcfcc450 |
|
02-Jul-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Fix definition of netif_vdbg() when VERBOSE_DEBUG is defined netif_vdbg() was originally defined as entirely equivalent to netdev_vdbg(), but I assume that it was intended to take the same parameters as netif_dbg() etc. (Currently it is only used by the sfc driver, in which I worked on that assumption.) In commit a4ed89c I changed the definition used when VERBOSE_DEBUG is not defined, but I failed to notice that the definition used when VERBOSE_DEBUG is defined was also not as I expected. Change that to match netif_dbg() as well. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f45f4321 |
|
26-Jun-2010 |
Joe Perches <joe@perches.com> |
netdevice.h: Change netif_<level> macros to call netdev_<level> functions Reduces text ~300 bytes of text (woohoo!) in an x86 defconfig $ size vmlinux* text data bss dec hex filename 7198526 720112 1366288 9284926 8dad3e vmlinux 7198862 720112 1366288 9285262 8dae8e vmlinux.netdev Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
256df2f3 |
|
26-Jun-2010 |
Joe Perches <joe@perches.com> |
netdevice.h net/core/dev.c: Convert netdev_<level> logging macros to functions Reduces an x86 defconfig text and data ~2k. text is smaller, data is larger. $ size vmlinux* text data bss dec hex filename 7198862 720112 1366288 9285262 8dae8e vmlinux 7205273 716016 1366288 9287577 8db799 vmlinux.device_h Uses %pV and struct va_format Format arguments are verified before printk Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f0796d5c |
|
01-Jul-2010 |
John Fastabend <john.r.fastabend@intel.com> |
net: decreasing real_num_tx_queues needs to flush qdisc Reducing real_num_queues needs to flush the qdisc otherwise skbs with queue_mappings greater then real_num_tx_queues can be sent to the underlying driver. The flow for this is, dev_queue_xmit() dev_pick_tx() skb_tx_hash() => hash using real_num_tx_queues skb_set_queue_mapping() ... qdisc_enqueue_root() => enqueue skb on txq from hash ... dev->real_num_tx_queues -= n ... sch_direct_xmit() dev_hard_start_xmit() ndo_start_xmit(skb,dev) => skb queue set with old hash skbs are enqueued on the qdisc with skb->queue_mapping set 0 < queue_mappings < real_num_tx_queues. When the driver decreases real_num_tx_queues skb's may be dequeued from the qdisc with a queue_mapping greater then real_num_tx_queues. This fixes a case in ixgbe where this was occurring with DCB and FCoE. Because the driver is using queue_mapping to map skbs to tx descriptor rings we can potentially map skbs to rings that no longer exist. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d29c0c5c |
|
14-Jun-2010 |
Herbert Xu <herbert@gondor.apana.org.au> |
udp: Add UFO to NETIF_F_SOFTWARE_GSO This patch adds UFO to the list of GSO features with a software fallback. This allows UFO to be used even if the hardware does not support it. In particular, this allows us to test the UFO fallback, as it has been reported to not work in some cases. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
82695d9b |
|
15-Jun-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Fix error in comment on net_device_ops::ndo_get_stats ndo_get_stats still returns struct net_device_stats *; there is no struct net_device_stats64. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f350a0a8 |
|
15-Jun-2010 |
Jiri Pirko <jpirko@redhat.com> |
bridge: use rx_handler_data pointer to store net_bridge_port pointer Register net_bridge_port pointer as rx_handler data pointer. As br_port is removed from struct net_device, another netdev priv_flag is added to indicate the device serves as a bridge port. Also rcuized pointers are now correctly dereferenced in br_fdb.c and in netfilter parts. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a35e2c1b |
|
14-Jun-2010 |
Jiri Pirko <jpirko@redhat.com> |
macvlan: use rx_handler_data pointer to store macvlan_port pointer V2 Register macvlan_port pointer as rx_handler data pointer. As macvlan_port is removed from struct net_device, another netdev priv_flag is added to indicate the device serves as a macvlan port. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
93e2c32b |
|
09-Jun-2010 |
Jiri Pirko <jpirko@redhat.com> |
net: add rx_handler data pointer Add possibility to register rx_handler data pointer along with a rx_handler. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4247e161 |
|
10-Jun-2010 |
Herbert Xu <herbert@gondor.apana.org.au> |
netpoll: Add ndo_netpoll_setup This patch adds ndo_netpoll_setup as the initialisation primitive to complement ndo_netpoll_cleanup. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
be1f3c2c |
|
08-Jun-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Enable 64-bit net device statistics on 32-bit architectures Use struct rtnl_link_stats64 as the statistics structure. On 32-bit architectures, insert 32 bits of padding after/before each field of struct net_device_stats to make its layout compatible with struct rtnl_link_stats64. Add an anonymous union in net_device; move stats into the union and add struct rtnl_link_stats64 stats64. Add net_device_ops::ndo_get_stats64, implementations of which will return a pointer to struct rtnl_link_stats64. Drivers that implement this operation must not update the structure asynchronously. Change dev_get_stats() to call ndo_get_stats64 if available, and to return a pointer to struct rtnl_link_stats64. Change callers of dev_get_stats() accordingly. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb69ae04 |
|
07-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
anycast: Some RCU conversions - dev_get_by_flags() changed to dev_get_by_flags_rcu() - ipv6_sock_ac_join() dont touch dev & idev refcounts - ipv6_sock_ac_drop() dont touch dev & idev refcounts - ipv6_sock_ac_close() dont touch dev & idev refcounts - ipv6_dev_ac_dec() dount touch idev refcount - ipv6_chk_acast_addr() dont touch idev refcount Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
724df615 |
|
26-May-2010 |
Justin P. Mattock <justinmattock@gmail.com> |
fix comment typo in netdevice.h Fix missing "of" in comment. Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
#
ab95bfe0 |
|
01-Jun-2010 |
Jiri Pirko <jpirko@redhat.com> |
net: replace hooks in __netif_receive_skb V5 What this patch does is it removes two receive frame hooks (for bridge and for macvlan) from __netif_receive_skb. These are replaced them with a single hook for both. It only supports one hook per device because it makes no sense to do bridging and macvlan on the same device. Then a network driver (of virtual netdev like macvlan or bridge) can register an rx_handler for needed net device. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c2d9ba9b |
|
01-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: CONFIG_NET_NS reduction Use read_pnet() and write_pnet() to reduce number of ifdef CONFIG_NET_NS Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
06c4648d |
|
25-May-2010 |
Ian Campbell <Ian.Campbell@citrix.com> |
arp_notify: allow drivers to explicitly request a notification event. Currently such notifications are only generated when the device comes up or the address changes. However one use case for these notifications is to enable faster network recovery after a virtual machine migration (by causing switches to relearn their MAC tables). A migration appears to the network stack as a temporary loss of carrier and therefore does not trigger either of the current conditions. Rather than adding carrier up as a trigger (which can cause issues when interfaces a flapping) simply add an interface which the driver can use to explicitly trigger the notification. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Stephen Hemminger <shemminger@linux-foundation.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David S. Miller <davem@davemloft.net> Cc: netdev@vger.kernel.org Cc: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4ed89cb |
|
18-May-2010 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Fix definition of netif_vdbg() when VERBOSE_DEBUG is not defined Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
76cc8b13 |
|
20-May-2010 |
Tom Herbert <therbert@google.com> |
net: fix problem in dequeuing from input_pkt_queue Fix some issues introduced in batch skb dequeuing for input_pkt_queue. The primary issue it that the queue head must be incremented only after a packet has been processed, that is only after __netif_receive_skb has been called. This is needed for the mechanism to prevent OOO packet in RFS. Also when flushing the input_pkt_queue and process_queue, the process queue should be done first to prevent OOO packets. Because the input_pkt_queue has been effectively split into two queues, the calculation of the tail ptr is no longer correct. The correct value would be head+input_pkt_queue->len+process_queue->len. To avoid this calculation we added an explict input_queue_tail in softnet_data. The tail value is simply incremented when queuing to input_pkt_queue. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
57b61080 |
|
17-May-2010 |
Scott Feldman <scofeldm@cisco.com> |
net: Add netlink support for virtual port management (was iovnl) Add new netdev ops ndo_{set|get}_vf_port to allow setting of port-profile on a netdev interface. Extends netlink socket RTM_SETLINK/ RTM_GETLINK with two new sub msgs called IFLA_VF_PORTS and IFLA_PORT_SELF (added to end of IFLA_cmd list). These are both nested atrtibutes using this layout: [IFLA_NUM_VF] [IFLA_VF_PORTS] [IFLA_VF_PORT] [IFLA_PORT_*], ... [IFLA_VF_PORT] [IFLA_PORT_*], ... ... [IFLA_PORT_SELF] [IFLA_PORT_*], ... These attributes are design to be set and get symmetrically. VF_PORTS is a list of VF_PORTs, one for each VF, when dealing with an SR-IOV device. PORT_SELF is for the PF of the SR-IOV device, in case it wants to also have a port-profile, or for the case where the VF==PF, like in enic patch 2/2 of this patch set. A port-profile is used to configure/enable the external switch virtual port backing the netdev interface, not to configure the host-facing side of the netdev. A port-profile is an identifier known to the switch. How port- profiles are installed on the switch or how available port-profiles are made know to the host is outside the scope of this patch. There are two types of port-profiles specs in the netlink msg. The first spec is for 802.1Qbg (pre-)standard, VDP protocol. The second spec is for devices that run a similar protocol as VDP but in firmware, thus hiding the protocol details. In either case, the specs have much in common and makes sense to define the netlink msg as the union of the two specs. For example, both specs have a notition of associating/deassociating a port-profile. And both specs require some information from the hypervisor manager, such as client port instance ID. The general flow is the port-profile is applied to a host netdev interface using RTM_SETLINK, the receiver of the RTM_SETLINK msg communicates with the switch, and the switch virtual port backing the host netdev interface is configured/enabled based on the settings defined by the port-profile. What those settings comprise, and how those settings are managed is again outside the scope of this patch, since this patch only deals with the first step in the flow. Signed-off-by: Scott Feldman <scofeldm@cisco.com> Signed-off-by: Roopa Prabhu <roprabhu@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b098e2d |
|
16-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Consistent skb timestamping With RPS inclusion, skb timestamping is not consistent in RX path. If netif_receive_skb() is used, its deferred after RPS dispatch. If netif_rx() is used, its done before RPS dispatch. This can give strange tcpdump timestamps results. I think timestamping should be done as soon as possible in the receive path, to get meaningful values (ie timestamps taken at the time packet was delivered by NIC driver to our stack), even if NAPI already can defer timestamping a bit (RPS can help to reduce the gap) Tom Herbert prefer to sample timestamps after RPS dispatch. In case sampling is expensive (HPET/acpi_pm on x86), this makes sense. Let admins switch from one mode to another, using a new sysctl, /proc/sys/net/core/netdev_tstamp_prequeue Its default value (1), means timestamps are taken as soon as possible, before backlog queueing, giving accurate timestamps. Setting a 0 value permits to sample timestamps when processing backlog, after RPS dispatch, to lower the load of the pre-RPS cpu. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ed77134b |
|
05-May-2010 |
Mark Gross <mgross@linux.intel.com> |
PM QOS update This patch changes the string based list management to a handle base implementation to help with the hot path use of pm-qos, it also renames much of the API to use "request" as opposed to "requirement" that was used in the initial implementation. I did this because request more accurately represents what it actually does. Also, I added a string based ABI for users wanting to use a string interface. So if the user writes 0xDDDDDDDD formatted hex it will be accepted by the interface. (someone asked me for it and I don't think it hurts anything.) This patch updates some documentation input I got from Randy. Signed-off-by: markgross <mgross@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
|
#
0e34e931 |
|
06-May-2010 |
WANG Cong <amwang@redhat.com> |
netpoll: add generic support for bridge and bonding devices This whole patchset is for adding netpoll support to bridge and bonding devices. I already tested it for bridge, bonding, bridge over bonding, and bonding over bridge. It looks fine now. To make bridge and bonding support netpoll, we need to adjust some netpoll generic code. This patch does the following things: 1) introduce two new priv_flags for struct net_device: IFF_IN_NETPOLL which identifies we are processing a netpoll; IFF_DISABLE_NETPOLL is used to disable netpoll support for a device at run-time; 2) introduce one new method for netdev_ops: ->ndo_netpoll_cleanup() is used to clean up netpoll when a device is removed. 3) introduce netpoll_poll_dev() which takes a struct net_device * parameter; export netpoll_send_skb() and netpoll_poll_dev() which will be used later; 4) hide a pointer to struct netpoll in struct netpoll_info, ditto. 5) introduce ->real_dev for struct netpoll. 6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable netconsole before releasing a slave, to avoid deadlocks. Cc: David Miller <davem@davemloft.net> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd7b5396 |
|
02-May-2010 |
David S. Miller <davem@davemloft.net> |
net: Use explicit "unsigned int" instead of plain "unsigned" in netdevice.h Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dee42870 |
|
01-May-2010 |
Changli Gao <xiaosuo@gmail.com> |
net: fix softnet_stat Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context without any protection. And enqueue_to_backlog should update the netdev_rx_stat of the target CPU. This patch renames softnet_data.total to softnet_data.processed: the number of packets processed in uppper levels(IP stacks). softnet_stat data is moved into softnet_data. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/linux/netdevice.h | 17 +++++++---------- net/core/dev.c | 26 ++++++++++++-------------- net/sched/sch_generic.c | 2 +- 3 files changed, 20 insertions(+), 25 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6e7676c1 |
|
27-Apr-2010 |
Changli Gao <xiaosuo@gmail.com> |
net: batch skb dequeueing from softnet input_pkt_queue batch skb dequeueing from softnet input_pkt_queue to reduce potential lock contention when RPS is enabled. Note: in the worst case, the number of packets in a softnet_data may be double of netdev_max_backlog. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a9cbd588 |
|
26-Apr-2010 |
Changli Gao <xiaosuo@gmail.com> |
net: reimplement softnet_data.output_queue as a FIFO queue reimplement softnet_data.output_queue as a FIFO queue to keep the fairness among the qdiscs rescheduled. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ---- include/linux/netdevice.h | 1 + net/core/dev.c | 22 ++++++++++++---------- 2 files changed, 13 insertions(+), 10 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e36fa2f7 |
|
19-Apr-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
rps: cleanups struct softnet_data holds many queues, so consistent use "sd" name instead of "queue" is better. Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog() Adds a _and_irq_disable suffix to net_rps_action() name, as David suggested. incr_input_queue_head() becomes input_queue_head_incr() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
88751275 |
|
18-Apr-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
rps: shortcut net_rps_action() net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if RPS is not active. Tom Herbert used two bitmasks to hold information needed to send IPI, but a single LIFO list seems more appropriate. Move all RPS logic into net_rps_action() to cleanup net_rx_action() code (remove two ifdefs) Move rps_remote_softirq_cpus into softnet_data to share its first cache line, filling an existing hole. In a future patch, we could call net_rps_action() from process_backlog() to make sure we send IPI before handling this cpu backlog. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fec5e652 |
|
16-Apr-2010 |
Tom Herbert <therbert@google.com> |
rfs: Receive Flow Steering This patch implements receive flow steering (RFS). RFS steers received packets for layer 3 and 4 processing to the CPU where the application for the corresponding flow is running. RFS is an extension of Receive Packet Steering (RPS). The basic idea of RFS is that when an application calls recvmsg (or sendmsg) the application's running CPU is stored in a hash table that is indexed by the connection's rxhash which is stored in the socket structure. The rxhash is passed in skb's received on the connection from netif_receive_skb. For each received packet, the associated rxhash is used to look up the CPU in the hash table, if a valid CPU is set then the packet is steered to that CPU using the RPS mechanisms. The convolution of the simple approach is that it would potentially allow OOO packets. If threads are thrashing around CPUs or multiple threads are trying to read from the same sockets, a quickly changing CPU value in the hash table could cause rampant OOO packets-- we consider this a non-starter. To avoid OOO packets, this solution implements two types of hash tables: rps_sock_flow_table and rps_dev_flow_table. rps_sock_table is a global hash table. Each entry is just a CPU number and it is populated in recvmsg and sendmsg as described above. This table contains the "desired" CPUs for flows. rps_dev_flow_table is specific to each device queue. Each entry contains a CPU and a tail queue counter. The CPU is the "current" CPU for a matching flow. The tail queue counter holds the value of a tail queue counter for the associated CPU's backlog queue at the time of last enqueue for a flow matching the entry. Each backlog queue has a queue head counter which is incremented on dequeue, and so a queue tail counter is computed as queue head count + queue length. When a packet is enqueued on a backlog queue, the current value of the queue tail counter is saved in the hash entry of the rps_dev_flow_table. And now the trick: when selecting the CPU for RPS (get_rps_cpu) the rps_sock_flow table and the rps_dev_flow table for the RX queue are consulted. When the desired CPU for the flow (found in the rps_sock_flow table) does not match the current CPU (found in the rps_dev_flow table), the current CPU is changed to the desired CPU if one of the following is true: - The current CPU is unset (equal to RPS_NO_CPU) - Current CPU is offline - The current CPU's queue head counter >= queue tail counter in the rps_dev_flow table. This checks if the queue tail has advanced beyond the last packet that was enqueued using this table entry. This guarantees that all packets queued using this entry have been dequeued, thus preserving in order delivery. Making each queue have its own rps_dev_flow table has two advantages: 1) the tail queue counters will be written on each receive, so keeping the table local to interrupting CPU s good for locality. 2) this allows lockless access to the table-- the CPU number and queue tail counter need to be accessed together under mutual exclusion from netif_receive_skb, we assume that this is only called from device napi_poll which is non-reentrant. This patch implements RFS for TCP and connected UDP sockets. It should be usable for other flow oriented protocols. There are two configuration parameters for RFS. The "rps_flow_entries" kernel init parameter sets the number of entries in the rps_sock_flow_table, the per rxqueue sysfs entry "rps_flow_cnt" contains the number of entries in the rps_dev_flow table for the rxqueue. Both are rounded to power of two. The obvious benefit of RFS (over just RPS) is that it achieves CPU locality between the receive processing for a flow and the applications processing; this can result in increased performance (higher pps, lower latency). The benefits of RFS are dependent on cache hierarchy, application load, and other factors. On simple benchmarks, we don't necessarily see improvement and sometimes see degradation. However, for more complex benchmarks and for applications where cache pressure is much higher this technique seems to perform very well. Below are some benchmark results which show the potential benfit of this patch. The netperf test has 500 instances of netperf TCP_RR test with 1 byte req. and resp. The RPC test is an request/response test similar in structure to netperf RR test ith 100 threads on each host, but does more work in userspace that netperf. e1000e on 8 core Intel No RFS or RPS 104K tps at 30% CPU No RFS (best RPS config): 290K tps at 63% CPU RFS 303K tps at 61% CPU RPC test tps CPU% 50/90/99% usec latency Latency StdDev No RFS/RPS 103K 48% 757/900/3185 4472.35 RPS only: 174K 73% 415/993/2468 491.66 RFS 223K 73% 379/651/1382 315.61 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd793d89 |
|
15-Apr-2010 |
Changli Gao <xiaosuo@gmail.com> |
net: CONFIG_SMP should be CONFIG_RPS Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
acbbc071 |
|
11-Apr-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: uninline skb_bond_should_drop() skb_bond_should_drop() is too big to be inlined. This patch reduces kernel text size, and its compilation time as well (shrinking include/linux/netdevice.h) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
18e225f2 |
|
07-Apr-2010 |
Pavel Roskin <proski@gnu.org> |
net: fix definition of netdev_for_each_mc_addr() The first argument should be called ha, not mclist. All callers use the name "ha", but if they used a different name, there would be a compile error. Signed-off-by: Pavel Roskin <proski@gnu.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
22bedad3 |
|
01-Apr-2010 |
Jiri Pirko <jpirko@redhat.com> |
net: convert multicast list to list_head Converts the list and the core manipulating with it to be the same as uc_list. +uses two functions for adding/removing mc address (normal and "global" variant) instead of a function parameter. +removes dev_mcast.c completely. +exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for manipulation with lists on a sandbox (used in bonding and 80211 drivers) Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a748ee24 |
|
01-Apr-2010 |
Jiri Pirko <jpirko@redhat.com> |
net: move address list functions to a separate file +little renaming of unicast functions to be smooth with multicast ones Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b00fabb4 |
|
29-Mar-2010 |
stephen hemminger <shemminger@vyatta.com> |
netdev: ethtool RXHASH flag This adds ethtool and device feature flag to allow control of receive hashing offload. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df334545 |
|
24-Mar-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
rps: add CONFIG_RPS RPS currently depends on SMP and SYSFS Adding a CONFIG_RPS makes sense in case this requirement changes in the future. This patch saves about 1500 bytes of kernel text in case SMP is on but SYSFS is off. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32a806c1 |
|
18-Mar-2010 |
Jiri Pirko <jpirko@redhat.com> |
bonding: flush unicast and multicast lists when changing type After the type change, addresses in unicast and multicast lists wouldn't make sense, not to mention possible different lenghts. So flush both lists here. Note "dev_addr_discard" will be very soon replaced by "dev_mc_flush" (once mc_list conversion will be done). Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0641e4fb |
|
18-Mar-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Potential null skb->dev dereference When doing "ifenslave -d bond0 eth0", there is chance to get NULL dereference in netif_receive_skb(), because dev->master suddenly becomes NULL after we tested it. We should use ACCESS_ONCE() to avoid this (or rcu_dereference()) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3ca5b404 |
|
10-Mar-2010 |
Jiri Pirko <jpirko@redhat.com> |
bonding: check return value of nofitier when changing type This patch adds the possibility to refuse the bonding type change for other subsystems (such as for example bridge, vlan, etc.) Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1e94d72f |
|
18-Mar-2010 |
Tom Herbert <therbert@google.com> |
rps: Fixed build with CONFIG_SMP not enabled. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a9627f2 |
|
16-Mar-2010 |
Tom Herbert <therbert@google.com> |
rps: Receive Packet Steering This patch implements software receive side packet steering (RPS). RPS distributes the load of received packet processing across multiple CPUs. Problem statement: Protocol processing done in the NAPI context for received packets is serialized per device queue and becomes a bottleneck under high packet load. This substantially limits pps that can be achieved on a single queue NIC and provides no scaling with multiple cores. This solution queues packets early on in the receive path on the backlog queues of other CPUs. This allows protocol processing (e.g. IP and TCP) to be performed on packets in parallel. For each device (or each receive queue in a multi-queue device) a mask of CPUs is set to indicate the CPUs that can process packets. A CPU is selected on a per packet basis by hashing contents of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index into the CPU mask. The IPI mechanism is used to raise networking receive softirqs between CPUs. This effectively emulates in software what a multi-queue NIC can provide, but is generic requiring no device support. Many devices now provide a hash over the 4-tuple on a per packet basis (e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash in an skb field, and that value in turn is used to index into the RPS maps. Using the HW generated hash can avoid cache misses on the packet when steering it to a remote CPU. The CPU mask is set on a per device and per queue basis in the sysfs variable /sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical bit maps for receive queues in the device (numbered by <n>). If a device does not support multi-queue, a single variable is used for the device (rx-0). Generally, we have found this technique increases pps capabilities of a single queue device with good CPU utilization. Optimal settings for the CPU mask seem to depend on architectures and cache hierarcy. Below are some results running 500 instances of netperf TCP_RR test with 1 byte req. and resp. Results show cumulative transaction rate and system CPU utilization. e1000e on 8 core Intel Without RPS: 108K tps at 33% CPU With RPS: 311K tps at 64% CPU forcedeth on 16 core AMD Without RPS: 156K tps at 15% CPU With RPS: 404K tps at 49% CPU bnx2x on 16 core AMD Without RPS 567K tps at 61% CPU (4 HW RX queues) Without RPS 738K tps at 96% CPU (8 HW RX queues) With RPS: 854K tps at 76% CPU (4 HW RX queues) Caveats: - The benefits of this patch are dependent on architecture and cache hierarchy. Tuning the masks to get best performance is probably necessary. - This patch adds overhead in the path for processing a single packet. In a lightly loaded server this overhead may eliminate the advantages of increased parallelism, and possibly cause some relative performance degradation. We have found that masks that are cache aware (share same caches with the interrupting CPU) mitigate much of this. - The RPS masks can be changed dynamically, however whenever the mask is changed this introduces the possibility of generating out of order packets. It's probably best not change the masks too frequently. Signed-off-by: Tom Herbert <therbert@google.com> include/linux/netdevice.h | 32 ++++- include/linux/skbuff.h | 3 + net/core/dev.c | 335 +++++++++++++++++++++++++++++++++++++-------- net/core/net-sysfs.c | 225 ++++++++++++++++++++++++++++++- net/core/skbuff.c | 2 + 5 files changed, 538 insertions(+), 59 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bd380811 |
|
25-Feb-2010 |
Patrick McHardy <kaber@trash.net> |
dev: support deferring device flag change notifications Split dev_change_flags() into two functions: __dev_change_flags() to perform the actual changes and __dev_notify_flags() to invoke netdevice notifiers. This will be used by rtnl_link to defer netlink notifications until the device has been fully configured. This changes ordering of some operations, in particular: - netlink notifications are sent after all changes have been performed. As a side effect this surpresses one unnecessary netlink message when the IFF_UP and other flags are changed simultaneously. - The NETDEV_UP/NETDEV_DOWN and NETDEV_CHANGE notifiers are invoked after all changes have been performed. Their relative is unchanged. - net_dmaengine_put() is invoked before the NETDEV_DOWN notifier instead of afterwards. This should not make any difference since both RX and TX are already shut down at this point. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a2835763 |
|
25-Feb-2010 |
Patrick McHardy <kaber@trash.net> |
rtnetlink: handle rtnl_link netlink notifications manually In order to support specifying device flags during device creation, we must be able to roll back device registration in case setting the flags fails without sending any notifications related to the device to userspace. This patch changes rollback_registered_many() and register_netdevice() to manually send netlink notifications for devices not handled by rtnl_link and allows to defer notifications for devices handled by rtnl_link until setup is complete. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
caf66e58 |
|
24-Feb-2010 |
John W. Linville <linville@tuxdriver.com> |
netdevice.h: check for CONFIG_WLAN instead of CONFIG_WLAN_80211 In "wireless: remove WLAN_80211 and WLAN_PRE80211 from Kconfig" I inadvertantly missed a line in include/linux/netdevice.h. I thereby effectively reverted "net: Set LL_MAX_HEADER properly for wireless." by accident. :-( Now we should check there for CONFIG_WLAN instead. Signed-off-by: John W. Linville <linville@tuxdriver.com> Reported-by: Christoph Egger <siccegge@stud.informatik.uni-erlangen.de> Cc: stable@kernel.org
|
#
95c26df8 |
|
09-Feb-2010 |
Williams, Mitch A <mitch.a.williams@intel.com> |
net: Add netdev ops for SR-IOV configuration Add netdev ops for configuring SR-IOV VF devices through the PF driver. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b3d95c5c |
|
09-Feb-2010 |
Joe Perches <joe@perches.com> |
include/linux/netdevice.h: Add netif_printk helpers Add macros to test a private structure for msg_enable bits and the netif_msg_##bit to test and call netdev_printk if set Simplifies logic in callers and adds message logging consistency Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
571ba423 |
|
09-Feb-2010 |
Joe Perches <joe@perches.com> |
netdevice.h: Add netdev_printk helpers like dev_printk These netdev_printk routines take a struct net_device * and emit dev_printk logging messages adding "%s: " ... netdev->dev.parent to the dev_printk format and arguments. This can create some uniformity in the output message log. These helpers should not be used until a successful alloc_netdev. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
15682bc4 |
|
10-Feb-2010 |
Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> |
ethtool: Introduce n-tuple filter programming support This patchset enables the ethtool layer to program n-tuple filters to an underlying device. The idea is to allow capable hardware to have static rules applied that can assist steering flows into appropriate queues. Hardware that is known to support these types of filters today are ixgbe and niu. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6683ece3 |
|
04-Feb-2010 |
Jiri Pirko <jpirko@redhat.com> |
net: use helpers to access mc list V2 This patch introduces the similar helpers as those already done for uc list. However multicast lists are no list_head lists but "mademanually". The three macros added by this patch will make the transition of mc_list to list_head smooth in two steps: 1) convert all drivers to use these macros (with the original iterator of type "struct dev_mc_list") 2) once all drivers are converted, convert list type and iterators to "struct netdev_hw_addr" in one patch. >From now on, drivers can (and should) use "netdev_for_each_mc_addr" to iterate over the addresses with iterator of type "struct netdev_hw_addr". Also macros "netdev_mc_count" and "netdev_mc_empty" to read list's length. This is the state which should be reached in all drivers. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8a83a00b |
|
29-Jan-2010 |
Arnd Bergmann <arnd@arndb.de> |
net: maintain namespace isolation between vlan and real device In the vlan and macvlan drivers, the start_xmit function forwards data to the dev_queue_xmit function for another device, which may potentially belong to a different namespace. To make sure that classification stays within a single namespace, this resets the potentially critical fields. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32e7bfc4 |
|
25-Jan-2010 |
Jiri Pirko <jpirko@redhat.com> |
net: use helpers to access uc list V2 This patch introduces three macros to work with uc list from net drivers. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a271623f |
|
22-Jan-2010 |
Alexey Dobriyan <adobriyan@gmail.com> |
netdev: remove certain HAVE_ macros After netdev_ops compat code HAVE_* macros aren't needed, in fact they _will_ result in compile breakage for out of tree drivers. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
11380a4b |
|
19-Jan-2010 |
David S. Miller <davem@davemloft.net> |
net: Unexport napi_gro_flush(). Nothing outside of net/core/dev.c uses it. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fc4a7489 |
|
03-Dec-2009 |
Patrick Mullaney <pmullaney@novell.com> |
netdevice: provide common routine for macvlan and vlan operstate management Provide common routine for the transition of operational state for a leaf device during a root device transition. Signed-off-by: Patrick Mullaney <pmullaney@novell.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dcbccbd4 |
|
29-Nov-2009 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Implement for_each_netdev_reverse. I will need this shortly to implement network namespace shutdown batching. For sanity sake network devices should be removed in the reverse order they were created in. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
44540960 |
|
25-Nov-2009 |
Arnd Bergmann <arnd@arndb.de> |
veth: move loopback logic to common location The veth driver contains code to forward an skb from the start_xmit function of one network device into the receive path of another device. Moving that code into a common location lets us reuse the code for direct forwarding of data between macvlan ports, and possibly in other drivers. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e014debe |
|
16-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
linkwatch: linkwatch_forget_dev() to speedup device dismantle Herbert Xu a écrit : > On Tue, Nov 17, 2009 at 04:26:04AM -0800, David Miller wrote: >> Really, the link watch stuff is just due for a redesign. I don't >> think a simple hack is going to cut it this time, sorry Eric :-) > > I have no objections against any redesigns, but since the only > caller of linkwatch_forget_dev runs in process context with the > RTNL, it could also legally emit those events. Thanks guys, here an updated version then, before linkwatch surgery ? In this version, I force the event to be sent synchronously. [PATCH net-next-2.6] linkwatch: linkwatch_forget_dev() to speedup device dismantle time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105 real 0m0.266s user 0m0.000s sys 0m0.001s real 0m0.770s user 0m0.000s sys 0m0.000s real 0m1.022s user 0m0.000s sys 0m0.000s One problem of current schem in vlan dismantle phase is the holding of device done by following chain : vlan_dev_stop() -> netif_carrier_off(dev) -> linkwatch_fire_event(dev) -> dev_hold() ... And __linkwatch_run_queue() runs up to one second later... A generic fix to this problem is to add a linkwatch_forget_dev() method to unlink the device from the list of watched devices. dev->link_watch_next becomes dev->link_watch_list (and use a bit more memory), to be able to unlink device in O(1). After patch : time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105 real 0m0.024s user 0m0.000s sys 0m0.000s real 0m0.032s user 0m0.000s sys 0m0.001s real 0m0.033s user 0m0.000s sys 0m0.000s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d83345ad |
|
15-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add dev_txq_stats_fold() helper Some drivers ndo_get_stats() method need to perform txqueue stats folding. Move folding from dev_get_stats() to a new dev_txq_stats_fold() function Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9a1654ba |
|
15-Nov-2009 |
Jarek Poplawski <jarkao2@gmail.com> |
net: Optimize hard_start_xmit() return checking Recent changes in the TX error propagation require additional checking and masking of values returned from hard_start_xmit(), mainly to separate cases where skb was consumed. This aim can be simplified by changing the order of NETDEV_TX and NET_XMIT codes, because the latter are treated similarly to negative (ERRNO) values. After this change much simpler dev_xmit_complete() is also used in sch_direct_xmit(), so it is moved to netdevice.h. Additionally NET_RX definitions in netdevice.h are moved up from between TX codes to avoid confusion while reading the TX comment. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce81b76a |
|
11-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv6: use RCU to walk list of network devices No longer need read_lock(&dev_base_lock), use RCU instead. We also can avoid taking references on inet6_dev structs. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
572a9d7b |
|
09-Nov-2009 |
Patrick McHardy <kaber@trash.net> |
net: allow to propagate errors through ->ndo_hard_start_xmit() Currently the ->ndo_hard_start_xmit() callbacks are only permitted to return one of the NETDEV_TX codes. This prevents any kind of error propagation for virtual devices, like queue congestion of the underlying device in case of layered devices, or unreachability in case of tunnels. This patches changes the NET_XMIT codes to avoid clashes with the NETDEV_TX codes and changes the two callers of dev_hard_start_xmit() to expect either errno codes, NET_XMIT codes or NETDEV_TX codes as return value. In case of qdisc_restart(), all non NETDEV_TX codes are mapped to NETDEV_TX_OK since no error propagation is possible when using qdiscs. In case of dev_queue_xmit(), the error is propagated upwards. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
254245d2 |
|
10-Nov-2009 |
stephen hemminger <shemminger@vyatta.com> |
netdev: add netdev_continue_rcu This adds an RCU macro for continuing search, useful for some network devices like vlan. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d94d9fee |
|
04-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup include/linux This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c6d14c84 |
|
04-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Introduce for_each_netdev_rcu() iterator Adds RCU management to the list of netdevices. Convert some for_each_netdev() users to RCU version, if it can avoid read_lock-ing dev_base_lock Ie: read_lock(&dev_base_loack); for_each_netdev(net, dev) some_action(); read_unlock(&dev_base_lock); becomes : rcu_read_lock(); for_each_netdev_rcu(net, dev) some_action(); rcu_read_unlock(); Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
72c9528b |
|
30-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Introduce dev_get_by_name_rcu() Some workloads hit dev_base_lock rwlock pretty hard. We can use RCU lookups to avoid touching this rwlock (and avoid touching netdevice refcount) netdevices are already freed after a RCU grace period, so this patch adds no penalty at device dismantle time. However, it adds a synchronize_rcu() call in dev_change_name() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c509a6c |
|
29-Oct-2009 |
Eric W. Biederman <ebiederm@aristanetworks.com> |
net: Allow devices to specify a device specific sysfs group. This isn't beautifully abstracted, but it is simple, simplifies uses and so far is only needed for the bonding driver. Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c7c4b3b6 |
|
29-Oct-2009 |
Ben Hutchings <bhutchings@solarflare.com> |
gro: Change all receive functions to return GRO result codes This will allow drivers to adjust their receive path dynamically based on whether GRO is being applied successfully. Currently all in-tree callers ignore the return values of these functions and do not need to be changed. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b252f0c |
|
29-Oct-2009 |
Ben Hutchings <bhutchings@solarflare.com> |
gro: Name the GRO result enumeration type This clarifies which return and parameter types are GRO result codes and not RX result codes. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fb699dfd |
|
19-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Introduce dev_get_by_index_rcu() Some workloads hit dev_base_lock rwlock pretty hard. We can use RCU lookups to avoid touching this rwlock. netdevices are already freed after a RCU grace period, so this patch adds no penalty at device dismantle time. dev_ifname() converted to dev_get_by_index_rcu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
df5c7945 |
|
28-Oct-2009 |
Yi Zou <yi.zou@intel.com> |
net: Add ndo_fcoe_get_wwn to net_device_ops Add ndo_fcoe_get_wwn so Fiber Channel over Ethernet (FCoE) can make use of the provided World Wide Port Name (WWPN) and World Wide Node Name (WWNN) from the underlying network interface driver. Signed-off-by: Yi Zou <yi.zou@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9b5e383c |
|
27-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Introduce unregister_netdevice_many() Introduce rollback_registered_many() and unregister_netdevice_many() rollback_registered_many() is able to perform necessary steps at device dismantle time, factorizing two expensive synchronize_net() calls. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
44a0873d |
|
27-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Introduce unregister_netdevice_queue() This patchs adds an unreg_list anchor to struct net_device, and introduces an unregister_netdevice_queue() function, able to queue a net_device to a list instead of immediately unregister it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d308e38f |
|
07-Oct-2009 |
Wolfram Sang <wsa@kernel.org> |
include/linux/netdevice.h: fix nanodoc mismatch nanodoc was missing an ndo_-prefix. Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
32953543 |
|
05-Oct-2009 |
Stephen Hemminger <shemminger@vyatta.com> |
dcb: data center bridging ops should be r/o The data center bridging ops structure can be const Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4dbd674 |
|
24-Jun-2009 |
David Brownell <dbrownell@users.sourceforge.net> |
driver model: constify attribute groups Let attribute group vectors be declared "const". We'd like to let most attribute metadata live in read-only sections... this is a start. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
75c78500 |
|
15-Sep-2009 |
Moni Shoua <monis@voltaire.com> |
bonding: remap muticast addresses without using dev_close() and dev_open() This patch fixes commit e36b9d16c6a6d0f59803b3ef04ff3c22c3844c10. The approach there is to call dev_close()/dev_open() whenever the device type is changed in order to remap the device IP multicast addresses to HW multicast addresses. This approach suffers from 2 drawbacks: *. It assumes tha the device is UP when calling dev_close(), or otherwise dev_close() has no affect. It is worth to mention that initscripts (Redhat) and sysconfig (Suse) doesn't act the same in this matter. *. dev_close() has other side affects, like deleting entries from the routing table, which might be unnecessary. The fix here is to directly remap the IP multicast addresses to HW multicast addresses for a bonding device that changes its type, and nothing else. Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Moni Shoua <monis@voltaire.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
384912ed |
|
31-Aug-2009 |
Marcel Holtmann <marcel@holtmann.org> |
net: Add DEVTYPE support for Ethernet based devices The Ethernet framing is used for a lot of devices these days. Most prominent are WiFi and WiMAX based devices. However for userspace application it is important to classify these devices correctly and not only see them as Ethernet devices. The daemons like HAL, DeviceKit or even NetworkManager with udev support tries to do the classification in userspace with a lot trickery and extra system calls. This is not good and actually reaches its limitations. Especially since the kernel does know the type of the Ethernet device it is pretty stupid. To solve this problem the underlying device type needs to be set and then the value will be exported as DEVTYPE via uevents and available within udev. # cat /sys/class/net/wlan0/uevent DEVTYPE=wlan INTERFACE=wlan0 IFINDEX=5 This is similar to subsystems like USB and SCSI that distinguish between hosts, devices, disks, partitions etc. The new SET_NETDEV_DEVTYPE() is a convenience helper to set the actual device type. All device types are free form, but for convenience the same strings as used with RFKILL are choosen. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
af356afa |
|
04-Sep-2009 |
Patrick McHardy <kaber@trash.net> |
net_sched: reintroduce dev->qdisc for use by sch_api Currently the multiqueue integration with the qdisc API suffers from a few problems: - with multiple queues, all root qdiscs use the same handle. This means they can't be exposed to userspace in a backwards compatible fashion. - all API operations always refer to queue number 0. Newly created qdiscs are automatically shared between all queues, its not possible to address individual queues or restore multiqueue behaviour once a shared qdisc has been attached. - Dumps only contain the root qdisc of queue 0, in case of non-shared qdiscs this means the statistics are incomplete. This patch reintroduces dev->qdisc, which points to the (single) root qdisc from userspace's point of view. Currently it either points to the first (non-shared) default qdisc, or a qdisc shared between all queues. The following patches will introduce a classful dummy qdisc, which will be used as root qdisc and contain the per-queue qdiscs as children. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cb454399 |
|
30-Aug-2009 |
Yi Zou <yi.zou@intel.com> |
net: Add ndo_fcoe_enable/ndo_fcoe_disable to net_device_ops Add ndo_fcoe_enable/_disable to net_device_ops so the corresponding HW can initialize itself for FCoE traffic or clean up after FCoE traffic is done. This is expected to be called by the kernel FCoE stack upon receiving a request for creating an FCoE instance on the corresponding netdev interface. When implemented by the actual HW, the HW driver check the op code to perform corresponding initialization or clean up for FCoE. The initialization normally includes allocating extra queues for FCoE, setting corresponding HW registers for FCoE, indicating FCoE offload features via netdev, etc. The clean-up would include releasing the resources allocated for FCoE. Signed-off-by: Yi Zou <yi.zou@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dc1f8bf6 |
|
31-Aug-2009 |
Stephen Hemminger <shemminger@vyatta.com> |
netdev: change transmit to limited range type The transmit function should only return one of three possible values, some drivers got confused and returned errno's or other values. This changes the definition so that this can be caught at compile time. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7b3d3e4f |
|
29-Aug-2009 |
Krishna Kumar <krkumar2@in.ibm.com> |
netdevice: Consolidate to use existing macros where available. Patch compiled and 32 simultaneous netperf testing ran fine. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb2af4f5 |
|
13-Aug-2009 |
Yi Zou <yi.zou@intel.com> |
net: Add NETIF_F_FCOE_MTU to indicate support for a different MTU for FCoE Add NETIF_F_FCOE_MTU to indicate that the NIC can support a secondary MTU for converged traffic of LAN and Fiber Channel over Ethernet (FCoE). The MTU for FCoE is 2158 = 14 (FCoE header) + 24 (FC header) + 2112 (FC max payload) + 4 (FC CRC) + 4 (FCoE trailer). Signed-off-by: Yi Zou <yi.zou@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0e8635a8 |
|
19-Jun-2009 |
Florian Westphal <fw@strlen.de> |
net: remove NET_RX_BAD and NET_RX_CN* defines almost no users in the tree; and the few that use them treat them like NET_RX_DROP. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
31278e71 |
|
16-Jun-2009 |
Jiri Pirko <jpirko@redhat.com> |
net: group address list and its count This patch is inspired by patch recently posted by Johannes Berg. Basically what my patch does is to group list and a count of addresses into newly introduced structure netdev_hw_addr_list. This brings us two benefits: 1) struct net_device becames a bit nicer. 2) in the future there will be a possibility to operate with lists independently on netdevices (with exporting right functions). I wanted to introduce this patch before I'll post a multicast lists conversion. Signed-off-by: Jiri Pirko <jpirko@redhat.com> drivers/net/bnx2.c | 4 +- drivers/net/e1000/e1000_main.c | 4 +- drivers/net/ixgbe/ixgbe_main.c | 6 +- drivers/net/mv643xx_eth.c | 2 +- drivers/net/niu.c | 4 +- drivers/net/virtio_net.c | 10 ++-- drivers/s390/net/qeth_l2_main.c | 2 +- include/linux/netdevice.h | 17 +++-- net/core/dev.c | 130 ++++++++++++++++++-------------------- 9 files changed, 89 insertions(+), 90 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a5bd8a13 |
|
09-Jun-2009 |
David S. Miller <davem@davemloft.net> |
netdevice.h: Use frag list abstraction interfaces. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
278b2513 |
|
03-Jun-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gso: Stop fraglists from escaping As it stands skb fraglists can get past the check in dev_queue_xmit if the skb is marked as GSO. In particular, if the packet doesn't have the proper checksums for GSO, but can otherwise be handled by the underlying device, we will not perform the fraglist check on it at all. If the underlying device cannot handle fraglists, then this will break. The fix is as simple as moving the fraglist check from the device check into skb_gso_ok. This has caused crashes with Xen when used together with GRO which can generate GSO packets with fraglists. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ccffad25 |
|
22-May-2009 |
Jiri Pirko <jpirko@redhat.com> |
net: convert unicast addr list This patch converts unicast address list to standard list_head using previously introduced struct netdev_hw_addr. It also relaxes the locking. Original spinlock (still used for multicast addresses) is not needed and is no longer used for a protection of this list. All reading and writing takes place under rtnl (with no changes). I also removed a possibility to specify the length of the address while adding or deleting unicast address. It's always dev->addr_len. The convertion touched especially e1000 and ixgbe codes when the change is not so trivial. Signed-off-by: Jiri Pirko <jpirko@redhat.com> drivers/net/bnx2.c | 13 +-- drivers/net/e1000/e1000_main.c | 24 +++-- drivers/net/ixgbe/ixgbe_common.c | 14 ++-- drivers/net/ixgbe/ixgbe_common.h | 4 +- drivers/net/ixgbe/ixgbe_main.c | 6 +- drivers/net/ixgbe/ixgbe_type.h | 4 +- drivers/net/macvlan.c | 11 +- drivers/net/mv643xx_eth.c | 11 +- drivers/net/niu.c | 7 +- drivers/net/virtio_net.c | 7 +- drivers/s390/net/qeth_l2_main.c | 6 +- drivers/scsi/fcoe/fcoe.c | 16 ++-- include/linux/netdevice.h | 18 ++-- net/8021q/vlan.c | 4 +- net/8021q/vlan_dev.c | 10 +- net/core/dev.c | 195 +++++++++++++++++++++++++++----------- net/dsa/slave.c | 10 +- net/packet/af_packet.c | 4 +- 18 files changed, 227 insertions(+), 137 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d4e039b |
|
27-May-2009 |
Jiri Pirko <jpirko@redhat.com> |
bonding: allow bond in mode balance-alb to work properly in bridge -try4.3 [PATCH net-next] bonding: allow bond in mode balance-alb to work properly in bridge -try4.3 (updated) changes v4.2 -> v4.3 - memcpy the address always, not just in case it differs from master->dev_addr - compare_ether_addr_64bits() is not used so there is no direct need to make new header file (I think it would be good to have bond stuff in separate file anyway). changes v4.1 -> v4.2 - use skb->pkt_type == PACKET_HOST compare rather then comparing skb dest addr against skb->dev->dev_addr The problem is described in following bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=487763 Basically here's what's going on. In every mode, bonding interface uses the same mac address for all enslaved devices (except fail_over_mac). Only balance-alb will simultaneously use multiple MAC addresses across different slaves. When you put this kind of bond device into a bridge it will only add one of mac adresses into a hash list of mac addresses, say X. This mac address is marked as local. But this bonding interface also has mac address Y. Now then packet arrives with destination address Y, this address is not marked as local and the packed looks like it needs to be forwarded. This packet is then lost which is wrong. Notice that interfaces can be added and removed from bond while it is in bridge. *** When the multiple addresses for bridge port approach failed to solve this issue due to STP I started to think other way to solve this. I returned to previous solution but tweaked one. This patch solves the situation in the bonding without touching bridge code. For every incoming frame to bonding the destination address is compared to current address of the slave device from which tha packet came. If these two match destination address is replaced by mac address of the master. This address is known by bridge so it is delivered properly. Note that the comparsion is not made directly, it's used skb->pkt_type == PACKET_HOST instead. This is "set" previously in eth_type_trans(). I experimentally tried that this works as good as searching through the slave list (v4 of this patch). Jirka Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
385a154c |
|
27-May-2009 |
Jiri Pirko <jpirko@redhat.com> |
net: correct a comment for the final #endif Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1ce8e7b5 |
|
26-May-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: ALIGN/PTR_ALIGN cleanup in alloc_netdev_mq()/netdev_priv() Use ALIGN() and PTR_ALIGN() macros instead of handcoding them. Get rid of NETDEV_ALIGN_CONST ugly define Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a5b1cf28 |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Avoid unnecessary comparison after skb_gro_header For the overwhelming majority of cases, skb_gro_header's return value cannot be NULL. Yet we must check it because of its current form. This patch splits it up into multiple functions in order to avoid this. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7489594c |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Optimise length comparison in skb_gro_header By caching frag0_len, we can avoid checking both frag0 and the length separately in skb_gro_header. This helps as skb_gro_header is called four times per packet which amounts to a few million times at 10Gb/s. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
78d3fd0b |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Only use skb_gro_header for completely non-linear packets Currently skb_gro_header is used for packets which put the hardware header in skb->data with the rest in frags. Since the drivers that need this optimisation all provide completely non-linear packets, we can gain extra optimisations by only performing the frag0 optimisation for completely non-linear packets. In particular, we can simply test frag0 (instead of skb_headlen) to see whether the optimisation is in force. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
78a478d0 |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Inline skb_gro_header and cache frag0 virtual address The function skb_gro_header is called four times per packet which quickly adds up at 10Gb/s. This patch inlines it to allow better optimisations. Some architectures perform multiplication for page_address, which is done by each skb_gro_header invocation. This patch caches that value in skb->cb to avoid the unnecessary multiplications. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
08baf561 |
|
25-May-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: txq_trans_update() helper We would like to get rid of netdev->trans_start = jiffies; that about all net drivers have to use in their start_xmit() function, and use txq->trans_start instead. This can be done generically in core network, as suggested by David. Some devices, (particularly loopback) dont need trans_start update, because they dont have transmit watchdog. We could add a new device flag, or rely on fact that txq->tran_start can be updated is txq->xmit_lock_owner is different than -1. Use a helper function to hide our choice. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3804cbe |
|
25-May-2009 |
Alexander Beregalov <a.beregalov@gmail.com> |
net: remove COMPAT_NET_DEV_OPS All drivers are already converted to new net_device_ops API and nobody uses old API anymore. Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7004bf25 |
|
17-May-2009 |
Eric Dumazet <dada1@cosmosbay.com> |
net: add tx_packets/tx_bytes/tx_dropped counters in struct netdev_queue offsetof(struct net_device, features)=0x44 offsetof(struct net_device, stats.tx_packets)=0x54 offsetof(struct net_device, stats.tx_bytes)=0x5c offsetof(struct net_device, stats.tx_dropped)=0x6c Network drivers that touch dev->stats.tx_packets/stats.tx_bytes in their tx path can slow down SMP operations, since they dirty a cache line that should stay shared (dev->features is needed in rx and tx paths) We could move away stats field in net_device but it wont help that much. (Two cache lines dirtied in tx path, we can do one only) Better solution is to add tx_packets/tx_bytes/tx_dropped in struct netdev_queue because this structure is already touched in tx path and counters updates will then be free (no increase in size) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9d21493b |
|
17-May-2009 |
Eric Dumazet <dada1@cosmosbay.com> |
net: tx scalability works : trans_start struct net_device trans_start field is a hot spot on SMP and high performance devices, particularly multi queues ones, because every transmitter dirties it. Is main use is tx watchdog and bonding alive checks. But as most devices dont use NETIF_F_LLTX, we have to lock a netdev_queue before calling their ndo_start_xmit(). So it makes sense to move trans_start from net_device to netdev_queue. Its update will occur on a already present (and in exclusive state) cache line, for free. We can do this transition smoothly. An old driver continue to update dev->trans_start, while an updated one updates txq->trans_start. Further patches could also put tx_bytes/tx_packets counters in netdev_queue to avoid dirtying dev->stats (vlan device comes to mind) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4d5b78c0 |
|
06-May-2009 |
David S. Miller <davem@davemloft.net> |
net: Add missing rculist.h include to netdevice.h Otherwise list_for_each_entry_rcu() et al. aren't visible and we get build failures in some configurations. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f001fde5 |
|
04-May-2009 |
Jiri Pirko <jpirko@redhat.com> |
net: introduce a list of device addresses dev_addr_list (v6) v5 -> v6 (current): -removed so far unused static functions -corrected dev_addr_del_multiple to call del instead of add v4 -> v5: -added device address type (suggested by davem) -removed refcounting (better to have simplier code then safe potentially few bytes) v3 -> v4: -changed kzalloc to kmalloc in __hw_addr_add_ii() -ASSERT_RTNL() avoided in dev_addr_flush() and dev_addr_init() v2 -> v3: -removed unnecessary rcu read locking -moved dev_addr_flush() calling to ensure no null dereference of dev_addr v1 -> v2: -added forgotten ASSERT_RTNL to dev_addr_init and dev_addr_flush -removed unnecessary rcu_read locking in dev_addr_init -use compare_ether_addr_64bits instead of compare_ether_addr -use L1_CACHE_BYTES as size for allocating struct netdev_hw_addr -use call_rcu instead of rcu_synchronize -moved is_etherdev_addr into __KERNEL__ ifdef This patch introduces a new list in struct net_device and brings a set of functions to handle the work with device address list. The list is a replacement for the original dev_addr field and because in some situations there is need to carry several device addresses with the net device. To be backward compatible, dev_addr is made to point to the first member of the list so original drivers sees no difference. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6a321cb3 |
|
28-Apr-2009 |
Eric Dumazet <dada1@cosmosbay.com> |
net: netif_tx_queue_stopped too expensive netif_tx_queue_stopped(txq) is most of the time false. Yet its cost is very expensive on SMP. static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue) { return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state); } I saw this on oprofile hunting and bnx2 driver bnx2_tx_int(). We probably should split "struct netdev_queue" in two parts, one being read mostly. __netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these deserve a separate cache line. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8dc92f7e |
|
27-Apr-2009 |
Jesse Brandeburg <jesse.brandeburg@intel.com> |
sctp: add feature bit for SCTP offload in hardware this is the sctp code to enable hardware crc32c offload for adapters that support it. Originally by: Vlad Yasevich <vladislav.yasevich@hp.com> modified by Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37b607c5 |
|
27-Apr-2009 |
Mike Rapoport <rppt@kernel.org> |
net: Fix typo in net_device_ops description. Signed-off-by: Mike Rapoport <mike@compulab.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
36e7b1b8 |
|
27-Apr-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Fix COMPLETE checksum handling On a brand new GRO skb, we cannot call ip_hdr since the header may lie in the non-linear area. This patch adds the helper skb_gro_network_header to handle this. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
edbd9e30 |
|
27-Apr-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Fix handling of headers that extend over the tail The skb_gro_* code fails to handle the case where a header starts in the linear area but ends in the frags area. Since the goal of skb_gro_* is to optimise the case of completely non-linear packets, we can simply bail out if we have anything in the linear area. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c759a6b4 |
|
27-Apr-2009 |
Adrian Bunk <bunk@kernel.org> |
net: Fix LL_MAX_HEADER for CONFIG_TR_MODULE Unless I miss anything this should fix a bug. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b1b67dd4 |
|
19-Apr-2009 |
Patrick McHardy <kaber@trash.net> |
net: factor out ethtool invocation of vlan/macvlan drivers Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
76620aaf |
|
16-Apr-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: New frags interface to avoid copying shinfo It turns out that copying a 16-byte area at ~800k times a second can be really expensive :) This patch redesigns the frags GRO interface to avoid copying that area twice. The two disciples of the frags interface have been converted. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cc0be322 |
|
27-Mar-2009 |
Dmitri Vorobiev <dmitri.vorobiev@movial.com> |
net: Add missing include into include/linux/netdevice.h The inline function skb_gro_mac_header defined in include/linux/netdevice.h makes use of page_address(). Depending on configuration options, the latter is either defined as a macro or is declared as a function in another header file, namely include/linux/mm.h. However, include/linux/netdevice.h does not include include/linux/mm.h. On MIPS, this has produced the following build error: CC kernel/sysctl_check.o In file included from include/linux/icmpv6.h:173, from include/linux/ipv6.h:208, from include/net/ip_vs.h:26, from kernel/sysctl_check.c:6: include/linux/netdevice.h: In function 'skb_gro_mac_header': include/linux/netdevice.h:1132: error: implicit declaration of function 'page_address' include/linux/netdevice.h:1133: warning: pointer/integer type mismatch in conditional expression make[1]: *** [kernel/sysctl_check.o] Error 1 make: *** [kernel] Error 2 The patch adds the missing include and fixes the build error. Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d1c76af9 |
|
16-Mar-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
GRO: Move netpoll checks to correct location As my netpoll fix for net doesn't really work for net-next, we need this update to move the checks into the right place. As it stands we may pass freed skbs to netpoll_receive_skb. This patch also introduces a netpoll_rx_on function to avoid GRO completely if we're invoked through netpoll. This might seem paranoid but as netpoll may have an external receive hook it's better to be safe than sorry. I don't think we need this for 2.6.29 though since there's nothing immediately broken by it. This patch also moves the GRO_* return values to netdevice.h since VLAN needs them too (I tried to avoid this originally but alas this seems to be the easiest way out). This fixes a bug in VLAN where it continued to use the old return value 2 instead of the correct GRO_DROP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4d288d57 |
|
27-Feb-2009 |
Yi Zou <yi.zou@intel.com> |
[SCSI] net: add FCoE offload support through net_device This adds support to provide Fiber Channel over Ethernet (FCoE) offload through net_device's net_device_ops struct. The offload through net_device for FCoE is enabled in kernel as built-in or module driver. Signed-off-by: Yi Zou <yi.zou@intel.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
|
#
01d5b2fc |
|
27-Feb-2009 |
Chris Leech <christopher.leech@intel.com> |
[SCSI] net: define feature flags for FCoE offloads Define feature flags for FCoE offloads. Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: Yi Zou <yi.zou@intel.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
|
#
43eb99c5 |
|
27-Feb-2009 |
Chris Leech <christopher.leech@intel.com> |
[SCSI] net: reclaim 8 upper bits of the netdev->features from GSO Reclaim 8 upper bits of netdev->features from GSO. Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: Yi Zou <yi.zou@intel.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
|
#
9d40bbda |
|
05-Mar-2009 |
David S. Miller <davem@davemloft.net> |
vlan: Fix vlan-in-vlan crashes. As analyzed by Patrick McHardy, vlan needs to reset it's netdev_ops pointer in it's ->init() function but this leaves the compat method pointers stale. Add a netdev_resync_ops() and call it from the vlan code. Any other driver which changes ->netdev_ops after register_netdevice() will need to call this new function after doing so too. With help from Patrick McHardy. Tested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3a7c66b |
|
14-Feb-2009 |
Harvey Harrison <harvey.harrison@gmail.com> |
net: replace __constant_{endian} uses in net headers Base versions handle constant folding now. For headers exposed to userspace, we must only expose the __ prefixed versions. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
aa4b9f53 |
|
08-Feb-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Optimise Ethernet header comparison This patch optimises the Ethernet header comparison to use 2-byte and 4-byte xors instead of memcmp. In order to facilitate this, the actual comparison is now carried out by the callers of the shared dev_gro_receive function. This has a significant impact when receiving 1500B packets through 10GbE. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4ae5544f |
|
08-Feb-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Remember number of held packets instead of counting every time This patch prepares for the move of the same_flow checks out of dev_gro_receive. As such we need to remember the number of held packets since doing a loop just to count them every time is silly. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fe2918b0 |
|
05-Feb-2009 |
Graf Yang <graf.yang@analog.com> |
net: fix some trailing whitespaces Signed-off-by: Graf Yang <graf.yang@analog.com> Signed-off-by: Bryan Wu <cooloney@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
86911732 |
|
29-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Avoid copying headers of unmerged packets Unfortunately simplicity isn't always the best. The fraginfo interface turned out to be suboptimal. The problem was quite obvious. For every packet, we have to copy the headers from the frags structure into skb->head, even though for 99% of the packets this part is immediately thrown away after the merge. LRO didn't have this problem because it directly read the headers from the frags structure. This patch attempts to address this by creating an interface that allows GRO to access the headers in the first frag without having to copy it. Because all drivers that use frags place the headers in the first frag this optimisation should be enough. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5d0d9be8 |
|
29-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Move common completion code into helpers Currently VLAN still has a bit of common code handling the aftermath of GRO that's shared with the common path. This patch moves them into shared helpers to reduce code duplication. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
288379f0 |
|
19-Jan-2009 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Remove redundant NAPI functions Following the removal of the unused struct net_device * parameter from the NAPI functions named *netif_rx_* in commit 908a7a1, they are exactly equivalent to the corresponding *napi_* functions and are therefore redundant. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
937f1ba5 |
|
14-Jan-2009 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
net: Add init_dummy_netdev() and fix EMAC driver using it This adds an init_dummy_netdev() function that gets a network device structure (allocation and lifetime entirely under caller's control) and initialize the minimum amount of fields so it can be used to schedule NAPI polls without registering a full blown interface. This is to be used by drivers that need to tie several hardware interfaces to a single NAPI poll scheduler due to HW limitations. It also updates the ibm_newemac driver to use that, this fixing the oops on 2.6.29 due to passing NULL as "dev" to netif_napi_add() Symbol is exported GPL only a I don't think we want binary drivers doing that sort of acrobatics (if we want them at all). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
985ebdb5 |
|
12-Jan-2009 |
Krzysztof Hałasa <khc@pm.waw.pl> |
net: Fix a comment in include/linux/netdevice.h. Fix a comment in include/linux/netdevice.h. Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
96e93eab |
|
06-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Add internal interfaces for VLAN Previously GRO's only entry point from the outside is through napi_gro_receive and napi_gro_frags. These interfaces are for device drivers. This patch rearranges things to provide a new set of interfaces for VLANs. These interfaces are for internal use only. The VLAN code itself can then provide a set of entry points for device drivers. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f67b4599 |
|
06-Jan-2009 |
Dan Williams <dan.j.williams@intel.com> |
net_dma: convert to dma_find_channel Use the general-purpose channel allocation provided by dmaengine. Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
5d38a079 |
|
04-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Add page frag support This patch allows GRO to merge page frags (skb_shinfo(skb)->frags) in one skb, rather than using the less efficient frag_list. It also adds a new interface, napi_gro_frags to allow drivers to inject page frags directly into the stack without allocating an skb. This is intended to be the GRO equivalent for LRO's lro_receive_frags interface. The existing GSO interface can already handle page frags with or without an appended frag_list so nothing needs to be changed there. The merging itself is rather simple. We store any new frag entries after the last existing entry, without checking whether the first new entry can be merged with the last existing entry. Making this check would actually be easy but since no existing driver can produce contiguous frags anyway it would just be mental masturbation. If the total number of entries would exceed the capacity of a single skb, we simply resort to using frag_list as we do now. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
908a7a16 |
|
22-Dec-2008 |
Neil Horman <nhorman@tuxdriver.com> |
net: Remove unused netdev arg from some NAPI interfaces. When the napi api was changed to separate its 1:1 binding to the net_device struct, the netif_rx_[prep|schedule|complete] api failed to remove the now vestigual net_device structure parameter. This patch cleans up that api by properly removing it.. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d565b0a1 |
|
16-Dec-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
net: Add Generic Receive Offload infrastructure This patch adds the top-level GRO (Generic Receive Offload) infrastructure. This is pretty similar to LRO except that this is protocol-independent. Instead of holding packets in an lro_mgr structure, they're now held in napi_struct. For drivers that intend to use this, they can set the NETIF_F_GRO bit and call napi_gro_receive instead of netif_receive_skb or just call netif_rx. The latter will call napi_receive_skb automatically. When napi_gro_receive is used, the driver must either call napi_complete/napi_rx_complete, or call napi_gro_flush in softirq context if the driver uses the primitives __napi_complete/__napi_rx_complete. Protocols will set the gro_receive and gro_complete function pointers in order to participate in this scheme. In addition to the packet, gro_receive will get a list of currently held packets. Each packet in the list has a same_flow field which is non-zero if it is a potential match for the new packet. For each packet that may match, they also have a flush field which is non-zero if the held packet must not be merged with the new packet. Once gro_receive has determined that the new skb matches a held packet, the held packet may be processed immediately if the new skb cannot be merged with it. In this case gro_receive should return the pointer to the existing skb in gro_list. Otherwise the new skb should be merged into the existing packet and NULL should be returned, unless the new skb makes it impossible for any further merges to be made (e.g., FIN packet) where the merged skb should be returned. Whenever the skb is merged into an existing entry, the gro_receive function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb merely matches an existing entry but can't be merged with it, then this shouldn't be set. If gro_receive finds it pointless to hold the new skb for future merging, it should set NAPI_GRO_CB(skb)->flush. Held packets will be flushed by napi_gro_flush which is called by napi_complete and napi_rx_complete. Currently held packets are stored in a singly liked list just like LRO. The list is limited to a maximum of 8 entries. In future, this may be expanded to use a hash table to allow more flows to be held for merging. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1a881f27 |
|
16-Dec-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
net: Add frag_list support to GSO This patch allows GSO to handle frag_list in a limited way for the purposes of allowing packets merged by GRO to be refragmented on output. Most hardware won't (and aren't expected to) support handling GRO frag_list packets directly. Therefore we will perform GSO in software for those cases. However, for drivers that can support it (such as virtual NICs) we may not have to segment the packets at all. Whether the added overhead of GRO/GSO is worthwhile for bridges and routers when weighed against the benefit of potentially increasing the MTU within the host is still an open question. However, for the case of host nodes this is undoubtedly a win. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7b363e44 |
|
10-Dec-2008 |
Neil Horman <nhorman@tuxdriver.com> |
netpoll: fix race on poll_list resulting in garbage entry A few months back a race was discused between the netpoll napi service path, and the fast path through net_rx_action: http://kerneltrap.org/mailarchive/linux-netdev/2007/10/16/345470 A patch was submitted for that bug, but I think we missed a case. Consider the following scenario: INITIAL STATE CPU0 has one napi_struct A on its poll_list CPU1 is calling netpoll_send_skb and needs to call poll_napi on the same napi_struct A that CPU0 has on its list CPU0 CPU1 net_rx_action poll_napi !list_empty (returns true) locks poll_lock for A poll_one_napi napi->poll netif_rx_complete __napi_complete (removes A from poll_list) list_entry(list->next) In the above scenario, net_rx_action assumes that the per-cpu poll_list is exclusive to that cpu. netpoll of course violates that, and because the netpoll path can dequeue from the poll list, its possible for CPU0 to detect a non-empty list at the top of the while loop in net_rx_action, but have it become empty by the time it calls list_entry. Since the poll_list isn't surrounded by any other structure, the returned data from that list_entry call in this situation is garbage, and any number of crashes can result based on what exactly that garbage is. Given that its not fasible for performance reasons to place exclusive locks arround each cpus poll list to provide that mutal exclusion, I think the best solution is modify the netpoll path in such a way that we continue to guarantee that the poll_list for a cpu is in fact exclusive to that cpu. To do this I've implemented the patch below. It adds an additional bit to the state field in the napi_struct. When executing napi->poll from the netpoll_path, this bit will be set. When a driver calls netif_rx_complete, if that bit is set, it will not remove the napi_struct from the poll_list. That work will be saved for the next iteration of net_rx_action. I've tested this and it seems to work well. About the biggest drawback I can see to it is the fact that it might result in an extra loop through net_rx_action in the event that the device is actually contended for (i.e. the netpoll path actually preforms all the needed work no the device, and the call to net_rx_action winds up doing nothing, except removing the napi_struct from the poll_list. However I think this is probably a small price to pay, given that the alternative is a crash. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b74ca3a8 |
|
08-Dec-2008 |
Wang Chen <wangchen@cn.fujitsu.com> |
netdevice: Kill netdev->priv This is the last shoot of this series. After I removing all directly reference of netdev->priv, I am killing "priv" of "struct net_device" and fixing relative comments/docs. Anyone will not be allowed to reference netdev->priv directly. If you want to reference the memory of private data, use netdev_priv() instead. If the private data is not allocted when alloc_netdev(), use netdev->ml_priv to point that memory after you creating that private data. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7a6b6f51 |
|
25-Nov-2008 |
Jeff Kirsher <jeffrey.t.kirsher@intel.com> |
DCB: fix kconfig option Since the netlink option for DCB is necessary to actually be useful, simplified the Kconfig option. In addition, added useful help text for the Kconfig option. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
47fd5b83 |
|
25-Nov-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
netdev: add HAVE_NET_DEVICE_OPS As a concession to vendors who have to deal with one source for different kernel versions, add a HAVE_NET_DEVICE_OPS so they don't end up hard coding ifdef against kernel version. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2f90b865 |
|
20-Nov-2008 |
Alexander Duyck <alexander.h.duyck@intel.com> |
ixgbe: this patch adds support for DCB to the kernel and ixgbe driver This adds support for Data Center Bridging (DCB) features in the ixgbe driver and adds an rtnetlink interface for configuring DCB to the kernel. The DCB feature support included are Priority Grouping (PG) - which allows bandwidth guarantees to be allocated to groups to traffic based on the 802.1q priority, and Priority Based Flow Control (PFC) - which introduces a new MAC control PAUSE frame which works at granularity of the 802.1p priority instead of the link (IEEE 802.3x). Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
00829823 |
|
20-Nov-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
netdev: add more functions to netdevice ops This patch moves neigh_setup and hard_start_xmit into the network device ops structure. For bisection, fix all the previously converted drivers as well. Bonding driver took the biggest hit on this. Added a prefetch of the hard_start_xmit in the fast path to try and reduce any impact this would have. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eeda3fd6 |
|
19-Nov-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
netdev: introduce dev_get_stats() In order for the network device ops get_stats call to be immutable, the handling of the default internal network device stats block has to be changed. Add a new helper function which replaces the old use of internal_get_stats. Note: change return code to make it clear that the caller should not go changing the returned statistics. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d314774c |
|
19-Nov-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
netdev: network device operations infrastructure This patch changes the network device internal API to move adminstrative operations out of the network device structure and into a separate structure. This patch involves some hackery to maintain compatablity between the new and old model, so all 300+ drivers don't have to be changed at once. For drivers that aren't converted yet, the netdevice_ops virt function list still resides in the net_device structure. For old protocols, the new net_device_ops are copied out to the old net_device pointers. After the transistion is completed the nag message can be changed to an WARN_ON, and the compatiablity code can be made configurable. Some function pointers aren't moved: * destructor can't be in net_device_ops because it may need to be referenced after the module is unloaded. * neighbor setup is manipulated in a couple of places that need special consideration * hard_start_xmit is in the fast path for transmit. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
505d4f73 |
|
07-Nov-2008 |
Eric W. Biederman <ebiederm@maxwell.aristanetworks.com> |
net: Guaranetee the proper ordering of the loopback device. v2 I was recently hunting a bug that occurred in network namespace cleanup. In looking at the code it became apparrent that we have and will continue to have cases where if we have anything going on in a network namespace there will be assumptions that the loopback device is present. Things like sending igmp unsubscribe messages when we bring down network devices invokes the routing code which assumes that at least the loopback driver is present. Therefore to avoid magic initcall ordering hackery that is hard to follow and hard to get right insert a call to register the loopback device directly from net_dev_init(). This guarantes that the loopback device is the first device registered and the last network device to go away. But do it carefully so we register the loopback device after we clear dev_boot_phase. Signed-off-by: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3d8160b1 |
|
07-Nov-2008 |
David S. Miller <davem@davemloft.net> |
Revert "net: Guaranetee the proper ordering of the loopback device." This reverts commit ae33bc40c0d96d02f51a996482ea7e41c5152695.
|
#
ae33bc40 |
|
05-Nov-2008 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Guaranetee the proper ordering of the loopback device. I was recently hunting a bug that occurred in network namespace cleanup. In looking at the code it became apparrent that we have and will continue to have cases where if we have anything going on in a network namespace there will be assumptions that the loopback device is present. Things like sending igmp unsubscribe messages when we bring down network devices invokes the routing code which assumes that at least the loopback driver is present. Therefore to avoid magic initcall ordering hackery that is hard to follow and hard to get right insert a call to register the loopback device directly from net_dev_init(). This guarantes that the loopback device is the first device registered and the last network device to go away. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6cf3f41e |
|
03-Nov-2008 |
Jay Vosburgh <fubar@us.ibm.com> |
bonding, net: Move last_rx update into bonding recv logic The only user of the net_device->last_rx field is bonding. This patch adds a conditional update of last_rx to the bonding special logic in skb_bond_should_drop, causing last_rx to only be updated when the ARP monitor is running. This frees network device drivers from the necessity of updating last_rx, which can have cache line thrash issues. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ad1d967c |
|
31-Oct-2008 |
Randy Dunlap <randy.dunlap@oracle.com> |
net: delete excess kernel-doc notation Remove excess kernel-doc function parameters from networking header & driver files: Warning(include/net/sock.h:946): Excess function parameter or struct member 'sk' description in 'sk_filter_release' Warning(include/linux/netdevice.h:1545): Excess function parameter or struct member 'cpu' description in 'netif_tx_lock' Warning(drivers/net/wan/z85230.c:712): Excess function parameter or struct member 'regs' description in 'z8530_interrupt' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b63365a2 |
|
23-Oct-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
net: Fix disjunct computation of netdev features My change commit e2a6b85247aacc52d6ba0d9b37a99b8d1a3e0d83 net: Enable TSO if supported by at least one device didn't do what was intended because the netdev_compute_features function was designed for conjunctions. So what happened was that it would simply take the TSO status of the last constituent device. This patch extends it to support both conjunctions and disjunctions under the new name of netdev_increment_features. It also adds a new function netdev_fix_features which does the sanity checking that usually occurs upon registration. This ensures that the computation doesn't result in an illegal combination since this checking is absent when the change is initiated via ethtool. The two users of netdev_compute_features have been converted. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
113aa838 |
|
13-Oct-2008 |
Alan Cox <alan@redhat.com> |
net: Rationalise email address: Network Specific Parts Clean up the various different email addresses of mine listed in the code to a single current and valid address. As Dave says his network merges for 2.6.28 are now done this seems a good point to send them in where they won't risk disrupting real changes. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
396138f0 |
|
07-Oct-2008 |
Lennert Buytenhek <buytenh@wantstofly.org> |
dsa: add support for Trailer tagging format This adds support for the Trailer switch tagging format. This is another tagging that doesn't explicitly mark tagged packets with a distinct ethertype, so that we need to add a similar hack in the receive path as for the Original DSA tagging format. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Tested-by: Byron Bradley <byron.bbradley@gmail.com> Tested-by: Tim Ellis <tim.ellis@mac.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf85d08f |
|
07-Oct-2008 |
Lennert Buytenhek <buytenh@wantstofly.org> |
dsa: add support for original DSA tagging format Most of the DSA switches currently in the field do not support the Ethertype DSA tagging format that one of the previous patches added support for, but only the original DSA tagging format. The original DSA tagging format carries the same information as the Ethertype DSA tagging format, but with the difference that it does not have an ethertype field. In other words, when receiving a packet that is tagged with an original DSA tag, there is no way of telling in eth_type_trans() that this packet is in fact a DSA-tagged packet. This patch adds a hook into eth_type_trans() which is only compiled in if support for a switch chip that doesn't support Ethertype DSA is selected, and which checks whether there is a DSA switch driver instance attached to this network device which uses the old tag format. If so, it sets the protocol field to ETH_P_DSA without looking at the packet, so that the packet ends up in the right place. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Tested-by: Nicolas Pitre <nico@marvell.com> Tested-by: Peter van Valderen <linux@ddcrew.com> Tested-by: Dirk Teurlings <dirk@upexia.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
91da11f8 |
|
07-Oct-2008 |
Lennert Buytenhek <buytenh@wantstofly.org> |
net: Distributed Switch Architecture protocol support Distributed Switch Architecture is a protocol for managing hardware switch chips. It consists of a set of MII management registers and commands to configure the switch, and an ethernet header format to signal which of the ports of the switch a packet was received from or is intended to be sent to. The switches that this driver supports are typically embedded in access points and routers, and a typical setup with a DSA switch looks something like this: +-----------+ +-----------+ | | RGMII | | | +-------+ +------ 1000baseT MDI ("WAN") | | | 6-port +------ 1000baseT MDI ("LAN1") | CPU | | ethernet +------ 1000baseT MDI ("LAN2") | |MIImgmt| switch +------ 1000baseT MDI ("LAN3") | +-------+ w/5 PHYs +------ 1000baseT MDI ("LAN4") | | | | +-----------+ +-----------+ The switch driver presents each port on the switch as a separate network interface to Linux, polls the switch to maintain software link state of those ports, forwards MII management interface accesses to those network interfaces (e.g. as done by ethtool) to the switch, and exposes the switch's hardware statistics counters via the appropriate Linux kernel interfaces. This initial patch supports the MII management interface register layout of the Marvell 88E6123, 88E6161 and 88E6165 switch chips, and supports the "Ethertype DSA" packet tagging format. (There is no officially registered ethertype for the Ethertype DSA packet format, so we just grab a random one. The ethertype to use is programmed into the switch, and the switch driver uses the value of ETH_P_EDSA for this, so this define can be changed at any time in the future if the one we chose is allocated to another protocol or if Ethertype DSA gets its own officially registered ethertype, and everything will continue to work.) Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Tested-by: Nicolas Pitre <nico@marvell.com> Tested-by: Byron Bradley <byron.bbradley@gmail.com> Tested-by: Tim Ellis <tim.ellis@mac.com> Tested-by: Peter van Valderen <linux@ddcrew.com> Tested-by: Dirk Teurlings <dirk@upexia.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cf04a4c7 |
|
30-Sep-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
netdev: use const for some name functions dev_change_name and netdev_drivername should use const char on parameters that are read-only input values. The strcpy to newname is not needed since newname is not used later in function. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0b815a1a |
|
22-Sep-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
net: network device name ifalias support This patch add support for keeping an additional character alias associated with an network interface. This is useful for maintaining the SNMP ifAlias value which is a user defined value. Routers use this to hold information like which circuit or line it is connected to. It is just an arbitrary text label on the network device. There are two exposed interfaces with this patch, the value can be read/written either via netlink or sysfs. This could be maintained just by the snmp daemon, but it is more generally useful for other management tools, and the kernel is good place to act as an agreed upon interface to store it. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cc6533e9 |
|
05-Aug-2008 |
David S. Miller <davem@davemloft.net> |
net: Kill plain NET_XMIT_BYPASS. dst_input() was doing something completely absurd, looping on skb->dst->input() if NET_XMIT_BYPASS was seen, but these functions never return such an error. And as a result plain ole' NET_XMIT_BYPASS has no more references and can be completely killed off. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
378a2f09 |
|
04-Aug-2008 |
Jarek Poplawski <jarkao2@gmail.com> |
net_sched: Add qdisc __NET_XMIT_STOLEN flag Patrick McHardy <kaber@trash.net> noticed: "The other problem that affects all qdiscs supporting actions is TC_ACT_QUEUED/TC_ACT_STOLEN getting mapped to NET_XMIT_SUCCESS even though the packet is not queued, corrupting upper qdiscs' qlen counters." and later explained: "The reason why it translates it at all seems to be to not increase the drops counter. Within a single qdisc this could be avoided by other means easily, upper qdiscs would still increase the counter when we return anything besides NET_XMIT_SUCCESS though. This means we need a new NET_XMIT return value to indicate this to the upper qdiscs. So I'd suggest to introduce NET_XMIT_STOLEN, return that to upper qdiscs and translate it to NET_XMIT_SUCCESS in dev_queue_xmit, similar to NET_XMIT_BYPASS." David Miller <davem@davemloft.net> noticed: "Maybe these NET_XMIT_* values being passed around should be a set of bits. They could be composed of base meanings, combined with specific attributes. So you could say "NET_XMIT_DROP | __NET_XMIT_NO_DROP_COUNT" The attributes get masked out by the top-level ->enqueue() caller, such that the base meanings are the only thing that make their way up into the stack. If it's only about communication within the qdisc tree, let's simply code it that way." This patch is trying to realize these ideas. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c3f26a26 |
|
31-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Fix lockdep warnings in multiqueue configurations. When support for multiple TX queues were added, the netif_tx_lock() routines we converted to iterate over all TX queues and grab each queue's spinlock. This causes heartburn for lockdep and it's not a healthy thing to do with lots of TX queues anyways. So modify this to use a top-level lock and a "frozen" state for the individual TX queues. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d29f749e |
|
22-Jul-2008 |
Dave Jones <davej@redhat.com> |
net: Fix build failure with 'make mandocs'. The function header comments have to go with the functions they are documenting, or things go horribly wrong when we try to process them with the docbook tools. Warning(include/linux/netdevice.h:1006): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1033): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1067): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1093): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1474): No description found for parameter 'txq' Error(net/core/dev.c:1674): cannot understand prototype: 'u32 simple_tx_hashrnd; ' Signed-off-by: Dave Jones <davej@redhat.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6579e57b |
|
21-Jul-2008 |
Arjan van de Ven <arjan@linux.intel.com> |
net: Print the module name as part of the watchdog message As suggested by Dave: This patch adds a function to get the driver name from a struct net_device, and consequently uses this in the watchdog timeout handler to print as part of the message. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
30723673 |
|
18-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Manage qdisc list inside of root qdisc. Idea is from Patrick McHardy. Instead of managing the list of qdiscs on the device level, manage it in the root qdisc of a netdev_queue. This solves all kinds of visibility issues during qdisc destruction. The way to iterate over all qdiscs of a netdev_queue is to visit the netdev_queue->qdisc, and then traverse it's list. The only special case is to ignore builting qdiscs at the root when dumping or doing a qdisc_lookup(). That was not needed previously because builtin qdiscs were not added to the device's qdisc_list. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
83874000 |
|
17-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Kill netdev_queue lock. We can simply use the qdisc->q.lock for all of the qdisc tree synchronization. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ead81cc5 |
|
17-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdevice: Move qdisc_list back into net_device proper. And give it it's own lock. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37437bb2 |
|
16-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Schedule qdiscs instead of netdev_queue. When we have shared qdiscs, packets come out of the qdiscs for multiple transmit queues. Therefore it doesn't make any sense to schedule the transmit queue when logically we cannot know ahead of time the TX queue of the SKB that the qdisc->dequeue() will give us. Just for sanity I added a BUG check to make sure we never get into a state where the noop_qdisc is scheduled. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e2627c8c |
|
16-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Make QDISC_RUNNING a qdisc state. Currently it is associated with a netdev_queue, but when we have qdisc sharing that no longer makes any sense. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3b753db |
|
15-Jul-2008 |
David S. Miller <davem@davemloft.net> |
pkt_sched: Move gso_skb into Qdisc. We liberate any dangling gso_skb during qdisc destruction. It really only matters for the root qdisc. But when qdiscs can be shared by multiple netdev_queue objects, we can't have the gso_skb in the netdev_queue any more. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
92831bc3 |
|
15-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Kill plain netif_schedule() No more users. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eae792b7 |
|
15-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Add netdev->select_queue() method. Devices or device layers can set this to control the queue selection performed by dev_pick_tx(). This function runs under RCU protection, which allows overriding functions to have some way of synchronizing with things like dynamic ->real_num_tx_queues adjustments. This makes the spinlock prefetch in dev_queue_xmit() a little bit less effective, but that's the price right now for correctness. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e3c50d5d |
|
15-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: netdev_priv() can now be sane again. The private area of a netdev is now at a fixed offset once more. Unfortunately, some assumptions that netdev_priv() == netdev->priv crept back into the tree. In particular this happened in the loopback driver. Make it use netdev->ml_priv. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6b0fb126 |
|
15-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Kill struct net_device_subqueue and netdev->egress_subqueue* No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fd2ea0a7 |
|
17-Jul-2008 |
David S. Miller <davem@davemloft.net> |
net: Use queue aware tests throughout. This effectively "flips the switch" by making the core networking and multiqueue-aware drivers use the new TX multiqueue structures. Non-multiqueue drivers need no changes. The interfaces they use such as netif_stop_queue() degenerate into an operation on TX queue zero. So everything "just works" for them. Code that really wants to do "X" to all TX queues now invokes a routine that does so, such as netif_tx_wake_all_queues(), netif_tx_stop_all_queues(), etc. pktgen and netpoll required a little bit more surgery than the others. In particular the pktgen changes, whilst functional, could be largely improved. The initial check in pktgen_xmit() will sometimes check the wrong queue, which is mostly harmless. The thing to do is probably to invoke fill_packet() earlier. The bulk of the netpoll changes is to make the code operate solely on the TX queue indicated by by the SKB queue mapping. Setting of the SKB queue mapping is entirely confined inside of net/core/dev.c:dev_pick_tx(). If we end up needing any kind of special semantics (drops, for example) it will be implemented here. Finally, we now have a "real_num_tx_queues" which is where the driver indicates how many TX queues are actually active. With IGB changes from Jeff Kirsher. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
09e83b5d |
|
17-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Kill NETIF_F_MULTI_QUEUE. There is no need for a feature bit for something that can be tested by simply checking the TX queue count. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e8a0464c |
|
17-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Allocate multiple queues for TX. alloc_netdev_mq() now allocates an array of netdev_queue structures for TX, based upon the queue_count argument. Furthermore, all accesses to the TX queues are now vectored through the netdev_get_tx_queue() and netdev_for_each_tx_queue() interfaces. This makes it easy to grep the tree for all things that want to get to a TX queue of a net device. Problem spots which are not really multiqueue aware yet, and only work with one queue, can easily be spotted by grepping for all netdev_get_tx_queue() calls that pass in a zero index. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e308a5d8 |
|
15-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Add netdev->addr_list_lock protection. Add netif_addr_{lock,unlock}{,_bh}() helpers. Use them to protect operations that operate on or read the network device unicast and multicast address lists. Also use them in cases where the code simply wants to block calls into the driver's ->set_rx_mode() and ->set_multicast_list() methods. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f1f28aa3 |
|
15-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Add addr_list_lock to struct net_device. This will be used to protect the per-device unicast and multicast address lists, as well as the callbacks into the drivers which configure such state such as ->set_rx_mode() and ->set_multicast_list(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bc1d0411 |
|
14-Jul-2008 |
Patrick McHardy <kaber@trash.net> |
vlan: deliver packets received with VLAN acceleration to network taps When VLAN header stripping is used, packets currently bypass packet sockets (and other network taps) completely. For locally existing VLANs, they appear directly on the VLAN device, for unknown VLANs they are silently dropped. Add a new function netif_nit_deliver() to deliver incoming packets to all network interface taps and use it in __vlan_hwaccel_rx() to make VLAN packets visible on the underlying device. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8156534 |
|
08-Jul-2008 |
Alexander Duyck <alexander.h.duyck@intel.com> |
net: add netif_napi_del function to allow for removal of napistructs Adds netif_napi_del function which is used to remove the napi struct from the netdev napi_list in cases where CONFIG_NETPOLL was enabled. The motivation for adding this is to handle the case in which the number of queues on a device changes due to a configuration change. Previously the napi structs for each queue would be left in the list until the netdev was freed. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
|
#
79d16385 |
|
09-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Move atomic queue state bits into netdev_queue. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b19fa1fa |
|
09-Jul-2008 |
David S. Miller <davem@davemloft.net> |
net: Delete NETDEVICES_MULTIQUEUE kconfig option. Multiple TX queue support is a core networking feature. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c773e847 |
|
09-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Move _xmit_lock and xmit_lock_owner into netdev_queue. Accesses are mostly structured such that when there are multiple TX queues the code transformations will be a little bit simpler. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
86d804e1 |
|
09-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Make netif_schedule() routines work with netdev_queue objects. Only plain netif_schedule() remains taking a net_device, mostly as a compatability item while we transition the rest of these interfaces. Everything else calls netif_schedule_queue() or __netif_schedule(), both of which take a netdev_queue pointer. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
970565bb |
|
09-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Move gso_skb into netdev_queue. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ee609cb3 |
|
08-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Move next_sched into struct netdev_queue. We schedule queues, not the device, for output queue processing in BH. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
816f3258 |
|
08-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Kill qdisc_ingress, use netdev->rx_queue.qdisc instead. Now that our qdisc management is bi-directional, per-queue, and fully orthogonal, there is no reason to have a special ingress qdisc pointer in struct net_device. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b0e1e646 |
|
08-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Move rest of qdisc state into struct netdev_queue Now qdisc, qdisc_sleeping, and qdisc_list also live there. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
555353cf |
|
08-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: The ingress_lock member is no longer needed. Every qdisc is assosciated with a queue, and in the case of ingress qdiscs that will now be netdev->rx_queue so using that queue's lock is the thing to do. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dc2b4847 |
|
08-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Move queue_lock into struct netdev_queue. The lock is now an attribute of the device queue. One thing to notice is that "suspicious" places emerge which will need specific training about multiple queue handling. They are so marked with explicit "netdev->rx_queue" and "netdev->tx_queue" references. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bb949fbd |
|
08-Jul-2008 |
David S. Miller <davem@davemloft.net> |
netdev: Create netdev_queue abstraction. A netdev_queue is an entity managed by a qdisc. Currently there is one RX and one TX queue, and a netdev_queue merely contains a backpointer to the net_device. The Qdisc struct is augmented with a netdev_queue pointer as well. Eventually the 'dev' Qdisc member will go away and we will have the resulting hierarchy: net_device --> netdev_queue --> Qdisc Also, qdisc_alloc() and qdisc_create_dflt() now take a netdev_queue pointer argument. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
eca9ebac |
|
05-Jul-2008 |
Patrick McHardy <kaber@trash.net> |
net: Add GARP applicant-only participant Add an implementation of the GARP (Generic Attribute Registration Protocol) applicant-only participant. This will be used by the following patch to add GVRP support to the VLAN code. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c88e6f51 |
|
27-Jun-2008 |
Adrian Bunk <bunk@kernel.org> |
include/linux/netdevice.h: don't export MAX_HEADER to userspace Due to the CONFIG_'s the value is anyway not correct in userspace. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0187bdfb |
|
19-Jun-2008 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Disable LRO on devices that are forwarding Large Receive Offload (LRO) is only appropriate for packets that are destined for the host, and should be disabled if received packets may be forwarded. It can also confuse the GSO on output. Add dev_disable_lro() function which uses the appropriate ethtool ops to disable LRO if enabled. Add calls to dev_disable_lro() in br_add_if() and functions that enable IPv4 and IPv6 forwarding. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
dad9b335 |
|
18-Jun-2008 |
Wang Chen <wangchen@cn.fujitsu.com> |
netdevice: Fix promiscuity and allmulti overflow Max of promiscuity and allmulti plus positive @inc can cause overflow. Fox example: when allmulti=0xFFFFFFFF, any caller give dev_set_allmulti() a positive @inc will cause allmulti be off. This is not what we want, though it's rare case. The fix is that only negative @inc will cause allmulti or promiscuity be off and when any caller makes the counters touch the roof, we return error. Change of v2: Change void function dev_set_promiscuity/allmulti to return int. So callers can get the overflow error. Caller's fix will be done later. Change of v3: 1. Since we return error to caller, we don't need to print KERN_ERROR, KERN_WARNING is enough. 2. In dev_set_promiscuity(), if __dev_set_promiscuity() failed, we return at once. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9d45abe1 |
|
17-Jun-2008 |
Wang Chen <wangchen@cn.fujitsu.com> |
netdevice: change net_device->promiscuity/allmulti to unsigned int The comments of dev_set_allmulti/promiscuity() is that "While the count in the device remains above zero...". So negative count is useless. Fix the type of the counter from "int" to "unsigned int". Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b8a9787e |
|
13-Jun-2008 |
Jay Vosburgh <fubar@us.ibm.com> |
bonding: Allow setting max_bonds to zero Permit bonding to function rationally if max_bonds is set to zero. This will load the module, but create no master devices (which can be created via sysfs). Requires some change to bond_create_sysfs; currently, the netdev sysfs directory is determined from the first bonding device created, but this is no longer possible. Instead, an interface from net/core is created to create and destroy files in net_class. Based on a patch submitted by Phil Oester <kernel@linuxaces.com>. Modified by Jay Vosburgh to fix the sysfs issue mentioned above and to update the documentation. Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
|
#
c1da4ac7 |
|
13-Jun-2008 |
Or Gerlitz <ogerlitz@voltaire.com> |
net/core: add NETDEV_BONDING_FAILOVER event Add NETDEV_BONDING_FAILOVER event to be used in a successive patch by bonding to announce fail-over for the active-backup mode through the netdev events notifier chain mechanism. Such an event can be of use for the RDMA CM (communication manager) to let native RDMA ULPs (eg NFS-RDMA, iSER) always be aligned with the IP stack, in the sense that they use the same ports/links as the stack does. More usages can be done to allow monitoring tools based on netlink events being aware to bonding fail-over. Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
|
#
289c79a4 |
|
23-May-2008 |
Patrick McHardy <kaber@trash.net> |
vlan: Use bitmask of feature flags instead of seperate feature bits Herbert Xu points out that the use of seperate feature bits for features to be propagated to VLAN devices is going to get messy real soon. Replace the VLAN feature bits by a bitmask of feature flags to be propagated and restore the old GSO_SHIFT/MASK values. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5fb13570 |
|
20-May-2008 |
Patrick McHardy <kaber@trash.net> |
[VLAN]: Propagate selected feature bits to VLAN devices Propagate feature bits from the NETDEV_FEAT_CHANGE notifier. For now only TSO is propagated for devices that announce their ability to support TSO in combination with VLAN accel by setting the NETIF_F_VLAN_TSO flag. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5184d26 |
|
12-May-2008 |
Johannes Berg <johannes@sipsolutions.net> |
net: Allow netdevices to specify needed head/tailroom This patch adds needed_headroom/needed_tailroom members to struct net_device and updates many places that allocate sbks to use them. Not all of them can be converted though, and I'm sure I missed some (I mostly grepped for LL_RESERVED_SPACE) Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8388e3da |
|
12-May-2008 |
David S. Miller <davem@davemloft.net> |
net: Set LL_MAX_HEADER properly for wireless. Wireless networking, particularly with MESH enabled, has quite strong requirements for link-layer header space. Based upon some numbers and descriptions from Johannes Berg we use 96 (same as AX25) for plain wireless, and with mesh enabled we use 128. In the process, simplify the cpp conditional logic here by ordering the cases by those needing the most space down to those needing the least case. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4951704b |
|
12-May-2008 |
David S. Miller <davem@davemloft.net> |
syncppp: Fix crashes. The syncppp layer wants a mid-level netdev private pointer. It was using netdev->priv but that only worked by accident, and thus this scheme was broken when the device private allocation strategy changed. Add a proper mid-layer private pointer for uses like this, update syncppp and all users, and remove the HDLC_PPP broken tag from drivers/net/wan/Kconfig Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f3005d7f |
|
16-Apr-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Add netns refcnt debug for network devices. dev_set_net is called for - just allocated devices - devices moving from one namespace to another release_net has proper check inside to distinguish these cases. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a4aa834a |
|
03-Apr-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Declare init_net even without CONFIG_NET defined. This does not look good, but there is no other choice. The compilation without CONFIG_NET is broken and can not be fixed with ease. After that there is no need for the following commits: 1567ca7eec7664b8be3b07755ac59dc1b1ec76cb 3edf8fa5ccf10688a9280b5cbca8ed3947c42866 2d38f9a4f8d2ebdc799f03eecf82345825495711 Revert them. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3edf8fa5 |
|
31-Mar-2008 |
David S. Miller <davem@davemloft.net> |
[NET]: Fix allnoconfig build on powerpc and avr32 As reported by Haavard Skinnemoen and Stephen Rothwell: > allnoconfig fails with > > include/linux/netdevice.h:843: error: implicit declaration of function 'dev_net' > > which seems to be because the definition of dev_net is inside #ifdef > CONFIG_NET, while next_net_device, which calls it, is not. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0ef47309 |
|
28-Mar-2008 |
Matti Linnanvuori <mattilinnanvuori@yahoo.com> |
net: Comment dev_kfree_skb_irq and dev_kfree_skb_any better Comment dev_kfree_skb_irq and dev_kfree_skb_any better. Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1567ca7e |
|
28-Mar-2008 |
David S. Miller <davem@davemloft.net> |
[NET]: Protect device namespace inlines with CONFIG_NET Include sites should not be bothered by whether CONFIG_NET is set or not when trying to include benign files like linux/etherdevice.h et al. From a report by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
50fd4407 |
|
27-Mar-2008 |
David S. Miller <davem@davemloft.net> |
[NET]: Use local_irq_{save,restore}() in napi_complete(). Based upon a lockdep report. Since ->poll() can be invoked from netpoll with interrupts disabled, we must not unconditionally enable interrupts in napi_complete(). Instead we must use local_irq_{save,restore}(). Noticed by Peter Zijlstra: <irqs disabled> netpoll_poll() poll_napi() spin_trylock(&napi->poll_lock) poll_one_napi() napi->poll() := sky2_poll() napi_complete() local_irq_disable() local_irq_enable() <--- *BUG* <irq> irq_exit() do_softirq() net_rx_action() spin_lock(&napi->poll_lock) <--- Deadlock! Because we still hold the lock.... Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5aa23fd |
|
26-Mar-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Compilation warnings under CONFIG_NET_NS. Recent commits from YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> have been introduced a several compilation warnings 'assignment discards qualifiers from pointer target type' due to extra const modifier in the inline call parameters of {dev|sock|twsk}_net_set. Drop it. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
9c2f5746 |
|
26-Mar-2008 |
Denis V. Lunev <den@openvz.org> |
[NETNS]: Compilation fix for include/linux/netdevice.h. Commit commit c346dca10840a874240c78efe3f39acf4312a1f2 ([NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS) breaks compilation with CONFIG_NET_NS set. Fix the typo. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c346dca1 |
|
25-Mar-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS. Introduce per-net_device inlines: dev_net(), dev_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
#
82cc1a7a |
|
21-Mar-2008 |
Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> |
[NET]: Add per-connection option to set max TSO frame size Update: My mailer ate one of Jarek's feedback mails... Fixed the parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the whitespace issue due to a patch import botch. Changed the types from u32 to unsigned int to be more consistent with other variables in the area. Also brought the patch up to the latest net-2.6.26 tree. Update: Made gso_max_size container 32 bits, not 16. Moved the location of gso_max_size within netdev to be less hotpath. Made more consistent names between the sock and netdev layers, and added a define for the max GSO size. Update: Respun for net-2.6.26 tree. Update: changed max_gso_frame_size and sk_gso_max_size from signed to unsigned - thanks Stephen! This patch adds the ability for device drivers to control the size of the TSO frames being sent to them, per TCP connection. By setting the netdevice's gso_max_size value, the socket layer will set the GSO frame size based on that value. This will propogate into the TCP layer, and send TSO's of that size to the hardware. This can be desirable to help tune the bursty nature of TSO on a per-adapter basis, where one may have 1 GbE and 10 GbE devices coexisting in a system, one running multiqueue and the other not, etc. This can also be desirable for devices that cannot support full 64 KB TSO's, but still want to benefit from some level of segmentation offloading. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b3c97528 |
|
13-Feb-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
include/linux: Remove all users of FASTCALL() macro FASTCALL() is always expanded to empty, remove it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
d5bd0146 |
|
13-Feb-2008 |
Neil Turton <nturton@solarflare.com> |
[NET]: Improve cache line coherency of ingress qdisc Move the ingress qdisc members of struct net_device from the transmit cache line to the receive cache line to avoid cache line ping-pong. These members are only used on the receive path. Signed-off-by: Neil Turton <nturton@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e83a2ea8 |
|
31-Jan-2008 |
Chris Leech <christopher.leech@intel.com> |
[VLAN]: set_rx_mode support for unicast address list Reuse the existing logic for multicast list synchronization for the unicast address list. The core of dev_mc_sync/unsync are split out as __dev_addr_sync/unsync and moved from dev_mcast.c to dev.c. These are then used to implement dev_unicast_sync/unsync as well. I'm working on cleaning up Intel's FCoE stack, which generates new MAC addresses from the fibre channel device id assigned by the fabric as per the current draft specification in T11. When using such a protocol in a VLAN environment it would be nice to not always be forced into promiscuous mode, assuming the underlying Ethernet driver supports multiple unicast addresses as well. Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
|
#
a0a46196 |
|
07-Jan-2008 |
David S. Miller <davem@davemloft.net> |
[NET]: Add NAPI_STATE_DISABLE. Create a bit to signal that a napi_disable() is in progress. This sets up infrastructure such that net_rx_action() can generically break out of the ->poll() loop on a NAPI context that has a pending napi_disable() yet is being bombed with packets (and thus would otherwise poll endlessly and not allow the napi_disable() to finish). Now, what napi_disable() does is first set the NAPI_STATE_DISABLE bit (to indicate that a disable is pending), then it polls for the NAPI_STATE_SCHED bit, and once the NAPI_STATE_SCHED bit is acquired the NAPI_STATE_DISABLE bit is cleared. Here, the test_and_set_bit() provides the necessary memory barrier between the various bitops. napi_schedule_prep() now tests for a pending disable as it's first action and won't try to obtain the NAPI_STATE_SCHED bit if a disable is pending. As a result, we can remove the netif_running() check in netif_rx_schedule_prep() because the NAPI disable pending state serves this purpose. And, it does so in a NAPI centric manner which is what we really want. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bdb95b17 |
|
07-Jan-2008 |
David S. Miller <davem@davemloft.net> |
[NET]: Do not grab device reference when scheduling a NAPI poll. It is pointless, because everything that can make a device go away will do a napi_disable() first. The main impetus behind this is that now we can legally do a NAPI completion in generic code like net_rx_action() which a following changeset needs to do. net_rx_action() can only perform actions in NAPI centric ways, because there may be a one to many mapping between NAPI contexts and network devices (SKY2 is one example). We also want to get rid of this because it's an extra atomic in the NAPI paths, and also because it is one of the last instances where the NAPI interfaces care about net devices. The one remaining netdev detail the NAPI stuff cares about is the netif_running() check which will be killed off in a subsequent changeset. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3b582cc1 |
|
01-Nov-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: docbook fixes for netif_ functions Documentation updates for network interfaces. 1. Add doc for netif_napi_add 2. Remove doc for unused returns from netif_rx 3. Add doc for netif_receive_skb [ Incorporated minor mods from Randy Dunlap -DaveM ] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
43cc7380 |
|
26-Oct-2007 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
[NET] napi: use non-interruptible sleep in napi_disable The current napi_disable() uses msleep_interruptible() but doesn't (and can't) exit in case there's a signal, thus ending up doing a hot spin without a cpu_relax. Use uninterruptible sleep instead. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Jeff Garzik <jeff@garzik.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bada339b |
|
23-Oct-2007 |
Jeff Garzik <jgarzik@redhat.com> |
[NET]: Validate device addr prior to interface-up Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f1ecfd5d |
|
22-Oct-2007 |
Ursula Braun <braunu@de.ibm.com> |
remove header_ops bug in qeth driver Remove qeth bug caused by commit: [NET]: Move hardware header operations out of netdevice. This is the second part of the qeth header_ops patch, since first patch sent 10/19 has been insufficient. Nevertheless first patch is still valid and should be kept. Signed-off-by: Ursula Braun <braunu@de.ibm.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
|
#
668f895a |
|
21-Oct-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Hide the queue_mapping field inside netif_subqueue_stopped Many places get the queue_mapping field from skb to pass it to the netif_subqueue_stopped() which will be 0 in any case. Make the helper that works with sk_buff Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
1b83336b |
|
18-Oct-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: Fix OOPS due to missing check in dev_parse_header(). [ This is kernel bugzilla 9174 "linux-2.6.23-git11 kernel panic" ] The device in question is an IPv6-over-IPv4 tunnel, which doesn't have any header_ops, so the crash happens in dev_parse_header when dereferencing them. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c264c3de |
|
17-Oct-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
napi_synchronize: waiting for NAPI Some drivers with shared NAPI need a synchronization barrier. Also suggested by Benjamin Herrenschmidt for EMAC. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>
|
#
c4ea43c5 |
|
12-Oct-2007 |
Randy Dunlap <randy.dunlap@oracle.com> |
net core: fix kernel-doc for new function parameters Fix networking code kernel-doc for newly added parameters. Warning(linux-2.6.23-git2//net/core/sock.c:879): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:570): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:594): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:617): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:641): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:667): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:722): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:959): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:1195): No description found for parameter 'dev' Warning(linux-2.6.23-git2//net/core/dev.c:2105): No description found for parameter 'n' Warning(linux-2.6.23-git2//net/core/dev.c:3272): No description found for parameter 'net' Warning(linux-2.6.23-git2//net/core/dev.c:3445): No description found for parameter 'net' Warning(linux-2.6.23-git2//include/linux/netdevice.h:1301): No description found for parameter 'cpu' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
9ef4429b |
|
10-Oct-2007 |
Benjamin Thery <benjamin.thery@bull.net> |
[NET]: Fix dev_put() and dev_hold() comments Trivial fix: Swap comments for dev_put() and dev_hold() to get them at the right place. Typo introduced by 4fa57c9ea9f36f9ca852f3a88ca5d2f1aebbc960. Signed-of-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bfe13f54 |
|
09-Oct-2007 |
Roland Dreier <rdreier@cisco.com> |
ibm_emac: Convert to use napi_struct independent of struct net_device Commit da3dedd9 ("[NET]: Make NAPI polling independent of struct net_device objects.") changed the interface to NAPI polling. Fix up the ibm_emac driver so that it works with this new interface. This is actually a nice cleanup because ibm_emac is one of the drivers that wants to have multiple NAPI structures for a single net_device. Tested with the internal MAC of a PowerPC 440SPe SoC with an AMCC 'Yucca' evaluation board. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
|
#
3b04ddde |
|
09-Oct-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: Move hardware header operations out of netdevice. Since hardware header operations are part of the protocol class not the device instance, make them into a separate object and save memory. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b95cce35 |
|
26-Sep-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: Wrap hard_header_parse Wrap the hard_header_parse function to simplify next step of header_ops conversion. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0c4e8581 |
|
09-Oct-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: Wrap netdevice hardware header creation. Add inline for common usage of hardware header creation, and fix bug in IPV6 mcast where the assumption about negative return is an errno. Negative return from hard_header means not enough space was available,(ie -N bytes). Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2774c7ab |
|
26-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Make the loopback device per network namespace. This patch makes loopback_dev per network namespace. Adding code to create a different loopback device for each network namespace and adding the code to free a loopback device when a network namespace exits. This patch modifies all users the loopback_dev so they access it as init_net.loopback_dev, keeping all of the code compiling and working. A later pass will be needed to update the users to use something other than the initial network namespace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e24eb521 |
|
25-Sep-2007 |
Christian Borntraeger <borntraeger@de.ibm.com> |
[NET]: note that NETIF_F_LLTX is deprecated Am Freitag, 21. September 2007 schrieb Herbert Xu: > Please don't use LLTX in new drivers. We're trying to get rid > of it since it's > > 1) unnecessary; > 2) causes problems with AF_PACKET seeing things twice. I suggest to document that LLTX is deprecated. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
de3cb747 |
|
25-Sep-2007 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NET]: Dynamically allocate the loopback device, part 1. This patch replaces all occurences to the static variable loopback_dev to a pointer loopback_dev. That provides the mindless, trivial, uninteressting change part for the dynamic allocation for the loopback. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Acked-By: Kirill Korotaev <dev@sw.ru> Acked-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ad7379d4 |
|
16-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Fix the prototype of call_netdevice_notifiers. This replaces the void * parameter with a struct net_device * which is what is actually required. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
22dd7495 |
|
16-Sep-2007 |
Jamal Hadi Salim <hadi@cyberus.ca> |
[NET]: migrate HARD_TX_LOCK to header file HARD_TX_LOCK micro is a nice aggregation that could be used in other spots. move it to netdevice.h Also makes sure the previously superflous cpu arguement is used. Thanks to DaveM for the suggestions. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
10d024c1 |
|
17-Sep-2007 |
Ralf Baechle <ralf@linux-mips.org> |
[NET]: Nuke SET_MODULE_OWNER macro. It's been a useless no-op for long enough in 2.6 so I figured it's time to remove it. The number of people that could object because they're maintaining unified 2.4 and 2.6 drivers is probably rather small. [ Handled drivers added by netdev tree and some missed IRDA cases... -DaveM ] Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Jeff Garzik <jeff@garzik.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a050c33f |
|
12-Sep-2007 |
Daniel Lezcano <dlezcano@fr.ibm.com> |
[NETNS]: Fix bad macro definition. The macro definition is bad. When calling next_net_device with parameter name "dev", the resulting code is: struct net_device *dev = dev and that leads to an unexpected behavior. Especially when llc_core is compiled in, the kernel panics at boot time. The patchset change macro definition with static inline functions as they were defined before. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ce286d32 |
|
12-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Implement network device movement between namespaces This patch introduces NETIF_F_NETNS_LOCAL a flag to indicate a network device is local to a single network namespace and should never be moved. Useful for pseudo devices that we need an instance in each network namespace (like the loopback device) and for any device we find that cannot handle multiple network namespaces so we may trap them in the initial network namespace. This patch introduces the function dev_change_net_namespace a function used to move a network device from one network namespace to another. To the network device nothing special appears to happen, to the components of the network stack it appears as if the network device was unregistered in the network namespace it is in, and a new device was registered in the network namespace the device was moved to. This patch sets up a namespace device destructor that upon the exit of a network namespace moves all of the movable network devices to the initial network namespace so they are not lost. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
881d966b |
|
17-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Make the device list and device lookups per namespace. This patch makes most of the generic device layer network namespace safe. This patch makes dev_base_head a network namespace variable, and then it picks up a few associated variables. The functions: dev_getbyhwaddr dev_getfirsthwbytype dev_get_by_flags dev_get_by_name __dev_get_by_name dev_get_by_index __dev_get_by_index dev_ioctl dev_ethtool dev_load wireless_process_ioctl were modified to take a network namespace argument, and deal with it. vlan_ioctl_set and brioctl_set were modified so their hooks will receive a network namespace argument. So basically anthing in the core of the network stack that was affected to by the change of dev_base was modified to handle multiple network namespaces. The rest of the network stack was simply modified to explicitly use &init_net the initial network namespace. This can be fixed when those components of the network stack are modified to handle multiple network namespaces. For now the ifindex generator is left global. Fundametally ifindex numbers are per namespace, or else we will have corner case problems with migration when we get that far. At the same time there are assumptions in the network stack that the ifindex of a network device won't change. Making the ifindex number global seems a good compromise until the network stack can cope with ifindex changes when you change namespaces, and the like. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4a1c5371 |
|
12-Sep-2007 |
Eric W. Biederman <ebiederm@xmission.com> |
[NET]: Add a network namespace tag to struct net_device Please note that network devices do not increase the count count on the network namespace. The are inside the network namespace and so the network namespace tag is in the nature of a back pointer and so getting and putting the network namespace is unnecessary. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3ae7c0b2 |
|
15-Aug-2007 |
Jeff Garzik <jeff@garzik.org> |
[ETHTOOL]: Add ETHTOOL_[GS]FLAGS sub-ioctls Signed-off-by: Jeff Garzik <jeff@garzik.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bea3348e |
|
03-Oct-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: Make NAPI polling independent of struct net_device objects. Several devices have multiple independant RX queues per net device, and some have a single interrupt doorbell for several queues. In either case, it's easier to support layouts like that if the structure representing the poll is independant from the net device itself. The signature of the ->poll() call back goes from: int foo_poll(struct net_device *dev, int *budget) to int foo_poll(struct napi_struct *napi, int budget) The caller is returned the number of RX packets processed (or the number of "NAPI credits" consumed if you want to get abstract). The callee no longer messes around bumping dev->quota, *budget, etc. because that is all handled in the caller upon return. The napi_struct is to be embedded in the device driver private data structures. Furthermore, it is the driver's responsibility to disable all NAPI instances in it's ->stop() device close handler. Since the napi_struct is privatized into the driver's private data structures, only the driver knows how to get at all of the napi_struct instances it may have per-device. With lots of help and suggestions from Rusty Russell, Roland Dreier, Michael Chan, Jeff Garzik, and Jamal Hadi Salim. Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra, Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan. [ Ported to current tree and all drivers converted. Integrated Stephen's follow-on kerneldoc additions, and restored poll_list handling to the old style to fix mutual exclusion issues. -DaveM ] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7f353bf2 |
|
10-Aug-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Share correct feature code between bridging and bonding http://bugzilla.kernel.org/show_bug.cgi?id=8797 shows that the bonding driver may produce bogus combinations of the checksum flags and SG/TSO. For example, if you bond devices with NETIF_F_HW_CSUM and NETIF_F_IP_CSUM you'll end up with a bonding device that has neither flag set. If both have TSO then this produces an illegal combination. The bridge device on the other hand has the correct code to deal with this. In fact, the same code can be used for both. So this patch moves that logic into net/core/dev.c and uses it for both bonding and bridging. In the process I've made small adjustments such as only setting GSO_ROBUST if at least one constituent device supports it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
31ce72a6 |
|
20-Jul-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: Fix loopback crashes when multiqueue is enabled. From: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
456ad75c |
|
18-Jul-2007 |
Denis Cheng <crquan@gmail.com> |
[NET]: move dev_mc_discard from dev_mcast.c to dev.c Because this function is only called by unregister_netdevice, this moving could make this non-global function static, and also remove its declaration in netdevice.h; Any further, function __dev_addr_discard is also just called by dev_mc_discard and dev_unicast_discard, keeping this two functions both in one c file could make __dev_addr_discard also static and remove its declaration in netdevice.h; Futhermore, the sequential call to dev_unicast_discard and then dev_mc_discard in unregister_netdevice have a similar mechanism that: (netif_tx_lock_bh / __dev_addr_discard / netif_tx_unlock_bh), they should merged into one to eliminate duplicates in acquiring and releasing the dev->_xmit_lock, this would be done in my following patch. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
2e27afb3 |
|
16-Jul-2007 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Revert "[NET]: Fix races in net_rx_action vs netpoll." This reverts commit 29578624e354f56143d92510fff33a8b2aaa2c03. Ingo Molnar reports complete breakage with his e1000 card (no networking, card reports transmit timeouts), and bisected it down to this commit. Let's figure out what went wrong, but not keep breaking machines until we do. Cc: Ingo Molnar <mingo@elte.hu> Cc: Olaf Kirch <olaf.kirch@oracle.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
b863ceb7 |
|
14-Jul-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: Add macvlan driver Add macvlan driver, which allows to create virtual ethernet devices based on MAC address. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a0a400d7 |
|
14-Jul-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: dev_mcast: add multicast list synchronization helpers The method drivers currently use to synchronize multicast lists is not very pretty: - walk the multicast list - search each entry on a copy of the previous list - if new add to lower device - walk the copy of the previous list - search each entry on the current list - if removed delete from lower device - copy entire list This patch adds a new field to struct dev_addr_list to store the synchronization state and adds two helper functions for synchronization and cleanup. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
24023451 |
|
14-Jul-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: Add net_device change_rx_mode callback Currently the set_multicast_list (and set_rx_mode) callbacks are responsible for configuring the device according to the IFF_PROMISC, IFF_MULTICAST and IFF_ALLMULTI flags and the mc_list (and uc_list in case of set_rx_mode). These callbacks can be invoked from BH context without the rtnl_mutex by dev_mc_add/dev_mc_delete, which makes reading the device flags and promiscous/allmulti count racy. For real hardware drivers that just commit all changes to the hardware this is not a real problem since the stack guarantees to call them for every change, so at least the final call will not race and commit the correct configuration to the hardware. For software devices that want to synchronize promiscous and multicast state to an underlying device however this can cause corruption of the underlying device's flags or promisc/allmulti counts. When the software device is concurrently put in promiscous or allmulti mode while set_multicast_list is invoked from bottem half context, the device might synchronize the change to the underlying device without holding the rtnl_mutex, which races with concurrent changes to the underlying device. Add a dev->change_rx_flags hook that is invoked when any of the flags that affect rx filtering change (under the rtnl_mutex), which allows drivers to perform synchronization immediately and only synchronize the address lists in set_multicast_list/set_rx_mode. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
29578624 |
|
11-Jul-2007 |
Olaf Kirch <olaf.kirch@oracle.com> |
[NET]: Fix races in net_rx_action vs netpoll. Keep netpoll/poll_napi from messing with the poll_list. Only net_rx_action is allowed to manipulate the list. Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
61cbc2fc |
|
30-Jun-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: Fix secondary unicast/multicast address count maintenance When a reference to an existing address is increased or decreased without hitting zero, the address count is incorrectly adjusted. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f25f4e44 |
|
06-Jul-2007 |
Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> |
[CORE] Stack changes to add multiqueue hardware support API Add the multiqueue hardware device support API to the core network stack. Allow drivers to allocate multiple queues and manage them at the netdev level if they choose to do so. Added a new field to sk_buff, namely queue_mapping, for drivers to know which tx_ring to select based on OS classification of the flow. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4417da66 |
|
27-Jun-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: dev: secondary unicast address support Add support for configuring secondary unicast addresses on network devices. To support this devices capable of filtering multiple unicast addresses need to change their set_multicast_list function to configure unicast filters as well and assign it to dev->set_rx_mode instead of dev->set_multicast_list. Other devices are put into promiscous mode when secondary unicast addresses are present. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3fba5a8b |
|
27-Jun-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: dev_mcast: switch to generic net_device address lists Use generic net_device address lists for multicast list handling. Some defines are used to keep drivers working. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bf742482 |
|
27-Jun-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: dev: introduce generic net_device address lists Introduce struct dev_addr_list and list maintenance functions based on dev_mc_list and the related functions. This will be used by follow-up patches for both multicast and secondary unicast addresses. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d212f87b |
|
27-Jun-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: IPV6 checksum offloading in network devices The existing model for checksum offload does not correctly handle devices that can offload IPV4 and IPV6 only. The NETIF_F_HW_CSUM flag implies device can do any arbitrary protocol. This patch: * adds NETIF_F_IPV6_CSUM for those devices * fixes bnx2 and tg3 devices that need it * add NETIF_F_IPV6_CSUM to ipv6 output (incl GSO) * fixes assumptions about NETIF_F_ALL_CSUM in nat * adjusts bridge union of checksumming computation Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
38f7b870 |
|
13-Jun-2007 |
Patrick McHardy <kaber@trash.net> |
[RTNETLINK]: Link creation API Add rtnetlink API for creating, changing and deleting software devices. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
6472ce60 |
|
13-Jun-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: Mark struct net_device * argument to netdev_priv const Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b0ba6667 |
|
29-May-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET] napi: Call __netif_rx_complete in netif_rx_complete This patch kills a little bit of code duplication between the two variants of netif_rx_complete. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
572a103d |
|
08-May-2007 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET] link_watch: Move link watch list into net_device These days the link watch mechanism is an integral part of the network subsystem as it manages the carrier status. So it now makes sense to allocate some memory for it in net_device rather than allocating it on demand. In fact, this is necessary because we can't tolerate a memory allocation failure since that means we'd have to potentially throw a link up event away. It also simplifies the code greatly. In doing so I discovered a subtle race condition in the use of singleevent. This race condition still exists (and is somewhat magnified) without singleevent but it's now plugged thanks to an smp_mb__before_clear_bit. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7562f876 |
|
03-May-2007 |
Pavel Emelianov <xemul@openvz.org> |
[NET]: Rework dev_base via list_head (v3) Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev->next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Kirill Korotaev <dev@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
4e9cac2b |
|
03-May-2007 |
Patrick McHardy <kaber@trash.net> |
[NET]: Add __dev_getfirstbyhwtype Add __dev_getfirstbyhwtype for callers that don't want a reference but some data from the device and thus need to take the rtnl anyway. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5a1b5898 |
|
28-Apr-2007 |
Rusty Russell <rusty@rustcorp.com.au> |
[NET]: Remove NETIF_F_INTERNAL_STATS, default to internal stats. Herbert Xu conviced me that a new flag was overkill; every driver currently overrides get_stats, so we might as well make the internal one the default. If someone did fail to set get_stats, they would now get all 0 stats instead of "No statistics available". Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5f286e11 |
|
28-Apr-2007 |
Sergei Shtylyov <sshtylyov@ru.mvista.com> |
[NETPOLL]: Fix TX queue overflow in trapped mode. CONFIG_NETPOLL_TRAP causes the TX queue controls to be completely bypassed in the netpoll's "trapped" mode which easily causes overflows in the drivers with short TX queues (most notably, in 8139too with its 4-deep queue). So, make this option more sensible by making it only bypass the TX softirq wakeup. Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Acked-by: Jeff Garzik <jgarzik@pobox.com> Acked-by: Tom Rini <trini@kernel.crashing.org> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b86e0280 |
|
26-Apr-2007 |
Johannes Berg <johannes@sipsolutions.net> |
[WEXT] net_device: Don't include wext bits if not required. This patch makes the wext bits in struct net_device depend on CONFIG_WIRELESS_EXT. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
704232c2 |
|
23-Apr-2007 |
Johannes Berg <johannes@sipsolutions.net> |
[WIRELESS] cfg80211: New wireless config infrastructure. This patch creates the core cfg80211 code along with some sysfs bits. This is a stripped down version to allow mac80211 to function, but doesn't include any configuration yet except for creating and removing virtual interfaces. This patch includes the nl80211 header file but it only contains the interface types which the cfg80211 interface for creating virtual interfaces relies on. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c45d286e |
|
28-Mar-2007 |
Rusty Russell <rusty@rustcorp.com.au> |
[NET]: Inline net_device_stats Network drivers which keep stats allocate their own stats structure then write a get_stats() function to return them. It would be nice if this were done by default. 1) Add a new "stats" field to "struct net_device". 2) Add a new feature field to say "this driver uses the internal one" 3) Have a default "get_stats" which returns NULL if that feature not set. 4) Change callers to check result of get_stats call for NULL, not if ->get_stats is set. This should not break backwards compatibility with older drivers, yet allow modern drivers to shed some boilerplate code. Lightly tested: works for a modified lguest network driver. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
22f8cde5 |
|
07-Feb-2007 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: unregister_netdevice as void There was no real useful information from the unregister_netdevice() return code, the only error occurred in a situation that was a driver bug. So change it to a void function. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
43cb76d9 |
|
09-Apr-2002 |
Greg Kroah-Hartman <gregkh@suse.de> |
Network: convert network devices to use struct device instead of class_device This lets the network core have the ability to handle suspend/resume issues, if it wants to. Thanks to Frederik Deweerdt <frederik.deweerdt@gmail.com> for the arm driver fixes. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
#
e44c39bd |
|
28-Dec-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[NET]: Add memory barrrier to netif_poll_enable() When a driver writer calls this, they generally expect that all previous stores and modifications they've made will be visible before netif_poll_enable() executes, so ensure this. Noticed by Ben H. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f0490980 |
|
08-Dec-2006 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: Force a cache line split in hh_cache in SMP. hh_lock was converted from rwlock to seqlock by Stephen. To have a 100% benefit of this change, I suggest to place read mostly fields of hh_cache in a separate cache line, because hh_refcnt may be changed quite frequently on some busy machines. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3644f0ce |
|
07-Dec-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: Convert hh_lock to seqlock. The hard header cache is in the main output path, so using seqlock instead of reader/writer lock should reduce overhead. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d7fe0f24 |
|
03-Dec-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[PATCH] severing skbuff.h -> mm.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
d5c42c0e |
|
27-Nov-2006 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[NET]: Pack struct hh_cache [acme@newtoy net-2.6.20]$ pahole net/ipv4/tcp.o hh_cache /* /pub/scm/linux/kernel/git/acme/net-2.6.20/include/linux/netdevice.h:190 */ struct hh_cache { struct hh_cache * hh_next; /* 0 4 */ atomic_t hh_refcnt; /* 4 4 */ __be16 hh_type; /* 8 2 */ /* XXX 2 bytes hole, try to pack */ int hh_len; /* 12 4 */ int (*hh_output)(); /* 16 4 */ rwlock_t hh_lock; /* 20 36 */ long unsigned int hh_data[24]; /* 56 96 */ }; /* size: 152, sum members: 150, holes: 1, sum holes: 2 */ [acme@newtoy net-2.6.20]$ find net -name "*.[ch]" | xargs grep 'hh_len.\+=' | sort -u net/atm/br2684.c: hh->hh_len = PADLEN + ETH_HLEN; net/ethernet/eth.c: hh->hh_len = ETH_HLEN; net/ipv4/ipconfig.c: int hh_len = LL_RESERVED_SPACE(dev); net/ipv4/ip_output.c: hh_len = LL_RESERVED_SPACE(rt->u.dst.dev); net/ipv4/ip_output.c: int hh_len = LL_RESERVED_SPACE(dev); net/ipv4/netfilter.c: hh_len = (*pskb)->dst->dev->hard_header_len; net/ipv4/raw.c: hh_len = LL_RESERVED_SPACE(rt->u.dst.dev); net/ipv6/ip6_output.c: hh_len = LL_RESERVED_SPACE(rt->u.dst.dev); net/ipv6/netfilter/ip6t_REJECT.c: hh_len = (dst->dev->hard_header_len + 15)&~15; net/ipv6/raw.c: hh_len = LL_RESERVED_SPACE(rt->u.dst.dev); [acme@newtoy net-2.6.20]$ [acme@newtoy net-2.6.20]$ find include -name "*.h" | xargs grep 'define ETH_HLEN' include/linux/if_ether.h:#define ETH_HLEN 14 /* Total octets in header. */ (((dev)->hard_header_len&~(HH_DATA_MOD - 1)) + HH_DATA_MOD) [acme@newtoy net-2.6.20]$ pahole net/ipv4/tcp.o net_device | grep hard_header_len short unsigned int hard_header_len; /* 106 2 */ [acme@newtoy net-2.6.20]$ So I think we're safe in turning hh_len an u16, end result: [acme@newtoy net-2.6.20]$ codiff -sV /tmp/tcp.o.before net/ipv4/tcp.o /pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv4/tcp.c: struct hh_cache | -4 hh_len; from: int /* 12(0) 4(0) */ to: u16 /* 10(0) 2(0) */ 1 struct changed [acme@newtoy net-2.6.20]$ Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
#
b9df3cb8 |
|
14-Nov-2006 |
Gerrit Renker <gerrit@erg.abdn.ac.uk> |
[TCP/DCCP]: Introduce net_xmit_eval Throughout the TCP/DCCP (and tunnelling) code, it often happens that the return code of a transmit function needs to be tested against NET_XMIT_CN which is a value that does not indicate a strict error condition. This patch uses a macro for these recurring situations which is consistent with the already existing macro net_xmit_errno, saving on duplicated code. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
#
90833aa4 |
|
13-Nov-2006 |
Adrian Bunk <bunk@stusta.de> |
[NET]: The scheduled removal of the frame diverter. This patch contains the scheduled removal of the frame diverter. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e81c7359 |
|
28-Nov-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[NET]: Fix MAX_HEADER setting. MAX_HEADER is either set to LL_MAX_HEADER or LL_MAX_HEADER + 48, and this is controlled by a set of CONFIG_* ifdef tests. It is trying to use LL_MAX_HEADER + 48 when any of the tunnels are enabled which set hard_header_len like this: dev->hard_header_len = LL_MAX_HEADER + sizeof(struct xxx); The correct set of tunnel drivers which do this are: ipip ip_gre ip6_tunnel sit so make the ifdef test match. Noticed by Patrick McHardy and with help from Herbert Xu. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d77072ec |
|
28-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[NET]: Annotate dst_ops protocol Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f5b2b966 |
|
22-Sep-2006 |
Jay Vosburgh <fubar@us.ibm.com> |
[PATCH] bonding: Validate probe replies in ARP monitor Add logic to check ARP request / reply packets used for ARP monitor link integrity checking. The current method simply examines the slave device to see if it has sent and received traffic; this can be fooled by extraneous traffic. For example, if multiple hosts running bonding are behind a common switch, the probe traffic from the multiple instances of bonding will update the tx/rx times on each other's slave devices. Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
|
#
baef1865 |
|
08-Sep-2006 |
John W. Linville <linville@tuxdriver.com> |
[PATCH] WE-21 support (core API) This is version 21 of the Wireless Extensions. Changelog : o finishes migrating the ESSID API (remove the +1) o netdev->get_wireless_stats is no more o long/short retry This is a redacted version of a patch originally submitted by Jean Tourrilhes. I removed most of the additions, in order to minimize future support requirements for nl80211 (or other WE successor). CC: Jean Tourrilhes <jt@hpl.hp.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
|
#
84fa7933 |
|
29-Aug-2006 |
Patrick McHardy <kaber@trash.net> |
[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose checksum still needs to be completed) and CHECKSUM_COMPLETE (for incoming packets, device supplied full checksum). Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
76fd8593 |
|
08-Sep-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[PATCH] ethtool: allow const ethtool_ops The ethtool_ops structure is immutable, it expected to be setup by the driver and is never changed. This patch allows drivers to declare there ethtool_ops structure read-only. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>
|
#
78eb8877 |
|
17-Aug-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[BRIDGE]: Disable SG/GSO if TX checksum is off When the bridge recomputes features, it does not maintain the constraint that SG/GSO must be off if TX checksum is off. This patch adds that constraint. On a completely unrelated note, I've also added TSO6 and TSO_ECN feature bits if GSO is enabled on the underlying device through the new NETIF_F_GSO_SOFTWARE macro. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7ea49ed7 |
|
14-Aug-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[VLAN]: Make sure bonding packet drop checks get done in hwaccel RX path. Since __vlan_hwaccel_rx() is essentially bypassing the netif_receive_skb() call that would have occurred if we did the VLAN decapsulation in software, we are missing the skb_bond() call and the assosciated checks it does. Export those checks via an inline function, skb_bond_should_drop(), and use this in __vlan_hwaccel_rx(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
53c4b2cc7 |
|
21-Jul-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Fix reversed error test in netif_tx_trylock A non-zero return value indicates success from spin_trylock, not error. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
a430a43d |
|
08-Jul-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET] gso: Fix up GSO packets with broken checksums Certain subsystems in the stack (e.g., netfilter) can break the partial checksum on GSO packets. Until they're fixed, this patch allows this to work by recomputing the partial checksums through the GSO mechanism. Once they've all been converted to update the partial checksum instead of clearing it, this workaround can be removed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
89114afd |
|
08-Jul-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET] gso: Add skb_is_gso This patch adds the wrapper function skb_is_gso which can be used instead of directly testing skb_shinfo(skb)->gso_size. This makes things a little nicer and allows us to change the primary key for indicating whether an skb is GSO (if we ever want to do that). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f83ef8c0 |
|
30-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPV6]: Added GSO support for TCPv6 This patch adds GSO support for IPv6 and TCPv6. This is based on a patch by Ananda Raju <Ananda.Raju@neterion.com>. His original description is: This patch enables TSO over IPv6. Currently Linux network stacks restricts TSO over IPv6 by clearing of the NETIF_F_TSO bit from "dev->features". This patch will remove this restriction. This patch will introduce a new flag NETIF_F_TSO6 which will be used to check whether device supports TSO over IPv6. If device support TSO over IPv6 then we don't clear of NETIF_F_TSO and which will make the TCP layer to create TSO packets. Any device supporting TSO over IPv6 will set NETIF_F_TSO6 flag in "dev->features" along with NETIF_F_TSO. In case when user disables TSO using ethtool, NETIF_F_TSO will get cleared from "dev->features". So even if we have NETIF_F_TSO6 we don't get TSO packets created by TCP layer. SKB_GSO_TCPV4 renamed to SKB_GSO_TCP to make it generic GSO packet. SKB_GSO_UDPV4 renamed to SKB_GSO_UDP as UFO is not a IPv4 feature. UFO is supported over IPv6 also The following table shows there is significant improvement in throughput with normal frames and CPU usage for both normal and jumbo. -------------------------------------------------- | | 1500 | 9600 | | ------------------|-------------------| | | thru CPU | thru CPU | -------------------------------------------------- | TSO OFF | 2.00 5.5% id | 5.66 20.0% id | -------------------------------------------------- | TSO ON | 2.63 78.0 id | 5.67 39.0% id | -------------------------------------------------- Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
bcd76111 |
|
30-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Generalise TSO-specific bits from skb_setup_caps This patch generalises the TSO-specific bits from sk_setup_caps by adding the sk_gso_type member to struct sock. This makes sk_setup_caps generic so that it can be used by TCPv6 or UFO. The only catch is that whoever uses this must provide a GSO implementation for their protocol which I think is a fair deal :) For now UFO continues to live without a GSO implementation which is OK since it doesn't use the sock caps field at the moment. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b0da8537 |
|
29-Jun-2006 |
Michael Chan <mchan@broadcom.com> |
[NET]: Add ECN support for TSO In the current TSO implementation, NETIF_F_TSO and ECN cannot be turned on together in a TCP connection. The problem is that most hardware that supports TSO does not handle CWR correctly if it is set in the TSO packet. Correct handling requires CWR to be set in the first packet only if it is set in the TSO header. This patch adds the ability to turn on NETIF_F_TSO and ECN using GSO if necessary to handle TSO packets with CWR set. Hardware that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev-> features flag. All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set. If the output device does not have the NETIF_F_TSO_ECN feature set, GSO will split the packet up correctly with CWR only set in the first segment. With help from Herbert Xu <herbert@gondor.apana.org.au>. Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock flag is completely removed. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d6b4991a |
|
29-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Fix logical error in skb_gso_ok The test in skb_gso_ok is backwards. Noticed by Michael Chan <mchan@broadcom.com>. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
576a30eb |
|
27-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Added GSO header verification When GSO packets come from an untrusted source (e.g., a Xen guest domain), we need to verify the header integrity before passing it to the hardware. Since the first step in GSO is to verify the header, we can reuse that code by adding a new bit to gso_type: SKB_GSO_DODGY. Packets with this bit set can only be fed directly to devices with the corresponding bit NETIF_F_GSO_ROBUST. If the device doesn't have that bit, then the skb is fed to the GSO engine which will allow the packet to be sent to the hardware if it passes the header check. This patch changes the sg flag to a full features flag. The same method can be used to implement TSO ECN support. We simply have to mark packets with CWR set with SKB_GSO_ECN so that only hardware with a corresponding NETIF_F_TSO_ECN can accept them. The GSO engine can either fully segment the packet, or segment the first MTU and pass the rest to the hardware for further segmentation. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
60481264 |
|
26-Jun-2006 |
Adrian Bunk <bunk@stusta.de> |
[NET]: make net/core/dev.c:netdev_nit static netdev_nit can now become static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37c3185a |
|
22-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Added GSO toggle This patch adds a generic segmentation offload toggle that can be turned on/off for each net device. For now it only supports in TCPv4. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f6a78bfc |
|
22-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Add generic segmentation offload This patch adds the infrastructure for generic segmentation offload. The idea is to tap into the potential savings of TSO without hardware support by postponing the allocation of segmented skb's until just before the entry point into the NIC driver. The same structure can be used to support software IPv6 TSO, as well as UFO and segmentation offload for other relevant protocols, e.g., DCCP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
7967168c |
|
22-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Merge TSO/UFO fields in sk_buff Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not going to scale if we add any more segmentation methods (e.g., DCCP). So let's merge them. They were used to tell the protocol of a packet. This function has been subsumed by the new gso_type field. This is essentially a set of netdev feature bits (shifted by 16 bits) that are required to process a specific skb. As such it's easy to tell whether a given device can process a GSO skb: you just have to and the gso_type field and the netdev's features field. I've made gso_type a conjunction. The idea is that you have a base type (e.g., SKB_GSO_TCPV4) that can be modified further to support new features. For example, if we add a hardware TSO type that supports ECN, they would declare NETIF_F_TSO | NETIF_F_TSO_ECN. All TSO packets with CWR set would have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO packets would be SKB_GSO_TCPV4. This means that only the CWR packets need to be emulated in software. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
48d83325 |
|
20-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Prevent multiple qdisc runs Having two or more qdisc_run's contend against each other is bad because it can induce packet reordering if the packets have to be requeued. It appears that this is an unintended consequence of relinquinshing the queue lock while transmitting. That in turn is needed for devices that spend a lot of time in their transmit routine. There are no advantages to be had as devices with queues are inherently single-threaded (the loopback device is not but then it doesn't have a queue). Even if you were to add a queue to a parallel virtual device (e.g., bolt a tbf filter in front of an ipip tunnel device), you would still want to process the queue in sequence to ensure that the packets are ordered correctly. The solution here is to steal a bit from net_device to prevent this. BTW, as qdisc_restart is no longer used by anyone as a module inside the kernel (IIRC it used to with netif_wake_queue), I have not exported the new __qdisc_run function. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
8648b305 |
|
17-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Add NETIF_F_GEN_CSUM and NETIF_F_ALL_CSUM The current stack treats NETIF_F_HW_CSUM and NETIF_F_NO_CSUM identically so we test for them in quite a few places. For the sake of brevity, I'm adding the macro NETIF_F_GEN_CSUM for these two. We also test the disjunct of NETIF_F_IP_CSUM and the other two in various places, for that purpose I've added NETIF_F_ALL_CSUM. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
932ff279 |
|
09-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Add netif_tx_lock Various drivers use xmit_lock internally to synchronise with their transmission routines. They do so without setting xmit_lock_owner. This is fine as long as netpoll is not in use. With netpoll it is possible for deadlocks to occur if xmit_lock_owner isn't set. This is because if a printk occurs while xmit_lock is held and xmit_lock_owner is not set can cause netpoll to attempt to take xmit_lock recursively. While it is possible to resolve this by getting netpoll to use trylock, it is suboptimal because netpoll's sole objective is to maximise the chance of getting the printk out on the wire. So delaying or dropping the message is to be avoided as much as possible. So the only alternative is to always set xmit_lock_owner. The following patch does this by introducing the netif_tx_lock family of functions that take care of setting/unsetting xmit_lock_owner. I renamed xmit_lock to _xmit_lock to indicate that it should not be used directly. I didn't provide irq versions of the netif_tx_lock functions since xmit_lock is meant to be a BH-disabling lock. This is pretty much a straight text substitution except for a small bug fix in winbond. It currently uses netif_stop_queue/spin_unlock_wait to stop transmission. This is unsafe as an IRQ can potentially wake up the queue. So it is safer to use netif_tx_disable. The hamradio bits used spin_lock_irq but it is unnecessary as xmit_lock must never be taken in an IRQ handler. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
db217334 |
|
17-Jun-2006 |
Chris Leech <christopher.leech@intel.com> |
[I/OAT]: Setup the networking subsystem as a DMA client Attempts to allocate per-CPU DMA channels Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b17a7c17 |
|
10-May-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: Do sysfs registration as part of register_netdevice. The last step of netdevice registration was being done by a delayed call, but because it was delayed, it was impossible to return any error code if the class_device registration failed. Side effects: * one state in registration process is unnecessary. * register_netdevice can sleep inside class_device registration/hotplug * code in netdev_run_todo only does unregistration so it is simpler. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d3240312 |
|
08-May-2006 |
Stephen Hemminger <shemminger@osdl.org> |
sky2: backout NAPI reschedule This is a backout of earlier patch. The whole rescheduling hack was a bad idea. It doesn't really solve the problem and it makes the code more complicated for no good reason. Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
|
#
fe9925b5 |
|
06-May-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: Create netdev attribute_groups with class_device_add Atomically create attributes when class device is added. This avoids the race between registering class_device (which generates hotplug event), and the creation of attribute groups. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
62c4f0a2 |
|
25-Apr-2006 |
David Woodhouse <dwmw2@infradead.org> |
Don't include linux/config.h from anywhere else in include/ Signed-off-by: David Woodhouse <dwmw2@infradead.org>
|
#
734cbc36 |
|
25-Apr-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[PATCH] sky2: reschedule if irq still pending This is a workaround for the case edge-triggered irq's. Several users seem to have broken configurations sharing edge-triggered irq's. To avoid losing IRQ's, reshedule if more work arrives. The changes to netdevice.h are to extract the part that puts device back in list into separate inline. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>
|
#
56079431 |
|
29-Mar-2006 |
Denis Vlasenko <vda@ilport.com.ua> |
[NET]: Deinline some larger functions from netdevice.h On a allyesconfig'ured kernel: Size Uses Wasted Name and definition ===== ==== ====== ================================================ 95 162 12075 netif_wake_queue include/linux/netdevice.h 129 86 9265 dev_kfree_skb_any include/linux/netdevice.h 127 56 5885 netif_device_attach include/linux/netdevice.h 73 86 4505 dev_kfree_skb_irq include/linux/netdevice.h 46 60 1534 netif_device_detach include/linux/netdevice.h 119 16 1485 __netif_rx_schedule include/linux/netdevice.h 143 5 492 netif_rx_schedule include/linux/netdevice.h 81 7 366 netif_schedule include/linux/netdevice.h netif_wake_queue is big because __netif_schedule is a big inline: static inline void __netif_schedule(struct net_device *dev) { if (!test_and_set_bit(__LINK_STATE_SCHED, &dev->state)) { unsigned long flags; struct softnet_data *sd; local_irq_save(flags); sd = &__get_cpu_var(softnet_data); dev->next_sched = sd->output_queue; sd->output_queue = dev; raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); } } static inline void netif_wake_queue(struct net_device *dev) { #ifdef CONFIG_NETPOLL_TRAP if (netpoll_trap()) return; #endif if (test_and_clear_bit(__LINK_STATE_XOFF, &dev->state)) __netif_schedule(dev); } By de-inlining __netif_schedule we are saving a lot of text at each callsite of netif_wake_queue and netif_schedule. __netif_rx_schedule is also big, and it makes more sense to keep both of them out of line. Patch also deinlines dev_kfree_skb_any. We can deinline dev_kfree_skb_irq instead... oh well. netif_device_attach/detach are not hot paths, we can deinline them too. Signed-off-by: Denis Vlasenko <vda@ilport.com.ua> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
15333061 |
|
20-Mar-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: dev_put/dev_hold cleanup Get rid of the old __dev_put macro that is just a hold over from pre 2.6 kernel. And turn dev_hold into an inline instead of a macro. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
b00055aa |
|
20-Mar-2006 |
Stefan Rompf <stefan@loplof.de> |
[NET] core: add RFC2863 operstate this patch adds a dormant flag to network devices, RFC2863 operstate derived from these flags and possibility for userspace interaction. It allows drivers to signal that a device is unusable for user traffic without disabling queueing (and therefore the possibility for protocol establishment traffic to flow) and a userspace supplicant (WPA, 802.1X) to mark a device unusable without changes to the driver. It is the result of our long discussion. However I must admit that it represents what Jamal and I agreed on with compromises towards Krzysztof, but Thomas and Krzysztof still disagree with some parts. Anyway I think it should be applied. Signed-off-by: Stefan Rompf <stefan@loplof.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a122576 |
|
30-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[PATCH] sky2: change netif_rx_schedule_test to __netif_schedule_prep I didn't like the name netif_rx_schedule_test(), in earlier patches and changed to __netif_rx_schedule_prep to be more consistent. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
|
#
c2373ee9 |
|
09-Nov-2005 |
Mitch Williams <mitch.a.williams@intel.com> |
[PATCH] net: make dev_valid_name public dev_valid_name() is a useful function. Make it public. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Acked-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
|
#
fb286bb2 |
|
10-Nov-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Detect hardware rx checksum faults correctly Here is the patch that introduces the generic skb_checksum_complete which also checks for hardware RX checksum faults. If that happens, it'll call netdev_rx_csum_fault which currently prints out a stack trace with the device name. In future it can turn off RX checksum. I've converted every spot under net/ that does RX checksum checks to use skb_checksum_complete or __skb_checksum_complete with the exceptions of: * Those places where checksums are done bit by bit. These will call netdev_rx_csum_fault directly. * The following have not been completely checked/converted: ipmr ip_vs netfilter dccp This patch is based on patches and suggestions from Stephen Hemminger and David S. Miller. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
e89e9cf5 |
|
18-Oct-2005 |
Ananda Raju <ananda.raju@neterion.com> |
[IPv4/IPv6]: UFO Scatter-gather approach Attached is kernel patch for UDP Fragmentation Offload (UFO) feature. 1. This patch incorporate the review comments by Jeff Garzik. 2. Renamed USO as UFO (UDP Fragmentation Offload) 3. udp sendfile support with UFO This patches uses scatter-gather feature of skb to generate large UDP datagram. Below is a "how-to" on changes required in network device driver to use the UFO interface. UDP Fragmentation Offload (UFO) Interface: ------------------------------------------- UFO is a feature wherein the Linux kernel network stack will offload the IP fragmentation functionality of large UDP datagram to hardware. This will reduce the overhead of stack in fragmenting the large UDP datagram to MTU sized packets 1) Drivers indicate their capability of UFO using dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG NETIF_F_HW_CSUM is required for UFO over ipv6. 2) UFO packet will be submitted for transmission using driver xmit routine. UFO packet will have a non-zero value for "skb_shinfo(skb)->ufo_size" skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP fragment going out of the adapter after IP fragmentation by hardware. skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[] contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW indicating that hardware has to do checksum calculation. Hardware should compute the UDP checksum of complete datagram and also ip header checksum of each fragmented IP packet. For IPV6 the UFO provides the fragment identification-id in skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating IPv6 fragments. Signed-off-by: Ananda Raju <ananda.raju@neterion.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded) Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
#
9356b8fc |
|
27-Sep-2005 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: Reorder some hot fields of struct net_device Place them on separate cache lines in SMP to lower memory bouncing between multiple CPU accessing the device. - One part is mostly used on receive path (including eth_type_trans()) (poll_list, poll, quota, weight, last_rx, dev_addr, broadcast) - One part is mostly used on queue transmit path (qdisc) (queue_lock, qdisc, qdisc_sleeping, qdisc_list, tx_queue_len) - One part is mostly used on xmit path (device) (xmit_lock, xmit_lock_owner, priv, hard_start_xmit, trans_start) 'features' is placed outside of these hot points, in a location that may be shared by all cpus (because mostly read) name_hlist is moved close to name[IFNAMSIZ] to speedup __dev_get_by_name() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
793b883e |
|
14-Sep-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[PATCH] sky2: driver update. Here is revised patch against netdev sky2 branch. It includes whitespace fixes, all the changes from the previous review as well as some optimizations and timing fixes to solve some of the hangs. The stall problem is better but not perfect. It appears that under stress the chip can't keep up with the bus and sends a pause frame, then hangs. This version is for testing, and hopefully other eyes might see the root cause of the problem. I don't want to reinvent the ugly watchdog code in the syskonnect version of sk98lin. If you read it you will see, the original driver writer and the hardware developer obviously didn't understand each other. Dual port support is included, but not tested yet. It did require small change to NAPI since both ports share same IRQ. Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
|
#
3173c890 |
|
11-Sep-2005 |
Nishanth Aravamudan <nacc@us.ibm.com> |
[PATCH] drivers/net: fix-up schedule_timeout() usage Use schedule_timeout_interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
|
#
a6f9a705 |
|
20-Aug-2005 |
Jon Wetzel <jon_wetzel@dell.com> |
[NET]: Add support for getting the permanent hardware address. This patch adds a new field to net device to hold the permanent hardware address, and adds a new generic ethtool_op function to get that address. Signed-off-by: Jon Wetzel <jon_wetzel@dell.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
20380731 |
|
15-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[NET]: Fix sparse warnings Of this type, mostly: CHECK net/ipv6/netfilter.c net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static? net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static? Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
0a242efc |
|
11-Aug-2005 |
Denis Vlasenko <vda@ilport.com.ua> |
[NET]: Deinline netif_carrier_{on,off}(). # grep -r 'netif_carrier_o[nf]' linux-2.6.12 | wc -l 246 # size vmlinux.org vmlinux.carrier text data bss dec hex filename 4339634 1054414 259296 5653344 564360 vmlinux.org 4337710 1054414 259296 5651420 563bdc vmlinux.carrier And this ain't an allyesconfig kernel! Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
f2ccd8fa |
|
09-Aug-2005 |
David S. Miller <davem@davemloft.net> |
[NET]: Kill skb->real_dev Bonding just wants the device before the skb_bond() decapsulation occurs, so simply pass that original device into packet_type->func() as an argument. It remains to be seen whether we can use this same exact thing to get rid of skb->input_dev as well. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
31aa02c5 |
|
23-Jun-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: Eliminate netif_rx massive packet drops. Eliminate the throttling behaviour when the netif receive queue fills because it behaves badly when using high speed networks under load. The throttling cause multiple packet drops that cause TCP to go into slow start mode. The same effective patch has been part of BIC TCP and H-TCP as well as part of Web100. The existing code drops 100's of packets when the queue fills; this changes it to individual packet drop-tail. Signed-off-by: Stephen Hemmminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
34008d8c |
|
23-Jun-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: Remove obsolete netif_rx congestion sensing mechanism. Remove the congestion sensing mechanism from netif_rx, and always return either full or empty. Almost no driver checks the return value from netif_rx, and those that do only use it for debug messages. The original design of netif_rx was to do flow control based on the receive queue, but NAPI has supplanted this and no driver uses the feedback. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
c1ebcdb8 |
|
23-Jun-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[NET]: Remove obsolete fastroute stats. Remove last vestiages of fastroute code that is no longer used. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
543537bd |
|
23-Jun-2005 |
Paulo Marques <pmarques@grupopie.com> |
[PATCH] create a kstrdup library function This patch creates a new kstrdup library function and changes the "local" implementations in several places to use this function. Most of the changes come from the sound and net subsystems. The sound part had already been acknowledged by Takashi Iwai and the net part by David S. Miller. I left UML alone for now because I would need more time to read the code carefully before making changes there. Signed-off-by: Paulo Marques <pmarques@grupopie.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
115c1d6e |
|
22-Jun-2005 |
Jeff Moyer <jmoyer@redhat.com> |
[NETPOLL]: Introduce a netpoll_info struct This patch introduces a netpoll_info structure, which the struct net_device will now point to instead of pointing to a struct netpoll. The reason for this is two-fold: 1) fields such as the rx_flags, poll_owner, and poll_lock should be maintained per net_device, not per netpoll; and 2) this is a first step in providing support for multiple netpoll clients to register against the same net_device. The struct netpoll is now pointed to by the netpoll_info structure. As such, the previous behaviour of the code is preserved. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5ba0eac6 |
|
02-Jun-2005 |
Jiri Benc <jbenc@suse.cz> |
[NET]: Fix HH_DATA_OFF. When the hardware header size is a multiple of HH_DATA_MOD, HH_DATA_OFF() incorrectly returns HH_DATA_MOD (instead of 0). This affects ieee80211 layer as 802.11 header is 32 bytes long. Signed-off-by: Jiri Benc <jbenc@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
81e81575 |
|
29-May-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[BRIDGE]: make dev->features unsigned The features field in netdevice is really a bitmask, and bitmask's should be unsigned. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d8a33ac4 |
|
29-May-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[BRIDGE]: features change notification Resend of earlier patch (no changes) from Catalin used to provide device feature change notification. Signed-off-by: Catalin BOIE <catab at umbrella.ro> Acked-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
205902ec |
|
23-May-2005 |
Alexey Dobriyan <adobriyan@gmail.com> |
[NET]: netdevice.h: be'ify packet_type Everybody does struct packet_type foo_packet_type = { .type = __constant_htons(ETH_P_FOO); }; 5 introduced warnings will be properly fixed later. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
02c30a84 |
|
05-May-2005 |
Jesper Juhl <juhl-lkml@dif.dk> |
[PATCH] update Ross Biro bouncing email address Ross moved. Remove the bad email address so people will find the correct one in ./CREDITS. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|