History log of /linux-master/net/mpls/af_mpls.c
Revision Date Author Comments
# 22e36ea9 22-Feb-2024 Eric Dumazet <edumazet@google.com>

inet: allow ip_valid_fib_dump_req() to be called with RTNL or RCU

Add a new field into struct fib_dump_filter, to let callers
tell if they use RTNL locking or RCU.

This is used in the following patch, when inet_dump_fib()
no longer holds RTNL.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c899710f 08-Aug-2023 Joel Granados <joel.granados@gmail.com>

networking: Update to register_net_sysctl_sz

Move from register_net_sysctl to register_net_sysctl_sz for all the
networking related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.

We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.

An additional size function was added to the following files in order to
calculate the size of an array that is defined in another file:
include/net/ipv6.h
net/ipv6/icmp.c
net/ipv6/route.c
net/ipv6/sysctl_net_ipv6.c

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>


# d457a0e3 08-Jun-2023 Eric Dumazet <edumazet@google.com>

net: move gso declarations and functions to their own files

Move declarations into include/net/gso.h and code into net/core/gso.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# fda6c89f 13-Feb-2023 Jakub Kicinski <kuba@kernel.org>

net: mpls: fix stale pointer if allocation fails during device rename

lianhui reports that when MPLS fails to register the sysctl table
under new location (during device rename) the old pointers won't
get overwritten and may be freed again (double free).

Handle this gracefully. The best option would be unregistering
the MPLS from the device completely on failure, but unfortunately
mpls_ifdown() can fail. So failing fully is also unreliable.

Another option is to register the new table first then only
remove old one if the new one succeeds. That requires more
code, changes order of notifications and two tables may be
visible at the same time.

sysctl point is not used in the rest of the code - set to NULL
on failures and skip unregister if already NULL.

Reported-by: lianhui tang <bluetlh@gmail.com>
Fixes: 0fae3bf018d9 ("mpls: handle device renames for per-device sysctls")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d120d1a6 26-Oct-2022 Thomas Gleixner <tglx@linutronix.de>

net: Remove the obsolte u64_stats_fetch_*_irq() users (net).

Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.

Convert to the regular interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 278d3ba6 25-Aug-2022 Sebastian Andrzej Siewior <bigeasy@linutronix.de>

net: Use u64_stats_fetch_begin_irq() for stats fetch.

On 32bit-UP u64_stats_fetch_begin() disables only preemption. If the
reader is in preemptible context and the writer side
(u64_stats_update_begin*()) runs in an interrupt context (IRQ or
softirq) then the writer can update the stats during the read operation.
This update remains undetected.

Use u64_stats_fetch_begin_irq() to ensure the stats fetch on 32bit-UP
are not interrupted by a writer. 32bit-SMP remains unaffected by this
change.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Catherine Sullivan <csully@google.com>
Cc: David Awogbemila <awogbemila@google.com>
Cc: Dimitris Michailidis <dmichail@fungible.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Hans Ulli Kroll <ulli.kroll@googlemail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jeroen de Borst <jeroendb@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <simon.horman@corigine.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: oss-drivers@corigine.com
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27a5a568 06-Apr-2022 GONG, Ruiqi <gongruiqi1@huawei.com>

net: mpls: fix memdup.cocci warning

Simply use kmemdup instead of explicitly allocating and copying memory.

Generated by: scripts/coccinelle/api/memdup.cocci

Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com>
Link: https://lore.kernel.org/r/20220406114629.182833-1-gongruiqi1@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# c4416f5c 09-Feb-2022 Victor Erminpour <victor.erminpour@oracle.com>

net: mpls: Fix GCC 12 warning

When building with automatic stack variable initialization, GCC 12
complains about variables defined outside of switch case statements.
Move the variable outside the switch, which silences the warning:

./net/mpls/af_mpls.c:1624:21: error: statement will never be executed [-Werror=switch-unreachable]
1624 | int err;
| ^~~

Signed-off-by: Victor Erminpour <victor.erminpour@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f05b0b97 28-Nov-2021 Benjamin Poirier <bpoirier@nvidia.com>

net: mpls: Make for_nexthops iterator const

There are separate for_nexthops and change_nexthops iterators. The
for_nexthops variant should use const.

Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 18916818 28-Nov-2021 Benjamin Poirier <bpoirier@nvidia.com>

net: mpls: Remove rcu protection from nh_dev

Following the previous commit, nh_dev can no longer be accessed and
modified concurrently.

Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7d4741ea 28-Nov-2021 Benjamin Poirier <bpoirier@nvidia.com>

net: mpls: Fix notifications when deleting a device

There are various problems related to netlink notifications for mpls route
changes in response to interfaces being deleted:
* delete interface of only nexthop
DELROUTE notification is missing RTA_OIF attribute
* delete interface of non-last nexthop
NEWROUTE notification is missing entirely
* delete interface of last nexthop
DELROUTE notification is missing nexthop

All of these problems stem from the fact that existing routes are modified
in-place before sending a notification. Restructure mpls_ifdown() to avoid
changing the route in the DELROUTE cases and to create a copy in the
NEWROUTE case.

Fixes: f8efb73c97e2 ("mpls: multipath route support")
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6a6b83ca 22-Jul-2021 Kangmin Park <l4stpr0gr4m@gmail.com>

mpls: defer ttl decrement in mpls_forward()

Defer ttl decrement to optimize in tx_err case. There is no need
to decrease ttl in the case of goto tx_err.

Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ad542fb7 27-Apr-2021 Jiapeng Chong <jiapeng.chong@linux.alibaba.com>

mpls: Remove redundant assignment to err

Variable err is set to -ENOMEM but this value is never read as it is
overwritten with a new value later on, hence it is a redundant
assignment and can be removed.

Cleans up the following clang-analyzer warning:

net/mpls/af_mpls.c:1022:2: warning: Value stored to 'err' is never read
[clang-analyzer-deadcode.DeadStores].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0992d67b 30-Oct-2020 Guillaume Nault <gnault@redhat.com>

mpls: drop skb's dst in mpls_forward()

Commit 394de110a733 ("net: Added pointer check for
dst->ops->neigh_lookup in dst_neigh_lookup_skb") added a test in
dst_neigh_lookup_skb() to avoid a NULL pointer dereference. The root
cause was the MPLS forwarding code, which doesn't call skb_dst_drop()
on incoming packets. That is, if the packet is received from a
collect_md device, it has a metadata_dst attached to it that doesn't
implement any dst_ops function.

To align the MPLS behaviour with IPv4 and IPv6, let's drop the dst in
mpls_forward(). This way, dst_neigh_lookup_skb() doesn't need to test
->neigh_lookup any more. Let's keep a WARN condition though, to
document the precondition and to ease detection of such problems in the
future.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/f8c2784c13faa54469a2aac339470b1049ca6b63.1604102750.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# df561f66 23-Aug-2020 Gustavo A. R. Silva <gustavoars@kernel.org>

treewide: Use fallthrough pseudo-keyword

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>


# 350e7ab9 27-Jul-2020 Martin Varghese <martin.varghese@nokia.com>

net: Removed the device type check to add mpls support for devices

MPLS has no dependency with the device type of underlying devices.
Hence the device type check to add mpls support for devices can be
avoided.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1515aa70 20-May-2020 Vadim Fedorenko <vfedorenko@novek.ru>

mpls: Add support for IPv6 tunnels

Add support for IPv6 tunnel devices in AF_MPLS.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 32927393 24-Apr-2020 Christoph Hellwig <hch@lst.de>

sysctl: pass kernel pointers to ->proc_handler

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6c8991f4 04-Dec-2019 Sabrina Dubroca <sd@queasysnail.net>

net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup

ipv6_stub uses the ip6_dst_lookup function to allow other modules to
perform IPv6 lookups. However, this function skips the XFRM layer
entirely.

All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the
ip_route_output_key and ip_route_output helpers) for their IPv4 lookups,
which calls xfrm_lookup_route(). This patch fixes this inconsistent
behavior by switching the stub to ip6_dst_lookup_flow, which also calls
xfrm_lookup_route().

This requires some changes in all the callers, as these two functions
take different arguments and have different return types.

Fixes: 5f81bd2e5d80 ("ipv6: export a stub for IPv6 symbols used by vxlan")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eec4844f 18-Jul-2019 Matteo Croce <mcroce@redhat.com>

proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 09c434b8 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for more missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have MODULE_LICENCE("GPL*") inside which was used in the initial
scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 8cb08174 26-Apr-2019 Johannes Berg <johannes.berg@intel.com>

netlink: make validation more configurable for future strictness

We currently have two levels of strict validation:

1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted

Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)

@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae0be8de 26-Apr-2019 Michal Kubecek <mkubecek@suse.cz>

netlink: make nla_nest_start() add NLA_F_NESTED flag

Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3c618c1d 20-Apr-2019 David Ahern <dsahern@gmail.com>

net: Rename net/nexthop.h net/rtnh.h

The header contains rtnh_ macros so rename the file accordingly.
Allows a later patch to use the nexthop.h name for the new
nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3616d08b 22-Mar-2019 David Ahern <dsahern@gmail.com>

ipv6: Move ipv6 stubs to a separate header file

The number of stubs is growing and has nothing to do with addrconf.
Move the definition of the stubs to a separate header file and update
users. In the move, drop the vxlan specific comment before ipv6_stub.

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# be48220e 26-Feb-2019 David Ahern <dsahern@gmail.com>

mpls: Return error for RTA_GATEWAY attribute

MPLS does not support nexthops with an MPLS address family.
Specifically, it does not handle RTA_GATEWAY attribute. Make it
clear by returning an error.

Fixes: 03c0566542f4c ("mpls: Netlink commands to add, remove, and dump routes")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0c4056ee 18-Jan-2019 Jakub Kicinski <kuba@kernel.org>

net: mpls: netconf: perform strict checks also for doit handlers

Make RTM_GETNETCONF's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d77851bf 18-Jan-2019 Jakub Kicinski <kuba@kernel.org>

net: mpls: route: perform strict checks also for doit handlers

Make RTM_GETROUTE's doit handler use strict checks when
NETLINK_F_STRICT_CHK is set.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 196cfebf 15-Oct-2018 David Ahern <dsahern@gmail.com>

net/mpls: Handle kernel side filtering of route dumps

Update the dump request parsing in MPLS for the non-INET case to
enable kernel side filtering. If INET is disabled the only filters
that make sense for MPLS are protocol and nexthop device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# effe6792 15-Oct-2018 David Ahern <dsahern@gmail.com>

net: Enable kernel side filtering of route dumps

Update parsing of route dump request to enable kernel side filtering.
Allow filtering results by protocol (e.g., which routing daemon installed
the route), route type (e.g., unicast), table id and nexthop device. These
amount to the low hanging fruit, yet a huge improvement, for dumping
routes.

ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
be used to look up the device index without taking a reference. From
there filter->dev is only used during dump loops with the lock still held.

Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
have been filtered should no entries be returned.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bae9a78b 15-Oct-2018 David Ahern <dsahern@gmail.com>

net/mpls: Plumb support for filtering route dumps

Implement kernel side filtering of routes by egress device index and
protocol. MPLS uses only a single table and route type.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4724676d 15-Oct-2018 David Ahern <dsahern@gmail.com>

net: Add struct for fib dump filter

Add struct fib_dump_filter for options on limiting which routes are
returned in a dump request. The current list is table id, protocol,
route type, rtm_flags and nexthop device index. struct net is needed
to lookup the net_device from the index.

Declare the filter for each route dump handler and plumb the new
arguments from dump handlers to ip_valid_fib_dump_req.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8a66aa2 09-Oct-2018 David Ahern <dsahern@gmail.com>

net/mpls: Implement handler for strict data checking on dumps

Without CONFIG_INET enabled compiles fail with:

net/mpls/af_mpls.o: In function `mpls_dump_routes':
af_mpls.c:(.text+0xed0): undefined reference to `ip_valid_fib_dump_req'

The preference is for MPLS to use the same handler as ipv4 and ipv6
to allow consistency when doing a dump for AF_UNSPEC which walks
all address families invoking the route dump handler. If INET is
disabled then fallback to an MPLS version which can be tighter on
the data checks.

Fixes: e8ba330ac0c5 ("rtnetlink: Update fib dumps for strict data checking")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# addd383f 07-Oct-2018 David Ahern <dsahern@gmail.com>

net: Update netconf dump handlers for strict data checking

Update inet_netconf_dump_devconf, inet6_netconf_dump_devconf, and
mpls_netconf_dump_devconf for strict data checking. If the flag is set,
the dump request is expected to have an netconfmsg struct as the header.
The struct only has the family member and no attributes can be appended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e8ba330a 07-Oct-2018 David Ahern <dsahern@gmail.com>

rtnetlink: Update fib dumps for strict data checking

Add helper to check netlink message for route dumps. If the strict flag
is set the dump request is expected to have an rtmsg struct as the header.
All elements of the struct are expected to be 0 with the exception of
rtm_flags (which is used by both ipv4 and ipv6 dumps) and no attributes
can be appended. rtm_flags can only have RTM_F_CLONED and RTM_F_PREFIX
set.

Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute,
and ip6mr_rtm_dumproute to call this helper if strict data checking is
enabled.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# dac9c979 07-Oct-2018 David Ahern <dsahern@gmail.com>

net: Add extack to nlmsg_parse

Make sure extack is passed to nlmsg_parse where easy to do so.
Most of these are dump handlers and leveraging the extack in
the netlink_callback.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d8e2262a 21-Sep-2018 Saif Hasan <has@fb.com>

mpls: allow routes on ip6gre devices

Summary:

This appears to be necessary and sufficient change to enable `MPLS` on
`ip6gre` tunnels (RFC4023).

This diff allows IP6GRE devices to be recognized by MPLS kernel module
and hence user can configure interface to accept packets with mpls
headers as well setup mpls routes on them.

Test Plan:

Test plan consists of multiple containers connected via GRE-V6 tunnel.
Then carrying out testing steps as below.

- Carry out necessary sysctl settings on all containers

```
sysctl -w net.mpls.platform_labels=65536
sysctl -w net.mpls.ip_ttl_propagate=1
sysctl -w net.mpls.conf.lo.input=1
```

- Establish IP6GRE tunnels

```
ip -6 tunnel add name if_1_2_1 mode ip6gre \
local 2401:db00:21:6048:feed:0::1 \
remote 2401:db00:21:6048:feed:0::2 key 1
ip link set dev if_1_2_1 up
sysctl -w net.mpls.conf.if_1_2_1.input=1
ip -4 addr add 169.254.0.2/31 dev if_1_2_1 scope link

ip -6 tunnel add name if_1_3_1 mode ip6gre \
local 2401:db00:21:6048:feed:0::1 \
remote 2401:db00:21:6048:feed:0::3 key 1
ip link set dev if_1_3_1 up
sysctl -w net.mpls.conf.if_1_3_1.input=1
ip -4 addr add 169.254.0.4/31 dev if_1_3_1 scope link
```

- Install MPLS encap rules on node-1 towards node-2

```
ip route add 192.168.0.11/32 nexthop encap mpls 32/64 \
via inet 169.254.0.3 dev if_1_2_1
```

- Install MPLS forwarding rules on node-2 and node-3
```
// node2
ip -f mpls route add 32 via inet 169.254.0.7 dev if_2_4_1

// node3
ip -f mpls route add 64 via inet 169.254.0.12 dev if_4_3_1
```

- Ping 192.168.0.11 (node4) from 192.168.0.1 (node1) (where routing
towards 192.168.0.1 is via IP route directly towards node1 from node4)
```
ping 192.168.0.11
```

- tcpdump on interface to capture ping packets wrapped within MPLS
header which inturn wrapped within IP6GRE header

```
16:43:41.121073 IP6
2401:db00:21:6048:feed::1 > 2401:db00:21:6048:feed::2:
DSTOPT GREv0, key=0x1, length 100:
MPLS (label 32, exp 0, ttl 255) (label 64, exp 0, [S], ttl 255)
IP 192.168.0.1 > 192.168.0.11:
ICMP echo request, id 1208, seq 45, length 64

0x0000: 6000 2cdb 006c 3c3f 2401 db00 0021 6048 `.,..l<?$....!`H
0x0010: feed 0000 0000 0001 2401 db00 0021 6048 ........$....!`H
0x0020: feed 0000 0000 0002 2f00 0401 0401 0100 ......../.......
0x0030: 2000 8847 0000 0001 0002 00ff 0004 01ff ...G............
0x0040: 4500 0054 3280 4000 ff01 c7cb c0a8 0001 E..T2.@.........
0x0050: c0a8 000b 0800 a8d7 04b8 002d 2d3c a05b ...........--<.[
0x0060: 0000 0000 bcd8 0100 0000 0000 1011 1213 ................
0x0070: 1415 1617 1819 1a1b 1c1d 1e1f 2021 2223 .............!"#
0x0080: 2425 2627 2829 2a2b 2c2d 2e2f 3031 3233 $%&'()*+,-./0123
0x0090: 3435 3637 4567
```

Signed-off-by: Saif Hasan <has@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 2f635cee 27-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Drop pernet_operations::async

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8cec2f49 14-Mar-2018 Kirill Tkhai <ktkhai@virtuozzo.com>

net: Convert mpls_net_ops

These pernet_operations register and unregister sysctl table.
Exit methods frees platform_labels from net::mpls::platform_label.
Everything is per-net, and they looks safe to be marked async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 779b7931 28-Feb-2018 Daniel Axtens <dja@axtens.net>

net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len

If you take a GSO skb, and split it into packets, will the network
length (L3 headers + L4 headers + payload) of those packets be small
enough to fit within a given MTU?

skb_gso_validate_mtu gives you the answer to that question. However,
we recently added to add a way to validate the MAC length of a split GSO
skb (L2+L3+L4+payload), and the names get confusing, so rename
skb_gso_validate_mtu to skb_gso_validate_network_len

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3968523f 07-Feb-2018 Dan Williams <dan.j.williams@intel.com>

mpls, nospec: Sanitize array index in mpls_label_ok()

mpls_label_ok() validates that the 'platform_label' array index from a
userspace netlink message payload is valid. Under speculation the
mpls_label_ok() result may not resolve in the CPU pipeline until after
the index is used to access an array element. Sanitize the index to zero
to prevent userspace-controlled arbitrary out-of-bounds speculation, a
precursor for a speculative execution side channel vulnerability.

Cc: <stable@vger.kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c1c502b5 02-Dec-2017 Florian Westphal <fw@strlen.de>

net: use rtnl_register_module where needed

all of these can be compiled as a module, so use new
_module version to make sure module can no longer be removed
while callback/dump is in use.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14c68c43 11-Oct-2017 Colin Ian King <colin.king@canonical.com>

net: mpls: make function ipgre_mpls_encap_hlen static

The function ipgre_mpls_encap_hlen is local to the source and
does not need to be in global scope, so make it static.

Cleans up sparse warning:
symbol 'ipgre_mpls_encap_hlen' was not declared. Should it be static?

Fixes: bdc476413dcdb ("ip_tunnel: add mpls over gre support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bdc47641 04-Oct-2017 Amine Kherbouche <amine.kherbouche@6wind.com>

ip_tunnel: add mpls over gre support

This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new
tunnel type TUNNEL_ENCAP_MPLS.

Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b97bac64 09-Aug-2017 Florian Westphal <fw@strlen.de>

rtnetlink: make rtnl_register accept a flags parameter

This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.

This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a906c1aa 07-Jul-2017 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: fix uninitialized in_label var warning in mpls_getroute

Fix the below warning generated by static checker:
net/mpls/af_mpls.c:2111 mpls_getroute()
error: uninitialized symbol 'in_label'."

Fixes: 397fc9e5cefe ("mpls: route get support")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ca4a1cd9 04-Jul-2017 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: fix rtm policy in mpls_getroute

fix rtm policy name typo in mpls_getroute and also remove
export of rtm_ipv4_policy

Fixes: 397fc9e5cefe ("mpls: route get support")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 397fc9e5 03-Jul-2017 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: route get support

This patch adds RTM_GETROUTE doit handler for mpls routes.

Input:
RTA_DST - input label
RTA_NEWDST - labels in packet for multipath selection

By default the getroute handler returns matched
nexthop label, via and oif

With RTM_F_FIB_MATCH flag, full matched route is
returned.

example (with patched iproute2):
$ip -f mpls route show
101
nexthop as to 102/103 via inet 172.16.2.2 dev virt1-2
nexthop as to 302/303 via inet 172.16.12.2 dev virt1-12
201
nexthop as to 202/203 via inet6 2001:db8:2::2 dev virt1-2
nexthop as to 402/403 via inet6 2001:db8:12::2 dev virt1-12

$ip -f mpls route get 103
RTNETLINK answers: Network is unreachable

$ip -f mpls route get 101
101 as to 102/103 via inet 172.16.2.2 dev virt1-2

$ip -f mpls route get as to 302/303 101
101 as to 302/303 via inet 172.16.12.2 dev virt1-12

$ip -f mpls route get fibmatch 103
RTNETLINK answers: Network is unreachable

$ip -f mpls route get fibmatch 101
101
nexthop as to 102/103 via inet 172.16.2.2 dev virt1-2
nexthop as to 302/303 via inet 172.16.12.2 dev virt1-12

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c2e8471d 31-May-2017 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: fix clearing of dead nh_flags on link up

recent fixes to use WRITE_ONCE for nh_flags on link up,
accidently ended up leaving the deadflags on a nh. This patch
fixes the WRITE_ONCE to use freshly evaluated nh_flags.

Fixes: 39eb8cd17588 ("net: mpls: rt_nhn_alive and nh_flags should be accessed using READ_ONCE")
Reported-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e1af005b 27-May-2017 David Ahern <dsahern@gmail.com>

net: mpls: remove unnecessary initialization of err

err is initialized to EINVAL and not used before it is set again.
Remove the unnecessary initialization.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d4e72560 27-May-2017 David Ahern <dsahern@gmail.com>

net: mpls: Make nla_get_via in af_mpls.c

nla_get_via is only used in af_mpls.c. Remove declaration from internal.h
and move up in af_mpls.c before first use. Code move only; no
functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 074350e2 27-May-2017 David Ahern <dsahern@gmail.com>

net: mpls: Add extack messages for route add and delete failures

Add error messages for failures in adding and deleting mpls routes.
This covers most of the annoying EINVAL errors.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b7b386f4 27-May-2017 David Ahern <dsahern@gmail.com>

net: mpls: Pull common label check into helper

mpls_route_add and mpls_route_del have the same checks on the label.
Move to a helper. Avoid duplicate extack messages in the next patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a1f10abe 27-May-2017 David Ahern <dsahern@gmail.com>

net: Fill in extack for mpls lwt encap

Fill in extack for errors in build_state for mpls lwt encap including
passing extack to nla_get_labels and adding error messages for failures
in it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 752ade68 08-May-2017 Michal Hocko <mhocko@suse.com>

treewide: use kv[mz]alloc* rather than opencoded variants

There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator. E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation. This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously. There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c21ef3e3 16-Apr-2017 David Ahern <dsa@cumulusnetworks.com>

net: rtnetlink: plumb extended ack to doit function

Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.

This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# fceb6435 12-Apr-2017 Johannes Berg <johannes.berg@intel.com>

netlink: pass extended ACK struct to parsing functions

Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1511009c 31-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Increase max number of labels for lwt encap

Alow users to push down more labels per MPLS encap. Similar to LSR case,
move label array to the end of mpls_iptunnel_encap and allocate based on
the number of labels for the route.

For consistency with the LSR case, re-use the same maximum number of
labels.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a4ac8c98 31-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: bump maximum number of labels

Allow users to push down more labels per MPLS route. With the previous
patches, no memory allocations are based on MAX_NEW_LABELS; the limit
is only used to keep userspace in check.

At this point MAX_NEW_LABELS is only used for mpls_route_config (copying
route data from userspace) and processing nexthops looking for the max
number of labels across the route spec.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# df1c6316 31-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Limit memory allocation for mpls_route

Limit memory allocation size for mpls_route to 4096.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 59b20966 31-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: change mpls_route layout

Move labels to the end of mpls_nh as a 0-sized array and within mpls_route
move the via for a nexthop after the mpls_nh. The new layout becomes:

+----------------------+
| mpls_route |
+----------------------+
| mpls_nh 0 |
+----------------------+
| alignment padding | 4 bytes for odd number of labels; 0 for even
+----------------------+
| via[rt_max_alen] 0 |
+----------------------+
| alignment padding | via's aligned on sizeof(unsigned long)
+----------------------+
| ... |
+----------------------+
| mpls_nh n-1 |
+----------------------+
| via[rt_max_alen] n-1 |
+----------------------+

Memory allocated for nexthop + via is constant across all nexthops and
their via. It is based on the maximum number of labels across all nexthops
and the maximum via length. The size is saved in the mpls_route as
rt_nh_size. Accessing a nexthop becomes rt->rt_nh + index * rt->rt_nh_size.

The offset of the via address from a nexthop is saved as rt_via_offset
so that given an mpls_nh pointer the via for that hop is simply
nh + rt->rt_via_offset.

With prior code, memory allocated per mpls_route with 1 nexthop:
via is an ethernet address - 64 bytes
via is an ipv4 address - 64
via is an ipv6 address - 72

With this patch set, memory allocated per mpls_route with 1 nexthop and
1 or 2 labels:
via is an ethernet address - 56 bytes
via is an ipv4 address - 56
via is an ipv6 address - 64

The 8-byte reduction is due to the previous patch; the change introduced
by this patch has no impact on the size of allocations for 1 or 2 labels.

Performance impact of this change was examined using network namespaces
with veth pairs connecting namespaces. ns0 inserts the packet to the
label-switched path using an lwt route with encap mpls. ns1 adds 1 or 2
labels depending on test, ns2 (and ns3 for 2-label test) pops the label
and forwards. ns3 (or ns4) for a 2-label is the destination. Similar
series of namespaces used for 2-nexthop test.

Intent is to measure changes to latency (overhead in manipulating the
packet) in the forwarding path. Tests used netperf with UDP_RR.

IPv4: current patches
1 label, 1 nexthop 29908 30115
2 label, 1 nexthop 29071 29612
1 label, 2 nexthop 29582 29776
2 label, 2 nexthop 29086 29149

IPv6: current patches
1 label, 1 nexthop 24502 24960
2 label, 1 nexthop 24041 24407
1 label, 2 nexthop 23795 23899
2 label, 2 nexthop 23074 22959

In short, the change has no effect to a modest increase in performance.
This is expected since this patch does not really have an impact on routes
with 1 or 2 labels (the current limit) and 1 or 2 nexthops.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 77ef013a 31-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Convert number of nexthops to u8

Number of nexthops and number of alive nexthops are tracked using an
unsigned int. A route should never have more than 255 nexthops so
convert both to u8. Update all references and intermediate variables
to consistently use u8 as well.

Shrinks the size of mpls_route from 32 bytes to 24 bytes with a 2-byte
hole before the nexthops.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 39eb8cd1 31-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: rt_nhn_alive and nh_flags should be accessed using READ_ONCE

The number of alive nexthops for a route (rt->rt_nhn_alive) and the
flags for a next hop (nh->nh_flags) are modified by netdev event
handlers. The event handlers run with rtnl_lock held so updates are
always done with the lock held. The packet path accesses the fields
under the rcu lock. Since those fields can change at any moment in
the packet path, both fields should be accessed using READ_ONCE. Updates
to both fields should use WRITE_ONCE.

Update mpls_select_multipath (packet path) and mpls_ifdown and mpls_ifup
(event handlers) accordingly.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e944e97a 28-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Update lfib_nlmsg_size to skip deleted nexthops

A recent commit skips nexthops in a route if the device has been
deleted. Update lfib_nlmsg_size accordingly.

Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1182e4d0 28-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Send netconf messages on device register and unregister

Send netconf notifications for MPLS when the device registers and
unregisters.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 823566ae 28-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net:mpls: Refactor mpls_netconf_notify_devconf to take event

Refactor mpls_netconf_notify_devconf to take the event as an input arg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4ea8efad 24-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Delete route when all nexthops have been deleted

When all devices for all nexthops in a route have been deleted, the
route is effectively dead, so remove it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c00e51dd 24-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Don't show nexthop if device has been deleted

If the device for a nexthop in a multipath route is deleted, the nexthop
is effectively removed from the route. Currently, a route dump still
returns the nexhop though without the device set:

$ ip -f mpls ro ls
100
nexthopvia inet 10.11.1.2 dev br0
nexthopvia inet 10.100.3.1 dev eth3
$ ip li del br0
$ ip -f mpls ro ls
100
nexthopvia inet 10.11.1.2 dev * dead linkdown
nexthopvia inet 10.100.3.1 dev eth3

Since the nexthop is effectively deleted, drop the hop from the route
dump.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6a18c312 23-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Fix setting ttl_propagate for rt2

Fix copy and paste error setting rt_ttl_propagate.

Fixes: 5b441ac8784c1 ("mpls: allow TTL propagation to IP packets to be configured")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 61733c91 13-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Fix nexthop alive tracking on down events

Alive tracking of nexthops can account for a link twice if the carrier
goes down followed by an admin down of the same link rendering multipath
routes useless. This is similar to 79099aab38c8 for UNREGISTER events and
DOWN events.

Fix by tracking number of alive nexthops in mpls_ifdown similar to the
logic in mpls_ifup. Checking the flags per nexthop once after all events
have been processed is simpler than trying to maintian a running count
through all event combinations.

Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive
per a comment from checkpatch:
WARNING: Prefer WRITE_ONCE(<FOO>, <BAR>) over ACCESS_ONCE(<FOO>) = <BAR>

Fixes: c89359a42e2a4 ("mpls: support for dead routes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a59166e4 10-Mar-2017 Robert Shearman <rshearma@brocade.com>

mpls: allow TTL propagation from IP packets to be configured

Allow TTL propagation from IP packets to MPLS packets to be
configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
allows the TTL to be set in the resulting MPLS packet, with the value
of 0 having the semantics of enabling propagation of the TTL from the
IP header (i.e. non-zero values disable propagation).

Also allow the configuration to be overridden globally by reusing the
same sysctl to control whether the TTL is propagated from IP packets
into the MPLS header. If the per-LWT attribute is set then it
overrides the global configuration. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl,
"net.mpls.default_ttl". This is kept separate from the configuration
of whether IP TTL propagation is enabled as it can be used in the
future when non-IP payloads are supported (i.e. where there is no
payload TTL that can be propagated).

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5b441ac8 10-Mar-2017 Robert Shearman <rshearma@brocade.com>

mpls: allow TTL propagation to IP packets to be configured

Provide the ability to control on a per-route basis whether the TTL
value from an MPLS packet is propagated to an IPv4/IPv6 packet when
the last label is popped as per the theoretical model in RFC 3443
through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
mean disable propagation and 1 to mean enable propagation.

In order to provide the ability to change the behaviour for packets
arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
way for a user to change the behaviour for all existing routes without
having to reprogram them, a global knob is provided. This is done
through the addition of a new per-namespace sysctl,
"net.mpls.ip_ttl_propagate", which defaults to enabled. If the
per-route attribute is set (either enabled or disabled) then it
overrides the global configuration.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 79099aab 10-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

mpls: Do not decrement alive counter for unregister events

Multipath routes can be rendered usesless when a device in one of the
paths is deleted. For example:

$ ip -f mpls ro ls
100
nexthop as to 200 via inet 172.16.2.2 dev virt12
nexthop as to 300 via inet 172.16.3.2 dev br0
101
nexthop as to 201 via inet6 2000:2::2 dev virt12
nexthop as to 301 via inet6 2000:3::2 dev br0

$ ip li del br0

When br0 is deleted the other hop is not considered in
mpls_select_multipath because of the alive check -- rt_nhn_alive
is 0.

rt_nhn_alive is decremented once in mpls_ifdown when the device is taken
down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For
a 2 hop route, deleting one device drops the alive count to 0. Since
devices are taken down before unregistering, the decrement on
NETDEV_UNREGISTER is redundant.

Fixes: c89359a42e2a4 ("mpls: support for dead routes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e37791ec 10-Mar-2017 David Ahern <dsa@cumulusnetworks.com>

mpls: Send route delete notifications when router module is unloaded

When the mpls_router module is unloaded, mpls routes are deleted but
notifications are not sent to userspace leaving userspace caches
out of sync. Add the call to mpls_notify_route in mpls_net_exit as
routes are freed.

Fixes: 0189197f44160 ("mpls: Basic routing support")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 24045a03 20-Feb-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Add support for netconf

Add netconf support to MPLS. Allows userpsace to learn and be notified
of changes to 'input' enable setting per interface.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 9f427a0e 20-Jan-2017 David Ahern <dsa@cumulusnetworks.com>

net: mpls: Fix multipath selection for LSR use case

MPLS multipath for LSR is broken -- always selecting the first nexthop
in the one label case. For example:

$ ip -f mpls ro ls
100
nexthop as to 200 via inet 172.16.2.2 dev virt12
nexthop as to 300 via inet 172.16.3.2 dev virt13
101
nexthop as to 201 via inet6 2000:2::2 dev virt12
nexthop as to 301 via inet6 2000:3::2 dev virt13

In this example incoming packets have a single MPLS labels which means
BOS bit is set. The BOS bit is passed from mpls_forward down to
mpls_multipath_hash which never processes the hash loop because BOS is 1.

Update mpls_multipath_hash to process the entire label stack. mpls_hdr_len
tracks the total mpls header length on each pass (on pass N mpls_hdr_len
is N * sizeof(mpls_shim_hdr)). When the label is found with the BOS set
it verifies the skb has sufficient header for ipv4 or ipv6, and find the
IPv4 and IPv6 header by using the last mpls_hdr pointer and adding 1 to
advance past it.

With these changes I have verified the code correctly sees the label,
BOS, IPv4 and IPv6 addresses in the network header and icmp/tcp/udp
traffic for ipv4 and ipv6 are distributed across the nexthops.

Fixes: 1c78efa8319ca ("mpls: flow-based multipath selection")
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 27d69105 16-Jan-2017 Robert Shearman <rshearma@brocade.com>

mpls: Packet stats

Having MPLS packet stats is useful for observing network operation and
for diagnosing network problems. In the absence of anything better,
RFC2863 and RFC3813 are used for guidance for which stats to expose
and the semantics of them. In particular rx_noroutes maps to in
unknown protos in RFC2863. The stats are exposed to userspace via
AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
RTM_GETSTATS messages.

All the introduced fields are 64-bit, even error ones, to ensure no
overflow with long uptimes. Per-CPU counters are used to avoid
cache-line contention on the commonly used fields. The other fields
have also been made per-CPU for code to avoid performance problems in
error conditions on the assumption that on some platforms the cost of
atomic operations could be more expensive than sending the packet
(which is what would be done in the success case). If that's not the
case, we could instead not use per-CPU counters for these fields.

Only unicast and non-fragment are exposed at the moment, but other
counters can be exposed in the future either by adding to the end of
struct mpls_link_stats or by additional netlink attributes in the
AF_MPLS IFLA_STATS_AF_SPEC nested attribute.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 14dd3e1b 03-Dec-2016 Suraj Deshmukh <surajssd009005@gmail.com>

net: af_mpls.c add space before open parenthesis

Adding space after switch keyword before open
parenthesis for readability purpose.

This patch fixes the checkpatch.pl warning:
space required before the open parenthesis '('

Signed-off-by: Suraj Deshmukh <surajssd009005@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ce927bf1 01-Sep-2016 stephen hemminger <stephen@networkplumber.org>

mpls: get rid of trivial returns

return at end of function is useless.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 407f31be 06-Jul-2016 Simon Horman <simon.horman@netronome.com>

mpls: allow routes on ipip and sit devices

Allow MPLS routes on IPIP and SIT devices now that they
support forwarding MPLS packets.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0d227a86 16-Jun-2016 Simon Horman <simon.horman@netronome.com>

mpls: allow routes on ipgre devices

This appears to be necessary and sufficient to provide
MPLS in GRE (RFC4023) support.

This can be used by establishing an ipgre tunnel device
and then routing MPLS over it.

The following example will forward MPLS frames received with an outermost
MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the
outermost MPLS LSE removed and two new LSEs added with labels 200
(outermost) and 300 (next).

ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225
ip link set up dev tun1
ip addr add 10.0.98.192/24 dev tun1
ip route sh

echo 1 > /proc/sys/net/mpls/conf/eth0/input
echo 101 > /proc/sys/net/mpls/platform_labels
ip -f mpls route add 100 as 200/300 via inet 10.0.98.193
ip -f mpls route sh

Also remove unnecessary braces.

Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ae7ef81e 02-Jun-2016 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

skbuff: introduce skb_gso_validate_mtu

skb_gso_network_seglen is not enough for checking fragment sizes if
skb is using GSO_BY_FRAGS as we have to check frag per frag.

This patch introduces skb_gso_validate_mtu, based on the former, which
will wrap the use case inside it as all calls to skb_gso_network_seglen
were to validate if it fits on a given TMU, and improve the check.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 94a57f1f 07-Apr-2016 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: find_outdev: check for err ptr in addition to NULL check

find_outdev calls inet{,6}_fib_lookup_dev() or dev_get_by_index() to
find the output device. In case of an error, inet{,6}_fib_lookup_dev()
returns error pointer and dev_get_by_index() returns NULL. But the function
only checks for NULL and thus can end up calling dev_put on an ERR_PTR.
This patch adds an additional check for err ptr after the NULL check.

Before: Trying to add an mpls route with no oif from user, no available
path to 10.1.1.8 and no default route:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
[ 822.337195] BUG: unable to handle kernel NULL pointer dereference at
00000000000003a3
[ 822.340033] IP: [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] PGD 1db38067 PUD 1de9e067 PMD 0
[ 822.340033] Oops: 0000 [#1] SMP
[ 822.340033] Modules linked in:
[ 822.340033] CPU: 0 PID: 11148 Comm: ip Not tainted 4.5.0-rc7+ #54
[ 822.340033] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
04/01/2014
[ 822.340033] task: ffff88001db82580 ti: ffff88001dad4000 task.ti:
ffff88001dad4000
[ 822.340033] RIP: 0010:[<ffffffff8148781e>] [<ffffffff8148781e>]
mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] RSP: 0018:ffff88001dad7a88 EFLAGS: 00010282
[ 822.340033] RAX: ffffffffffffff9b RBX: ffffffffffffff9b RCX:
0000000000000002
[ 822.340033] RDX: 00000000ffffff9b RSI: 0000000000000008 RDI:
0000000000000000
[ 822.340033] RBP: ffff88001ddc9ea0 R08: ffff88001e9f1768 R09:
0000000000000000
[ 822.340033] R10: ffff88001d9c1100 R11: ffff88001e3c89f0 R12:
ffffffff8187e0c0
[ 822.340033] R13: ffffffff8187e0c0 R14: ffff88001ddc9e80 R15:
0000000000000004
[ 822.340033] FS: 00007ff9ed798700(0000) GS:ffff88001fc00000(0000)
knlGS:0000000000000000
[ 822.340033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 822.340033] CR2: 00000000000003a3 CR3: 000000001de89000 CR4:
00000000000006f0
[ 822.340033] Stack:
[ 822.340033] 0000000000000000 0000000100000000 0000000000000000
0000000000000000
[ 822.340033] 0000000000000000 0801010a00000000 0000000000000000
0000000000000000
[ 822.340033] 0000000000000004 ffffffff8148749b ffffffff8187e0c0
000000000000001c
[ 822.340033] Call Trace:
[ 822.340033] [<ffffffff8148749b>] ? mpls_rt_alloc+0x2b/0x3e
[ 822.340033] [<ffffffff81488e66>] ? mpls_rtm_newroute+0x358/0x3e2
[ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa
[ 822.340033] [<ffffffff813b7d94>] ? rtnetlink_rcv_msg+0x17e/0x191
[ 822.340033] [<ffffffff8111794e>] ? __kmalloc_track_caller+0x8c/0x9e
[ 822.340033] [<ffffffff813c9393>] ?
rht_key_hashfn.isra.20.constprop.57+0x14/0x1f
[ 822.340033] [<ffffffff813b7c16>] ? __rtnl_unlock+0xc/0xc
[ 822.340033] [<ffffffff813cb794>] ? netlink_rcv_skb+0x36/0x82
[ 822.340033] [<ffffffff813b4507>] ? rtnetlink_rcv+0x1f/0x28
[ 822.340033] [<ffffffff813cb2b1>] ? netlink_unicast+0x106/0x189
[ 822.340033] [<ffffffff813cb5b3>] ? netlink_sendmsg+0x27f/0x2c8
[ 822.340033] [<ffffffff81392ede>] ? sock_sendmsg_nosec+0x10/0x1b
[ 822.340033] [<ffffffff81393df1>] ? ___sys_sendmsg+0x182/0x1e3
[ 822.340033] [<ffffffff810e4f35>] ?
__alloc_pages_nodemask+0x11c/0x1e4
[ 822.340033] [<ffffffff8110619c>] ? PageAnon+0x5/0xd
[ 822.340033] [<ffffffff811062fe>] ? __page_set_anon_rmap+0x45/0x52
[ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa
[ 822.340033] [<ffffffff810e85ab>] ? __lru_cache_add+0x1a/0x3a
[ 822.340033] [<ffffffff81087ea9>] ? current_kernel_time64+0x9/0x30
[ 822.340033] [<ffffffff813940c4>] ? __sys_sendmsg+0x3c/0x5a
[ 822.340033] [<ffffffff8148f597>] ?
entry_SYSCALL_64_fastpath+0x12/0x6a
[ 822.340033] Code: 83 08 04 00 00 65 ff 00 48 8b 3c 24 e8 40 7c f2 ff
eb 13 48 c7 c3 9f ff ff ff eb 0f 89 ce e8 f1 ae f1 ff 48 89 c3 48 85 db
74 15 <48> 8b 83 08 04 00 00 65 ff 08 48 81 fb 00 f0 ff ff 76 0d eb 07
[ 822.340033] RIP [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] RSP <ffff88001dad7a88>
[ 822.340033] CR2: 00000000000003a3
[ 822.435363] ---[ end trace 98cc65e6f6b8bf11 ]---

After patch:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
RTNETLINK answers: Network is unreachable

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f20367df 10-Dec-2015 Robert Shearman <rshearma@brocade.com>

mpls: make via address optional for multipath routes

The via address is optional for a single path route, yet is mandatory
when the multipath attribute is used:

# ip -f mpls route add 100 dev lo
# ip -f mpls route add 101 nexthop dev lo
RTNETLINK answers: Invalid argument

Make them consistent by making the via address optional when the
RTA_MULTIPATH attribute is being parsed so that both forms of
specifying the route work.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# eb7809f0 10-Dec-2015 Robert Shearman <rshearma@brocade.com>

mpls: fix out-of-bounds access when via address not specified

When a via address isn't specified, the via table is left initialised
to 0 (NEIGH_ARP_TABLE), and the via address length also left
initialised to 0. This results in a via address array of length 0
being allocated (contiguous with route and nexthop array), meaning
that when a packet is sent using neigh_xmit the neighbour lookup and
creation will cause an out-of-bounds access when accessing the 4 bytes
of the IPv4 address it assumes it has been given a pointer to.

This could be fixed by allocating the 4 bytes of via address necessary
and leaving it as all zeroes. However, it seems wrong to me to use an
ipv4 nexthop (including possibly ARPing for 0.0.0.0) when the user
didn't specify to do so.

Instead, set the via address table to NEIGH_NR_TABLES to signify it
hasn't been specified and use this at forwarding time to signify a
neigh_xmit using an L2 address consisting of the device address. This
mechanism is the same as that used for both ARP and ND for loopback
interfaces and those flagged as no-arp, which are all we can really
support in this case.

Fixes: cf4b24f0024f ("mpls: reduce memory usage of routes")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 72dcac96 10-Dec-2015 Robert Shearman <rshearma@brocade.com>

mpls: don't dump RTA_VIA attribute if not specified

The problem seen is that when adding a route with a nexthop with no
via address specified, iproute2 generates bogus output:

# ip -f mpls route add 100 dev lo
# ip -f mpls route list
100 via inet 0.0.8.0 dev lo

The reason for this is that the kernel generates an RTA_VIA attribute
with the family set to AF_INET, but the via address data having zero
length. The cause of family being AF_INET is that on route insert
cfg->rc_via_table is left set to 0, which just happens to be
NEIGH_ARP_TABLE which is then translated into AF_INET.

iproute2 doesn't validate the length prior to printing and so prints
garbage. Although it could be fixed to do the validation, I would
argue that AF_INET addresses should always be exactly 4 bytes so the
kernel is really giving userspace bogus data.

Therefore, avoid generating the RTA_VIA attribute when dumping the
route if the via address wasn't specified on add/modify. This is
indicated by NEIGH_ARP_TABLE and a zero via address length - if the
user specified a via address the address length would have been
validated such that it was 4 bytes. Although this is a change in
behaviour that is visible to userspace, I believe that what was
generated before was invalid and as such userspace wouldn't be
expecting it.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a3e948e8 10-Dec-2015 Robert Shearman <rshearma@brocade.com>

mpls: validate L2 via address length

If an L2 via address for an mpls nexthop is specified, the length of
the L2 address must match that expected by the output device,
otherwise it could access memory beyond the end of the via address
buffer in the route.

This check was present prior to commit f8efb73c97e2 ("mpls: multipath
route support"), but got lost in the refactoring, so add it back,
applying it to all nexthops in multipath routes.

Fixes: f8efb73c97e2 ("mpls: multipath route support")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c89359a4 01-Dec-2015 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: support for dead routes

Adds support for RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls
routes due to link events. Also adds code to ignore dead
routes during route selection.

Unlike ip routes, mpls routes are not deleted when the route goes
dead. This is current mpls behaviour and this patch does not change
that. With this patch however, routes will be marked dead.
dead routes are not notified to userspace (this is consistent with ipv4
routes).

dead routes:
-----------
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1
nexthop as to 700 via inet 10.1.1.6 dev swp2

$ip link set dev swp1 down

$ip link show dev swp1
4: swp1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode
DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff

$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1 dead linkdown
nexthop as to 700 via inet 10.1.1.6 dev swp2

linkdown routes:
----------------
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1
nexthop as to 700 via inet 10.1.1.6 dev swp2

$ip link show dev swp1
4: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff

/* carrier goes down */
$ip link show dev swp1
4: swp1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
state DOWN mode DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff

$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1 linkdown
nexthop as to 700 via inet 10.1.1.6 dev swp2

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# cf4b24f0 26-Oct-2015 Robert Shearman <rshearma@brocade.com>

mpls: reduce memory usage of routes

Nexthops for MPLS routes have a via address field sized for the
largest via address that is expected, which is 32 bytes. This means
that in the most common case of having ipv4 via addresses, 28 bytes of
memory more than required are used per nexthop. In the other common
case of an ipv6 nexthop then 16 bytes more than required are
used. With large numbers of MPLS routes this extra memory usage could
start to become significant.

To avoid allocating memory for a maximum length via address when not
all of it is required and to allow for ease of iterating over
nexthops, then the via addresses are changed to be stored in the same
memory block as the route and nexthops, but in an array after the end
of the array of nexthops. New accessors are provided to retrieve a
pointer to the via address.

To allow for O(1) access without having to store a pointer or offset
per nh, the via address for each nexthop is sized according to the
maximum via address for any nexthop in the route, which is stored in a
new route field, rt_max_alen, but this is in an existing hole in
struct mpls_route so it doesn't increase the size of the
structure. Each via address is ensured to be aligned to VIA_ALEN_ALIGN
to account for architectures that don't allow unaligned accesses.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b4e04fc7 26-Oct-2015 Robert Shearman <rshearma@brocade.com>

mpls: fix forwarding using v4/v6 explicit null

Fill in the via address length for the predefined IPv4 and IPv6
explicit-null label routes.

Fixes: f8efb73c97e2 ("mpls: multipath route support")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1c78efa8 23-Oct-2015 Robert Shearman <rshearma@brocade.com>

mpls: flow-based multipath selection

Change the selection of a multipath route to use a flow-based
hash. This more suitable for traffic sensitive to reordering within a
flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution
of traffic given enough flows.

Selection of the path for a multipath route is done using a hash of:
1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and
including entropy label, whichever is first.
2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS
payload, if present.

Naturally, a 5-tuple hash using L4 information in addition would be
possible and be better in some scenarios, but there is a tradeoff
between looking deeper into the packet to achieve good distribution,
and packet forwarding performance, and I have erred on the side of the
latter as the default.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f8efb73c 23-Oct-2015 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: multipath route support

This patch adds support for MPLS multipath routes.

Includes following changes to support multipath:
- splits struct mpls_route into 'struct mpls_route + struct mpls_nh'

- 'struct mpls_nh' represents a mpls nexthop label forwarding entry

- moves mpls route and nexthop structures into internal.h

- A mpls_route can point to multiple mpls_nh structs

- the nexthops are maintained as a array (similar to ipv4 fib)

- In the process of restructuring, this patch also consistently changes
all labels to u8

- Adds support to parse/fill RTA_MULTIPATH netlink attribute for
multipath routes similar to ipv4/v6 fib

- In this patch, the multipath route nexthop selection algorithm
simply returns the first nexthop. It is replaced by a
hash based algorithm from Robert Shearman in the next patch

- mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
mpls_route_update though implemented to update based on dev, it was
never used that way. And the dev handling gets tricky with multiple
nexthops. Cannot match against any single nexthops dev. So, this patch
removes the unused 'dev' handling in mpls_route_update.

- dead route/path handling will be implemented in a subsequent patch

Example:

$ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
nexthop as 700 via inet 10.1.1.6 dev swp2 \
nexthop as 800 via inet 40.1.1.2 dev swp3

$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1
nexthop as to 700 via inet 10.1.1.6 dev swp2
nexthop as to 800 via inet 40.1.1.2 dev swp3

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 6ea3c9d5 31-Aug-2015 Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

mpls: fix mpls_net_init memory leak

Fix a memory leak in the mpls netns init function in case of failure. If
register_net_sysctl fails then we need to free the ctl_table.

Fixes: 7720c01f3f59 ("mpls: Add a sysctl to control the size of the mpls label table")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 118d5234 06-Aug-2015 Robert Shearman <rshearma@brocade.com>

mpls: Enforce payload type of traffic sent using explicit NULL

RFC 4182 s2 states that if an IPv4 Explicit NULL label is the only
label on the stack, then after popping the resulting packet must be
treated as a IPv4 packet and forwarded based on the IPv4 header. The
same is true for IPv6 Explicit NULL with an IPv6 packet following.

Therefore, when installing the IPv4/IPv6 Explicit NULL label routes,
add an attribute that specifies the expected payload type for use at
forwarding time for determining the type of the encapsulated packet
instead of inspecting the first nibble of the packet.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3dcb615e 04-Aug-2015 Roopa Prabhu <roopa@cumulusnetworks.com>

af_mpls: add null dev check in find_outdev

This patch adds null dev check for the 'cfg->rc_via_table ==
NEIGH_LINK_TABLE or dev_get_by_index() failed' case

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a9348b5 04-Aug-2015 Dan Carpenter <dan.carpenter@oracle.com>

mpls: small cleanup in inet/inet6_fib_lookup_dev()

We recently changed this code from returning NULL to returning ERR_PTR.
There are some left over NULL assignments which we can remove. We can
preserve the error code from ip_route_output() instead of always
returning -ENODEV. Also these functions use a mix of gotos and direct
returns. There is no cleanup necessary so I changed the gotos to
direct returns.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a6affd24 03-Aug-2015 Robert Shearman <rshearma@brocade.com>

mpls: Use definition for reserved label checks

In multiple locations there are checks for whether the label in hand
is a reserved label or not using the arbritray value of 16. Factor
this out into a #define for better maintainability and for
documentation.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# bf21563a 30-Jul-2015 Roopa Prabhu <roopa@cumulusnetworks.com>

af_mpls: fix undefined reference to ip6_route_output

Undefined reference to ip6_route_output and ip_route_output
was reported with CONFIG_INET=n and CONFIG_IPV6=n.

This patch uses ipv6_stub_impl.ipv6_dst_lookup instead of
ip6_route_output. And wraps affected code under
IS_ENABLED(CONFIG_INET) and IS_ENABLED(CONFIG_IPV6).

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 01faef2c 21-Jul-2015 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: make RTA_OIF optional

If user did not specify an oif, try and get it from the via address.
If failed to get device, return with -ENODEV.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# face0188 21-Jul-2015 Roopa Prabhu <roopa@cumulusnetworks.com>

mpls: export mpls functions for use by mpls iptunnels

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0fae3bf0 11-Jun-2015 Robert Shearman <rshearma@brocade.com>

mpls: handle device renames for per-device sysctls

If a device is renamed and the original name is subsequently reused
for a new device, the following warning is generated:

sysctl duplicate entry: /net/mpls/conf/veth0//input
CPU: 3 PID: 1379 Comm: ip Not tainted 4.1.0-rc4+ #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
0000000000000000 0000000000000000 ffffffff81566aaf 0000000000000000
ffffffff81236279 ffff88002f7d7f00 0000000000000000 ffff88000db336d8
ffff88000db33698 0000000000000005 ffff88002e046000 ffff8800168c9280
Call Trace:
[<ffffffff81566aaf>] ? dump_stack+0x40/0x50
[<ffffffff81236279>] ? __register_sysctl_table+0x289/0x5a0
[<ffffffffa051a24f>] ? mpls_dev_notify+0x1ff/0x300 [mpls_router]
[<ffffffff8108db7f>] ? notifier_call_chain+0x4f/0x70
[<ffffffff81470e72>] ? register_netdevice+0x2b2/0x480
[<ffffffffa0524748>] ? veth_newlink+0x178/0x2d3 [veth]
[<ffffffff8147f84c>] ? rtnl_newlink+0x73c/0x8e0
[<ffffffff8147f27a>] ? rtnl_newlink+0x16a/0x8e0
[<ffffffff81459ff2>] ? __kmalloc_reserve.isra.30+0x32/0x90
[<ffffffff8147ccfd>] ? rtnetlink_rcv_msg+0x8d/0x250
[<ffffffff8145b027>] ? __alloc_skb+0x47/0x1f0
[<ffffffff8149badb>] ? __netlink_lookup+0xab/0xe0
[<ffffffff8147cc70>] ? rtnetlink_rcv+0x30/0x30
[<ffffffff8149e7a0>] ? netlink_rcv_skb+0xb0/0xd0
[<ffffffff8147cc64>] ? rtnetlink_rcv+0x24/0x30
[<ffffffff8149df17>] ? netlink_unicast+0x107/0x1a0
[<ffffffff8149e4be>] ? netlink_sendmsg+0x50e/0x630
[<ffffffff8145209c>] ? sock_sendmsg+0x3c/0x50
[<ffffffff81452beb>] ? ___sys_sendmsg+0x27b/0x290
[<ffffffff811bd258>] ? mem_cgroup_try_charge+0x88/0x110
[<ffffffff811bd5b6>] ? mem_cgroup_commit_charge+0x56/0xa0
[<ffffffff811d7700>] ? do_filp_open+0x30/0xa0
[<ffffffff8145336e>] ? __sys_sendmsg+0x3e/0x80
[<ffffffff8156c3f2>] ? system_call_fastpath+0x16/0x75

Fix this by unregistering the previous sysctl table (registered for
the path containing the original device name) and re-registering the
table for the path containing the new device name.

Fixes: 37bde79979c3 ("mpls: Per-device enabling of packet input")
Reported-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 25cc8f07 05-Jun-2015 Robert Shearman <rshearma@brocade.com>

mpls: fix possible use after free of device

The mpls device is used in an RCU read context without a lock being
held. As the memory is freed without waiting for the RCU grace period
to elapse, the freed memory could still be in use.

Address this by using kfree_rcu to free the memory for the mpls device
after the RCU grace period has elapsed.

Fixes: 03c57747a702 ("mpls: Per-device MPLS state")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 78f5b899 07-May-2015 Tom Herbert <tom@herbertland.com>

mpls: Change reserved label names to be consistent with netbsd

Since these are now visible to userspace it is nice to be consistent
with BSD (sys/netmpls/mpls.h in netBSD).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c967a087 05-May-2015 Tom Herbert <tom@herbertland.com>

mpls: Move reserved label definitions

Move to include/uapi/linux/mpls.h to be externally visibile.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 5a9ab017 22-Apr-2015 Robert Shearman <rshearma@brocade.com>

mpls: Prevent use of implicit NULL label as outgoing label

The reserved implicit-NULL label isn't allowed to appear in the label
stack for packets, so make it an error for the control plane to
specify it as an outgoing label.

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 37bde799 22-Apr-2015 Robert Shearman <rshearma@brocade.com>

mpls: Per-device enabling of packet input

An MPLS network is a single trust domain where the edges must be in
control of what labels make their way into the core. The simplest way
of ensuring this is for the edge device to always impose the labels,
and not allow forward labeled traffic from untrusted neighbours. This
is achieved by allowing a per-device configuration of whether MPLS
traffic input from that interface should be processed or not.

To be secure by default, the default state is changed to MPLS being
disabled on all interfaces unless explicitly enabled and no global
option is provided to change the default. Whilst this differs from
other protocols (e.g. IPv6), network operators are used to explicitly
enabling MPLS forwarding on interfaces, and with the number of links
to the MPLS core typically fairly low this doesn't present too much of
a burden on operators.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 03c57747 22-Apr-2015 Robert Shearman <rshearma@brocade.com>

mpls: Per-device MPLS state

Add per-device MPLS state to supported interfaces. Use the presence of
this state in mpls_route_add to determine that this is a supported
interface.

Use the presence of mpls_dev to drop packets that arrived on an
unsupported interface - previously they were allowed through.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 76fecd82 12-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: In mpls_egress verify the packet length.

Reobert Shearman noticed that mpls_egress is failing to verify that
the bytes to be examined are in fact present in the packet before
mpls_egress reads those bytes.

As suggested by David Miller reduce this to a single pskb_may_pull
call so that we don't do unnecessary work in the fast path.

Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# b79bda3d 07-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

neigh: Use neigh table index for neigh_packet_xmit

Remove a little bit of unnecessary work when transmitting a packet with
neigh_packet_xmit. Use the neighbour table index not the address family
as a parameter.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# aa7da937 07-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Correct the ttl decrement.

According to RFC3032 section 2.4.2 packets with an outgoing
ttl of 0 MUST NOT be forwarded. According to section 2.4.1
an outgoing TTL of 0 comes from an incomming TTL <= 1.

Therefore any packets that is received with a ttl <= 1 should
not have it's ttl decremented and forwarded.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0f7bbd58 07-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Better error code for unsupported option.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 19d0c341 07-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Cleanup the rcu usage in the code.

Sparse was generating a lot of warnings mostly from missing annotations
in the code. Add missing annotations and in a few cases tweak the code
for performance by moving work before loops.

This also fixes a problematic ommision of rcu_assign_pointer and
rcu_dereference.

Hopefully with complete rcu annotations any new rcu errors will stick
out like a sore thumb.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d865616e 07-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Fix the kzalloc argument order in mpls_rt_alloc

*Blink* I got the argument order wrong to kzalloc and the
code was working properly when tested. *Blink*

Fix that.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f8d54afc 06-Mar-2015 Robert Shearman <rshearma@brocade.com>

mpls: Properly validate RTA_VIA payload length

If the nla length is less than 2 then the nla data could be accessed
beyond the accessible bounds. So ensure that the nla is big enough to
at least read the via_family before doing so. Replace magic value of
2.

Fixes: 03c0566542f4 ("mpls: Basic support for adding and removing routes")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 4b5edb2f 04-Mar-2015 Stephen Rothwell <sfr@canb.auug.org.au>

mpls: using vzalloc requires including vmalloc.h

Fixes this build error:

net/mpls/af_mpls.c: In function 'resize_platform_label_table':
net/mpls/af_mpls.c:767:4: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
labels = vzalloc(size);
^

Fixes: 7720c01f3f59 ("mpls: Add a sysctl to control the size of the mpls label table")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f0126539 04-Mar-2015 Wu Fengguang <fengguang.wu@intel.com>

mpls: rtm_mpls_policy[] can be static

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8de147dc 03-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Multicast route table change notifications

Unlike IPv4 this code notifies on all cases where mpls routes
are added or removed and it never automatically removes routes.
Avoiding both the userspace confusion that is caused by omitting
route updates and the possibility of a flood of netlink traffic
when an interface goes doew.

For now reserved labels are handled automatically and userspace
is not notified.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 03c05665 03-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Netlink commands to add, remove, and dump routes

This change adds two new netlink routing attributes:
RTA_VIA and RTA_NEWDST.

RTA_VIA specifies the specifies the next machine to send a packet to
like RTA_GATEWAY. RTA_VIA differs from RTA_GATEWAY in that it
includes the address family of the address of the next machine to send
a packet to. Currently the MPLS code supports addresses in AF_INET,
AF_INET6 and AF_PACKET. For AF_INET and AF_INET6 the destination mac
address is acquired from the neighbour table. For AF_PACKET the
destination mac_address is specified in the netlink configuration.

I think raw destination mac address support with the family AF_PACKET
will prove useful. There is MPLS-TP which is defined to operate
on machines that do not support internet packets of any flavor. Further
seem to be corner cases where it can be useful. At this point
I don't care much either way.

RTA_NEWDST specifies the destination address to forward the packet
with. MPLS typically changes it's destination address at every hop.
For a swap operation RTA_NEWDST is specified with a length of one label.
For a push operation RTA_NEWDST is specified with two or more labels.
For a pop operation RTA_NEWDST is not specified or equivalently an emtpy
RTAN_NEWDST is specified.

Those new netlink attributes are used to implement handling of rt-netlink
RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE messages, to maintain the
MPLS label table.

rtm_to_route_config parses a netlink RTM_NEWROUTE or RTM_DELROUTE message,
verify no unhandled attributes or unhandled values are present and sets
up the data structures for mpls_route_add and mpls_route_del.

I did my best to match up with the existing conventions with the caveats
that MPLS addresses are all destination-specific-addresses, and so
don't properly have a scope.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 966bae33 03-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Functions for reading and wrinting mpls labels over netlink

Reading and writing addresses in network byte order in netlink is
traditional and I see no reason to change that. MPLS is interesting
as effectively it has variabely length addresses (the MPLS label
stack). To represent these variable length addresses in netlink
I use a valid MPLS label stack (complete with stop bit).

This achieves two things: a well defined existing format is used,
and the data can be interpreted without looking at it's length.

Not needed to look at the length to decode the variable length
network representation allows existing userspace functions
such as inet_ntop to be used without needed to change their
prototype.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# a2519929 03-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Basic support for adding and removing routes

mpls_route_add and mpls_route_del implement the basic logic for adding
and removing Next Hop Label Forwarding Entries from the MPLS input
label map. The addition and subtraction is done in a way that is
consistent with how the existing routing table in Linux are
maintained. Thus all of the work to deal with NLM_F_APPEND,
NLM_F_EXCL, NLM_F_REPLACE, and NLM_F_CREATE.

Cases that are not clearly defined such as changing the interpretation
of the mpls reserved labels is not allowed.

Because it seems like the right thing to do adding an MPLS route without
specifying an input label and allowing the kernel to pick a free label
table entry is supported. The implementation is currently less than optimal
but that can be changed.

As I don't have anything else to test with only ethernet and the loopback
device are the only two device types currently supported for forwarding
MPLS over.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 7720c01f 03-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Add a sysctl to control the size of the mpls label table

This sysctl gives two benefits. By defaulting the table size to 0
mpls even when compiled in and enabled defaults to not forwarding
any packets. This prevents unpleasant surprises for users.

The other benefit is that as mpls labels are allocated locally a dense
table a small dense label table may be used which saves memory and
is extremely simple and efficient to implement.

This sysctl allows userspace to choose the restrictions on the label
table size userspace applications need to cope with.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0189197f 03-Mar-2015 Eric W. Biederman <ebiederm@xmission.com>

mpls: Basic routing support

This change adds a new Kconfig option MPLS_ROUTING.

The core of this change is the code to look at an mpls packet received
from another machine. Look that packet up in a routing table and
forward the packet on.

Support of MPLS over ATM is not considered or attempted here. This
implemntation follows RFC3032 and implements the MPLS shim header that
can pass over essentially any network.

What RFC3021 refers to as the as the Incoming Label Map (ILM) I call
net->mpls.platform_label[]. What RFC3031 refers to as the Next Label
Hop Forwarding Entry (NHLFE) I call mpls_route. Though calling it the
label fordwarding information base (lfib) might also be valid.

Further the implemntation forwards packets as described in RFC3032.
There is no need and given the original motivation for MPLS a strong
discincentive to have a flexible label forwarding path. In essence
the logic is the topmost label is read, looked up, removed, and
replaced by 0 or more new lables and the sent out the specified
interface to it's next hop.

Quite a few optional features are not implemented here. Among them
are generation of ICMP errors when the TTL is exceeded or the packet
is larger than the next hop MTU (those conditions are detected and the
packets are dropped instead of generating an icmp error). The traffic
class field is always set to 0. The implementation focuses on IP over
MPLS and does not handle egress of other kinds of protocols.

Instead of implementing coordination with the neighbour table and
sorting out how to input next hops in a different address family (for
which there is value). I was lazy and implemented a next hop mac
address instead. The code is simpler and there are flavor of MPLS
such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is
appropriate so a next hop by mac address would need to be implemented
at some point.

Two new definitions AF_MPLS and PF_MPLS are exposed to userspace.

Decoding the mpls header must be done by first byeswapping a 32bit bit
endian word into the local cpu endian and then bit shifting to extract
the pieces. There is no C bit-field that can represent a wire format
mpls header on a little endian machine as the low bits of the 20bit
label wind up in the wrong half of third byte. Therefore internally
everything is deal with in cpu native byte order except when writing
to and reading from a packet.

For management simplicity if a label is configured to forward out
an interface that is down the packet is dropped early. Similarly
if an network interface is removed rt_dev is updated to NULL
(so no reference is preserved) and any packets for that label
are dropped. Keeping the label entries in the kernel allows
the kernel label table to function as the definitive source
of which labels are allocated and which are not.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>