History log of /freebsd-10.1-release/sys/netinet/if_ether.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 272461 02-Oct-2014 gjb

Copy stable/10@r272459 to releng/10.1 as part of
the 10.1-RELEASE process.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 267175 06-Jun-2014 asomers

MFC r263779

Correct ARP update handling when the routes for network interfaces are
restricted to a single FIB in a multifib system.

Restricting an interface's routes to the FIB to which it is assigned (by
setting net.add_addr_allfibs=0) causes ARP updates to fail with "arpresolve:
can't allocate llinfo for x.x.x.x". This is due to the ARP update code hard
coding it's lookup for existing routing entries to FIB 0.

sys/netinet/in.c:
When dealing with RTM_ADD (add route) requests for an interface, use
the interface's assigned FIB instead of the default (FIB 0).

sys/netinet/if_ether.c:
In arpresolve(), enhance error message generated when an
lla_lookup() fails so that the interface causing the error is
visible in logs.

tests/sys/netinet/fibs_test.sh
Clear ATF expected error.


# 260504 10-Jan-2014 ae

MFC r260151 (by adrian):
Use an RLOCK here instead of an RWLOCK - matching all the other calls
to lla_lookup().

This drastically reduces the very high lock contention when doing parallel
TCP throughput tests (> 1024 sockets) with IPv6.

MFC r260187:
lla_lookup() does modification only when LLE_CREATE is specified.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.

MFC r260217:
Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called with
LLE_CREATE flag.


# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 254523 19-Aug-2013 andre

Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with: trociny, glebius


# 253084 09-Jul-2013 ae

Migrate structs arpstat, icmpstat, mrtstat, pimstat and udpstat to PCPU
counters.


# 252577 03-Jul-2013 np

Catch up with r238990. LLE_DELETED does not clobber everything else in
la_flags since said revision.


# 250504 11-May-2013 glebius

Rate limit the number of remotely triggered ARP log messages
to 1 log message per second.


# 249925 26-Apr-2013 glebius

Add const qualifier to the dst parameter of the ifnet if_output method.


# 249903 25-Apr-2013 glebius

Fix couple of mbuf leaks in incoming ARP processing.


# 244183 13-Dec-2012 glebius

Fix problem in r238990. The LLE_LINKED flag should be tested prior to
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.

Reported by: Ian FREISLICH <ianf clue.co.za>


# 243882 05-Dec-2012 glebius

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 240073 03-Sep-2012 glebius

Provide a sysctl switch that allows to install ARP entries
with multicast bit set. FreeBSD refuses to install such
entries since 9.0, and this broke installations running
Microsoft NLB, which are violating standards.

Tested by: Tarasov Oleg <oleg_tarasov sg-tea.com>


# 238990 02-Aug-2012 glebius

Fix races between in_lltable_prefix_free(), lla_lookup(),
llentry_free() and arptimer():

o Use callout_init_rw() for lle timeout, this allows us safely
disestablish them.
- This allows us to simplify the arptimer() and make it
race safe.
o Consistently use ifp->if_afdata_lock to lock access to
linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
is attached to the hash.
- Use LLE_LINKED to avoid double unlinking via consequent
calls to llentry_free().
- Mark lle with LLE_DELETED via |= operation istead of =,
so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
consistent and provide more informative KASSERTs.

The patch is a collaborative work of all submitters and myself.

PR: kern/165863
Submitted by: Andrey Zonov <andrey zonov.org>
Submitted by: Ryan Stone <rysto32 gmail.com>
Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>


# 238967 01-Aug-2012 glebius

Some more whitespace cleanup.


# 238945 31-Jul-2012 glebius

Some style(9) and whitespace changes.

Together with: Andrey Zonov <andrey zonov.org>


# 237263 19-Jun-2012 np

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


# 230442 22-Jan-2012 bz

Clean up some #endif comments removing from short sections. Add #endif
comments to longer, also refining strange ones.

Properly use #ifdef rather than #if defined() where possible. Four
#if defined(PCBGROUP) occurances (netinet and netinet6) were ignored to
avoid conflicts with eventually upcoming changes for RSS.

Reported by: bde (most)
Reviewed by: bde
MFC after: 3 days


# 230387 20-Jan-2012 bz

Remove a superfluous INET6 check (no opt_inet6.h included anyway).

MFC after: 3 days


# 229816 08-Jan-2012 glebius

Make it possible to use alternative source hardware address
in the ARP datagram generated by arprequest(). If caller doesn't
supply the address, then it is either picked from CARP or hardware
address of the interface is taken.

While here, make several minor fixes:

- Hold IF_ADDR_RLOCK(ifp) while traversing address list.
- Remove not true comment.
- Access internet address and mask via in_ifaddr fields,
rather than ifaddr.


# 229810 08-Jan-2012 glebius

Move arprequest() declaration to if_ether.h.


# 229621 05-Jan-2012 jhb

Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by: bz
MFC after: 2 weeks


# 228959 29-Dec-2011 glebius

Don't fallback to a CARP address in BACKUP state.


# 228571 16-Dec-2011 glebius

A major overhaul of the CARP implementation. The ip_carp.c was started
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.

The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.

ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.

To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]

The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.

Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!

PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]


# 227790 21-Nov-2011 glebius

Be more informative for "unknown hardware address format" message.

Submitted by: Andrzej Tobola <ato iem.pw.edu.pl>


# 227785 21-Nov-2011 glebius

- Reduce severity for all ARP events, that can be triggered from remote
machine to LOG_NOTICE. Exception left to "using my IP address".
- Fix multicast ARP warning: add newline and also log the bad MAC address.

Tested by: Alexander Wittig <wittigal msu.edu>


# 227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 223840 07-Jul-2011 ae

Add again the checking for log_arp_permanent_modify that was by accident
removed in the r186119.

PR: kern/154831
MFC after: 1 week


# 223753 04-Jul-2011 ae

ARP code reuses mbuf from ARP request to make a reply, but it does not
reset rcvif to NULL. Since rcvif is not NULL, ipfw(4) supposes that ARP
replies were received on specified interface.
Reset rcvif to NULL for ARP replies to fix this issue.

PR: kern/131817
Reviewed by: glebius
MFC after: 1 month


# 223261 18-Jun-2011 bz

Remove a these days incorrect comment left from before new-arp.

MFC after: 1 week


# 219819 21-Mar-2011 jeff

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


# 218909 21-Feb-2011 brucec

Fix typos - remove duplicate "the".

PR: bin/154928
Submitted by: Eitan Adler <lists at eitanadler.com>
MFC after: 3 days


# 217829 25-Jan-2011 thompsa

When matching an incoming ARP against a bridge, ensure both interfaces belong
to the same bridge.

Submitted by: Alexander Zagrebin


# 217333 12-Jan-2011 csjp

Un-break the build: use the correct format specifier for sizeof()


# 217315 12-Jan-2011 gnn

Fix several bugs in the ARP code related to improperly formatted
packets.

*) Reject requests with a protocol length not equal to 4. This is IPv4
and there is no reason to accept anything else.

*) Reject packets that have a multicast source hardware address.

*) Drop requests where the hardware address length is not equal
to the hardware address length of the interface.

Pointed out by: Rozhuk Ivan
MFC after: 1 week


# 217121 07-Jan-2011 gnn

Fix a memory leak in ARP queues.

Pointed out by: jhb@
MFC after: 2 weeks


# 217113 07-Jan-2011 gnn

Adjust ARP hold queue locking.

Submitted by: Rozhuk Ivan, jhb
MFC after: 2 weeks


# 216075 30-Nov-2010 glebius

Use time_uptime instead of non-monotonic time_second to drive ARP
timeouts.

Suggested by: bde


# 215701 22-Nov-2010 dim

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 215677 22-Nov-2010 zec

Remove an apparently redundant CURVNET_SET() / CURVNET_RESTORE() pair.

MFC after: 3 days


# 215317 14-Nov-2010 dim

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# 215207 12-Nov-2010 gnn

Add a queue to hold packets while we await an ARP reply.

When a fast machine first brings up some non TCP networking program
it is quite possible that we will drop packets due to the fact that
only one packet can be held per ARP entry. This leads to packets
being missed when a program starts or restarts if the ARP data is
not currently in the ARP cache.

This code adds a new sysctl, net.link.ether.inet.maxhold, which defines
a system wide maximum number of packets to be held in each ARP entry.
Up to maxhold packets are queued until an ARP reply is received or
the ARP times out. The default setting is the old value of 1
which has been part of the BSD networking code since time
immemorial.

Expose the time we hold an incomplete ARP entry by adding
the sysctl net.link.ether.inet.wait, which defaults to 20
seconds, the value used when the new ARP code was added..

Reviewed by: bz, rpaulo
MFC after: 3 weeks


# 214675 02-Nov-2010 jhb

Don't leak the LLE lock if the arptimer callout is pending or inactive.

Reported by: David Rhodus
MFC after: 1 month


# 214509 29-Oct-2010 glebius

Remove meaningless XXXXX, that is a remain of comment, removed in r186200.


# 214508 29-Oct-2010 glebius

Revert a small part of the r198301, that is entirely unrelated to the
r198301 itself. It also broke the logic of not sending more than one
ARP request per second, that consequently lead to a potential problem
of flooding network with broadcast packets.

MFC after: 1 week


# 211193 11-Aug-2010 will

Unbreak LINT by moving all carp hooks to net/if.c / netinet/ip_carp.h, with
the appropriate ifdefs.

Reviewed by: bz
Approved by: ken (mentor)


# 211157 10-Aug-2010 will

Allow carp(4) to be loaded as a kernel module. Follow precedent set by
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.

Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default. Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.

This commit requires IP6PROTOSPACER, introduced in r211115.

Reviewed by: bz, simon
Approved by: ken (mentor)
MFC after: 2 weeks


# 210703 31-Jul-2010 bz

Document the mandatory argument to the arptimer() and
nd6_llinfo_timer() functions with a KASSERT().
Note: there is no need to return after panic.

In the legacy IP case, only assign the arg after the check,
in the IPv6 case, remove the extra checks for the table and
interface as they have to be there unless we freed and forgot
to cancel the timer. It doesn't matter anyway as we would
panic on the NULL pointer deref immediately and the bug is
elsewhere.
This unifies the code of both address families to some extend.

Reviewed by: rwatson
MFC after: 6 days


# 207369 29-Apr-2010 bz

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


# 206481 11-Apr-2010 bz

Plug reference leaks in the link-layer code ("new-arp") that previously
prevented the link-layer entry from being freed.

In both in.c and in6.c (though that code path seems to be basically dead)
plug a reference leak in case of a pending callout being drained.

In if_ether.c consistently add a reference before resetting the callout
and in case we canceled a pending one remove the reference for that.
In the final case in arptimer, before freeing the expired entry, remove
the reference again and explicitly call callout_stop() to clear the active
flag.

In nd6.c:nd6_free() we are only ever called from the callout function and
thus need to remove the reference there as well before calling into
llentry_free().

In if_llatbl.c when freeing entire tables make sure that in case we cancel
a pending callout to remove the reference as well.

Reviewed by: qingli (earlier version)
MFC after: 10 days
Problem observed, patch tested by: simon on ipv6gw.f.o,
Christian Kratzer (ck cksoft.de),
Evgenii Davidov (dado korolev-net.ru)
PR: kern/144564
Configurations still affected: with options FLOWTABLE


# 201544 04-Jan-2010 qingli

An existing incomplete ARP entry would expire a subsequent
statically configured entry of the same host. This bug was
due to the expiration timer was not cancelled when installing
the static entry. Since there exist a potential race condition
with respect to timer cancellation, simply check for the
LLE_STATIC bit inside the expiration function instead of
cancelling the active timer.

MFC after: 5 days


# 201416 03-Jan-2010 np

Avoid NULL dereference in arpresolve.


# 198418 23-Oct-2009 qingli

Use the correct option name in the preprocessor command to enable
or disable diagnostic messages.

Reviewed by: ru
MFC after: 3 days


# 198301 20-Oct-2009 qingli

In the ARP callout timer expiration function, the current time_second
is compared against the entry expiration time value (that was set based
on time_second) to check if the current time is larger than the set
expiration time. Due to the +/- timer granularity value, the comparison
returns false, causing the alternative code to be executed. The
alternative code path freed the memory without removing that entry
from the table list, causing a use-after-free bug.

Reviewed by: discussed with kmacy
MFC after: immediately
Verified by: rnoland, yongari


# 198111 15-Oct-2009 qingli

This patch fixes the following issues in the ARP operation:

1. There is a regression issue in the ARP code. The incomplete
ARP entry was timing out too quickly (1 second timeout), as
such, a new entry is created each time arpresolve() is called.
Therefore the maximum attempts made is always 1. Consequently
the error code returned to the application is always 0.
2. Set the expiration of each incomplete entry to a 20-second
lifetime.
3. Return "incomplete" entries to the application.

Reviewed by: kmacy
MFC after: 3 days


# 197225 15-Sep-2009 qingli

This patch enables the node to respond to ARP requests for
configured proxy ARP entries.

Reviewed by: bz
MFC after: immediately


# 196995 08-Sep-2009 np

Add arp_update_event. This replaces route_arp_update_event, which
has not worked since the arp-v2 rewrite.

The event handler will be called with the llentry write-locked and
can examine la_flags to determine whether the entry is being added
or removed.

Reviewed by: gnn, kmacy
Approved by: gnn (mentor)
MFC after: 1 month


# 196797 03-Sep-2009 gnn

Add ARP statistics to the kernel and netstat.

New counters now exist for:
requests sent
replies sent
requests received
replies received
packets received
total packets dropped due to no ARP entry
entrys timed out
Duplicate IPs seen

The new statistics are seen in the netstat command
when it is given the -s command line switch.

MFC after: 2 weeks
In collaboration with: bz


# 196738 01-Sep-2009 bz

In case an upper layer protocol tries to send a packet but the
L2 code does not have the ethernet address for the destination
within the broadcast domain in the table, we remember the
original mbuf in `la_hold' in arpresolve() and send out a
different packet with an arp request.
In case there will be more upper layer packets to send we will
free an earlier one held in `la_hold' and queue the new one.

Once we get a packet in, with which we can perfect our arp table
entry we send out the original 'on hold' packet, should there
be any.
Rather than continuing to process the packet that we received,
we returned without freeing the packet that came in, which
basically means that we leaked an mbuf for every arp request
we sent.

Rather than freeing the received packet and returning, continue
to process the incoming arp packet as well.
This should (a) improve some setups, also proxy-arp, in case it was an
incoming arp request and (b) resembles the behaviour FreeBSD had
from day 1, which alignes with RFC826 "Packet reception" (merge case).

Rename 'm0' to 'hold' to make the code more understandable as
well as diffable to earlier versions more easily.

Handle the link-layer entry 'la' lock comepletely in the block
where needed and release it as early as possible, rather than
holding it longer, down to the end of the function.

Found by: pointyhat, ns1
Bug hunting session with: erwin, simon, rwatson
Tested by: simon on cluster machines
Reviewed by: ratson, kmacy, julian
MFC after: 3 days


# 196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


# 195914 27-Jul-2009 qingli

This patch does the following:

- Allow loopback route to be installed for address assigned to
interface of IFF_POINTOPOINT type.
- Install loopback route for an IPv4 interface addreess when the
"useloopback" sysctl variable is enabled. Similarly, install
loopback route for an IPv6 interface address when the sysctl variable
"nd6_useloopback" is enabled. Deleting loopback routes for interface
addresses is unconditional in case these sysctl variables were
disabled after an interface address has been assigned.

Reviewed by: bz
Approved by: re


# 195727 16-Jul-2009 rwatson

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


# 195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


# 194951 25-Jun-2009 rwatson

Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the
in_ifaddrhead and INADDR_HASH address lists.

Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).

For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support. As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done. This means that one
class of reader-writer races still exists.

MFC after: 6 weeks
Reviewed by: bz


# 194837 24-Jun-2009 rwatson

Add missing unlock of if_addr_mtx when an unmatched ARP packet is received.

Reported by: lstewart
MFC after: 6 weeks


# 194820 24-Jun-2009 rwatson

In ARP input, more consistently acquire and release ifaddr references.

MFC after: 6 weeks


# 194739 23-Jun-2009 bz

After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.


# 193744 08-Jun-2009 bz

After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 193219 01-Jun-2009 rwatson

Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
workstream, or set of per-protocol queues. Threads may be bound
to specific CPUs, or allowed to migrate, based on a global policy.

In the future it would be desirable to support topology-centric
policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
currently be one of:

NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
an implicit or explicit source (such as an interface or socket).

NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
as well as allowing protocols to provide a flow generation function
for mbufs without flow identifers (m2flow). Falls back on
NETISR_POLICY_SOURCE if now flow ID is available.

NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
being used, as well as a mapping function from workstream to CPU ID,
which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
get and clear drop counters, which query data or apply changes across
all workstreams.

- Add a more extensible netisr registration interface, in which
protocols declare 'struct netisr_handler' structures for each
registered NETISR_ type. These include name, handler function,
optional mbuf to flow ID function, optional mbuf to CPU ID function,
queue limit, and ordering policy. Padding is present to allow these
to be expanded in the future. If no queue limit is declared, then
a default is used.

- Queue limits are now per-workstream, and raised from the previous
IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
with the exception of netnatm, default queue limits. Most protocols
register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
NETISR_POLICY_FLOW, and will therefore take advantage of driver-
generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
the netisr, rather than having polling pretend to be two protocols.
Provide two explicit hooks in the netisr worker for start and end
events for runs: netisr_poll() and netisr_pollmore(), as well as a
function, netisr_sched_poll(), to allow the polling code to schedule
netisr execution. DEVICE_POLLING still embeds single-netisr
assumptions in its implementation, so for now if it is compiled into
the kernel, a single and un-bound netisr thread is enforced
regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking. It will be enabled once further rmlock
optimization has taken place. However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by: bz


# 191816 05-May-2009 zec

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# 191148 16-Apr-2009 kmacy

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


# 191120 15-Apr-2009 kmacy

make LLTABLE visible to netinet


# 190909 11-Apr-2009 zec

Introduce vnet module registration / initialization framework with
dependency tracking and ordering enforcement.

With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized. In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.

The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw). Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.

The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module. The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.

For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.

A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on. Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first. The framework will panic if it detects any unresolved
dependencies before completing system initialization. Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.

Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run. In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container. IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.

The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.

Reviewed by: bz
Approved by: julian (mentor)


# 190787 06-Apr-2009 zec

First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.

Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).

While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.

Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.

Approved by: julian (mentor)


# 189592 09-Mar-2009 bms

Merge IGMPv3 and Source-Specific Multicast (SSM) to the FreeBSD
IPv4 stack.

Diffs are minimized against p4.
PCS has been used for some protocol verification, more widespread
testing of recorded sources in Group-and-Source queries is needed.
sizeof(struct igmpstat) has changed.

__FreeBSD_version is bumped to 800070.


# 189106 27-Feb-2009 bz

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


# 186474 24-Dec-2008 kmacy

Fix missed unlock and reference drop of lle

Found by: pho


# 186411 23-Dec-2008 qingli

Don't create a bogus ARP entry for 0.0.0.0.


# 186317 19-Dec-2008 qingli

The proxy-arp code was broken and responds to ARP
requests for addresses that are not proxied locally.


# 186200 16-Dec-2008 kmacy

default to doing lla_lookup with shared afdata lock and returning a
shared lock on the lle - thus restoring parallel performance to
pre-arpv2 level


# 186161 16-Dec-2008 kmacy

don't unlock lle if it is NULL


# 186119 15-Dec-2008 qingli

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


# 185713 06-Dec-2008 csjp

in_rtalloc1(9) returns a locked route, so make sure that we use
RTFREE_LOCKED() here. This macro makes sure the reference count
on the route is being managed properly. This elimates another
case which results in the following message being printed to the
console:

rtfree: 0xc841ee88 has 1 refs

Reviewed by: bz
MFC after: 2 weeks


# 185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


# 185348 26-Nov-2008 zec

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 185088 19-Nov-2008 zec

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183954 16-Oct-2008 zec

Remove a useless global static variable.

Approved by: bz (ad-hoc mentor)


# 183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


# 183014 14-Sep-2008 julian

oops commit the version that compiles


# 183013 14-Sep-2008 julian

Revert a part of the MRT commit that proved un-needed.
rt_check() in its original form proved to be sufficient and
rt_check_fib() can go away (as can its evil twin in_rt_check()).

I believe this does NOT address the crashes people have been seeing
in rt_check.

MFC after: 1 week


# 182775 04-Sep-2008 keramida

Slightly reword comment and remove typos.


# 182563 31-Aug-2008 julian

fix tiny nti in comment


# 181824 18-Aug-2008 philip

Fix ARP in bridging scenarios where the bridge shares its
MAC address with one of its members (see my r180140).

Pointy hat to: philip
Submitted by: Eygene Ryabinkin <rea-fbsd@codelabs.ru>
MFC after: 3 days


# 181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


# 180239 03-Jul-2008 rwatson

Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred). Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired. This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by: bz
MFC after: 1 month
X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
but the rest can go back.


# 178888 09-May-2008 julian

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


# 175025 31-Dec-2007 julian

Don't duplicate the whole of arpresolve to arpresolve 2 for the sake
of two compares against 0. The negative effect of cache flushing
is probably more than the gain by not doing the two compares (the
value is almost certainly in register or at worst, cache).
Note that the uses of m_freem() are in error cases and m_freem()
handles NULL anyhow. So fast-path really isn't changed much at all.


# 174703 17-Dec-2007 kmacy

widen the routing event interface (arp update, redirect, and eventually pmtu change)
into separate functions

revert previous commit's changes to arpresolve and add a new interface
arpresolve2 which does arp resolution without an mbuf


# 174699 17-Dec-2007 kmacy

Don't panic in arpresolve if we're given a null mbuf. We could
insist that the caller just pass in an initialized mbuf even
if didn't have any data - but that seems rather contrived.


# 174648 16-Dec-2007 kmacy

Move arp update upcall to always be called for ARP replies - previous invocation
would not always get called at the appropriate times


# 174559 12-Dec-2007 kmacy

add interface for allowing consumers to register for ARP updates,
redirects, and path MTU changes

Reviewed by: silby


# 174256 04-Dec-2007 yar

For the sake of convenience, print the name of the network interface
IPv4 address duplication was detected on.

Idea by: marck


# 173095 28-Oct-2007 rwatson

Move towards more explicit support for various network protocol stacks
in the TrustedBSD MAC Framework:

- Add mac_atalk.c and add explicit entry point mac_netatalk_aarp_send()
for AARP packet labeling, rather than using a generic link layer
entry point.

- Add mac_inet6.c and add explicit entry point mac_netinet6_nd6_send()
for ND6 packet labeling, rather than using a generic link layer entry
point.

- Add expliict entry point mac_netinet_arp_send() for ARP packet
labeling, and mac_netinet_igmp_send() for IGMP packet labeling,
rather than using a generic link layer entry point.

- Remove previous genering link layer entry point,
mac_mbuf_create_linklayer() as it is no longer used.

- Add implementations of new entry points to various policies, largely
by replicating the existing link layer entry point for them; remove
old link layer entry point implementation.

- Make MAC_IFNET_LOCK(), MAC_IFNET_UNLOCK(), and mac_ifnet_mtx global
to the MAC Framework rather than static to mac_net.c as it is now
needed outside of mac_net.c.

Obtained from: TrustedBSD Project


# 173018 26-Oct-2007 rwatson

Rename 'mac_mbuf_create_from_firewall' to 'mac_netinet_firewall_send' as
we move towards netinet as a pseudo-object for the MAC Framework.

Rename 'mac_create_mbuf_linklayer' to 'mac_mbuf_create_linklayer' to
reflect general object-first ordering preference.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 172467 07-Oct-2007 silby

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


# 169454 10-May-2007 rwatson

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


# 167796 22-Mar-2007 glebius

Remove global list of all llinfo_arp entries and use a callout per
instance expiry of the ARP entries. Since we no longer abuse the IPv4
radix head lock, we can now enter arp_rtrequest() with a lock held on
an arbitrary rt_entry.

Reviewed by: bms


# 166436 02-Feb-2007 bms

Comply with RFC 3927, by forcing ARP replies which contain a source
address within the link-local IPv4 prefix 169.254.0.0/16, to be
broadcast at link layer.

Reviewed by: fenner
MFC after: 2 weeks


# 166010 14-Jan-2007 maxim

o Increment requests counter right before send out an ARP query actually.
Otherwise the code could lead to the spurious EHOSTDOWN errors.

PR: kern/107807
Submitted by: Dmitrij Tejblum
MFC after: 1 month


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 163006 04-Oct-2006 glebius

Save space on stack moving token ring stuff to its own hack block.


# 163005 04-Oct-2006 glebius

Style rev. 1.152.


# 162586 23-Sep-2006 jmg

now that we don't automagicly increase the MTU of host routes, when we copy
the loopback interface, copy it's mtu also.. This means that we again have
large mtu support for local ip addresses...


# 160038 29-Jun-2006 yar

There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures. Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by: glebius, jhb, ru


# 159448 08-Jun-2006 thompsa

Enable proxy ARP answers on any of the bridged interfaces if proxy record
belongs to another interface within the bridge group.

PR: kern/94408
Submitted by: Eygene A. Ryabinkin
MFC after: 1 month


# 156409 07-Mar-2006 thompsa

Further refine the bridge hack in the arp code. Only do the special arp
handling for interfaces which are actually in the bridge group, ignore all
others.

MFC after: 3 days


# 155145 31-Jan-2006 thompsa

Now that the bridge also processes Ethernet frames as itself, two arp replies
will be sent if there is an address on the bridge. Exclude the bridge from the
special arp handling.

This has been tested with all combinations of addresses on the bridge and members.

Pointed out by: Michal Mertl


# 155018 29-Jan-2006 thompsa

Back out of r1.148, it causes two arp replies to be sent with different mac
addresses. One for the bridged interface with the IP address assigned but then
another with the mac for the bridge itself.


# 154271 12-Jan-2006 thompsa

Include the bridge interface itself in the special arp handling.

PR: 90973
MFC after: 1 week


# 153513 18-Dec-2005 glebius

Add a knob to suppress logging of attempts to modify
permanent ARP entries.

Submitted by: Andrew Alcheyev <buddy telenet.ru>


# 153478 16-Dec-2005 emaste

Add descriptions for sysctl -d.

Approved by: glebius
Silence from: rwatson (mentor)


# 152188 08-Nov-2005 glebius

Rework ARP retransmission algorythm so that ARP requests are
retransmitted without suppression, while there is demand for
such ARP entry. As before, retransmission is rate limited to
one packet per second. Details:
- Remove net.link.ether.inet.host_down_time
- Do not set/clear RTF_REJECT flag on route, to
avoid rt_check() returning error. We will generate error
ourselves.
- Return EWOULDBLOCK on first arp_maxtries failed
requests , and return EHOSTDOWN/EHOSTUNREACH
on further requests.
- Retransmit ARP request always, independently from return
code. Ratelimit to 1 pps.


# 150942 04-Oct-2005 thompsa

When bridging is enabled and an ARP request is recieved on a member interface,
the arp code will search all local interfaces for a match. This triggers a
kernel log if the bridge has been assigned an address.

arp: ac:de:48:18:83:3d is using my IP address 192.168.0.142!

bridge0: flags=8041<UP,RUNNING,MULTICAST> mtu 1500
inet 192.168.0.142 netmask 0xffffff00
ether ac:de:48:18:83:3d

Silence this warning for 6.0 to stop unnecessary bug reports, the code will need
to be reworked.

Approved by: mlaier (mentor)
MFC after: 3 days


# 150636 27-Sep-2005 mlaier

Remove bridge(4) from the tree. if_bridge(4) is a full functional
replacement and has additional features which make it superior.

Discussed on: -arch
Reviewed by: thompsa
X-MFC-after: never (RELENG_6 as transition period)


# 150351 19-Sep-2005 andre

Use monotonic 'time_uptime' instead of 'time_second' as timebase
for rt->rt_rmx.rmx_expire.


# 149909 09-Sep-2005 glebius

- Do not hold route entry lock, when calling arprequest(). One such
call was introduced by me in 1.139, the other one was present before.
- Do all manipulations with rtentry and la before dropping the lock.
- Copy interface address from route into local variable before dropping
the lock. Supply this copy as argument to arprequest()

LORs fixed:
http://sources.zabbadoz.net/freebsd/lor/003.html
http://sources.zabbadoz.net/freebsd/lor/037.html
http://sources.zabbadoz.net/freebsd/lor/061.html
http://sources.zabbadoz.net/freebsd/lor/062.html
http://sources.zabbadoz.net/freebsd/lor/064.html
http://sources.zabbadoz.net/freebsd/lor/068.html
http://sources.zabbadoz.net/freebsd/lor/071.html
http://sources.zabbadoz.net/freebsd/lor/074.html
http://sources.zabbadoz.net/freebsd/lor/077.html
http://sources.zabbadoz.net/freebsd/lor/093.html
http://sources.zabbadoz.net/freebsd/lor/135.html
http://sources.zabbadoz.net/freebsd/lor/140.html
http://sources.zabbadoz.net/freebsd/lor/142.html
http://sources.zabbadoz.net/freebsd/lor/145.html
http://sources.zabbadoz.net/freebsd/lor/152.html
http://sources.zabbadoz.net/freebsd/lor/158.html


# 149451 25-Aug-2005 glebius

When we have a published ARP entry for some IP address, do reply on
ARP requests only on the network where this IP address belong, to.

Before this change we did replied on all interfaces. This could
lead to an IP address conflict with host we are doing ARP proxy
for.

PR: kern/75634
Reviewed by: andre


# 148955 11-Aug-2005 glebius

o Fix a race between three threads: output path,
incoming ARP packet and route request adding/removing
ARP entries. The root of the problem is that
struct llinfo_arp was accessed without any locks.
To close race we will use locking provided by
rtentry, that references this llinfo_arp:
- Make arplookup() return a locked rtentry.
- In arpresolve() hold the lock provided by
rt_check()/arplookup() until the end of function,
covering all accesses to the rtentry itself and
llinfo_arp it refers to.
- In in_arpinput() do not drop lock provided by
arplookup() during first part of the function.
- Simplify logic in the first part of in_arpinput(),
removing one level of indentation.
- In the second part of in_arpinput() hold rtentry
lock while copying address.

o Fix a condition when route entry is destroyed, while
another thread is contested on its lock:
- When storing a pointer to rtentry in llinfo_arp list,
always add a reference to this rtentry, to prevent
rtentry being destroyed via RTM_DELETE request.
- Remove this reference when removing entry from
llinfo_arp list.

o Further cleanup of arptimer():
- Inline arptfree() into arptimer().
- Use official queue(3) way to pass LIST.
- Hold rtentry lock while reading its structure.
- Do not check that sdl_family is AF_LINK, but
assert this.

Reviewed by: sam
Stress test: http://www.holm.cc/stress/log/cons141.html
Stress test: http://people.freebsd.org/~pho/stress/log/cons144.html


# 148883 09-Aug-2005 glebius

In preparation for fixing races in ARP (and probably in other
L2/L3 mappings) make rt_check() return a locked rtentry.


# 146986 05-Jun-2005 thompsa

Add hooks into the networking layer to support if_bridge. This changes struct
ifnet so a buildworld is necessary.

Approved by: mlaier (mentor)
Obtained from: NetBSD


# 143491 13-Mar-2005 glebius

Embrace with #ifdef DEV_CARP carp-related code.


# 143314 09-Mar-2005 glebius

Make ARP do not complain about wrong interface if correct interface
is a carp one and address matched it.

Reviewed by: brooks


# 142215 22-Feb-2005 glebius

Add CARP (Common Address Redundancy Protocol), which allows multiple
hosts to share an IP address, providing high availability and load
balancing.

Original work on CARP done by Michael Shalayeff, with many
additions by Marco Pfatschbacher and Ryan McBride.

FreeBSD port done solely by Max Laier.

Patch by: mlaier
Obtained from: OpenBSD (mickey, mcbride)


# 139823 06-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 138615 08-Dec-2004 mlaier

More fixing of multiple addresses in the same prefix. This time do not try
to arp resolve "secondary" local addresses.

Found and submitted by: ru
With additions from: OpenBSD (rev. 1.47)
Reviewed by: ru


# 136960 26-Oct-2004 bms

Check that rt_mask(rt) is non-NULL before dereferencing it, in the
RTM_ADD case, thus avoiding a panic.

Submitted by: Iasen Kostov


# 136441 12-Oct-2004 rwatson

Modify the thrilling "%D is using my IP address %s!" message so that
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.


# 134991 09-Sep-2004 glebius

Check flag do_bridge always, even if kernel was compiled without
BRIDGE support. This makes dynamic bridge.ko working.

Reviewed by: sam
Approved by: julian (mentor)
MFC after: 1 week


# 130407 13-Jun-2004 dfr

Add a new driver to support IP over firewire. This driver is intended to
conform to the rfc2734 and rfc3146 standard for IP over firewire and
should eventually supercede the fwe driver. Right now the broadcast
channel number is hardwired and we don't support MCAP for multicast
channel allocation - more infrastructure is required in the firewire
code itself to fix these problems.


# 128645 25-Apr-2004 luigi

Another small set of changes to reduce diffs with the new arp code.


# 128642 25-Apr-2004 luigi

remove a stale comment on the behaviour of arpresolve


# 128641 25-Apr-2004 luigi

Start the arp timer at init time.
It runs so rarely that it makes no sense to wait until the first request.


# 128636 25-Apr-2004 luigi

This commit does two things:

1. rt_check() cleanup:
rt_check() is only necessary for some address families to gain access
to the corresponding arp entry, so call it only in/near the *resolve()
routines where it is actually used -- at the moment this is
arpresolve(), nd6_storelladdr() (the call is embedded here),
and atmresolve() (the call is just before atmresolve to reduce
the number of changes).
This change will make it a lot easier to decouple the arp table
from the routing table.

There is an extra call to rt_check() in if_iso88025subr.c to
determine the routing info length. I have left it alone for
the time being.

The interface of arpresolve() and nd6_storelladdr() now changes slightly:
+ the 'rtentry' parameter (really a hint from the upper level layer)
is now passed unchanged from *_output(), so it becomes the route
to the final destination and not to the gateway.
+ the routines will return 0 if resolution is possible, non-zero
otherwise.
+ arpresolve() returns EWOULDBLOCK in case the mbuf is being held
waiting for an arp reply -- in this case the error code is masked
in the caller so the upper layer protocol will not see a failure.

2. arpcom untangling
Where possible, use 'struct ifnet' instead of 'struct arpcom' variables,
and use the IFP2AC macro to access arpcom fields.
This mostly affects the netatalk code.

=== Detailed changes: ===
net/if_arcsubr.c
rt_check() cleanup, remove a useless variable

net/if_atmsubr.c
rt_check() cleanup

net/if_ethersubr.c
rt_check() cleanup, arpcom untangling

net/if_fddisubr.c
rt_check() cleanup, arpcom untangling

net/if_iso88025subr.c
rt_check() cleanup

netatalk/aarp.c
arpcom untangling, remove a block of duplicated code

netatalk/at_extern.h
arpcom untangling

netinet/if_ether.c
rt_check() cleanup (change arpresolve)

netinet6/nd6.c
rt_check() cleanup (change nd6_storelladdr)


# 128398 18-Apr-2004 luigi

Replace Bcopy with 'the real thing' as in the rest of the file.


# 128019 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 127828 04-Apr-2004 luigi

+ arpresolve(): remove an unused argument
+ struct ifnet: remove unused fields, move ipv6-related field close
to each other, add a pointer to l3<->l2 translation tables (arp,nd6,
etc.) for future use.

+ struct route: remove an unused field, move close to each
other some fields that might likely go away in the future


# 127277 21-Mar-2004 mdodd

- Fix indentation lost by 'diff -b'.
- Un-wrap short line.


# 127261 21-Mar-2004 mdodd

Remove interface type specific code from arprequest(), and in_arpinput().

The AF_ARP case in the (*if_output)() routine will handle the interface type
specific bits.

Obtained from: NetBSD


# 126936 13-Mar-2004 mdodd

De-register.


# 123768 23-Dec-2003 ru

I didn't notice it right away, but check the right length too.


# 123765 23-Dec-2003 ru

Fix a problem introduced in revision 1.84: m_pullup() does not
necessarily return the same mbuf chain so we need to recompute
mtod() consumers after pulling up.


# 122334 08-Nov-2003 sam

replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by: FreeBSD Foundation


# 122320 08-Nov-2003 sam

o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by: FreeBSD Foundation


# 121816 31-Oct-2003 brooks

Replace the if_name and if_unit members of struct ifnet with new members
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.

This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.

Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)


# 121770 30-Oct-2003 sam

Overhaul routing table entry cleanup by introducing a new rtexpunge
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).

Supported by: FreeBSD Foundation


# 120727 04-Oct-2003 sam

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


# 120699 03-Oct-2003 bms

Shorten 'bad gateway' AF_LINK message.

Submitted by: green


# 120698 03-Oct-2003 bms

Make arp_rtrequest()'s 'bad gateway' messages slightly more informative,
to aid me in tracking down LLINFO inconsistencies in the routing table.

Discussed with: fenner


# 120685 03-Oct-2003 bms

Only delete the route if arplookup() tried to create it. Do not delete
RTF_STATIC routes. Do not check for RTF_HOST so as to avoid being DoSed
when an RTF_GENMASK route exists in the table.

Add a more verbose comment about exactly what this code does.

Submitted by: ru


# 120626 01-Oct-2003 ru

By popular demand, added the "static ARP" per-interface option.


# 120418 24-Sep-2003 bms

Fix a logic error in the check to see if arplookup() should free the route.

Noticed by: Mike Hogsett
Reviewed by: ru


# 120383 23-Sep-2003 bms

Fix a bug in arplookup(), whereby a hostile party on a locally
attached network could exhaust kernel memory, and cause a system
panic, by sending a flood of spoofed ARP requests.

Approved by: jake (mentor)
Reported by: Apple Product Security <product-security@apple.com>


# 111888 04-Mar-2003 jlemon

Update netisr handling; Each SWI now registers its queue, and all queue
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.

Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110544 08-Feb-2003 orion

Avoid multiply for preemptive arp calculation since it hits every
ethernet packet sent.

Prompted by: Jeffrey Hsu <hsu@FreeBSD.org>


# 110308 04-Feb-2003 orion

MFS 1.64.2.22: Re-enable non pre-emptive ARP requests.

Submitted by: "Diomidis Spinellis" <dds@aueb.gr>
PR: kern/46116


# 109996 28-Jan-2003 hsu

Avoid lock order reversal by expanding the scope of the
AF_INET radix tree lock to cover the ARP data structures.


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109409 17-Jan-2003 hsu

SMP locking for ARP.


# 109035 09-Jan-2003 tmm

Clear the target hardware address field when generating an ARP request.

Reviewed by: nectar
MFC after: 1 week


# 101090 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

When generating an ARP query, invoke a MAC entry point to permit the
MAC framework to label its mbuf appropriately for the interface.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 98459 19-Jun-2002 peter

Solve the 'unregistered netisr 18' information notice with a sledgehammer.
Register the ISR early, but do not actually kick off the timer until we
see some activity. This still saves us from running the arp timers on
a system with no network cards.


# 96624 14-May-2002 kbyanc

Reset token-ring source routing control field on receipt of ethernet frame
without source routing information. This restores the behaviour in this
scenario to that of prior to my last commit.


# 96184 07-May-2002 kbyanc

Move ISO88025 source routing information into sockaddr_dl's sdl_data
field. This returns the sdl_data field to a variable-length field. More
importantly, this prevents a easily-reproduceable data-corruption bug when
the interface name plus the hardware address exceed the sdl_data field's
original 12 byte limit. However, token-ring interfaces may still overflow
the new sdl_data field's 46 byte limit if the interface name exceeds 6
characters (since 6 characters for interface name plus 6 for hardware
address plus 34 for source routing = the size of sdl_data). Further
refinements could overcome this limitation but would break binary
compatibility; this commit only addresses fixing the bug for
commonly-occuring cases without breaking binary compatibility with the
intention that the functionality can be MFC'ed to -stable.

See message ID's (both send to -arch):
20020421013332.F87395-100000@gateway.posi.net
20020430181359.G11009-300000@gateway.posi.net
for a more thorough description of the bug addressed and how to
reproduce it.

Approved by: silence on -arch and -net
Sponsored by: NTT Multimedia Communications Labs
MFC after: 1 week


# 93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 92802 20-Mar-2002 orion

Send periodic ARP requests when ARP entries for hosts we are sending
to are about to expire. This prevents high packet rate flows from
experiencing packet drops at the sender following ARP cache entry
timeout.

PR: kern/25517
Reviewed by: luigi
MFC after: 7 days


# 92723 19-Mar-2002 alfred

Remove __P.


# 87776 13-Dec-2001 jlemon

Minor style fix.


# 87410 05-Dec-2001 ru

Fixed remotely exploitable DoS in arpresolve().

Easily exploitable by flood pinging the target
host over an interface with the IFF_NOARP flag
set (all you need to know is the target host's
MAC address).

MFC after: 0 days


# 85466 25-Oct-2001 jlemon

If we are bridging, fall back to using any inet address in the system,
irrespective of receive interface, as a last resort.

Submitted by: ru


# 85223 20-Oct-2001 jlemon

Only examine inet addresses of the interface. This was broken in r1.83,
with the result that the system would reply to an ARP request of 0.0.0.0


# 85074 17-Oct-2001 ru

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


# 84931 14-Oct-2001 fjoe

bring in ARP support for variable length link level addresses

Reviewed by: jdp
Approved by: jdp
Obtained from: NetBSD
MFC after: 6 weeks


# 84102 29-Sep-2001 jlemon

Add a hash table that contains the list of internet addresses, and use
this in place of the in_ifaddr list when appropriate. This improves
performance on hosts which have a large number of IP aliases.


# 82966 04-Sep-2001 alfred

Fix sysctl comment field, s/the the/then the

Pointed out by: ru


# 82893 03-Sep-2001 alfred

Allow disabling of "arp moved" messages.

Submitted by: Stephen Hurd <deuce@lordlegacy.org>


# 78295 15-Jun-2001 jlemon

Do not perform arp send/resolve on an interface marked NOARP.

PR: 25006
MFC after: 2 weeks


# 74851 27-Mar-2001 yar

Add a missing m_pullup() before a mtod() in in_arpinput().

PR: kern/22177
Reviewed by: wollman


# 72270 09-Feb-2001 luigi

Sync with the bridge/dummynet/ipfw code already tested in stable.

In ip_fw.[ch] change a couple of variable and field names to
avoid having types, variables and fields with the same name.


# 72056 05-Feb-2001 julian

Fix bad patch from a few days ago. It broke some bridging.


# 71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 71963 03-Feb-2001 julian

Make the code act the same in the case of BRIDGE being defined, but not
turned on, and the case of it not being defined at all.
i.e. Disabling bridging re-enables some of the checks it disables.

Submitted by: "Rogier R. Mulhuijzen" <drwilco@drwilco.net>


# 70699 05-Jan-2001 alfred

provide a sysctl 'net.link.ether.inet.log_arp_wrong_iface' to allow one
to supress logging when ARP replies arrive on the wrong interface:
"/kernel: arp: 1.2.3.4 is on dc0 but got reply from 00:00:c5:79:d0:0c on dc1"

the default is to log just to give notice about possibly incorrectly
configured networks.


# 69152 25-Nov-2000 jlemon

Lock down the network interface queues. The queue mutex must be obtained
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.

IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.

Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.


# 63080 13-Jul-2000 dwmalone

Extra sanity check when arp proxyall is enabled. Don't send an arp
reply if the requesting machine isn't on the interface we believe
it should be. Prevents arp wars when you plug cables in the wrong
way around.

PR: 9848
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
Not objected to by: wollman


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 59143 11-Apr-2000 wes

PR: kern/17872
Submitted by: csg@waterspout.com (C. Stephen Gunn)


# 58770 29-Mar-2000 joerg

Peter Johnson found another log() call without a trailing newline.
All three of them have been introduced in rev 1.64, so i guess i've
got all of them now. :)

Submitted by: Peter Johnson <locke@mcs.net>


# 58758 28-Mar-2000 joerg

Added two missing newlines in calls to log(9).

Reported in Usenet by: locke@mcs.net (Peter Johnson)

While i was at it, prepended a 0x to the %D output, to make it clear that
the printed value is in hex (i assume %D has been chosen over %#x to
obey network byte order).


# 58499 23-Mar-2000 dillon

Fix parens in m_pullup() line in arp handling code. The code was
improperly doing the equivalent of (m = (function() == NULL)) instead
of ((m = function()) == NULL).

This fixes a NULL pointer dereference panic with runt arp packets.


# 58313 19-Mar-2000 lile

o Replace most magic numbers related to token ring with #defines
from iso88025.h.

o Add minimal llc support to iso88025_input.

o Clean up most of the source routing code.

* Submitted by: Nikolai Saoukh <nms@otdel-1.org>


# 57900 10-Mar-2000 rwatson

The function arpintr() incorrectly checks m->m_len to detect incomplete
ARP packets. This can incorrectly reject complete frames since the frame
could be stored in more than one mbuf.

The following patches fix the length comparisson, and add several
diagnostic log messages to the interrupt handler for out-of-the-norm ARP
packets. This should make ARP problems easier to detect, diagnose and
fix.

Submitted by: C. Stephen Gunn <csg@waterspout.com>
Approved by: jkh
Reviewed by: rwatson


# 57178 13-Feb-2000 peter

Clean up some loose ends in the network code, including the X.25 and ISO
#ifdefs. Clean out unused netisr's and leftover netisr linker set gunk.
Tested on x86 and alpha, including world.

Approved by: jkh


# 52377 18-Oct-1999 sheldonh

Append missing newline to log() message for permanent ARP modification
attempt warning, which was added in rev 1.48 .

PR: 14371
Submitted by: sec@pi.musin.de (Stefan `Sec` Zehl)


# 51320 15-Sep-1999 lile

Re-arrange the arp code so that fddi arps work properly.


# 50512 28-Aug-1999 lile

It is much easier to arp if you don't truncate your arp-reply's.
[affects token-ring only]


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 46568 06-May-1999 peter

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


# 45705 15-Apr-1999 eivind

Better handling for ARP/source routing on Token Ring

Submitted by: Larry Lile <lile@stdio.com>


# 44627 10-Mar-1999 julian

Submitted by: Larry Lile
Move the Olicom token ring driver to the officially sanctionned location of
/sys/contrib. Also fix some brokenness in the generic token ring support.

Be warned that if_dl.h has been changed and SOME programs might
like recompilation.


# 44456 04-Mar-1999 wpaul

arprequest() allocates an mbuf with m_gethdr() but does not initialize
m->m_pkthdr.rcvif to NULL. Bad arprequest(). No biscuit.


# 44165 20-Feb-1999 julian

World, I'd like you to meet the first FreeBSD token Ring driver.
This is for various Olicom cards. An IBM driver is following.
This patch also adds support to tcpdump to decode packets on tokenring.
Congratulations to the proud father.. (below)

Submitted by: Larry Lile <lile@stdio.com>


# 44078 16-Feb-1999 dfr

* Change sysctl from using linker_set to construct its tree using SLISTs.
This makes it possible to change the sysctl tree at runtime.

* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.

Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)


# 42866 19-Jan-1999 fenner

Fix bug in last commit (la was used uninitialized if no route was passed in).


# 42775 17-Jan-1999 fenner

If arpresolve() gets passed a route with a null llinfo, call
arplookup() to try again. This gets rid of at least one user's
"arpresolve: can't allocate llinfo" errors, and arplookup() gives
better error messages to help track down the problem if there really
is a problem with the routing table.


# 42486 10-Jan-1999 luigi

Remove check from where arp replies are coming from -- when doing bridging,
interfaces are used in clusters so the check does not apply.


# 41793 14-Dec-1998 luigi

Last bits (i think) of dummynet for -current.


# 39389 16-Sep-1998 fenner

Prevent modification of permanent ARP entries (PR kern/7649)
Ignore ARP replies from the wrong interface (discussion on mailing list)
Add interface name to a couple of error messages


# 36908 12-Jun-1998 julian

Go through the loopback code with a broom..
Remove lots'o'hacks.
looutput is now static.

Other callers who want to use loopback to allow shortcutting
should call the special entrypoint for this, if_simloop(), which is
specifically designed for this purpose. Using looutput for this purpose
was problematic, particularly with bpf and trying to keep track
of whether one should be using the charateristics of the loopback interface
or the interface (e.g. if_ethersubr.c) that was requesting the loopback.
There was a whole class of errors due to this mis-use each of which had
hacks to cover them up.

Consists largly of hack removal :-)


# 36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 36308 23-May-1998 phk

Get more details on the "arpresolve: can't allocate llinfo" bogon.

PR: 2570
Reviewed by: phk
Submitted by: fenner


# 34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 32350 08-Jan-1998 eivind

Make INET a proper option.

This will not make any of object files that LINT create change; there
might be differences with INET disabled, but hardly anything compiled
before without INET anyway. Now the 'obvious' things will give a
proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The
only thing that _should_ work (but can't be made to compile reasonably
easily) is sppp :-(

This commit move struct arpcom from <netinet/if_ether.h> to
<net/if_arp.h>.


# 31884 19-Dec-1997 bde

Fixed gratuitous ANSIisms.


# 30813 28-Oct-1997 bde

Removed unused #includes.


# 25822 14-May-1997 tegge

Don't send arp request for the ip address 0.0.0.0.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 20525 15-Dec-1996 bde

Attempt to complete the fix in the previous revision. This version
fixes the problem reported by max.


# 20448 14-Dec-1996 dyson

Missing TAILQ mod.


# 19794 15-Nov-1996 fenner

Reword two messages:

duplicate ip address 204.162.228.7! sent from ethernet address: 08:00:20:09:7b:1d
changed to
arp: 08:00:20:09:7b:1d is using my IP address 204.162.228.7!

and

arp info overwritten for 204.162.228.2 by 08:00:20:09:7b:1d
changed to
arp: 204.162.228.2 moved from 08:00:20:07:b6:a0 to 08:00:20:09:7b:1d

I think the new wordings are more clear and could save some support
questions.


# 18892 12-Oct-1996 bde

Removed nested include if <sys/socket.h> from <net/if.h> and
<net/if_arp.h> and fixed the things that depended on it. The nested
include just allowed unportable programs to compile and made my
simple #include checking program report that networking code doesn't
need to include <sys/socket.h>.


# 16576 21-Jun-1996 peter

Set the rmx.rmx_expire to 0 when creating fake ethernet addresses for the
broadcast and multicast routes, otherwise they will be expired by
arptimeout after a few minutes, reverting to " (incomplete)". This makes
the work done by rev 1.27 stay around until the route itself is deleted.
This is mainly cosmetic for 'arp' and 'netstat -r'.


# 16557 20-Jun-1996 fenner

Use the route that's guaranteed to exist when picking a source address
for ARP requests.

The NetBSD version of this patch (see NetBSD PR kern/2381) has this change
already. This should close our PR kern/1140 .

Although it's not quite what he submitted, I got the idea from him so
Submitted by: Jin Guojun <jin@george.lbl.gov>


# 16341 13-Jun-1996 dg

Keep ether_type in network order for BPF to be consistent with other
systems.

Submitted by: Ted Lemon, Matt Thomas, and others. Retrofitted for
-current by me.


# 16206 08-Jun-1996 bde

Changed some memcpy()'s back to bcopy()'s.

gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.


# 14761 22-Mar-1996 fenner

Send ARP's for aliased subnets with the proper source address.
Get rid of ac->ac_ipaddr and arpwhohas() since they assume that
an interface has only one address.

Obtained from: BSD/OS 2.1, via Rich Stevens <rstevens@noao.edu>


# 14163 20-Feb-1996 fenner

Make the "arpresolve: can't allocate llinfo" error message
more useful by printing out the IP address it was trying to
resolve, since we're seeing so many complaints about this
error.


# 13926 05-Feb-1996 wollman

Fill in the corresponding ether address of multicast and broadcast
pseudo-``ARP entries'' so arp(8) doesn't show them as `unresolved'.


# 13619 24-Jan-1996 phk

Use new printf features rather than local kludges.


# 12942 20-Dec-1995 wollman

in_proto.c: spell ``Internet'' right and put whitespace after commas.

others: start to populate the link-layer branch of the net mib, by
moving ARP to its proper place. (ARP is not a protocol family, it's an
interface layer between a medium-access layer and a protocol family.)
sysctl(8) needs to be taught about the structure of this branch, unless
Poul-Henning implements dynamic MIB exploration soon.


# 12877 15-Dec-1995 bde

Added a prototype.


# 12820 14-Dec-1995 phk

Another mega commit to staticize things.


# 12693 09-Dec-1995 phk

Remove old ballast, clean up a little bit, staticize.
Add five sysctl variables that you should probably never tweak.
net.arp.t_prune: 300
net.arp.t_keep: 1200
net.arp.t_down: 20
net.arp.maxtries: 5
net.arp.useloopback: 1
net.arp.proxyall: 0

(It's net.arp because arp isn't limited to inet, though our present
implementation surely is).


# 12579 02-Dec-1995 bde

Completed function declarations and/or added prototypes.


# 11680 22-Oct-1995 phk

Remove the last trace of arptnew()


# 11225 05-Oct-1995 wollman

Convert ARP to use queue.h macros rather than insque/remque. While
we're at it, eliminate obsolete exposure of `struct llinfo_arp' to
the world. (This dates back to when ARP entries were not stored in
the routing table, and there was no other way for the `arp' program
to read the whole table than to grovel around in /dev/kmem.)


# 9339 27-Jun-1995 wollman

Delete obsolete #if 0 block.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8426 10-May-1995 wollman

Make networking domains drop-ins, through the magic of GNU ld. (Some day,
there may even be LKMs.) Also, change the internal name of `unixdomain'
to `localdomain' since AF_LOCAL is now the preferred name of this family.
Declare netisr correctly and in the right place.


# 8384 09-May-1995 dg

Replaced some bcopy()'s with memcpy()'s so that gcc while inline/optimize.


# 8090 26-Apr-1995 pst

Cleanup loopback interface support.
Reviewed by: wollman


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 7088 16-Mar-1995 wollman

Add inet_ntoa() and replace ARP's private routine with same.


# 6568 20-Feb-1995 dg

Added missing newlines to calls to log().


# 5196 22-Dec-1994 wollman

Make arp_rtrequest() static since nobody needs to referene it any more.


# 5195 22-Dec-1994 wollman

Move ARP interface initialization into if_ether.c:arp_ifinit().


# 5101 13-Dec-1994 wollman

Update calls to rtalloc1(). Also merge rt_prflags with rt_flags.


# 4069 01-Nov-1994 wollman

Clean up ARP error messages: format IP addresses, explain arplookup()
failures in English.


# 3514 11-Oct-1994 wollman

Fix a bug which caused panics when attempting to change just the flags of
a route. (This still doesn't work, but it doesn't panic now.) It looks
like there may be a number of incipient bugs in this code.

Also, get ready for the time when all IP gateway routes are cloning, which
is necessary to keep proper TCP statistics.


# 3311 02-Oct-1994 phk

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


# 3282 01-Oct-1994 wollman

Implement full proxy ARP, gated on option ARP_PROXYALL. This allows
a FreeBSD box to do proxy ARP as easily as most commercial routers do,
without messing around with (potentially variable) Ethernet addresses.
This code is really quite simple; I'm not at all sure why it wasn't
implemented in 4.4.

It might be worth stealing an interface flag (maybe IFF_LINK1) to use for
finer-grained control over which interfaces get proxy treatment. For the
moment, it's all or nothing.


# 2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 1817 02-Aug-1994 dg

Added $Id$


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources