#
30cf0fbf |
|
01-May-2024 |
Richard Scheffenegger <rscheff@FreeBSD.org> |
in_pcb: don't leak credential refcounts on error In the error path during allocating an in_pcb, the credentials associated with the new struct get their reference count increased early on, but not decremented when the allocation fails. Reported by: cmiller_netapp.com MFC after: 3 days Reviewed by: jhb, tuexen Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D45033
|
#
1a8d1764 |
|
29-Mar-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: fully retire inp_ppcb pointer Before a protocol specific control block started to embed inpcb in self (see 0aa120d52f3c, e68b3792440c, 483fe96511ec) this pointer used to point at it. Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>. The exp-run detected a minimal use of the field in ports: * sysutils/lsof - patched upstream * net-mgmt/netdata - patch accepted upstream * emulators/qemu-user-static - upstream master branch seems not using the field anymore We can keep the field around for some time, but eventually it may be reused for something else. PR: 277659 (exp-run) Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D44491
|
#
d62c4607 |
|
18-Mar-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: remove unused KPIs to manipulate sockets These KPIs were added in dd0e6c383a9f0 and through 15 years had zero use. They slightly remind what IfAPI does for struct ifnet. But IfAPI does that for the sake of large collection of NIC drivers not being aware of struct ifnet. For the sockets it is unclear what could be a large collection of externally written kernel modules that need extensively use sockets and not be aware of their internals at the same time. This isolation of a structure knowledge requires a lot of work, and just throwing in a few KPIs isn't helpful. Reviewed by: kib, olce, markj Differential Revision: https://reviews.freebsd.org/D44311
|
#
027fda80 |
|
18-Mar-2024 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: remove unused KPIs to manipulate inpcbs These KPIs were added in 9d29c635daa69 and through 15 years had zero use. They slightly remind what IfAPI does for struct ifnet. But IfAPI does that for the sake of large collection of NIC drivers not being aware of struct ifnet. For the inpcb it is unclear what could be a large collection of externally written kernel modules that need extensively use inpcb and not be aware of its internals at the same time. This isolation of a structure knowledge requires a lot of work, and just throwing in a few KPIs isn't helpful. Reviewed by: kib, bz, markj Differential Revision: https://reviews.freebsd.org/D44310
|
#
13720136 |
|
10-Jan-2024 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcpsso: fix when used without -i option Since fdb987bebddf it is not possible anymore to use inp_next iterator for bound, but unconnected sockets. This applies to TCP listening sockets. Therefore the metioned commit broke tcpsso on listening sockets if the -i option was not used. Fix this by iterating through all endpoints instead of only through the bound, but unconnected ones. Reviewed by: markj Fixes: fdb987bebddf ("inpcb: Split PCB hash tables") Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D43353
|
#
4a0c6403 |
|
27-Dec-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: poison several inpcb pointer in in_pcbfree() There are few subsystems that reference inpcb and allow it to outlive in_pcbfree(). There are no known bugs with them to unreference the options pointers for a freed inpcb. Enforce this so that such bugs don't appear in the future. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D43134
|
#
a13039e2 |
|
27-Dec-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: reoder inpcb destruction First, merge in_pcbdetach() with in_pcbfree(). The comment for in_pcbdetach() was no longer correct. Then, make sure we remove the inpcb from the hash before we commit any destructive actions on it. There are couple functions that rely on the hash lock skipping SMR + inpcb lock to lookup an inpcb. Although there are no known functions that similarly rely on the global inpcb list lock, also do list removal before destructive actions. PR: 273890 Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D43122
|
#
bed7633b |
|
08-Dec-2023 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: tcp: allow SOL_SOCKET-level socket options via sysctl interface When using the sysctl interface for setting a SOL_SOCKET-level socket option, the TCP handler refers to the IP handler, which only handles SO_SETFIB and SO_MAX_PACING_RATE. So call sosetopt(), which handles all SOL_SOCKET-level options. Then you can use tcpsso with SOL_SOCKET-level socket options as expected. Reported by: rscheff Reviewed by: glebius, rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D42985
|
#
0fac350c |
|
30-Nov-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
sockets: don't malloc/free sockaddr memory on getpeername/getsockname Just like it was done for accept(2) in cfb1e92912b4, use same approach for two simplier syscalls that return socket addresses. Although, these two syscalls aren't performance critical, this change generalizes some code between 3 syscalls trimming code size. Following example of accept(2), provide VNET-aware and INVARIANT-checking wrappers sopeeraddr() and sosockaddr() around protosw methods. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D42694
|
#
29363fb4 |
|
23-Nov-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
|
#
bbbd7aab |
|
20-Nov-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: garbage collect in_pcbnotifyall()
|
#
685dc743 |
|
16-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
|
#
e3ba0d6a |
|
26-Jul-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: do not copy so_options into inp_flags2 Since f71cb9f74808 socket stays connnected with inpcb through latter's lifetime and there is no reason to complicate things and copy these flags. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D41198
|
#
a43e7a96 |
|
26-Jul-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use internal flag to mark pcbs that are inserted into lbgroup Using INP_REUSEPORT_LB is unsafe, as it is basically a copy of socket's SO_REUSEPORT_LB flag, which can be cleared by userland after bind(). Reviewed by: markj Reported by: syzbot+e7d2e451f89fb444319b@syzkaller.appspotmail.com Differential Revision: https://reviews.freebsd.org/D41197
|
#
a306ed50 |
|
30-May-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Restore missing validation of local addresses for jailed sockets When looking up a listening socket, the SMR-protected lookup routine may return a jailed socket with no local address. This happens when using classic jails with more than one IP address; in a single-IP classic jail, a bound socket's local address is always rewritten to be that of the jail. After commit 7b92493ab1d4, the lookup path failed to check whether the jail corresponding to a matched wildcard socket actually owns the address, and would return the match regardless. Restore the omitted checks. Fixes: 7b92493ab1d4 ("inpcb: Avoid inp_cred dereferences in SMR-protected lookup") Reported by: peter Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D40268
|
#
c2a69e84 |
|
25-Apr-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: move HPTS related fields from inpcb to tcpcb This makes inpcb lighter and allows future cache line optimizations of tcpcb. The reason why HPTS originally used inpcb is the compressed TIME-WAIT state (see 0d7445193ab), that used to free a tcpcb, while the associated connection is still on the HPTS ring. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D39697
|
#
93998d16 |
|
23-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Fix some bugs in _in_pcbinshash_wild() - In _in_pcbinshash_wild(), we should avoid returning v6 sockets unless no other matches are available. This preserves pre-existing semantics. - Fix an inverted test: when inserting a non-jailed PCB, we want to search for the first non-jailed PCB in the hash chain. - Test the right PCB when searching for a non-jailed PCB. While here, add a required locking assertion. Fixes: 7b92493ab1d4 ("inpcb: Avoid inp_cred dereferences in SMR-protected lookup")
|
#
5fd1a67e |
|
20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Release the inpcb cred reference before freeing the structure Now that the inp_cred pointer is accessed only while the inpcb lock is held, we can avoid deferring a crfree() call when freeing an inpcb. This fixes a problem introduced when inpcb hash tables started being synchronized with SMR: the credential reference previously could not be released until all lockless readers have drained, and there is no mechanism to explicitly purge cached, freed UMA items. Thus, ucred references could linger indefinitely, and since ucreds hold a jail reference, the jail would linger indefinitely as well. This manifests as jails getting stuck in the DYING state. Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38573
|
#
7b92493a |
|
20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Avoid inp_cred dereferences in SMR-protected lookup The SMR-protected inpcb lookup algorithm currently has to check whether a matching inpcb belongs to a jail, in order to prioritize jailed bound sockets. To do this it has to maintain a ucred reference, and for this to be safe, the reference can't be released until the UMA destructor is called, and this will not happen within any bounded time period. Changing SMR to periodically recycle garbage is not trivial. Instead, let's implement SMR-synchronized lookup without needing to dereference inp_cred. This will allow the inpcb code to free the inp_cred reference immediately when a PCB is freed, ensuring that ucred (and thus jail) references are released promptly. Commit 220d89212943 ("inpcb: immediately return matching pcb on lookup") gets us part of the way there. This patch goes further to handle lookups of unconnected sockets. Here, the strategy is to maintain a well-defined order of items within a hash chain so that a wild lookup can simply return the first match and preserve existing semantics. This makes insertion of listening sockets more complicated in order to make lookup simpler, which seems like the right tradeoff anyway given that bind() is already a fairly expensive operation and lookups are more common. In particular, when inserting an unconnected socket, in_pcbinhash() now keeps the following ordering: - jailed sockets before non-jailed sockets, - specified local addresses before unspecified local addresses. Most of the change adds a separate SMR-based lookup path for inpcb hash lookups. When a match is found, we try to lock the inpcb and re-validate its connection info. In the common case, this works well and we can simply return the inpcb. If this fails, typically because something is concurrently modifying the inpcb, we go to the slow path, which performs a serialized lookup. Note, I did not touch lbgroup lookup, since there the credential reference is formally synchronized by net_epoch, not SMR. In particular, lbgroups are rarely allocated or freed. I think it is possible to simplify in_pcblookup_hash_wild_locked() now, but I didn't do it in this patch. Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38572
|
#
3e98dcb3 |
|
20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Move inpcb matching logic into separate functions These functions will get some additional callers in future revisions. No functional change intended. Discussed with: glebius Tested by: glebius Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38571
|
#
fdb987be |
|
20-Apr-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Split PCB hash tables Currently we use a single hash table per PCB database for connected and bound PCBs. Since we started using net_epoch to synchronize hash table lookups, there's been a bug, noted in a comment above in_pcbrehash(): connecting a socket can cause an inpcb to move between hash chains, and this can cause a concurrent lookup to follow the wrong linkage pointers. I believe this could cause rare, spurious ECONNREFUSED errors in the worse case. Address the problem by introducing a second hash table and adding more linkage pointers to struct inpcb. Now the database has one table each for connected and unconnected sockets. When inserting an inpcb into the hash table, in_pcbinhash() now looks at the foreign address of the inpcb to figure out which table to use. This ensures that queue linkage pointers are stable until the socket is disconnected, so the problem described above goes away. There is also a small benefit in that in_pcblookup_*() can now search just one of the two possible hash buckets. I also made the "rehash" parameter of in(6)_pcbconnect() unused. This parameter seems confusing and it is simpler to let the inpcb code figure out what to do using the existing INP_INHASHLIST flag. UDP sockets pose a special problem since they can be connected and disconnected multiple times during their lifecycle. To handle this, the patch plugs a hole in the inpcb structure and uses it to store an SMR sequence number. When an inpcb is disconnected - an operation which requires the global PCB database hash lock - the write sequence number is advanced, and in order to reconnect, the connecting thread must wait for readers to drain before reusing the inpcb's hash chain linkage pointers. raw_ip (ab)uses the hash table without using the corresponding accessors. Since there are now two hash tables, it arbitrarily uses the "connected" table for all of its PCBs. This will be addressed in some way in the future. inp interators which specify a hash bucket will only visit connected PCBs. This is not really correct, but nothing in the tree uses that functionality except raw_ip, which as mentioned above places all of its PCBs in the "connected" table and so is unaffected. Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38569
|
#
713264f6 |
|
06-Mar-2023 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Tighten checks for unspecified source addresses The assertions added in commit b0ccf53f2455 ("inpcb: Assert against wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol layers may pass the unspecified address to in_pcblookup(). Add some checks to filter out such packets before we attempt an inpcb lookup: - Disallow the use of an unspecified source address in in_pcbladdr() and in6_pcbladdr(). - Disallow IP packets with an unspecified destination address. - Disallow TCP packets with an unspecified source address, and add an assertion to verify the comment claiming that the case of an unspecified destination address is handled by the IP layer. Reported by: syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com Reported by: syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com Reported by: syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com Reviewed by: glebius, melifaro MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38570
|
#
317fa516 |
|
28-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Remove the IP(V6)_RSS_LISTEN_BUCKET socket option It has no effect, and an exp-run revealed that it is not in use. PR: 261398 (exp-run) Reviewed by: mjg, glebius Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38822
|
#
3aff4ccd |
|
27-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
netinet: Remove IP(V6)_BINDMULTI This option was added in commit 0a100a6f1ee5 but was never completed. In particular, there is no logic to map flowids to different listening sockets, so it accomplishes basically the same thing as SO_REUSEPORT. Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to balance among listening sockets using a hash of the 4-tuple and some optional NUMA policy. The option was never documented or completed, and an exp-run revealed nothing using it in the ports tree. Moreover, it complicates the already very complicated in_pcbbind_setup(), and the checking in in_pcbbind_check_bindmulti() is insufficient. So, let's remove it. PR: 261398 (exp-run) Reviewed by: glebius Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38574
|
#
96871af0 |
|
15-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use family specific sockaddr argument for bind functions Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_bind method and from there on go down the call stack with family specific argument. Reviewed by: zlei, melifaro, markj Differential Revision: https://reviews.freebsd.org/D38601
|
#
c7ea65ec |
|
13-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: refcount_release() returns a bool No functional change intended. MFC after: 1 week Sponsored by: Klara, Inc.
|
#
4130ea61 |
|
09-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Split in_pcblookup_hash_locked() and clean up a bit Split the in_pcblookup_hash_locked() function into several independent subroutine calls, each of which does some kind of hash table lookup. This refactoring makes it easier to introduce variants of the lookup algorithm that behave differently depending on whether they are synchronized by SMR or the PCB database hash lock. While here, do some related cleanup: - Remove an unused ifnet parameter from internal functions. Keep it in external functions so that it can be used in the future to derive a v6 scopeid. - Reorder the parameters to in_pcblookup_lbgroup() to be consistent with the other lookup functions. - Remove an always-true check from in_pcblookup_lbgroup(): we can assume that we're performing a wildcard match. No functional change intended. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D38364
|
#
220d8921 |
|
07-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: immediately return matching pcb on lookup This saves a lot of CPU cycles if you got large connection table. The code removed originates from 413628a7e3d, a very large changeset. Discussed that with Bjoern, Jamie we can't recover why would we ever have identical 4-tuples in the hash, even in the presence of jails. Bjoern did a test that confirms that it is impossible to allocate an identical connection from a jail to a host. Code review also confirms that system shouldn't allow for such connections to exist. With a lack of proper test suite we decided to take a risk and go forward with removing that code. Reviewed by: gallatin, bz, markj Differential Revision: https://reviews.freebsd.org/D38015
|
#
9e46ff4d |
|
03-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
netinet: don't return conflicting inpcb in in_pcbconnect_setup() Last time this inpcb was actually used was in tcp_connect() before c94c54e4df9a.
|
#
a9d22cce |
|
03-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use family specific sockaddr argument for connect functions Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_connect method and from there on go down the call stack with family specific argument. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38356
|
#
2589ec0f |
|
03-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
pcb: Move an assignment into in_pcbdisconnect() All callers of in_pcbdisconnect() clear the local address, so let's just do that in the function itself. Note that the inp's local address is not a parameter to the inp hash functions. No functional change intended. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38362
|
#
b0ccf53f |
|
03-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Assert against wildcard addrs in in_pcblookup_hash_locked() No functional change intended. Reviewed by: glebius MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38361
|
#
675e2618 |
|
03-Feb-2023 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Deduplicate some assertions It makes more sense to check lookupflags in the function which actually uses SMR. No functional change intended. Reviewed by: glebius MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38359
|
#
5ebea466 |
|
01-Feb-2023 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: add myself to the copyright notice for the SMR synchronization in late 2021 and following cleanups
|
#
3d0d5b21 |
|
23-Jan-2023 |
Justin Hibbits <jhibbits@FreeBSD.org> |
IfAPI: Explicitly include <net/if_private.h> in netstack Summary: In preparation of making if_t completely opaque outside of the netstack, explicitly include the header. <net/if_var.h> will stop including the header in the future. Sponsored by: Juniper Networks, Inc. Reviewed by: glebius, melifaro Differential Revision: https://reviews.freebsd.org/D38200
|
#
c4a4b263 |
|
14-Dec-2022 |
Andrew Gallatin <gallatin@FreeBSD.org> |
allocate inpcb aligned to cachelines The inpcb struct is one of the most heavily utilized in the kernel on a busy network server. By aligning it to a cacheline boundary, we can ensure that closely related fields in the inpcb and tcbcb can be predictably located on the same cacheline. rrs has already done a lot of this work to put related fields on the same line for the tcbcb. In combination with a forthcoming patch to align the start of the tcpcb, we see a roughly 3% reduction in CPU use on a busy web server serving traffic over roughly 50,000 TCP connections. Reviewed by: glebius, markj, tuexen Differential Revision: https://reviews.freebsd.org/D37687 Sponsored by: Netflix
|
#
0aa120d5 |
|
02-Dec-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: allow to provide protocol specific pcb size The protocol specific structure shall start with inpcb. Differential revision: https://reviews.freebsd.org/D37126
|
#
73bebcc5 |
|
08-Nov-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: remove TCP includes, all TCP specific code was moved
|
#
ab0ef945 |
|
08-Nov-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
hpts: move inp initialization from the generic inpcb code to TCP Differential revision: https://reviews.freebsd.org/D37124
|
#
f567d55f |
|
08-Nov-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: don't return INP_DROPPED entries from pcb lookups The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a pcb from hash list and mark it as dropped. The comment suggests that such pcb won't be returned by lookups. Indeed, every call to in_pcblookup*() is accompanied by a check for INP_DROPPED. Do what comment suggests: never return such pcbs and remove unnecessary checks. Reviewed by: tuexen Differential revision: https://reviews.freebsd.org/D37061
|
#
d93ec8cb |
|
02-Nov-2022 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Allow SO_REUSEPORT_LB to be used in jails Currently SO_REUSEPORT_LB silently does nothing when set by a jailed process. It is trivial to support this option in VNET jails, but it's also useful in traditional jails. This patch enables LB groups in jails with the following semantics: - all PCBs in a group must belong to the same jail, - PCB lookup prefers jailed groups to non-jailed groups This is a straightforward extension of the semantics used for individual listening sockets. One pre-existing quirk of the lbgroup implementation is that non-jailed lbgroups are searched before jailed listening sockets; that is preserved with this change. Discussed with: glebius MFC after: 1 month Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37029
|
#
a152dd86 |
|
02-Nov-2022 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Remove a PCB from its LB group upon a subsequent error If a memory allocation failure causes bind to fail, we should take the inpcb back out of its LB group since it's not prepared to handle connections. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37027
|
#
ac1750dd |
|
02-Nov-2022 |
Mark Johnston <markj@FreeBSD.org> |
inpcb: Remove NULL checks of credential references Some auditing of the code shows that "cred" is never non-NULL in these functions, either because all callers pass a non-NULL reference or because they unconditionally dereference "cred". So, let's simplify the code a bit and remove NULL checks. No functional change intended. Reviewed by: glebius MFC after: 1 week Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37025
|
#
19acc506 |
|
31-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: retire suppresion of randomization of ephemeral ports The suppresion was added in 5f311da2ccb with no explanation in the commit message of the exact problem that was fixed. In the BSDCan 2006 talk [1], slides 12 to 14, we can find that it seems that there was some problem with the TIME_WAIT state not properly being handled on the remote side (also FreeBSD!), and this switching off the suppression had hidden the problem. The rationale of the change was that other stacks may also be buggy wrt the TIME_WAIT. I did not find the actual problem in TIME_WAIT that the suppression has hidden, neither a commit that would fix it. However, since that time we started to handle SYNs with RFC5961 instead of RFC793, see 3220a2121cc. We also now have the tcp-testsuite [2], that has full coverage of all possible scenarios of receiving SYN in TIME_WAIT. This effectively reverts 5f311da2ccb6c216b79049172be840af4778129a and 6ee79c59d2c081f837a11cc78908be54a6dbe09d. [1] https://www.bsdcan.org/2006/papers/ImprovingTCPIP.pdf [2] https://github.com/freebsd-net/tcp-testsuite Reviewed by: rscheff Discussed with: rscheff, rrs, tuexen Differential revision: https://reviews.freebsd.org/D37042
|
#
24cf7a8d |
|
19-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: provide pcbinfo pointer argument to inp_apply_all() Allows to clear inpcb layer of TCP knowledge.
|
#
b6a816f1 |
|
19-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: garbage collect so_sototcpcb() It had very little use and required inpcb layer to know tcpcb.
|
#
3ba34b07 |
|
13-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: provide in_pcbremhash() to reduce copy-paste
|
#
53af6903 |
|
06-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove INP_TIMEWAIT flag Mechanically cleanup INP_TIMEWAIT from the kernel sources. After 0d7445193ab, this commit shall not cause any functional changes. Note: this flag was very often checked together with INP_DROPPED. If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we will be able to remove most of this checks and turn them to assertions. Some of them can be turned into assertions right now, but that should be carefully done on a case by case basis. Differential revision: https://reviews.freebsd.org/D36400
|
#
0d744519 |
|
06-Oct-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove tcptw, the compressed timewait state structure The memory savings the tcptw brought back in 2003 (see 340c35de6a2) no longer justify the complexity required to maintain it. For longer explanation please check out the email [1]. Surpisingly through almost 20 years the TCP stack functionality of handling the TIME_WAIT state with a normal tcpcb did not bitrot. The existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state, which is confirmed by the packetdrill tcp-testsuite [2]. This change just removes tcptw and leaves INP_TIMEWAIT. The flag will be removed in a separate commit. This makes it easier to review and possibly debug the changes. [1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html [2] https://github.com/freebsd-net/tcp-testsuite Differential revision: https://reviews.freebsd.org/D36398
|
#
c7a62c92 |
|
10-Aug-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: gather v4/v6 handling code into in_pcballoc() from protocols Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D36062
|
#
637f317c |
|
29-Jul-2022 |
Mike Karels <karels@FreeBSD.org> |
IPv6: fix problem with duplicate port assignment with v4-mapped addrs In in_pcb_lport_dest(), if an IPv6 socket does not match any other IPv6 socket using in6_pcblookup_local(), and if the socket can also connect to IPv4 (the INP_IPV4 vflag is set), check for IPv4 matches as well. Otherwise, we can allocate a port that is used by an IPv4 socket (possibly one created from IPv6 via the same procedure), and then connect() can fail with EADDRINUSE, when it could have succeeded if the bound port was not in use. PR: 265064 Submitted by: firk at cantconnect.ru (with modifications) Reviewed by: bz, melifaro Differential Revision: https://reviews.freebsd.org/D36012
|
#
fe5324ac |
|
13-Apr-2022 |
John Baldwin <jhb@FreeBSD.org> |
in_pcballoc: error is only used for IPSEC or MAC.
|
#
bab34d63 |
|
12-Apr-2022 |
John Baldwin <jhb@FreeBSD.org> |
in_pcboutput_txrtlmt: Remove unused variable.
|
#
942e8cab |
|
02-Apr-2022 |
Gordon Bergling <gbe@FreeBSD.org> |
netinet: Fix a typo in a source code comment - s/exisitng/existing/ MFC after: 3 days
|
#
a0aeb1ce |
|
09-Feb-2022 |
Michael Tuexen <tuexen@FreeBSD.org> |
in_pcb.c: fix compilation of an IPv4 only configuration While there, remove a duplicate inclusion of sysctl.h. Reported by: Gary Jennejohn Fixes: a35bdd4489b9 - main - tcp: add sysctl interface for setting socket options Sponsored by: Netflix, Inc.
|
#
a35bdd44 |
|
08-Feb-2022 |
Michael Tuexen <tuexen@FreeBSD.org> |
tcp: add sysctl interface for setting socket options This interface allows to set a socket option on a TCP endpoint, which is specified by its inp_gencnt. This interface will be used in an upcoming command line tool tcpsso. Reviewed by: glebius, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D34138
|
#
fec8a8c7 |
|
03-Jan-2022 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use global UMA zones for protocols Provide structure inpcbstorage, that holds zones and lock names for a protocol. Initialize it with global protocol init using macro INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses its private hash, but they all use same zone to allocate and SMR section to synchronize. Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit on the socket zone, which was always global. Historically same maxsockets value is applied also to every PCB zone. Important fact: you can't create a pcb without a socket! A pcb may outlive its socket, however. Given that there are multiple protocols, and only one socket zone, the per pcb zone limits seem to have little value. Under very special conditions it may trigger a little bit earlier than socket zone limit, but in most setups the socket zone limit will be triggered earlier. When VIMAGE was added to the kernel PCB zones became per-VNET. This magnified existing disbalance further: now we have multiple pcb zones in multiple vnets limited to maxsockets, but every pcb requires a socket allocated from the global zone also limited by maxsockets. IMHO, this per pcb zone limit doesn't bring any value, so this patch drops it. If anybody explains value of this limit, it can be restored very easy - just 2 lines change to in_pcbstorage_init(). Differential revision: https://reviews.freebsd.org/D33542
|
#
430df2ab |
|
01-Jan-2022 |
Michael Tuexen <tuexen@FreeBSD.org> |
in_pcb: improve inp_next() If there is no inp to check, exit the loop iterating through them. Reported by: syzbot+403406a9cbf082b36ea4@syzkaller.appspotmail.com Reviewed by: glebius Sponsored by: Netflix, Inc.
|
#
a0577692 |
|
26-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
in_pcb: use jenkins hash over the entire IPv6 (or IPv4) address The intent is to provide more entropy than can be provided by just the 32-bits of the IPv6 address which overlaps with 6to4 tunnels. This is needed to mitigate potential algorithmic complexity attacks from attackers who can control large numbers of IPv6 addresses. Together with: gallatin Reviewed by: dwmalone, rscheff Differential revision: https://reviews.freebsd.org/D33254
|
#
a370832b |
|
26-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp: remove delayed drop KPI No longer needed after tcp_output() can ask caller to drop. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33371
|
#
d8b45c8e |
|
16-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: don't leak the port zone in in_pcbinfo_destroy()
|
#
185e659c |
|
14-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: use locked variant of prison_check_ip*() The pcb lookup always happens in the network epoch and in SMR section. We can't block on a mutex due to the latter. Right now this patch opens up a race. But soon that will be addressed by D33339. Reviewed by: markj, jamie Differential revision: https://reviews.freebsd.org/D33340 Fixes: de2d47842e8
|
#
eb93b99d |
|
05-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
in_pcb: delay crfree() down into UMA dtor inpcb lookups, which check inp_cred, work with pcbs that potentially went through in_pcbfree(). So inp_cred should stay valid until SMR guarantees its invisibility to lookups. While here, put the whole inpcb destruction sequence of in_pcbfree(), inpcb_dtor() and inpcb_fini() sequentially. Submitted by: markj Differential revision: https://reviews.freebsd.org/D33273
|
#
4c018b5a |
|
03-Dec-2021 |
Peter Lei <peterlei@netflix.com> |
in_pcb: limit the effect of wraparound in TCP random port allocation check The check to see if TCP port allocation should change from random to sequential port allocation mode may incorrectly cause a false positive due to negative wraparound. Example: V_ipport_tcpallocs = 2147483585 (0x7fffffc1) V_ipport_tcplastcount = 2147483553 (0x7fffffa1) V_ipport_randomcps = 100 The original code would compare (2147483585 <= -2147483643) and thus incorrectly move to sequential allocation mode. Compute the delta first before comparing against the desired limit to limit the wraparound effect (since tcplastcount is always a snapshot of a previous tcpallocs).
|
#
13e3f3349 |
|
03-Dec-2021 |
Peter Lei <peterlei@netflix.com> |
in_pcb: fix TCP local ephemeral port accounting Fix logic error causing UDP(-Lite) local ephemeral port bindings to count against the TCP allocation counter, potentially causing TCP to go from random to sequential port allocation mode prematurely.
|
#
db0ac6de |
|
02-Dec-2021 |
Cy Schubert <cy@FreeBSD.org> |
Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit 266f97b5e9a7958e365e78288616a459b40d924a, reversing changes made to a10253cffea84c0c980a36ba6776b00ed96c3e3b. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.
|
#
f971e791 |
|
02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_hpts: rename input queue to drop queue and trim dead code The HPTS input queue is in reality used only for "delayed drops". When a TCP stack decides to drop a connection on the output path it can't do that due to locking protocol between main tcp_output() and stacks. So, rack/bbr utilize HPTS to drop the connection in a different context. In the past the queue could also process input packets in context of HPTS thread, but now no stack uses this, so remove this functionality. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33025
|
#
de2d4784 |
|
02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
SMR protection for inpcbs With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector. Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details: - The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *). Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022
|
#
565655f4 |
|
02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
inpcb: reduce some aliased functions after removal of PCBGROUP. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33021
|
#
93c67567 |
|
02-Dec-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Remove "options PCBGROUP" With upcoming changes to the inpcb synchronisation it is going to be broken. Even its current status after the move of PCB synchronization to the network epoch is very questionable. This experimental feature was sponsored by Juniper but ended never to be used in Juniper and doesn't exist in their source tree [sjg@, stevek@, jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix [gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@]. I'm up to resurrecting it back if there is any interest from anybody. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33020
|
#
27c4abc7 |
|
30-Nov-2021 |
Gordon Bergling <gbe@FreeBSD.org> |
inet(3): Fix two typos in sysctl descriptions - s/sequental/sequential/ MFC after: 3 days
|
#
5c534010 |
|
20-Oct-2021 |
Roy Marples <roy@marples.name> |
net: Allow binding of unspecified address without address existance Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind did the same for empty V_in6_ifaddrhead (no IPv6 addresses). An equivalent test has existed since 4.4-Lite. It was presumably done to avoid extra work (assuming the address isn't going to be found later). In normal system operation *_ifaddrhead will not be empty: they will at least have the loopback address(es). In practice no work will be avoided. Further, this case caused net/dhcpd to fail when run early in boot before assignment of any addresses. It should be possible to bind the unspecified address even if no addresses have been configured yet, so just remove the tests. The now-removed "XXX broken" comments were added in 59562606b9d3, which converted the ifaddr lists to TAILQs. As far as I (emaste) can tell the brokenness is the issue described above, not some aspect of the TAILQ conversion. PR: 253166 Reviewed by: ae, bz, donner, emaste, glebius MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D32563
|
#
0f617ae4 |
|
18-Oct-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c.
|
#
744a64bd |
|
27-Apr-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
in_pcb: garbage collect in_pcbrele()
|
#
5a78df20 |
|
27-Apr-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
in_pcb: garbage collect unused structure in_pcblist
|
#
2144431c |
|
08-Oct-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Remove in_ifaddr_lock acquisiton to access in_ifaddrhead. An IPv4 address is embedded into an ifaddr which is freed via epoch. And the in_ifaddrhead is already a CK list. Use the network epoch to protect against use after free. Next step would be to CK-ify the in_addr hash and get rid of the... Reviewed by: melifaro Differential Revision: https://reviews.freebsd.org/D32434
|
#
c782ea8b |
|
14-Sep-2021 |
John Baldwin <jhb@FreeBSD.org> |
Add a switch structure for send tags. Move the type and function pointers for operations on existing send tags (modify, query, next, free) out of 'struct ifnet' and into a new 'struct if_snd_tag_sw'. A pointer to this structure is added to the generic part of send tags and is initialized by m_snd_tag_init() (which now accepts a switch structure as a new argument in place of the type). Previously, device driver ifnet methods switched on the type to call type-specific functions. Now, those type-specific functions are saved in the switch structure and invoked directly. In addition, this more gracefully permits multiple implementations of the same tag within a driver. In particular, NIC TLS for future Chelsio adapters will use a different implementation than the existing NIC TLS support for T6 adapters. Reviewed by: gallatin, hselasky, kib (older version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D31572
|
#
f161d294 |
|
02-May-2021 |
Mark Johnston <markj@FreeBSD.org> |
Add missing sockaddr length and family validation to various protocols Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519
|
#
1db08fbe |
|
16-Apr-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_input: always request read-locking of PCB for any pure SYN segment. This is further rework of 08d9c920275. Now we carry the knowledge of lock type all the way through tcp_input() and also into tcp_twcheck(). Ideally the rlocking for pure SYNs should propagate all the way into the alternative TCP stacks, but not yet today. This should close a race when socket is bind(2)-ed but not yet listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered recently by pho@.
|
#
08d9c920 |
|
18-Mar-2021 |
Gleb Smirnoff <glebius@FreeBSD.org> |
tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either create a syncache entry or generate syncookie, but we don't modify anything with the listening socket or associated PCB. Thus create a new PCB lookup mode - rlock if listening. This removes the primary contention point under SYN flood - the listening socket PCB. Sidenote: when SYN arrives on a synchronized connection, we still don't need write access to PCB to send a challenge ACK or just to drop. There is only one exclusion - tcptw recycling. However, existing entanglement of tcp_input + stacks doesn't allow to make this change small. Consider this patch as first approach to the problem. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D29576
|
#
1a714ff2 |
|
26-Jan-2021 |
Randall Stewart <rrs@FreeBSD.org> |
This pulls over all the changes that are in the netflix tree that fix the ratelimit code. There were several bugs in tcp_ratelimit itself and we needed further work to support the multiple tag format coming for the joint TLS and Ratelimit dances. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28357
|
#
093e7231 |
|
26-Jan-2021 |
Hans Petter Selasky <hselasky@FreeBSD.org> |
Add missing decrement of active ratelimit connections. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking
|
#
85d8d30f |
|
26-Jan-2021 |
Hans Petter Selasky <hselasky@FreeBSD.org> |
Don't allow allocating a new send tag on an INP which is being torn down. This fixes a potential send tag leak. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking
|
#
a034518a |
|
19-Dec-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
|
#
36e0a362 |
|
29-Oct-2020 |
John Baldwin <jhb@FreeBSD.org> |
Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc(). This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000
|
#
98d7a8d9 |
|
29-Oct-2020 |
John Baldwin <jhb@FreeBSD.org> |
Call m_snd_tag_rele() to free send tags. Send tags are refcounted and if_snd_tag_free() is called by m_snd_tag_rele() when the last reference is dropped on a send tag. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26995
|
#
aebfdc1f |
|
29-Oct-2020 |
John Baldwin <jhb@FreeBSD.org> |
Store the new send tag in the right place. r350501 added the 'st' parameter, but did not pass it down to if_snd_tag_alloc(). Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26997
|
#
0c325f53 |
|
18-Oct-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Implement flowid calculation for outbound connections to balance connections over multiple paths. Multipath routing relies on mbuf flowid data for both transit and outbound traffic. Current code fills mbuf flowid from inp_flowid for connection-oriented sockets. However, inp_flowid is currently not calculated for outbound connections. This change creates simple hashing functions and starts calculating hashes for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the system. Reviewed by: glebius (previous version),ae Differential Revision: https://reviews.freebsd.org/D26523
|
#
662c1305 |
|
01-Sep-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
net: clean up empty lines in .c and .h files
|
#
2fdbcbea |
|
18-May-2020 |
Mike Karels <karels@FreeBSD.org> |
Fix NULL-pointer bug from r361228. Note that in_pcb_lport and in_pcb_lport_dest can be called with a NULL local address for IPv6 sockets; handle it. Found by syzkaller. Reported by: cem MFC after: 1 month
|
#
25102351 |
|
18-May-2020 |
Mike Karels <karels@FreeBSD.org> |
Allow TCP to reuse local port with different destinations Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections for the address family. Instead, choose a local port after selecting the destination and the local address, requiring only that the tuple is unique and does not match a wildcard binding. Reviewed by: tuexen (rscheff, rrs previous version) MFC after: 1 month Sponsored by: Forcepoint LLC Differential Revision: https://reviews.freebsd.org/D24781
|
#
9ac7c6cf |
|
14-Apr-2020 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to the new routing KPI. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24245
|
#
98085bae |
|
09-Mar-2020 |
Andrew Gallatin <gallatin@FreeBSD.org> |
make lacp's use_numa hashing aware of send tags When I did the use_numa support, I missed the fact that there is a separate hash function for send tag nic selection. So when use_numa is enabled, ktls offload does not work properly, as it does not reliably allocate a send tag on the proper egress nic since different egress nics are selected for send-tag allocation and packet transmit. To fix this, this change: - refectors lacp_select_tx_port_by_hash() and lacp_select_tx_port() to make lacp_select_tx_port_by_hash() always called by lacp_select_tx_port() - pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash() - adds a numa_domain field to if_snd_tag_alloc_params - plumbs the numa domain into places where we allocate send tags In testing with NIC TLS setup on a NUMA machine, I see thousands of output errors before the change when enabling kern.ipc.tls.ifnet.permitted=1. After the change, I see no errors, and I see the NIC sysctl counters showing active TLS offload sessions. Reviewed by: rrs, hselasky, jhb Sponsored by: Netflix
|
#
7029da5c |
|
26-Feb-2020 |
Pawel Biernacki <kaktus@FreeBSD.org> |
Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
|
#
481be5de |
|
12-Feb-2020 |
Randall Stewart <rrs@FreeBSD.org> |
White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.
|
#
c1604fe4 |
|
21-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make in_pcbladdr() require network epoch entered by its callers. Together with this widen network epoch coverage up to tcp_connect() and udp_connect(). Revisions from r356974 and up to this revision cover D23187. Differential Revision: https://reviews.freebsd.org/D23187
|
#
5c722e2a |
|
21-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
The network epoch changes in the TCP stack combined with old r286227, actually make removal of a PCB not needing ipi_lock in any form. The ipi_list_lock is sufficient.
|
#
6a2954a1 |
|
21-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Relax locking requirements for in_pcballoc(). All pcbinfo fields modified by this function are protected by the PCB list lock that is acquired inside the function. This could have been done even before epoch changes, after r286227.
|
#
2a4bd982 |
|
14-Jan-2020 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Introduce NET_EPOCH_CALL() macro and use it everywhere where we free data based on the network epoch. The macro reverses the argument order of epoch_call(9) - first function, then its argument. NFC
|
#
fe1274ee |
|
12-Jan-2020 |
Michael Tuexen <tuexen@FreeBSD.org> |
Fix race when accepting TCP connections. When expanding a SYN-cache entry to a socket/inp a two step approach was taken: 1) The local address was filled in, then the inp was added to the hash table. 2) The remote address was filled in and the inp was relocated in the hash table. Before the epoch changes, a write lock was held when this happens and the code looking up entries was holding a corresponding read lock. Since the read lock is gone away after the introduction of the epochs, the half populated inp was found during lookup. This resulted in processing TCP segments in the context of the wrong TCP connection. This patch changes the above procedure in a way that the inp is fully populated before inserted into the hash table. Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@ mailing list and for testing the patch! Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22971
|
#
d797164a |
|
07-Nov-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Since r353292 on input path we are always in network epoch, when we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197
|
#
1a496125 |
|
06-Nov-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). Remove few outdated comments and extraneous assertions. No functional change here.
|
#
903c4ee6 |
|
02-Aug-2019 |
Xin LI <delphij@FreeBSD.org> |
Fix !INET build.
|
#
20abea66 |
|
01-Aug-2019 |
Randall Stewart <rrs@FreeBSD.org> |
This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953
|
#
59854ecf |
|
25-Jun-2019 |
Hans Petter Selasky <hselasky@FreeBSD.org> |
Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m | grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies
|
#
fb3bc596 |
|
24-May-2019 |
John Baldwin <jhb@FreeBSD.org> |
Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117
|
#
50575ce1 |
|
25-Apr-2019 |
Andrew Gallatin <gallatin@FreeBSD.org> |
Track TCP connection's NUMA domain in the inpcb Drivers can now pass up numa domain information via the mbuf numa domain field. This information is then used by TCP syncache_socket() to associate that information with the inpcb. The domain information is then fed back into transmitted mbufs in ip{6}_output(). This mechanism is nearly identical to what is done to track RSS hash values in the inp_flowid. Follow on changes will use this information for lacp egress port selection, binding TCP pacers to the appropriate NUMA domain, etc. Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20028
|
#
43b65e3c |
|
29-Mar-2019 |
John Baldwin <jhb@FreeBSD.org> |
Don't check the inp socket pointer in in_pcboutput_eagain. Reviewed by: hps (by saying it was ok to be removed) MFC after: 1 month Sponsored by: Netflix
|
#
c7ee62fc |
|
13-Feb-2019 |
Andrey V. Elsukov <ae@FreeBSD.org> |
In r335015 PCB destroing was made deferred using epoch_call(). But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables, and thus it needs VNET context, that is missing during gtaskqueue executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred(). PR: 235684 MFC after: 1 week
|
#
a68cc388 |
|
08-Jan-2019 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin
|
#
cc426dd3 |
|
11-Dec-2018 |
Mateusz Guzik <mjg@FreeBSD.org> |
Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation
|
#
9d2877fc |
|
05-Dec-2018 |
Mark Johnston <markj@FreeBSD.org> |
Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains. Memory beyond that limit was previously unused, wasting roughly 1MB per 8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identical to INP_PCBPORTHASH. Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17803
|
#
79db6fe7 |
|
22-Nov-2018 |
Mark Johnston <markj@FreeBSD.org> |
Plug some networking sysctl leaks. Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301
|
#
d9ff5789 |
|
01-Nov-2018 |
Mark Johnston <markj@FreeBSD.org> |
Remove redundant checks for a NULL lbgroup table. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17108
|
#
79ee680b |
|
01-Nov-2018 |
Mark Johnston <markj@FreeBSD.org> |
Improve style in in_pcbinslbgrouphash() and related subroutines. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17107
|
#
54af3d0d |
|
10-Sep-2018 |
Mark Johnston <markj@FreeBSD.org> |
Fix synchronization of LB group access. Lookups are protected by an epoch section, so the LB group linkage must be a CK_LIST rather than a plain LIST. Furthermore, we were not deferring LB group frees, so in_pcbremlbgrouphash() could race with readers and cause a use-after-free. Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com> Tested by: gallatin Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17031
|
#
a7026c7f |
|
07-Sep-2018 |
Mark Johnston <markj@FreeBSD.org> |
Use ratecheck(9) in in_pcbinslbgrouphash(). Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com> Approved by: re (kib) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17065
|
#
8be02ee4 |
|
05-Sep-2018 |
Mark Johnston <markj@FreeBSD.org> |
Fix style bugs in in_pcblookup_lbgroup(). No functional change intended. Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com> Approved by: re (rgrimes) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17030
|
#
73ad0b6a |
|
03-Sep-2018 |
Mark Johnston <markj@FreeBSD.org> |
Use the correct malloc type in in_pcblbgroup_free(). Approved by: re (kib) Sponsored by: The FreeBSD Foundation
|
#
f9be0386 |
|
15-Aug-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Fix in6_multi double free This is actually several different bugs: - The code is not designed to handle inpcb deletion after interface deletion - add reference for inpcb membership - The multicast address has to be removed from interface lists when the refcount goes to zero OR when the interface goes away - decouple list disconnect from refcount (v6 only for now) - ifmultiaddr can exist past being on interface lists - add flag for tracking whether or not it's enqueued - deferring freeing moptions makes the incpb cleanup code simpler but opens the door wider still to races - call inp_gcmoptions synchronously after dropping the the inpcb lock Fundamentally multicast needs a rewrite - but keep applying band-aids for now. Tested by: kp Reported by: novel, kp, lwhsu
|
#
5f901c92 |
|
24-Jul-2018 |
Andrew Turner <andrew@FreeBSD.org> |
Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147
|
#
3a20f06a |
|
10-Jul-2018 |
Brooks Davis <brooks@FreeBSD.org> |
Use uintptr_t alone when assigning to kvaddr_t variables. Suggested by: jhb
|
#
7524b4c1 |
|
06-Jul-2018 |
Brooks Davis <brooks@FreeBSD.org> |
Correct breakage on 32-bit platforms from r335979.
|
#
f38b68ae |
|
05-Jul-2018 |
Brooks Davis <brooks@FreeBSD.org> |
Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386
|
#
6573d758 |
|
03-Jul-2018 |
Matt Macy <mmacy@FreeBSD.org> |
epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
|
#
45fc0718 |
|
24-Jun-2018 |
Sean Bruno <sbruno@FreeBSD.org> |
Reap unused variable and assignment that had no effect. Noted by cross compiling with gcc on mips. Reviewed by: mmacy
|
#
3d348772 |
|
21-Jun-2018 |
Matt Macy <mmacy@FreeBSD.org> |
in_pcblookup_hash: validate inp before return Post r335356 it is possible to have an inpcb on the hash lists that is partially torn down. Validate before using. Also as a side effect of this change the lock ordering issue between hash lock and inpcb no longer exists allowing some simplification. Reported by: pho@
|
#
feeef850 |
|
13-Jun-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Fix PCBGROUPS build post CK conversion of pcbinfo
|
#
483305b9 |
|
12-Jun-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Handle INP_FREED when looking up an inpcb When hash table lookups are not serialized with in_pcbfree it will be possible for callers to find an inpcb that has been marked free. We need to check for this and return NULL.
|
#
700e893c |
|
12-Jun-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Defer inpcbport free in in_pcbremlists as well
|
#
f09ee4fc |
|
12-Jun-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Defer inpcbport free until after a grace period has elapsed This is a dependency for inpcbinfo rlock conversion to epoch
|
#
b872626d |
|
12-Jun-2018 |
Matt Macy <mmacy@FreeBSD.org> |
mechanical CK macro conversion of inpcbinfo lists This is a dependency for converting the inpcbinfo hash and info rlocks to epoch.
|
#
addf2b20 |
|
12-Jun-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Defer inpcb deletion until after a grace period has elapsed Deferring the actual free of the inpcb until after a grace period has elapsed will allow us to convert the inpcbinfo info and hash read locks to epoch. Reviewed by: gallatin, jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15510
|
#
1a43cff9 |
|
06-Jun-2018 |
Sean Bruno <sbruno@FreeBSD.org> |
Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003
|
#
62d733a1 |
|
27-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
in_pcbladdr: remove debug code that snuck in with ifa epoch conversion r334118
|
#
4f6c66cc |
|
23-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409
|
#
f42a83f2 |
|
21-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
inpcb: revert deferred inpcb free pending further review
|
#
1a3d880c |
|
20-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
in(s)_moptions: free before tearing down inpcb
|
#
47d2a585 |
|
19-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
inpcb: defer destruction of inpcb until after a grace period has elapsed in_pcbfree will remove the incpb from the list and release the rtentry while the vnet is set, but the actual destruction will be deferred until any threads in a (not yet used) epoch section, no longer potentially have references.
|
#
ddece765 |
|
19-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
in_pcb: add helper for deferring inpcb rele calls from list functions
|
#
cb6bb230 |
|
19-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
ip(6)_freemoptions: defer imo destruction to epoch callback task Avoid the ugly unlock / lock of the inpcbinfo where we need to figure out what kind of lock we hold by simply deferring the operation to another context. (Also a small dependency for converting the pcbinfo read lock to epoch)
|
#
d7c5a620 |
|
18-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366
|
#
28c00100 |
|
05-May-2018 |
Matt Macy <mmacy@FreeBSD.org> |
Currently in_pcbfree will unconditionally wunlock the pcbinfo lock to avoid a LOR on the multicast list lock in the freemoptions routines. As it turns out, tcp_usr_detach can acquire the tcbinfo lock readonly. Trying to wunlock the pcbinfo lock in that context has caused a number of reported crashes. This change unclutters in_pcbfree and moves the handling of wunlock vs runlock of pcbinfo to the freemoptions routine. Reported by: mjg@, bde@, o.hartmann at walstatt.org Approved by: sbruno
|
#
f3e1324b |
|
02-May-2018 |
Stephen Hurd <shurd@FreeBSD.org> |
Separate list manipulation locking from state change in multicast Multicast incorrectly calls in to drivers with a mutex held causing drivers to have to go through all manner of contortions to use a non sleepable lock. Serialize multicast updates instead. Submitted by: mmacy <mmacy@mattmacy.io> Reviewed by: shurd, sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14969
|
#
7875017c |
|
24-Apr-2018 |
Sean Bruno <sbruno@FreeBSD.org> |
Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks
|
#
7b7796ee |
|
23-Apr-2018 |
Sean Bruno <sbruno@FreeBSD.org> |
Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003
|
#
3ee9c3c4 |
|
19-Apr-2018 |
Randall Stewart <rrs@FreeBSD.org> |
This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded. MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020
|
#
45a48d07 |
|
06-Apr-2018 |
Jonathan T. Looney <jtl@FreeBSD.org> |
Check that in_pcbfree() is only called once for each PCB. If that assumption is violated, "bad things" could follow. I believe such an assert would have detected some of the problems jch@ was chasing in PR 203175 (see r307551). We also use it in our internal TCP development efforts. And, in case a bug does slip through to released code, this change silently ignores subsequent calls to in_pcbfree(). Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D14990
|
#
b9a9e8e9 |
|
29-Mar-2018 |
Navdeep Parhar <np@FreeBSD.org> |
Fix RSS build (broken in r331309). Sponsored by: Chelsio Communications
|
#
7fb2986f |
|
21-Mar-2018 |
Jonathan T. Looney <jtl@FreeBSD.org> |
If the INP lock is uncontested, avoid taking a reference and jumping through the lock-switching hoops. A few of the INP lookup operations that lock INPs after the lookup do so using this mechanism (to maintain lock ordering): 1. Lock lookup structure. 2. Find INP. 3. Acquire reference on INP. 4. Drop lock on lookup structure. 5. Acquire INP lock. 6. Drop reference on INP. This change provides a slightly shorter path for cases where the INP lock is uncontested: 1. Lock lookup structure. 2. Find INP. 3. Try to acquire the INP lock. 4. If successful, drop lock on lookup structure. Of course, if the INP lock is contested, the functions will need to revert to the previous way of switching locks safely. This saves a few atomic operations when the INP lock is uncontested. Discussed with: gallatin, rrs, rwatson MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D12911
|
#
fc21c53f |
|
22-Jan-2018 |
Ryan Stone <rstone@FreeBSD.org> |
Reduce code duplication for inpcb route caching Add a new macro to clear both the L3 and L2 route caches, to hopefully prevent future instances where only the L3 cache was cleared when both should have been. MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13989 Reviewed by: karels
|
#
51369649 |
|
20-Nov-2017 |
Pedro F. Giffuni <pfg@FreeBSD.org> |
sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
|
#
95ed5015 |
|
06-Sep-2017 |
Hans Petter Selasky <hselasky@FreeBSD.org> |
Add support for generic backpressure indicator for ratelimited transmit queues aswell as non-ratelimited ones. Add the required structure bits in order to support a backpressure indication with ratelimited connections aswell as non-ratelimited ones. The backpressure indicator is a value between zero and 65535 inclusivly, indicating if the destination transmit queue is empty or full respectivly. Applications can use this value as a decision point for when to stop transmitting data to avoid endless ENOBUFS error codes upon transmitting an mbuf. This indicator is also useful to reduce the latency for ratelimited queues. Reviewed by: gallatin, kib, gnn Differential Revision: https://reviews.freebsd.org/D11518 Sponsored by: Mellanox Technologies
|
#
6a6cefac |
|
24-May-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
o Rearrange struct inpcb fields to optimize the TCP output code path considering cache line hits and misses. Put the lock and hash list glue into the first cache line, put inp_refcount inp_flags inp_socket into the second cache line. o On allocation zero out entire structure except the lock and list entries, including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were introduced, they were added below inp_zero_size, resulting on not being cleared after free/alloc. This definitely was a source of bugs with route caching. Could be that r315956 has just fixed one of them. The inp_gencnt is reinitialized on every alloc, so it is safe to clear it. This has been proved to improve TCP performance at Netflix. Obtained from: rrs Differential Revision: D10686
|
#
cc487c16 |
|
15-May-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Reduce in_pcbinfo_init() by two params. No users supply any flags to this function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away. The zone_fini parameter also goes away. Previously no protocols (except divert) supplied zone_fini function, so inpcb locks were leaked with slabs. This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now this is a leak. Fix that by suppling inpcb_fini() function as fini method for all inpcb zones.
|
#
8c1960d5 |
|
25-Mar-2017 |
Mike Karels <karels@FreeBSD.org> |
Fix reference count leak with L2 caching. ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache entry because they used their own routes on the stack, not in_pcb routes. The original model for route caching was callers that provided a route structure to ip{,6}input() would keep the route, and this model was used for L2 caching as well. Instead, change L2 caching to be done by default only when using a route structure in the in_pcb; the pcb deallocation code frees L2 as well as L3 cacches. A separate change will add route caching to TCP/IPv6. Another suggestion was to have the transport protocols indicate willingness to use L2 caching, but this approach keeps the changes in the network level Reviewed by: ae gnn MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10059
|
#
cc65eb4e |
|
21-Mar-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Hide struct inpcb, struct tcpcb from the userland. This is a painful change, but it is needed. On the one hand, we avoid modifying them, and this slows down some ideas, on the other hand we still eventually modify them and tools like netstat(1) never work on next version of FreeBSD. We maintain a ton of spares in them, and we already got some ifdef hell at the end of tcpcb. Details: - Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO. - Make struct xinpcb, struct xtcpcb pure API structures, not including kernel structures inpcb and tcpcb inside. Export into these structures the fields from inpcb and tcpcb that are known to be used, and put there a ton of spare space. - Make kernel and userland utilities compilable after these changes. - Bump __FreeBSD_version. Reviewed by: rrs, gnn Differential Revision: D10018
|
#
c75e2666 |
|
08-Mar-2017 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Make inp_lock_assert() depend on INVARIANT_SUPPORT, not INVARIANTS. This will make INVARIANT-enabled modules, that use this function to load successfully on a kernel that has INVARIANT_SUPPORT only.
|
#
dce33a45 |
|
05-Mar-2017 |
Ermal Luçi <eri@FreeBSD.org> |
The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Reviewed by: adrian, aw Approved by: ae (mentor) Sponsored by: rsync.net Differential Revision: D9235
|
#
fbbd9655 |
|
28-Feb-2017 |
Warner Losh <imp@FreeBSD.org> |
Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
|
#
7a60a910 |
|
14-Feb-2017 |
Andrey V. Elsukov <ae@FreeBSD.org> |
Add missing check to fix the build with IPSEC_SUPPORT and without MAC. Submitted by: netchild
|
#
c10c5b1e |
|
11-Feb-2017 |
Ermal Luçi <eri@FreeBSD.org> |
Committed without approval from mentor. Reported by: gnn
|
#
4616026f |
|
09-Feb-2017 |
Ermal Luçi <eri@FreeBSD.org> |
Revert r313527 Heh svn is not git
|
#
c0fadfdb |
|
09-Feb-2017 |
Ermal Luçi <eri@FreeBSD.org> |
Correct missed variable name. Reported-by: ohartmann@walstatt.org
|
#
ed55edce |
|
09-Feb-2017 |
Ermal Luçi <eri@FreeBSD.org> |
The patch provides the same socket option as Linux IP_ORIGDSTADDR. Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD. The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application. This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets. Sponsored-by: rsync.net Differential Revision: D9235 Reviewed-by: adrian
|
#
fcf59617 |
|
06-Feb-2017 |
Andrey V. Elsukov <ae@FreeBSD.org> |
Merge projects/ipsec into head/. Small summary ------------- o Almost all IPsec releated code was moved into sys/netipsec. o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel option IPSEC_SUPPORT added. It enables support for loading and unloading of ipsec.ko and tcpmd5.ko kernel modules. o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type support was removed. Added TCP/UDP checksum handling for inbound packets that were decapsulated by transport mode SAs. setkey(8) modified to show run-time NAT-T configuration of SA. o New network pseudo interface if_ipsec(4) added. For now it is build as part of ipsec.ko module (or with IPSEC kernel). It implements IPsec virtual tunnels to create route-based VPNs. o The network stack now invokes IPsec functions using special methods. The only one header file <netipsec/ipsec_support.h> should be included to declare all the needed things to work with IPsec. o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed. Now these protocols are handled directly via IPsec methods. o TCP_SIGNATURE support was reworked to be more close to RFC. o PF_KEY SADB was reworked: - now all security associations stored in the single SPI namespace, and all SAs MUST have unique SPI. - several hash tables added to speed up lookups in SADB. - SADB now uses rmlock to protect access, and concurrent threads can do SA lookups in the same time. - many PF_KEY message handlers were reworked to reflect changes in SADB. - SADB_UPDATE message was extended to support new PF_KEY headers: SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They can be used by IKE daemon to change SA addresses. o ipsecrequest and secpolicy structures were cardinally changed to avoid locking protection for ipsecrequest. Now we support only limited number (4) of bundled SAs, but they are supported for both INET and INET6. o INPCB security policy cache was introduced. Each PCB now caches used security policies to avoid SP lookup for each packet. o For inbound security policies added the mode, when the kernel does check for full history of applied IPsec transforms. o References counting rules for security policies and security associations were changed. The proper SA locking added into xform code. o xform code was also changed. Now it is possible to unregister xforms. tdb_xxx structures were changed and renamed to reflect changes in SADB/SPDB, and changed rules for locking and refcounting. Reviewed by: gnn, wblock Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D9352
|
#
f3e7afe2 |
|
18-Jan-2017 |
Hans Petter Selasky <hselasky@FreeBSD.org> |
Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months
|
#
cc94f0c2 |
|
13-Oct-2016 |
Gleb Smirnoff <glebius@FreeBSD.org> |
- Revert r300854, r303657 which tried to fix regression from r297225. - Fix the regression proper way using RO_RTFREE(). Submitted by: ae
|
#
6d768226 |
|
02-Jun-2016 |
George V. Neville-Neil <gnn@FreeBSD.org> |
This change re-adds L2 caching for TCP and UDP, as originally added in D4306 but removed due to other changes in the system. Restore the llentry pointer to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as appropriate. Submitted by: Mike Karels Differential Revision: https://reviews.freebsd.org/D6262
|
#
84cc0778 |
|
24-Mar-2016 |
George V. Neville-Neil <gnn@FreeBSD.org> |
FreeBSD previously provided route caching for TCP (and UDP). Re-add route caching for TCP, with some improvements. In particular, invalidate the route cache if a new route is added, which might be a better match. The cache is automatically invalidated if the old route is deleted. Submitted by: Mike Karels Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4306
|
#
ea8d1492 |
|
09-Jan-2016 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Remove sys/eventhandler.h from net/route.h Reviewed by: ae
|
#
edd0e0b0 |
|
25-Nov-2015 |
Fabien Thomas <fabient@FreeBSD.org> |
The r241129 description was wrong that the scenario is possible only for read locks on pcbs. The same race can happen with write lock semantics as well. The race scenario: - Two threads (1 and 2) locate pcb with writer semantics (INPLOOKUP_WLOCKPCB) and do in_pcbref() on it. - 1 and 2 both drop the inp hash lock. - Another thread (3) grabs the inp hash lock. Then it runs in_pcbfree(), which wlocks the pcb. They must happen faster than 1 or 2 come INP_WLOCK()! - 1 and 2 congest in INP_WLOCK(). - 3 does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which doesn't free the pcb due to two references on it. Then it unlocks the pcb. - 1 (or 2) gets wlock on the pcb, runs in_pcbrele_wlocked(), which doesn't report inp as freed, due to 2 (or 1) still helding extra reference on it. The thread tries to do smth with a disconnected pcb and crashes. Submitted by: emeric.poupon@stormshield.eu Reviewed by: gleb@ MFC after: 1 week Sponsored by: Stormshield Tested by: Cassiano Peixoto, Stormshield
|
#
079672cb |
|
08-Aug-2015 |
Julien Charbon <jch@FreeBSD.org> |
Fix a kernel assertion issue introduced with r286227: Avoid too strict INP_INFO_RLOCK_ASSERT checks due to tcp_notify() being called from in6_pcbnotify(). Reported by: Larry Rosenman <ler@lerctr.org> Submitted by: markj, jch
|
#
ff9b006d |
|
02-Aug-2015 |
Julien Charbon <jch@FreeBSD.org> |
Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc.
|
#
cc0a3c8c |
|
29-Jul-2015 |
Andrey V. Elsukov <ae@FreeBSD.org> |
Convert in_ifaddr_lock and in6_ifaddr_lock to rmlock. Both are used to protect access to IP addresses lists and they can be acquired for reading several times per packet. To reduce lock contention it is better to use rmlock here. Reviewed by: gnn (previous version) Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D3149
|
#
fd90e2ed |
|
22-May-2015 |
Jung-uk Kim <jkim@FreeBSD.org> |
CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten years for head. However, it is continuously misused as the mpsafe argument for callout_init(9). Deprecate the flag and clean up callout_init() calls to make them more consistent. Differential Revision: https://reviews.freebsd.org/D2613 Reviewed by: jhb MFC after: 2 weeks
|
#
b2bdc62a |
|
18-Jan-2015 |
Adrian Chadd <adrian@FreeBSD.org> |
Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific bits. The motivation here is to eventually teach netisr and potentially other networking subsystems a bit more about how RSS work queues / buckets are configured so things have a hope of auto-configuring in the future. * net/rss_config.[ch] takes care of the generic bits for doing configuration, hash function selection, etc; * topelitz.[ch] is now in net/ rather than netinet/; * (and would be in libkern if it didn't directly include RSS_KEYSIZE; that's a later thing to fix up.) * netinet/in_rss.[ch] now just contains the IPv4 specific methods; * and netinet/in6_rss.[ch] now just contains the IPv6 specific methods. This should have no functional impact on anyone currently using the RSS support. Differential Revision: D1383 Reviewed by: gnn, jfv (intel driver bits)
|
#
d7ebebfb |
|
22-Nov-2014 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Fix non-debug build after r274855.
|
#
8f465f66 |
|
22-Nov-2014 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Convert &in_ifaddr_lock to dual-locking model: use rwlock accessible via external functions (IN_IFADDR_CFG_* -> in_ifaddr_cfg_*()) for all control plane tasks use rmlock (IN_IFADDR_RUN_*) for fast-path lookups.
|
#
603eaf79 |
|
09-Nov-2014 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Renove faith(4) and faithd(8) from base. It looks like industry have chosen different (and more traditional) stateless/statuful NAT64 as translation mechanism. Last non-trivial commits to both faith(4) and faithd(8) happened more than 12 years ago, so I assume it is time to drop RFC3142 in FreeBSD. No objections from: net@
|
#
6df8a710 |
|
07-Nov-2014 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed. Sponsored by: Nginx, Inc.
|
#
9f65116c |
|
25-Oct-2014 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
* Increase nh_flags to be u16 thus reducing nhop payload to be 48 bytes * Use NHF_ namespace for all nhop flags * Rename nhop_data -> nhop_prepend * Rename fib4_lookup_nh_extended -> fib4_lookup_nh_ext * Add "flags" argument to fib4_lookup_nh_ext() to specify whether we want returned nh_ext structure to be refcounted or not.
|
#
f5070664 |
|
23-Oct-2014 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Add new fib4_lookup_nh_extended() which fills in nhop4_extended structure without doinf L2 resolve. It also requires freeing references by calling fib4_free_nh_ext(). Convert in_pcbladdr() to use it. Convert tcp_maxmtu() to use it.
|
#
2bb83c79 |
|
23-Oct-2014 |
Alexander V. Chernikov <melifaro@FreeBSD.org> |
Rename ip_sendmbuf to fib4_sendmbuf() and move it to rt_nhops api. Convert IPv4 SAS to use new routing api.
|
#
58a39d8c |
|
16-Sep-2014 |
Alan Somers <asomers@FreeBSD.org> |
Fix source address selection on unbound sockets in the presence of multiple fibs. Use the mbuf's or the socket's fib instead of RT_ALL_FIBS. Fixes PR 187553. Also fixes netperf's UDP_STREAM test on a nondefault fib. sys/netinet/ip_output.c In ip_output, lookup the source address using the mbuf's fib instead of RT_ALL_FIBS. sys/netinet/in_pcb.c in in_pcbladdr, lookup the source address using the socket's fib, because we don't seem to have the mbuf fib. They should be the same, though. tests/sys/net/fibs_test.sh Clear the expected failure on udp_dontroute. PR: 187553 CR: https://reviews.freebsd.org/D772 MFC after: 3 weeks Sponsored by: Spectra Logic
|
#
4f8585e0 |
|
11-Sep-2014 |
Alan Somers <asomers@FreeBSD.org> |
Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and ifa_ifwithdstaddr. For the sake of backwards compatibility, the new arguments were added to new functions named ifa_ifwithnet_fib and ifa_ifwithdstaddr_fib, while the old functions became wrappers around the new ones that passed RT_ALL_FIBS for the fib argument. However, the backwards compatibility is not desired for FreeBSD 11, because there are numerous other incompatible changes to the ifnet(9) API. We therefore decided to remove it from head but leave it in place for stable/9 and stable/10. In addition, this commit adds the fib argument to ifa_ifwithbroadaddr for consistency's sake. sys/sys/param.h Increment __FreeBSD_version sys/net/if.c sys/net/if_var.h sys/net/route.c Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute. sys/net/route.c sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_options.c sys/netinet/ip_output.c sys/netinet6/nd6.c Fixup calls of modified functions. share/man/man9/ifnet.9 Document changed API. CR: https://reviews.freebsd.org/D458 MFC after: Never Sponsored by: Spectra Logic
|
#
1b44e5ff |
|
09-Sep-2014 |
Andrey V. Elsukov <ae@FreeBSD.org> |
Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of IPv6 address as hash key in all places. Obtained from: Yandex LLC
|
#
39c8c62e |
|
29-Jul-2014 |
Hiren Panchasara <hiren@FreeBSD.org> |
Add a comment and while there, fix trailing whitespace.
|
#
d5bb8bd3 |
|
11-Jul-2014 |
Adrian Chadd <adrian@FreeBSD.org> |
Expose in_pcbbind_check_bindmulti() so the upcoming IPv6 RSS changes can be made to use it.
|
#
0a100a6f |
|
09-Jul-2014 |
Adrian Chadd <adrian@FreeBSD.org> |
Implement the first stage of multi-bind listen sockets and RSS socket awareness. * Introduce IP_BINDMULTI - indicating that it's okay to bind multiple sockets on the same bind details. Although the PCB code has been taught about this (see below) this patch doesn't introduce the rest of the PCB changes necessary to distribute lookups among multiple PCB entries in the global wildcard table. * Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the given RSS bucket (and thus a single PCBGROUP hash.) * Modify the PCB add path to be aware of IP_BINDMULTI: + Only allow further PCB entries to be added if the owner credentials and IP_BINDMULTI has been specified. Ie, only allow further IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI. * Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries. Instead of using the wildcard logic and hashing, these sockets are simply placed into the PCBGROUP and _not_ in the wildcard hash. * When doing a PCBGROUP lookup, also do a wildcard match as well. This allows for an RSS bucket PCB entry to appear in a PCBGROUP rather than having to exist in the wildcard list. Tested: * TCP IPv4 server testing with igb(4) * TCP IPv4 server testing with ix(4) TODO: * The pcbgroup lookup code duplicated the wildcard and wildcard-PCB logic. This could be refactored into a single function. * This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't yet know about this); nor does it yet fully work for UDP.
|
#
2f308a34 |
|
29-May-2014 |
Alan Somers <asomers@FreeBSD.org> |
Fix unintended KBI change from r264905. Add _fib versions of ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the _fib() versions with RT_ALL_FIBS, preserving legacy behavior. sys/net/if_var.h sys/net/if.c Add legacy-compatible functions as described above. Ensure legacy behavior when RT_ALL_FIBS is passed as fibnum. sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/net/route.c sys/net/rtsock.c sys/netinet6/nd6.c Call with _fib() functions if we must use a specific fib, or the legacy functions otherwise. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c Improve the udp_dontroute test. The bug that this test exercises is that ifa_ifwithnet() will return the wrong address, if multiple interfaces have addresses on the same subnet but with different fibs. The previous version of the test only considered one possible failure mode: that ifa_ifwithnet_fib() might fail to find any suitable address at all. The new version also checks whether ifa_ifwithnet_fib() finds the correct address by checking where the ARP request goes. Reported by: bz, hrs Reviewed by: hrs MFC after: 1 week X-MFC-with: 264905 Sponsored by: Spectra Logic
|
#
0cfee0c2 |
|
24-Apr-2014 |
Alan Somers <asomers@FreeBSD.org> |
Fix subnet and default routes on different FIBs on the same subnet. These two bugs are closely related. The root cause is that ifa_ifwithnet does not consider FIBs when searching for an interface address. sys/net/if_var.h sys/net/if.c Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those functions will only return an address whose interface fib equals the argument. sys/net/route.c Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib arguments. sys/netinet/in.c Update in_addprefix to consider the interface fib when adding prefixes. This will prevent it from not adding a subnet route when one already exists on a different fib. sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/netinet6/nd6.c Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet. In some cases it there wasn't a clear specific fib number to use. In others, I was unable to test those functions so I chose RT_DEFAULT_FIB to minimize divergence from current behavior. I will fix some of the latter changes along with PR kern/187553. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c tests/sys/netinet/Makefile Revert r263738. The udp_dontroute test was right all along. However, bugs kern/187550 and kern/187553 cancelled each other out when it came to this test. Because of kern/187553, ifa_ifwithnet searched the default fib instead of the requested one, but because of kern/187550, there was an applicable subnet route on the default fib. The new test added in r263738 doesn't work right, however. I can verify with dtrace that ifa_ifwithnet returned the wrong address before I applied this commit, but route(8) miraculously found the correct interface to use anyway. I don't know how. Clear expected failure messages for kern/187550 and kern/187552. PR: kern/187550 PR: kern/187552 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic
|
#
ae190832 |
|
23-Apr-2014 |
Steven Hartland <smh@FreeBSD.org> |
Fix jailed raw sockets not setting the correct source address by calling in_pcbladdr instead of prison_get_ip4 MFC after: 1 month
|
#
e06e816f |
|
06-Apr-2014 |
Kevin Lo <kevlo@FreeBSD.org> |
Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks. Tested with vlc and a test suite [1]. [1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz Reviewed by: jhb, glebius, adrian
|
#
7527624e |
|
14-Mar-2014 |
Robert Watson <rwatson@FreeBSD.org> |
Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
|
#
96d11124 |
|
07-Feb-2014 |
Gleb Smirnoff <glebius@FreeBSD.org> |
Catch up on r261590.
|
#
76039bc8 |
|
26-Oct-2013 |
Gleb Smirnoff <glebius@FreeBSD.org> |
The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare to this event, adding if_var.h to files that do need it. Also, include all includes that now are included due to implicit pollution via if_var.h Sponsored by: Netflix Sponsored by: Nginx, Inc.
|
#
f122b319 |
|
12-Jul-2013 |
Mikolaj Golub <trociny@FreeBSD.org> |
A complete duplication of binding should be allowed if on both new and duplicated sockets a multicast address is bound and either SO_REUSEPORT or SO_REUSEADDR is set. But actually it works for the following combinations: * SO_REUSEPORT is set for the fist socket and SO_REUSEPORT for the new; * SO_REUSEADDR is set for the fist socket and SO_REUSEADDR for the new; * SO_REUSEPORT is set for the fist socket and SO_REUSEADDR for the new; and fails for this: * SO_REUSEADDR is set for the fist socket and SO_REUSEPORT for the new. Fix the last case. PR: 179901 MFC after: 1 month
|
#
efdf104b |
|
04-Jul-2013 |
Mikolaj Golub <trociny@FreeBSD.org> |
In r227207, to fix the issue with possible NULL inp_socket pointer dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR for multicast), INP_REUSEPORT flag was introduced to cache the socket option. It was decided then that one flag would be enough to cache both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR setsockopt(2), it was checked if it was called for a multicast address and INP_REUSEPORT was set accordingly. Unfortunately that approach does not work when setsockopt(2) is called before binding to a multicast address: the multicast check fails and INP_REUSEPORT is not set. Fix this by adding INP_REUSEADDR flag to unconditionally cache SO_REUSEADDR. PR: 179901 Submitted by: Michael Gmelin freebsd grem.de (initial version) Reviewed by: rwatson MFC after: 1 week
|
#
5cd3dcaa |
|
25-Jan-2013 |
Navdeep Parhar <np@FreeBSD.org> |
Remove redundant test, we know inp_lport is 0. MFC after: 1 week
|
#
6acd596e |
|
07-Dec-2012 |
Pawel Jakub Dawidek <pjd@FreeBSD.org> |
More warnings for zones that depend on the kern.ipc.maxsockets limit. Obtained from: WHEEL Systems
|
#
df4e91d3 |
|
01-Oct-2012 |
Gleb Smirnoff <glebius@FreeBSD.org> |
There is a complex race in in_pcblookup_hash() and in_pcblookup_group(). Both functions need to obtain lock on the found PCB, and they can't do classic inter-lock with the PCB hash lock, due to lock order reversal. To keep the PCB stable, these functions put a reference on it and after PCB lock is acquired drop it. If the reference was the last one, this means we've raced with in_pcbfree() and the PCB is no longer valid. This approach works okay only if we are acquiring writer-lock on the PCB. In case of reader-lock, the following scenario can happen: - 2 threads locate pcb, and do in_pcbref() on it. - These 2 threads drop the inp hash lock. - Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock, does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which doesn't free the pcb due to two references on it. Then it unlocks the pcb. - 2 aforementioned threads acquire reader lock on the pcb and run in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues, second gets 0 and considers pcb freed, returns. - The thread that got 1 continutes working with detached pcb, which later leads to panic in the underlying protocol level. To plumb that problem an additional INPCB flag introduced - INP_FREED. We check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend that that was the last reference. Discussed with: rwatson, jhb Reported by: Vladimir Medvedkin <medved rambler-co.ru>
|
#
3cca425b |
|
12-Jun-2012 |
Michael Tuexen <tuexen@FreeBSD.org> |
Add a IP_RECVTOS socket option to receive for received UDP/IPv4 packets a cmsg of type IP_RECVTOS which contains the TOS byte. Much like IP_RECVTTL does for TTL. This allows to implement a protocol on top of UDP and implementing ECN. MFC after: 3 days
|
#
83e521ec |
|
21-Jan-2012 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Clean up some #endif comments removing from short sections. Add #endif comments to longer, also refining strange ones. Properly use #ifdef rather than #if defined() where possible. Four #if defined(PCBGROUP) occurances (netinet and netinet6) were ignored to avoid conflicts with eventually upcoming changes for RSS. Reported by: bde (most) Reviewed by: bde MFC after: 3 days
|
#
137f91e8 |
|
05-Jan-2012 |
John Baldwin <jhb@FreeBSD.org> |
Convert all users of IF_ADDR_LOCK to use new locking macros that specify either a read lock or write lock. Reviewed by: bz MFC after: 2 weeks
|
#
6472ac3d |
|
07-Nov-2011 |
Ed Schouten <ed@FreeBSD.org> |
Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
|
#
fc06cd42 |
|
06-Nov-2011 |
Mikolaj Golub <trociny@FreeBSD.org> |
Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid inp_socket->so_options dereference when we may not acquire the lock on the inpcb. This fixes the crash due to NULL pointer dereference in in_pcbbind_setup() when inp_socket->so_options in a pcb returned by in_pcblookup_local() was checked. Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com> Suggested by: rwatson Glanced by: rwatson Tested by: dave jones <s.dave.jones@gmail.com>
|
#
ec95b709 |
|
06-Nov-2011 |
Mikolaj Golub <trociny@FreeBSD.org> |
Fix the typo made in r157474. MFC after: 3 days
|
#
52cd27cb |
|
05-Jun-2011 |
Robert Watson <rwatson@FreeBSD.org> |
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
#
d3c1f003 |
|
04-Jun-2011 |
Robert Watson <rwatson@FreeBSD.org> |
Add _mbuf() variants of various inpcb-related interfaces, including lookup, hash install, etc. For now, these are arguments are unused, but as we add RSS support, we will want to use hashes extracted from mbufs, rather than manually calculated hashes of header fields, due to the expensive of the software version of Toeplitz (and similar hashes). Add notes that it would be nice to be able to pass mbufs into lookup routines in pf(4), optimising firewall lookup in the same way, but the code structure there doesn't facilitate that currently. (In principle there is no reason this couldn't be MFCed -- the change extends rather than modifies the KBI. However, it won't be useful without other previous possibly less MFCable changes.) Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
#
d2025bd0 |
|
30-May-2011 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Unbreak NOINET kernels after r222488. Reviewed by: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems! Pointy hat: to myself for missing this during review?
|
#
fa046d87 |
|
30-May-2011 |
Robert Watson <rwatson@FreeBSD.org> |
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
#
61401ec2 |
|
24-May-2011 |
Robert Watson <rwatson@FreeBSD.org> |
An inpcb lock is no longer required in in_pcbref() since the move to refcount(9). MFC after: 3 weeks Sponsored by: Juniper Networks, Inc.
|
#
79bdc6e5 |
|
23-May-2011 |
Robert Watson <rwatson@FreeBSD.org> |
Continue to refine inpcb reference counting and locking, in preparation for reworking of inpcbinfo locking: (1) Convert inpcb reference counting from manually manipulated integers to the refcount(9) KPI. This allows the refcount to be managed atomically with an inpcb read lock rather than write lock, or even with no inpcb lock at all. As a result, in_pcbref() also no longer requires an inpcb lock, so can be performed solely using the lock used to look up an inpcb. (2) Shift more inpcb freeing activity from the in_pcbrele() context (via in_pcbfree_internal) to the explicit in_pcbfree() context. This means that the inpcb refcount is increasingly used only to maintain memory stability, not actually defer the clean up of inpcb protocol parts. This is desirable as many of those protocol parts required the pcbinfo lock, which we'd like not to acquire in in_pcbrele() contexts. Document this in comments better. (3) Introduce new read-locked and write-locked in_pcbrele() variations, in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to be properly unlocked as needed. in_pcbrele() is a wrapper around the latter, and should probably go away at some point. This makes it easier to use this weak reference model when holding only a read lock, as will happen in the future. This may well be safe to MFC, but some more KBI analysis is required. Reviewed by: bz MFC after: 3 weeks Sponsored by: Juniper Networks, Inc.
|
#
68e0d7e0 |
|
23-May-2011 |
Robert Watson <rwatson@FreeBSD.org> |
Move from passing a wildcard boolean to a general set up lookup flags into in_pcb_lport(), in_pcblookup_local(), and in_pcblookup_hash(), and similarly for IPv6 functions. In the future, we would like to support other flags relating to locking strategy. This change doesn't appear to modify the KBI in practice, as callers already passed in INPLOOKUP_WILDCARD rather than a simple boolean. MFC after: 3 weeks Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
#
67107f45 |
|
30-Apr-2011 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Make the PCB code compile without INET support by adding #ifdef INETs and correcting few #includes. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days
|
#
aae49dd3 |
|
20-Apr-2011 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
MFp4 CH=191470: Move the ipport_tick_callout and related functions from ip_input.c to in_pcb.c. The random source port allocation code has been merged and is now local to in_pcb.c only. Use a SYSINIT to get the callout started and no longer depend on initialization from the inet code, which would not work in an IPv6 only setup. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days
|
#
4d457387 |
|
19-Mar-2011 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Properly check for an IPv4 socket after r219579. In some cases as udp6_connect() without an earlier bind(2) to an address, v4-mapped scokets allowed and a non mapped destination address, we can end up here with both v4 and v6 indicated: inp_vflag = (INP_IPV4|INP_IPV6|INP_IPV6PROTO) In that case however laddrp is NULL as the IPv6 path does not pass in a copy currently. Reported by: Pawel Worach (pawel.worach gmail.com) Tested by: Pawel Worach (pawel.worach gmail.com) MFC after: 6 days X-MFC with: r219579
|
#
efc76f72 |
|
12-Mar-2011 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Merge the two identical implementations for local port selections from in_pcbbind_setup() and in6_pcbsetport() in a single in_pcb_lport(). MFC after: 2 weeks
|
#
e691be70 |
|
26-Jan-2011 |
Daniel Eischen <deischen@FreeBSD.org> |
Prison check addresses set with multicast interface options. Reviewed by: bz MFC after: 1 week
|
#
d79fdd98 |
|
08-Jan-2011 |
Daniel Eischen <deischen@FreeBSD.org> |
Make sure to always do source address selection on an unbound socket, regardless of any multicast options. If an address is specified via a multicast option, then let it override normal the source address selection. This fixes a bug where source address selection was not being performed when multicast options were present but without an interface being specified. Reviewed by: bz MFC after: 1 day
|
#
eab54f6a |
|
27-Dec-2010 |
Robert Watson <rwatson@FreeBSD.org> |
Remove comment bemoaning the lack of an INP_INHASHLIST above in_pcbdrop(); I fixed this in r189657 in early 2009, so the comment is OBE. Reviewed by: bz MFC after: 3 days
|
#
3e288e62 |
|
22-Nov-2010 |
Dimitry Andric <dim@FreeBSD.org> |
After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
|
#
31c6a003 |
|
14-Nov-2010 |
Dimitry Andric <dim@FreeBSD.org> |
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
|
#
04215ed2 |
|
10-Nov-2010 |
Randall Stewart <rrs@FreeBSD.org> |
Fix so that a multicast packet can be sent even if there is no route out to that mcast address. The code in in_pcb inadvertantly would error (no route) even though the user may have specified the address with the proper socket option (to specify the egress interface). Thanks bz for reminding me I forgot to commit this ;-) Reviewed by: bz MFC after: 1 week
|
#
a7d5f7eb |
|
19-Oct-2010 |
Jamie Gritton <jamie@FreeBSD.org> |
A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
|
#
dd62f5c0 |
|
25-Jun-2010 |
Qing Li <qingli@FreeBSD.org> |
MFC r208553 This patch fixes the problem where proxy ARP entries cannot be added over the if_ng interface. Approved by: re (bz)
|
#
0ed6142b |
|
25-May-2010 |
Qing Li <qingli@FreeBSD.org> |
This patch fixes the problem where proxy ARP entries cannot be added over the if_ng interface. MFC after: 3 days
|
#
9bcd427b |
|
14-Mar-2010 |
Robert Watson <rwatson@FreeBSD.org> |
Abstract out initialization of most aspects of struct inpcbinfo from their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy() to do this work in a central spot. As inpcbinfo becomes more complex due to ongoing work to add connection groups, this will reduce code duplication. MFC after: 1 month Reviewed by: bz Sponsored by: Juniper Networks
|
#
3bcceea4 |
|
23-Jan-2010 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
MFC r202468: Add ip4.saddrsel/ip4.nosaddrsel (and equivalent for ip6) to control whether to use source address selection (default) or the primary jail address for unbound outgoing connections. This is intended to be used by people upgrading from single-IP jails to multi-IP jails but not having to change firewall rules, application ACLs, ... but to force their connections (unless otherwise changed) to the primry jail IP they had been used for years, as well as for people prefering to implement similar policies. Note that for IPv6, if configured incorrectly, this might lead to scope violations, which single-IPv6 jails could as well, as by the design of jails. [1] Reviewed by: jamie, hrs (ipv6 part) Pointed out by: hrs [1]
|
#
592bcae8 |
|
16-Jan-2010 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Add ip4.saddrsel/ip4.nosaddrsel (and equivalent for ip6) to control whether to use source address selection (default) or the primary jail address for unbound outgoing connections. This is intended to be used by people upgrading from single-IP jails to multi-IP jails but not having to change firewall rules, application ACLs, ... but to force their connections (unless otherwise changed) to the primry jail IP they had been used for years, as well as for people prefering to implement similar policies. Note that for IPv6, if configured incorrectly, this might lead to scope violations, which single-IPv6 jails could as well, as by the design of jails. [1] Reviewed by: jamie, hrs (ipv6 part) Pointed out by: hrs [1] MFC After: 2 weeks Asked for by: Jase Thew (bazerka beardz.net)
|
#
599f45c5 |
|
15-Sep-2009 |
Qing Li <qingli@FreeBSD.org> |
MFC r197203 Previously local end of point-to-point interface is not reachable within the system that owns the interface. Packets destined to the local end point leak to the wire towards the default gateway if one exists. This behavior is changed as part of the L2/L3 rewrite efforts. The local end point is now reachable within the system. The inpcb code needs to consider this fact during the address selection process. Reviewed by: bz Approved by: re
|
#
f0bb05fc |
|
14-Sep-2009 |
Qing Li <qingli@FreeBSD.org> |
Previously local end of point-to-point interface is not reachable within the system that owns the interface. Packets destined to the local end point leak to the wire towards the default gateway if one exists. This behavior is changed as part of the L2/L3 rewrite efforts. The local end point is now reachable within the system. The inpcb code needs to consider this fact during the address selection process. Reviewed by: bz MFC after: immediately
|
#
530c0060 |
|
01-Aug-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
|
#
5ee847d3 |
|
19-Jul-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Reimplement and/or implement vnet list locking by replacing a mostly unused custom mutex/condvar-based sleep locks with two locks: an rwlock (for non-sleeping use) and sxlock (for sleeping use). Either acquired for read is sufficient to stabilize the vnet list, but both must be acquired for write to modify the list. Replace previous no-op read locking macros, used in various places in the stack, with actual locking to prevent race conditions. Callers must declare when they may perform unbounded sleeps or not when selecting how to lock. Refactor vnet sysinits so that the vnet list and locks are initialized before kernel modules are linked, as the kernel linker will use them for modules loaded by the boot loader. Update various consumers of these KPIs based on whether they may sleep or not. Reviewed by: bz Approved by: re (kib)
|
#
1e77c105 |
|
16-Jul-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
|
#
eddfbb76 |
|
14-Jul-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
|
#
2d9cfaba |
|
25-Jun-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the in_ifaddrhead and INADDR_HASH address lists. Previously, these lists were used unsynchronized as they were effectively never changed in steady state, but we've seen increasing reports of writer-writer races on very busy VPN servers as core count has gone up (and similar configurations where address lists change frequently and concurrently). For the time being, use rwlocks rather than rmlocks in order to take advantage of their better lock debugging support. As a result, we don't enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion is complete and a performance analysis has been done. This means that one class of reader-writer races still exists. MFC after: 6 weeks Reviewed by: bz
|
#
8c0fec80 |
|
23-Jun-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
|
#
8896f83a |
|
22-Jun-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Add a new function, ifa_ifwithaddr_check(), which rather than returning a pointer to an ifaddr matching the passed socket address, returns a boolean indicating whether one was present. In the (near) future, ifa_ifwithaddr() will return a referenced ifaddr rather than a raw ifaddr pointer, and the new wrapper will allow callers that care only about the boolean condition to avoid having to free that reference. MFC after: 3 weeks
|
#
173de0f9 |
|
22-Jun-2009 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Remove a hack from r186086 so that IPsec via loopback routes continued working. It was targeted for stable/7 compatibility and actually never did anything in HEAD. Reminded by: rwatson X-MFC after: never
|
#
bcf11e8d |
|
05-Jun-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
|
#
3de40469 |
|
03-Jun-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Continue work to optimize performance of "options MAC" when no MAC policy modules are loaded by avoiding mbuf label lookups when policies aren't loaded, pushing further socket locking into MAC policy modules, and avoiding locking MAC ifnet locks when no policies are loaded: - Check mac_policies_count before looking for mbuf MAC label m_tags in MAC Framework entry points. We will still pay label lookup costs if MAC policies are present but don't require labels (typically a single mbuf header field read, but perhaps further indirection if IPSEC or other m_tag consumers are in use). - Further push socket locking for socket-related access control checks and events into MAC policies from the MAC Framework, so that sockets are only locked if a policy specifically requires a lock to protect a label. This resolves lock order issues during sonewconn() and also in local domain socket cross-connect where multiple socket locks could not be held at once for the purposes of propagatig MAC labels across multiple sockets. Eliminate mac_policy_count check in some entry points where it no longer avoids locking. - Add mac_policy_count checking in some entry points relating to network interfaces that otherwise lock a global MAC ifnet lock used to protect ifnet labels. Obtained from: TrustedBSD Project
|
#
f44270e7 |
|
01-Jun-2009 |
Pawel Jakub Dawidek <pjd@FreeBSD.org> |
- Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option as there is no more space for more SOL_SOCKET options, but this option also fits better as an IP socket option, it seems. - Implement this functionality also for IPv6 and RAW IP sockets. - Always compile it in (don't use additional kernel options). - Remove sysctl to turn this functionality on and off. - Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this functionality (currently only unjail root can use it). Discussed with: julian, adrian, jhb, rwatson, kmacy
|
#
0304c731 |
|
27-May-2009 |
Jamie Gritton <jamie@FreeBSD.org> |
Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
|
#
6d888973 |
|
14-May-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Staticize two functions not used outside of in_pcb.c: in_pcbremlists() and db_print_inpcb(). MFC after: 1 month
|
#
f6dfe47a |
|
30-Apr-2009 |
Marko Zec <zec@FreeBSD.org> |
Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_* macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_* family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)
|
#
1096332a |
|
29-Apr-2009 |
Bruce M Simpson <bms@FreeBSD.org> |
Do not assume that ip6_moptions is always set, it is a lazy-allocated structure.
|
#
9317b04e |
|
19-Apr-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Lock interface address lists in in_pcbladdr() when searching for a source address for a connection and there's no route or now interface for the route. MFC after: 2 weeks
|
#
ad71fe3c |
|
15-Mar-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Correct a number of evolved problems with inp_vflag and inp_flags: certain flags that should have been in inp_flags ended up in inp_vflag, meaning that they were inconsistently locked, and in one case, interpreted. Move the following flags from inp_vflag to gaps in the inp_flags space (and clean up the inp_flags constants to make gaps more obvious to future takers): INP_TIMEWAIT INP_SOCKREF INP_ONESBCAST INP_DROPPED Some aspects of this change have no effect on kernel ABI at all, as these are UDP/TCP/IP-internal uses; however, netstat and sockstat detect INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this into account. MFC after: 1 week (or after dependencies are MFC'd) Reviewed by: bz
|
#
111d57a6 |
|
10-Mar-2009 |
Robert Watson <rwatson@FreeBSD.org> |
Add INP_INHASHLIST flag for inpcb->inp_flags to indicate whether or not the inpcb is currenty on various hash lookup lists, rather than using (lport != 0) to detect this. This means that the full 4-tuple of a connection can be retained after close, which should lead to more sensible netstat output in the window between TCP close and socket close. MFC after: 2 weeks
|
#
7c2f3cb9 |
|
05-Feb-2009 |
Jamie Gritton <jamie@FreeBSD.org> |
Remove redundant calls of prison_local_ip4 in in_pcbbind_setup, and of prison_local_ip6 in in6_pcbbind. Approved by: bz (mentor)
|
#
b89e82dd |
|
05-Feb-2009 |
Jamie Gritton <jamie@FreeBSD.org> |
Standardize the various prison_foo_ip[46] functions and prison_if to return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)
|
#
1cecba0f |
|
25-Jan-2009 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
For consistency with prison_{local,remote,check}_ipN rename prison_getipN to prison_get_ipN. Submitted by: jamie (as part of a larger patch) MFC after: 1 week
|
#
8696873d |
|
09-Jan-2009 |
Adrian Chadd <adrian@FreeBSD.org> |
Fix fat-fingered comment. Noticed-by: julian
|
#
4209e01a |
|
09-Jan-2009 |
Adrian Chadd <adrian@FreeBSD.org> |
Comment some potentially confusing logic. Nitpicking by: mlaier MFC after: 2 weeks
|
#
be9347e3 |
|
09-Jan-2009 |
Adrian Chadd <adrian@FreeBSD.org> |
Implement a new IP option (not compiled/enabled by default) to allow applications to specify a non-local IP address when bind()'ing a socket to a local endpoint. This allows applications to spoof the client IP address of connections if (obviously!) they somehow are able to receive the traffic normally destined to said clients. This patch doesn't include any changes to ipfw or the bridging code to redirect the client traffic through the PCB checks so TCP gets a shot at it. The normal behaviour is that packets with a non-local destination IP address are not handled locally. This can be dealth with some IPFW hackery; modifications to IPFW to make this less hacky will occur in subsequent commmits. Thanks to Julian Elischer and others at Ironport. This work was approved and donated before Cisco acquired them. Obtained from: Julian Elischer and others MFC after: 2 weeks
|
#
dcdb4371 |
|
16-Dec-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Use inc_flags instead of the inc_isipv6 alias which so far had been the only flag with random usage patterns. Switch inc_flags to be used as a real bit field by using INC_ISIPV6 with bitops to check for the 'isipv6' condition. While here fix a place or two where in case of v4 inc_flags were not properly initialized before.[1] Found by: rwatson during review [1] Discussed with: rwatson Reviewed by: rwatson MFC after: 4 weeks
|
#
6e6b3f7c |
|
14-Dec-2008 |
Qing Li <qingli@FreeBSD.org> |
This main goals of this project are: 1. separating L2 tables (ARP, NDP) from the L3 routing tables 2. removing as much locking dependencies among these layers as possible to allow for some parallelism in the search operations 3. simplify the logic in the routing code, The most notable end result is the obsolescent of the route cloning (RTF_CLONING) concept, which translated into code reduction in both IPv4 ARP and IPv6 NDP related modules, and size reduction in struct rtentry{}. The change in design obsoletes the semantics of RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland applications such as "arp" and "ndp" have been modified to reflect those changes. The output from "netstat -r" shows only the routing entries. Quite a few developers have contributed to this project in the past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and Andre Oppermann. And most recently: - Kip Macy revised the locking code completely, thus completing the last piece of the puzzle, Kip has also been conducting active functional testing - Sam Leffler has helped me improving/refactoring the code, and provided valuable reviews - Julian Elischer setup the perforce tree for me and has helped me maintaining that branch before the svn conversion
|
#
03d8b6fd |
|
14-Dec-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Add a check, that is currently under discussion for 8 but that we need to keep for 7-STABLE when MFCing in_pcbladdr() to not change the behaviour there. With this a destination route via a loopback interface is treated as a valid and reachable thing for IPv4 source address selection, even though nothing of that network is ever directly reachable, but it is more like a blackhole route. With this the source address will be selected and IPsec can grab the packets before we would discard them at a later point, encapsulate them and send them out from a different tunnel endpoint IP. Discussed on: net Reported by: Frank Behrens <frank@harz.behrens.de> Tested by: Frank Behrens <frank@harz.behrens.de> MFC after: 4 weeks (just so that I get the mail)
|
#
cd416355 |
|
10-Dec-2008 |
Robert Watson <rwatson@FreeBSD.org> |
Remove inconsistent white space from in_pcballoc(). MFC after: pretty soon
|
#
28696211 |
|
08-Dec-2008 |
Robert Watson <rwatson@FreeBSD.org> |
Add a reference count to struct inpcb, which may be explicitly incremented using in_pcbref(), and decremented using in_pcbfree() or inpcbrele(). Protocols using only current in_pcballoc() and in_pcbfree() calls will see the same semantics, but it is now possible for TCP to call in_pcbref() and in_pcbrele() to prevent an inpcb from being freed when both tcbinfo and per-inpcb locks are released. This makes it possible to safely transition from holding only the inpcb lock to both tcbinfo and inpcb lock without re-looking up a connection in the input path, timer path, etc. Notice that in_pcbrele() does not unlock the connection after decrementing the refcount, if the connection remains, so that the caller can continue to use it; in_pcbrele() returns a flag indicating whether or not the inpcb pointer is still valid, and in_pcbfee() is now a simple wrapper around in_pcbrele(). MFC after: 1 month Discussed with: bz, kmacy Reviewed by: bz, gnn, kmacy Tested by: kmacy
|
#
4b79449e |
|
02-Dec-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
|
#
413628a7 |
|
29-Nov-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
|
#
5cd54324 |
|
27-Nov-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Replace most INP_CHECK_SOCKAF() uses checking if it is an IPv6 socket by comparing a constant inp vflag. This is expected to help to reduce extra locking. Suggested by: rwatson Reviewed by: rwatson MFC after: 6 weeks
|
#
6aee2fc5 |
|
26-Nov-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Merge in6_pcbfree() into in_pcbfree() which after the previous IPsec change in r185366 only differed in two additonal IPv6 lines. Rather than splattering conditional code everywhere add the v6 check centrally at this single place. Reviewed by: rwatson (as part of a larger changset) MFC after: 6 weeks (*) (*) possibly need to leave a stub wrapper in 7 to keep the symbol.
|
#
6974bd9e |
|
27-Nov-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Unify ipsec[46]_delete_pcbpolicy in ipsec_delete_pcbpolicy. Ignoring different names because of macros (in6pcb, in6p_sp) and inp vs. in6p variable name both functions were entirely identical. Reviewed by: rwatson (as part of a larger changeset) MFC after: 6 weeks (*) (*) possibly need to leave a stub wrappers in 7 to keep the symbols.
|
#
97021c24 |
|
26-Nov-2008 |
Marko Zec <zec@FreeBSD.org> |
Merge more of currently non-functional (i.e. resolving to whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
|
#
a7df09e8 |
|
25-Nov-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Unify the v4 and v6 versions of pcbdetach and pcbfree as good as possible so that they are easily diffable. No functional changes. Reviewed by: rwatson MFC after: 6 weeks
|
#
44e33a07 |
|
19-Nov-2008 |
Marko Zec <zec@FreeBSD.org> |
Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
|
#
1ede983c |
|
23-Oct-2008 |
Dag-Erling Smørgrav <des@FreeBSD.org> |
Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months
|
#
7e1bc272 |
|
20-Oct-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Update a comment which to my reading had been misplaced in rev. 1.12 already (but probably had been way above as the code was there twice) and describe what was last changed in rev. 1.199 there (which now is in sync with in6_src.c r184096). Pointed at by: mlaier MFC after: 2 mmonths
|
#
d7f03759 |
|
19-Oct-2008 |
Ulf Lilleengen <lulf@FreeBSD.org> |
- Import the HEAD csup code which is the basis for the cvsmode work.
|
#
86d02c5c |
|
04-Oct-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Cache so_cred as inp_cred in the inpcb. This means that inp_cred is always there, even after the socket has gone away. It also means that it is constant for the lifetime of the inp. Both facts lead to simpler code and possibly less locking. Suggested by: rwatson Reviewed by: rwatson MFC after: 6 weeks X-MFC Note: use a inp_pspare for inp_cred
|
#
0895aec3 |
|
02-Oct-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Implement IPv4 source address selection for unbound sockets. For the jail case we are already looping over the interface addresses before falling back to the only IP address of a jail in case of no match. This is in preparation for the upcoming multi-IPv4/v6/no-IP jail patch this change was developed with initially. This also changes the semantics of selecting the IP for processes within a jail as it now uses the same logic as outside the jail (with additional checks) but no longer is on a mutually exclusive code path. Benchmarks had shown no difference at 95.0% confidence for neither the plain nor the jail case (even with the additional overhead). See: http://lists.freebsd.org/pipermail/freebsd-net/2008-September/019531.html Inpsired by a patch from: Yahoo! (partially) Tested by: latest multi-IP jail patch users (implictly) Discussed with: rwatson (general things around this) Reviewed by: mostly silence (feedback from bms) Help with benchmarking from: kris MFC after: 2 months
|
#
8b615593 |
|
02-Oct-2008 |
Marko Zec <zec@FreeBSD.org> |
Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_*() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(*). (*) netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
|
#
c0a211c5 |
|
29-Sep-2008 |
Robert Watson <rwatson@FreeBSD.org> |
Expand comments relating various detach/free/drop inpcb routines. MFC after: 3 days
|
#
603724d3 |
|
17-Aug-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
|
#
5cb2685a |
|
07-Aug-2008 |
Robert Watson <rwatson@FreeBSD.org> |
Minor white space tweaks. MFC after: 1 week
|
#
72bed082 |
|
07-Aug-2008 |
Robert Watson <rwatson@FreeBSD.org> |
Correct comment typo. MFC after: 1 week (after inpcb rwlocking)
|
#
df9cf830 |
|
21-Jul-2008 |
Tai-hwa Liang <avatar@FreeBSD.org> |
Trying to fix compilation bustage: - removing 'const' qualifier from an input parameter to conform to the type required by rw_assert(); - using in_addr->s_addr to retrive 32 bits address value. Observed by: tinderbox
|
#
9d29c635 |
|
21-Jul-2008 |
Kip Macy <kmacy@FreeBSD.org> |
make new accessor functions consistent with existing style
|
#
dd0e6c38 |
|
20-Jul-2008 |
Kip Macy <kmacy@FreeBSD.org> |
Add accessor functions for socket fields. MFC after: 1 week
|
#
9378e437 |
|
20-Jul-2008 |
Kip Macy <kmacy@FreeBSD.org> |
add inpcb accessor functions for fields needed by TOE devices
|
#
8699ea08 |
|
19-Jul-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
ia is a pointer thus use NULL rather then 0 for initialization and in comparisons to make this more obvious. MFC after: 5 days
|
#
078b7042 |
|
10-Jul-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Pass the ucred along into in{,6}_pcblookup_local for upcoming prison checks. Reviewed by: rwatson
|
#
cdcb11b9 |
|
10-Jul-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
For consistency take lport as u_short in in{,6}_pcblookup_local. All callers either pass in an u_short or u_int16_t. Reviewed by: rwatson
|
#
e5cf427b |
|
09-Jul-2008 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
For consistency with the rest of the function use the locally cached pointer pcbinfo rather than inp->inp_pcbinfo. MFC after: 3 weeks
|
#
8b07e49a |
|
09-May-2008 |
Julian Elischer <julian@FreeBSD.org> |
Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
|
#
a69042a5 |
|
19-Apr-2008 |
Robert Watson <rwatson@FreeBSD.org> |
When querying the local or foreign address from an IP socket, acquire only a read lock on the inpcb. When an external module requests a read lock, acquire only a read lock. MFC after: 3 months
|
#
8501a69c |
|
17-Apr-2008 |
Robert Watson <rwatson@FreeBSD.org> |
Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch)
|
#
f457d580 |
|
06-Apr-2008 |
Robert Watson <rwatson@FreeBSD.org> |
In in_pcbnotifyall() and in6_pcbnotify(), use LIST_FOREACH_SAFE() and eliminate unnecessary local variable caching of the list head pointer, making the code a bit easier to read. MFC after: 3 weeks
|
#
e79dd20d |
|
24-Mar-2008 |
Kip Macy <kmacy@FreeBSD.org> |
change inp_wlock_assert to inp_lock_assert
|
#
3d585327 |
|
23-Mar-2008 |
Kip Macy <kmacy@FreeBSD.org> |
Insulate inpcb consumers outside the stack from the lock type and offset within the pcb by adding accessor functions. Reviewed by: rwatson MFC after: 3 weeks
|
#
c2877015 |
|
17-Mar-2008 |
Robert Watson <rwatson@FreeBSD.org> |
Fix indentation for a closing brace in in_pcballoc(). MFC after: 3 days
|
#
1cf6e4f5 |
|
04-Mar-2008 |
Rui Paulo <rpaulo@FreeBSD.org> |
Change the default port range for outgoing connections by introducing IPPORT_EPHEMERALFIRST and IPPORT_EPHEMERALLAST with values 10000 and 65535 respectively. The rationale behind is that it makes the attacker's life more difficult if he/she wants to guess the ephemeral port range and also lowers the probability of a port colision (described in draft-ietf-tsvwg-port-randomization-01.txt). While there, remove code duplication in in_pcbbind_setup(). Submitted by: Fernando Gont <fernando at gont.com.ar> Approved by: njl (mentor) Reviewed by: silby, bms Discussed on: freebsd-net
|
#
0bffde27 |
|
22-Dec-2007 |
Robert Watson <rwatson@FreeBSD.org> |
When IPSEC fails to allocate policy state for an inpcb, and MAC is in use, free the MAC label on the inpcb before freeing the inpcb. MFC after: 3 days Submitted by: tanyong <tanyong at ercist dot iscas dot ac dot cn>, zhouzhouyi
|
#
30d239bc |
|
24-Oct-2007 |
Robert Watson <rwatson@FreeBSD.org> |
Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
|
#
4b421e2d |
|
07-Oct-2007 |
Mike Silbersack <silby@FreeBSD.org> |
Add FBSDID to all files in netinet so that people can more easily include file version information in bug reports. Approved by: re (kensmith)
|
#
b2630c29 |
|
02-Jul-2007 |
George V. Neville-Neil <gnn@FreeBSD.org> |
Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC option is now deprecated, as well as the KAME IPsec code. What was FAST_IPSEC is now IPSEC. Approved by: re Sponsored by: Secure Computing
|
#
2cb64cb2 |
|
01-Jul-2007 |
George V. Neville-Neil <gnn@FreeBSD.org> |
Commit IPv6 support for FAST_IPSEC to the tree. This commit includes only the kernel files, the rest of the files will follow in a second commit. Reviewed by: bz Approved by: re Supported by: Secure Computing
|
#
71498f30 |
|
12-Jun-2007 |
Bruce M Simpson <bms@FreeBSD.org> |
Import rewrite of IPv4 socket multicast layer to support source-specific and protocol-independent host mode multicast. The code is written to accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work. This change only pertains to FreeBSD's use as a multicast end-station and does not concern multicast routing; for an IGMPv3/MLDv2 router implementation, consider the XORP project. The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6, which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html Summary * IPv4 multicast socket processing is now moved out of ip_output.c into a new module, in_mcast.c. * The in_mcast.c module implements the IPv4 legacy any-source API in terms of the protocol-independent source-specific API. * Source filters are lazy allocated as the common case does not use them. They are part of per inpcb state and are covered by the inpcb lock. * struct ip_mreqn is now supported to allow applications to specify multicast joins by interface index in the legacy IPv4 any-source API. * In UDP, an incoming multicast datagram only requires that the source port matches the 4-tuple if the socket was already bound by source port. An unbound socket SHOULD be able to receive multicasts sent from an ephemeral source port. * The UDP socket multicast filter mode defaults to exclusive, that is, sources present in the per-socket list will be blocked from delivery. * The RFC 3678 userland functions have been added to libc: setsourcefilter, getsourcefilter, setipv4sourcefilter, getipv4sourcefilter. * Definitions for IGMPv3 are merged but not yet used. * struct sockaddr_storage is now referenced from <netinet/in.h>. It is therefore defined there if not already declared in the same way as for the C99 types. * The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF which are then interpreted as interface indexes) is now deprecated. * A patch for the Rhyolite.com routed in the FreeBSD base system is available in the -net archives. This only affects individuals running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces. * Make IPv6 detach path similar to IPv4's in code flow; functionally same. * Bump __FreeBSD_version to 700048; see UPDATING. This work was financially supported by another FreeBSD committer. Obtained from: p4://bms_netdev Submitted by: Wilbert de Graaf (original work) Reviewed by: rwatson (locking), silence from fenner, net@ (but with encouragement)
|
#
32f9753c |
|
11-Jun-2007 |
Robert Watson <rwatson@FreeBSD.org> |
Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project
|
#
54d642bb |
|
11-May-2007 |
Robert Watson <rwatson@FreeBSD.org> |
Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr protocol entry points using functions named proto_getsockaddr and proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr. While it's true that sockaddrs are allocated and set, the net effect is to retrieve (get) the socket address or peer address from a socket, not set it, so align names to that intent.
|
#
84ca8aa6 |
|
01-May-2007 |
Robert Watson <rwatson@FreeBSD.org> |
Remove unused pcbinfo arguments to in_setsockaddr() and in_setpeeraddr().
|
#
712fc218 |
|
30-Apr-2007 |
Robert Watson <rwatson@FreeBSD.org> |
Rename some fields of struct inpcbinfo to have the ipi_ prefix, consistent with the naming of other structure field members, and reducing improper grep matches. Clean up and comment structure fields in structure definition.
|
#
6493245d |
|
10-Apr-2007 |
Robert Watson <rwatson@FreeBSD.org> |
Add a new privilege, PRIV_NETINET_REUSEPORT, which will replace superuser checks to see whether bind() can reuse a port/address combination while it's already in use (for some definition of use).
|
#
03dc38a4 |
|
18-Feb-2007 |
Robert Watson <rwatson@FreeBSD.org> |
#ifdef INET6 printing of inpcb IPv6 addresses in DDB. Patch committed with minor adjustments. Submitted by: Florian C. Smeets <flo at kasimir dot com>
|
#
497057ee |
|
17-Feb-2007 |
Robert Watson <rwatson@FreeBSD.org> |
Add "show inpcb", "show tcpcb" DDB commands, which should come in handy for debugging sblock and other network panics.
|
#
08651e1f |
|
29-Dec-2006 |
John Baldwin <jhb@FreeBSD.org> |
Some whitespace nits and remove a few casts.
|
#
e3fd5ffd |
|
30-Nov-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Consistently use #ifdef INET6 rather than mixing and matching with #if defined(INET6). Don't comment the end of short #ifdef blocks. Comment cleanup. Line wrap.
|
#
acd3428b |
|
06-Nov-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
|
#
aed55708 |
|
22-Oct-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
|
#
2c857a9b |
|
06-Sep-2006 |
Gleb Smirnoff <glebius@FreeBSD.org> |
o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely bad under high load. For example with 40k sockets and 25k tcptw entries, connect() syscall can run for seconds. Debugging showed that it iterates the cycle millions times and purges thousands of tcptw entries at a time. Besides practical unusability this change is architecturally wrong. First, in_pcblookup_local() is used in connect() and bind() syscalls. No stale entries purging shouldn't be done here. Second, it is a layering violation. o Return back the tcptw purging cycle to tcp_timer_2msl_tw(), that was removed in rev. 1.78 by rwatson. The commit log of this revision tells nothing about the reason cycle was removed. Now we need this cycle, since major cleaner of stale tcptw structures is removed. o Disable probably necessary, but now unused tcp_twrecycleable() function. Reviewed by: ru
|
#
d915b280 |
|
18-Jul-2006 |
Stephan Uphoff <ups@FreeBSD.org> |
Fix race conditions on enumerating pcb lists by moving the initialization ( and where appropriate the destruction) of the pcb mutex to the init/finit functions of the pcb zones. This allows locking of the pcb entries and race condition free comparison of the generation count. Rearrange locking a bit to avoid extra locking operation to update the generation count in in_pcballoc(). (in_pcballoc now returns the pcb locked) I am planning to convert pcb list handling from a type safe to a reference count model soon. ( As this allows really freeing the PCBs) Reviewed by: rwatson@, mohans@ MFC after: 1 week
|
#
421d8aa6 |
|
29-Jun-2006 |
Bjoern A. Zeeb <bz@FreeBSD.org> |
Use INPLOOKUP_WILDCARD instead of just 1 more consistently. OKed by: rwatson (some weeks ago)
|
#
835d4b89 |
|
27-Jun-2006 |
Pawel Jakub Dawidek <pjd@FreeBSD.org> |
- Use suser_cred(9) instead of directly checking cr_uid. - Change the order of conditions to first verify that we actually need to check for privileges and then eventually check them. Reviewed by: rwatson
|
#
ad3a630f |
|
02-Jun-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Minor restyling and cleanup around ipport_tick(). MFC after: 1 month
|
#
7c5a8ab2 |
|
25-Apr-2006 |
Marcel Moolenaar <marcel@FreeBSD.org> |
In in_pcbdrop(), fix !INVARIANTS build.
|
#
10702a28 |
|
25-Apr-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP, into in_pcbdrop(). Expand logic to detach the inpcb from its bound address/port so that dropping a TCP connection releases the inpcb resource reservation, which since the introduction of socket/pcb reference count updates, has been persisting until the socket closed rather than being released implicitly due to prior freeing of the inpcb on TCP drop. MFC after: 3 months
|
#
602cc7f1 |
|
22-Apr-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Assert the inpcb lock when rehashing an inpcb. Improve consistency of style around some current assertions. MFC after: 3 months
|
#
6466b28a |
|
22-Apr-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Remove pcbinfo locking from in_setsockaddr() and in_setpeeraddr(); holding the inpcb lock is sufficient to prevent races in reading the address and port, as both the inpcb lock and pcbinfo lock are required to change the address/port. Improve consistency of spelling in assertions about inp != NULL. MFC after: 3 months
|
#
ae0e7143 |
|
03-Apr-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being NULL. We currently do allow this to happen, but may want to remove that possibility in the future. This case can occur when a socket is left open after TCP wraps up, and the timewait state is recycled. This will be cleaned up in the future. Found by: Kazuaki Oda <kaakun at highway dot ne dot jp> MFC after: 3 months
|
#
afa39e25 |
|
03-Apr-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Change inp_ppcb from caddr_t to void *, fix/remove associated related casts. Consistently use intotw() to cast inp_ppcb pointers to struct tcptw * pointers. Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb * pointers. Don't assign tp to the results to intotcpcb() during variable declation at the top of functions, as that is before the asserts relating to locking have been performed. Do this later in the function after appropriate assertions have run to allow that operation to be conisdered safe. MFC after: 3 months
|
#
4c7c478d |
|
01-Apr-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Break out in_pcbdetach() into two functions: - in_pcbdetach(), which removes the link between an inpcb and its socket. - in_pcbfree(), which frees a detached pcb. Unlike the previous in_pcbdetach(), neither of these functions will attempt to conditionally free the socket, as they are responsible only for managing in_pcb memory. Mirror these changes into in6_pcbdetach() by breaking it into in6_pcbdetach() and in6_pcbfree(). While here, eliminate undesired checks for NULL inpcb pointers in sockets, as we will now have as an invariant that sockets will always have valid so_pcb pointers. MFC after: 3 months
|
#
cf744713 |
|
16-Feb-2006 |
Andre Oppermann <andre@FreeBSD.org> |
In in_pcbconnect_setup() reduce code duplication and use ip_rtaddr() to find the outgoing interface for this connection. Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 2 weeks
|
#
d5e8a67e |
|
04-Feb-2006 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
Never select the PCB that has INP_IPV6 flag and is bound to :: if we have another PCB which is bound to 0.0.0.0. If a PCB has the INP_IPV6 flag, then we set its cost higher than IPv4 only PCBs. Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net> Obtained from: KAME MFC after: 1 week
|
#
136d4f1c |
|
21-Jan-2006 |
Robert Watson <rwatson@FreeBSD.org> |
Convert remaining functions to ANSI C function declarations; remove 'register' where present. MFC after: 1 week
|
#
de35559f |
|
18-Jul-2005 |
Robert Watson <rwatson@FreeBSD.org> |
Remove no-op spl references in in_pcb.c, since in_pcb locking has been basically complete for several years now. Update one spl comment to reference the locking strategy. MFC after: 3 days
|
#
fe6bfc37 |
|
01-Jun-2005 |
Robert Watson <rwatson@FreeBSD.org> |
Commit correct version of previous commit (in_pcb.c:1.164). Use the local variables as currently named. MFC after: 7 days
|
#
6b348152 |
|
01-Jun-2005 |
Robert Watson <rwatson@FreeBSD.org> |
Assert pcbinfo lock in in_pcbdisconnect() and in_pcbdetach(), as the global pcb lists are modified. MFC after: 7 days
|
#
29f2a6ec |
|
08-Apr-2005 |
Maxim Konovalov <maxim@FreeBSD.org> |
o Tweak the comment a bit.
|
#
e99971bf |
|
08-Apr-2005 |
Maxim Konovalov <maxim@FreeBSD.org> |
o Disable random port allocation when ip.portrange.first == ip.portrange.last and there is the only port for that because: a) it is not wise; b) it leads to a panic in the random ip port allocation code. In general we need to disable ip port allocation randomization if the last - first delta is ridiculous small. PR: kern/79342 Spotted by: Anjali Kulkarni Glanced at by: silby MFC after: 2 weeks
|
#
6ee79c59 |
|
23-Mar-2005 |
Maxim Konovalov <maxim@FreeBSD.org> |
o Document net.inet.ip.portrange.random* sysctls. o Correct a comment about random port allocation threshold implementation. Reviewed by: silby, ru MFC after: 3 days
|
#
797127a9 |
|
22-Feb-2005 |
Gleb Smirnoff <glebius@FreeBSD.org> |
We can make code simplier after last change. Noticed by: Andrew Thompson
|
#
914d092f |
|
22-Feb-2005 |
Gleb Smirnoff <glebius@FreeBSD.org> |
In in_pcbconnect_setup() remove a check that route points at loopback interface. Nobody have explained me sense of this check. It breaks connect() system call to a destination address which is loopback routed (e.g. blackholed). Reviewed by: silence on net@ MFC after: 2 weeks
|
#
c398230b |
|
06-Jan-2005 |
Warner Losh <imp@FreeBSD.org> |
/* -> /*- for license, minor formatting changes
|
#
5f311da2 |
|
01-Jan-2005 |
Mike Silbersack <silby@FreeBSD.org> |
Port randomization leads to extremely fast port reuse at high connection rates, which is causing problems for some users. To retain the security advantage of random ports and ensure correct operation for high connection rate users, disable port randomization during periods of high connection rates. Whenever the connection rate exceeds randomcps (10 by default), randomization will be disabled for randomtime (45 by default) seconds. These thresholds may be tuned via sysctl. Many thanks to Igor Sysoev, who proved the necessity of this change and tested many preliminary versions of the patch. MFC After: 20 seconds
|
#
81158452 |
|
18-Oct-2004 |
Robert Watson <rwatson@FreeBSD.org> |
Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>
|
#
48ac555d |
|
28-Sep-2004 |
Robert Watson <rwatson@FreeBSD.org> |
Assign so_pcb to NULL rather than 0 as it's a pointer. Spotted by: dwhite
|
#
4c2bb15a |
|
18-Aug-2004 |
Robert Watson <rwatson@FreeBSD.org> |
In in_pcbrehash(), do assert the inpcb lock as well as the pcbinfo lock.
|
#
27f74fd0 |
|
10-Aug-2004 |
Robert Watson <rwatson@FreeBSD.org> |
Assert the locks of inpcbinfo's and inpcb's passed into in_pcbconnect() and in_pcbconnect_setup(), since these functions frob the port and address state of inpcbs.
|
#
a4eb4405 |
|
28-Jul-2004 |
Yaroslav Tykhiy <ytykhiy@gmail.com> |
Disallow a particular kind of port theft described by the following scenario: Alice is too lazy to write a server application in PF-independent manner. Therefore she knocks up the server using PF_INET6 only and allows the IPv6 socket to accept mapped IPv4 as well. An evil hacker known on IRC as cheshire_cat has an account in the same system. He starts a process listening on the same port as used by Alice's server, but in PF_INET. As a consequence, cheshire_cat will distract all IPv4 traffic supposed to go to Alice's server. Such sort of port theft was initially enabled by copying the code that implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for the implied case of the same owner for both connections. After this change, the above scenario will be impossible. In the same setting, the user who attempts to start his server last will get EADDRINUSE. Of course, using IPv4 mapped to IPv6 leads to security complications in the first place, but there is no reason to make it even more unsafe. This change doesn't apply to KAME since it affects a FreeBSD-specific part of the code. It doesn't modify the out-of-box behaviour of the TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default. MFC after: 1 month
|
#
56f21b9d |
|
26-Jul-2004 |
Colin Percival <cperciva@FreeBSD.org> |
Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb
|
#
ef14c369 |
|
16-Jun-2004 |
Maxim Konovalov <maxim@FreeBSD.org> |
o connect(2): if there is no a route to the destination do not pick up the first local ip address for the source ip address, return ENETUNREACH instead. Submitted by: Gleb Smirnoff Reviewed by: -current (silence)
|
#
310e7ceb |
|
12-Jun-2004 |
Robert Watson <rwatson@FreeBSD.org> |
Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so): - Hold socket lock over calls to MAC entry points reading or manipulating socket labels. - Assert socket lock in MAC entry point implementations. - When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
|
#
395a08c9 |
|
12-Jun-2004 |
Robert Watson <rwatson@FreeBSD.org> |
Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
|
#
4658dc83 |
|
20-May-2004 |
Yaroslav Tykhiy <ytykhiy@gmail.com> |
When checking for possible port theft, skip over a TCP inpcb unless it's in the closed or listening state (remote address == INADDR_ANY). If a TCP inpcb is in any other state, it's impossible to steal its local port or use it for port theft. And if there are both closed/listening and connected TCP inpcbs on the same localIP:port couple, the call to in_pcblookup_local() will find the former due to the design of that function. No objections raised in: -net, -arch MFC after: 1 month
|
#
6b2fc10b |
|
23-Apr-2004 |
Mike Silbersack <silby@FreeBSD.org> |
Wrap two long lines in the previous commit.
|
#
174624e0 |
|
22-Apr-2004 |
Mike Silbersack <silby@FreeBSD.org> |
Take out an unneeded variable I forgot to remove in the last commit, and make two small whitespace fixes so that diffs vs rev 1.142 are minimal.
|
#
6ac48b74 |
|
22-Apr-2004 |
Mike Silbersack <silby@FreeBSD.org> |
Simplify random port allocation, and add net.inet.ip.portrange.randomized, which can be used to turn off randomized port allocation if so desired. Requested by: alfred
|
#
6dd946b3 |
|
20-Apr-2004 |
Mike Silbersack <silby@FreeBSD.org> |
Switch from using sequential to random ephemeral port allocation, implementation taken directly from OpenBSD. I've resisted committing this for quite some time because of concern over TIME_WAIT recycling breakage (sequential allocation ensures that there is a long time before ports are recycled), but recent testing has shown me that my fears were unwarranted.
|
#
f36cfd49 |
|
07-Apr-2004 |
Warner Losh <imp@FreeBSD.org> |
Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
|
#
30a4ab08 |
|
06-Apr-2004 |
Bruce Evans <bde@FreeBSD.org> |
Fixed misspelling of IPPORT_MAX as USHRT_MAX. Don't include <sys/limits.h> to implement this mistake. Fixed some nearby style bugs (initialization in declaration, misformatting of this initialization, missing blank line after the declaration, and comparision of the non-boolean result of the initialization with 0 using "!". In KNF, "!" is not even used to compare booleans with 0).
|
#
b0330ed9 |
|
27-Mar-2004 |
Pawel Jakub Dawidek <pjd@FreeBSD.org> |
Reduce 'td' argument to 'cred' (struct ucred) argument in those functions: - in_pcbbind(), - in_pcbbind_setup(), - in_pcbconnect(), - in_pcbconnect_setup(), - in6_pcbbind(), - in6_pcbconnect(), - in6_pcbsetport(). "It should simplify/clarify things a great deal." --rwatson Requested by: rwatson Reviewed by: rwatson, ume
|
#
6823b823 |
|
27-Mar-2004 |
Pawel Jakub Dawidek <pjd@FreeBSD.org> |
Remove unused argument. Reviewed by: ume
|
#
8da601df |
|
25-Mar-2004 |
Pawel Jakub Dawidek <pjd@FreeBSD.org> |
Remove unused function. It was used in FreeBSD 4.x, but now we're using cr_canseesocket().
|
#
846840ba |
|
09-Mar-2004 |
Robert Watson <rwatson@FreeBSD.org> |
Scrub unused variable zeroin_addr.
|
#
548c676b |
|
13-Jan-2004 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
do not deref freed pointer Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net> Reviewed by: itojun
|
#
0cfbbe3b |
|
26-Nov-2003 |
Andre Oppermann <andre@FreeBSD.org> |
Make sure all uses of stack allocated struct route's are properly zeroed. Doing a bzero on the entire struct route is not more expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst. Reviewed by: sam (mentor) Approved by: re (scottl)
|
#
5bd311a5 |
|
25-Nov-2003 |
Sam Leffler <sam@FreeBSD.org> |
Split the "inp" mutex class into separate classes for each of divert, raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness complaints. Reviewed by: rwatson Approved by: re (rwatson)
|
#
1f831750 |
|
22-Nov-2003 |
Thomas Moestl <tmm@FreeBSD.org> |
bzero() the the sockaddr used for the destination address for rtalloc_ign() in in_pcbconnect_setup() before it is filled out. Otherwise, stack junk would be left in sin_zero, which could cause host routes to be ignored because they failed the comparison in rn_match(). This should fix the wrong source address selection for connect() to 127.0.0.1, among other things. Reviewed by: sam Approved by: re (rwatson)
|
#
97d8d152 |
|
20-Nov-2003 |
Andre Oppermann <andre@FreeBSD.org> |
Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
|
#
a557af22 |
|
17-Nov-2003 |
Robert Watson <rwatson@FreeBSD.org> |
Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer. This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check. For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update. Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy. Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
|
#
f7bbe2c0 |
|
12-Nov-2003 |
Sam Leffler <sam@FreeBSD.org> |
add missing inpcb lock before call to tcp_twclose (which reclaims the inpcb) Supported by: FreeBSD Foundation
|
#
1b73ca0b |
|
12-Nov-2003 |
Sam Leffler <sam@FreeBSD.org> |
o reorder some locking asserts to reflect the order of the locks o correct a read-lock assert in in_pcblookup_local that should be a write-lock assert (since time wait close cleanups may alter state) Supported by: FreeBSD Foundation
|
#
3ab2096b |
|
10-Nov-2003 |
Ian Dowse <iedowse@FreeBSD.org> |
In in_pcbconnect_setup(), don't use the cached inp->inp_route unless it is marked as RTF_UP. This appears to fix a crash that was sometimes triggered when dhclient(8) tried to send a packet after an interface had been detatched. Reviewed by: sam
|
#
59daba27 |
|
08-Nov-2003 |
Sam Leffler <sam@FreeBSD.org> |
add locking assertions Supported by: FreeBSD Foundation
|
#
0f9ade71 |
|
04-Nov-2003 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
- cleanup SP refcnt issue. - share policy-on-socket for listening socket. - don't copy policy-on-socket at all. secpolicy no longer contain spidx, which saves a lot of memory. - deep-copy pcb policy if it is an ipsec policy. assign ID field to all SPD entries. make it possible for racoon to grab SPD entry on pcb. - fixed the order of searching SA table for packets. - fixed to get a security association header. a mode is always needed to compare them. - fixed that the incorrect time was set to sadb_comb_{hard|soft}_usetime. - disallow port spec for tunnel mode policy (as we don't reassemble). - an user can define a policy-id. - clear enc/auth key before freeing. - fixed that the kernel crashed when key_spdacquire() was called because key_spdacquire() had been implemented imcopletely. - preparation for 64bit sequence number. - maintain ordered list of SA, based on SA id. - cleanup secasvar management; refcnt is key.c responsibility; alloc/free is keydb.c responsibility. - cleanup, avoid double-loop. - use hash for spi-based lookup. - mark persistent SP "persistent". XXX in theory refcnt should do the right thing, however, we have "spdflush" which would touch all SPs. another solution would be to de-register persistent SPs from sptree. - u_short -> u_int16_t - reduce kernel stack usage by auto variable secasindex. - clarify function name confusion. ipsec_*_policy -> ipsec_*_pcbpolicy. - avoid variable name confusion. (struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct secpolicy *) - count number of ipsec encapsulations on ipsec4_output, so that we can tell ip_output() how to handle the packet further. - When the value of the ul_proto is ICMP or ICMPV6, the port field in "src" of the spidx specifies ICMP type, and the port field in "dst" of the spidx specifies ICMP code. - avoid from applying IPsec transport mode to the packets when the kernel forwards the packets. Tested by: nork Obtained from: KAME
|
#
96af9ea5 |
|
01-Nov-2003 |
Mike Silbersack <silby@FreeBSD.org> |
- Add a new function tcp_twrecycleable, which tells us if the ISN which we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce.
|
#
9c63e9db |
|
30-Oct-2003 |
Sam Leffler <sam@FreeBSD.org> |
Overhaul routing table entry cleanup by introducing a new rtexpunge routine that takes a locked routing table reference and removes all references to the entry in the various data structures. This eliminates instances of recursive locking and also closes races where the lock on the entry had to be dropped prior to calling rtrequest(RTM_DELETE). This also cleans up confusion where the caller held a reference to an entry that might have been reclaimed (and in some cases used that reference). Supported by: FreeBSD Foundation
|
#
d1dd20be |
|
03-Oct-2003 |
Sam Leffler <sam@FreeBSD.org> |
Locking for updates to routing table entries. Each rtentry gets a mutex that covers updates to the contents. Note this is separate from holding a reference and/or locking the routing table itself. Other/related changes: o rtredirect loses the final parameter by which an rtentry reference may be returned; this was never used and added unwarranted complexity for locking. o minor style cleanups to routing code (e.g. ansi-fy function decls) o remove the logic to bump the refcnt on the parent of cloned routes, we assume the parent will remain as long as the clone; doing this avoids a circularity in locking during delete o convert some timeouts to MPSAFE callouts Notes: 1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level applications cannot/do-no know about mutex's. Doing this requires that the mutex be the last element in the structure. A better solution is to introduce an externalized version of struct rtentry but this is a major task because of the intertwining of rtentry and other data structures that are visible to user applications. 2. There are known LOR's that are expected to go away with forthcoming work to eliminate many held references. If not these will be resolved prior to release. 3. ATM changes are untested. Sponsored by: FreeBSD Foundation Obtained from: BSD/OS (partly)
|
#
8b149b51 |
|
07-Aug-2003 |
John Baldwin <jhb@FreeBSD.org> |
Consistently use the BSD u_int and u_short instead of the SYSV uint and ushort. In most of these files, there was a mixture of both styles and this change just makes them self-consistent. Requested by: bde (kern_ktrace.c)
|
#
104a9b7e |
|
29-Apr-2003 |
Alexander Kabaev <kan@FreeBSD.org> |
Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h> Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
|
#
b0d22693 |
|
20-Feb-2003 |
Crist J. Clark <cjc@FreeBSD.org> |
The ancient and outdated concept of "privileged ports" in UNIX-type OSes has probably caused more problems than it ever solved. Allow the user to retire the old behavior by specifying their own privileged range with, net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1 net.inet.ip.portrange.reservedlo default = 0 Now you can run that webserver without ever needing root at all. Or just imagine, an ftpd that can really drop privileges, rather than just set the euid, and still do PORT data transfers from 20/tcp. Two edge cases to note, # sysctl net.inet.ip.portrange.reservedhigh=0 Opens all ports to everyone, and, # sysctl net.inet.ip.portrange.reservedhigh=65535 Locks all network activity to root only (which could actually have been achieved before with ipfw(8), but is somewhat more complicated). For those who stick to the old religion that 0-1023 belong to root and root alone, don't touch the knobs (or even lock them by raising securelevel(8)), and nothing changes.
|
#
340c35de |
|
19-Feb-2003 |
Jonathan Lemon <jlemon@FreeBSD.org> |
Add a TCP TIMEWAIT state which uses less space than a fullblown TCP control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs
|
#
a163d034 |
|
18-Feb-2003 |
Warner Losh <imp@FreeBSD.org> |
Back out M_* changes, per decision of the TRB. Approved by: trb
|
#
3dc7ebf9 |
|
12-Feb-2003 |
Jeffrey Hsu <hsu@FreeBSD.org> |
in_pcbnotifyall() requires an exclusive protocol lock for notify functions which modify the connection list, namely, tcp_notify().
|
#
28a34902 |
|
29-Jan-2003 |
Sam Leffler <sam@FreeBSD.org> |
remove the restriction on build a kernel with FAST_IPSEC and INET6; you still don't want to use the two together, but it's ok to have them in the same kernel (the problem that initiated this bandaid has long since been fixed)
|
#
44956c98 |
|
21-Jan-2003 |
Alfred Perlstein <alfred@FreeBSD.org> |
Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
|
#
9c0a8ace |
|
08-Nov-2002 |
Sam Leffler <sam@FreeBSD.org> |
temporarily disallow FAST_IPSEC and INET6 to avoid potential panics; will correct this before 5.0 release
|
#
5200e00e |
|
21-Oct-2002 |
Ian Dowse <iedowse@FreeBSD.org> |
Replace in_pcbladdr() with a more generic inner subroutine for in_pcbconnect() called in_pcbconnect_setup(). This version performs all of the functions of in_pcbconnect() except for the final committing of changes to the PCB. In the case of an EADDRINUSE error it can also provide to the caller the PCB of the duplicate connection, avoiding an extra in_pcblookup_hash() lookup in tcp_connect(). This change will allow the "temporary connect" hack in udp_output() to be removed and is part of the preparation for adding the IP_SENDSRCADDR control message. Discussed on: -net Approved by: re
|
#
4b932371 |
|
20-Oct-2002 |
Ian Dowse <iedowse@FreeBSD.org> |
Split out most of the logic from in_pcbbind() into a new function called in_pcbbind_setup() that does everything except commit the changes to the PCB. There should be no functional change here, but in_pcbbind_setup() will be used by the soon-to-appear IP_SENDSRCADDR control message implementation to check or allocate the source address and port. Discussed on: -net Approved by: re
|
#
b9234faf |
|
15-Oct-2002 |
Sam Leffler <sam@FreeBSD.org> |
Tie new "Fast IPsec" code into the build. This involves the usual configuration stuff as well as conditional code in the IPv4 and IPv6 areas. Everything is conditional on FAST_IPSEC which is mutually exclusive with IPSEC (KAME IPsec implmentation). As noted previously, don't use FAST_IPSEC with INET6 at the moment. Reviewed by: KAME, rwatson Approved by: silence Supported by: Vernier Networks
|
#
26ef6ac4 |
|
21-Aug-2002 |
Don Lewis <truckman@FreeBSD.org> |
Create new functions in_sockaddr(), in6_sockaddr(), and in6_v4mapsin6_sockaddr() which allocate the appropriate sockaddr_in* structure and initialize it with the address and port information passed as arguments. Use calls to these new functions to replace code that is replicated multiple times in in_setsockaddr(), in_setpeeraddr(), in6_setsockaddr(), in6_setpeeraddr(), in6_mapped_sockaddr(), and in6_mapped_peeraddr(). Inline COMMON_END in tcp_usr_accept() so that we can call in_sockaddr() with temporary copies of the address and port after the PCB is unlocked. Fix the lock violation in tcp6_usr_accept() (caused by calling MALLOC() inside in6_mapped_peeraddr() while the PCB is locked) by changing the implementation of tcp6_usr_accept() to match tcp_usr_accept(). Reviewed by: suz
|
#
eccb7001 |
|
25-Jul-2002 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
cleanup usage of ip6_mapped_addr_on and ip6_v6only. now, ip6_mapped_addr_on is unified into ip6_v6only. MFC after: 1 week
|
#
3ce144ea |
|
14-Jun-2002 |
Jeffrey Hsu <hsu@FreeBSD.org> |
Notify functions can destroy the pcb, so they have to return an indication of whether this happenned so the calling function knows whether or not to unlock the pcb. Submitted by: Jennifer Yang (yangjihui@yahoo.com) Bug reported by: Sid Carter (sidcarter@symonds.net)
|
#
3cfcc388 |
|
11-Jun-2002 |
Jeffrey Hsu <hsu@FreeBSD.org> |
Fix typo where INP_INFO_RLOCK should be INP_INFO_RUNLOCK. Submitted by: tegge, jlemon Prefer LIST_FOREACH macro. Submitted by: jlemon
|
#
f76fcf6d |
|
10-Jun-2002 |
Jeffrey Hsu <hsu@FreeBSD.org> |
Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>
|
#
4cc20ab1 |
|
31-May-2002 |
Seigo Tanimura <tanimura@FreeBSD.org> |
Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
|
#
243917fe |
|
19-May-2002 |
Seigo Tanimura <tanimura@FreeBSD.org> |
Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
|
#
ad278afd |
|
09-Apr-2002 |
John Baldwin <jhb@FreeBSD.org> |
Change the first argument of prison_xinpcb() to be a thread pointer instead of a proc pointer so that prison_xinpcb() can use td_ucred.
|
#
44731cab |
|
01-Apr-2002 |
John Baldwin <jhb@FreeBSD.org> |
Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
|
#
9e5a5ed4 |
|
21-Mar-2002 |
Mike Silbersack <silby@FreeBSD.org> |
Change the ephemeral port range from 1024-5000 to 49152-65535. This increases the number of concurrent outgoing connections from ~4000 to ~16000. Other OSes (Solaris, OS X, NetBSD) and many other NAT products have already made this change without ill effects, so we should not run into any problems. MFC after: 1 week
|
#
69c2d429 |
|
19-Mar-2002 |
Jeff Roberson <jeff@FreeBSD.org> |
Switch vm_zone.h with uma.h. Change over to uma interfaces.
|
#
4d77a549 |
|
19-Mar-2002 |
Alfred Perlstein <alfred@FreeBSD.org> |
Remove __P.
|
#
a854ed98 |
|
27-Feb-2002 |
John Baldwin <jhb@FreeBSD.org> |
Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
|
#
a4a6e773 |
|
21-Jan-2002 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
- Check the address family of the destination cached in a PCB. - Clear the cached destination before getting another cached route. Otherwise, garbage in the padding space (which might be filled in if it was used for IPv4) could annoy rtalloc. Obtained from: KAME
|
#
eaa6d8ef |
|
12-Dec-2001 |
Jonathan Lemon <jlemon@FreeBSD.org> |
Minor style fixes.
|
#
01137630 |
|
03-Dec-2001 |
Robert Watson <rwatson@FreeBSD.org> |
o Introduce pr_mtx into struct prison, providing protection for the mutable contents of struct prison (hostname, securelevel, refcount, pr_linux, ...) o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/ so as to enforce these protections, in particular, in kern_mib.c protection sysctl access to the hostname and securelevel, as well as kern_prot.c access to the securelevel for access control purposes. o Rewrite linux emulator abstractions for accessing per-jail linux mib entries (osname, osrelease, osversion) so that they don't return a pointer to the text in the struct linux_prison, rather, a copy to an array passed into the calls. Likewise, update linprocfs to use these primitives. o Update in_pcb.c to always use prison_getip() rather than directly accessing struct prison. Reviewed by: jhb
|
#
be2ac88c |
|
21-Nov-2001 |
Jonathan Lemon <jlemon@FreeBSD.org> |
Introduce a syncache, which enables FreeBSD to withstand a SYN flood DoS in an improved fashion over the existing code. Reviewed by: silby (in a previous iteration) Sponsored by: DARPA, NAI Labs
|
#
b1e4abd2 |
|
16-Nov-2001 |
Matthew Dillon <dillon@FreeBSD.org> |
Give struct socket structures a ref counting interface similar to vnodes. This will hopefully serve as a base from which we can expand the MP code. We currently do not attempt to obtain any mutex or SX locks, but the door is open to add them when we nail down exactly how that part of it is going to work.
|
#
83103a73 |
|
05-Nov-2001 |
Andrew R. Reiter <arr@FreeBSD.org> |
- Fixes non-zero'd out sin_zero field problem so that the padding is used as it is supposed to be. Inspired by: PR #31704 Approved by: jdp Reviewed by: jhb, -net@
|
#
8071913d |
|
17-Oct-2001 |
Ruslan Ermilov <ru@FreeBSD.org> |
Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2. Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *'' as the argument. Pass rt_addrinfo all the way down to rtrequest1 and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now ``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is using it anyways). Benefit: the following command now works. Previously we needed two route(8) invocations, "add" then "change". # route add -inet6 default ::1 -ifp gif0 Remove unsafe typecast in rtrequest(), from ``rtentry *'' to ``sockaddr *''. It was introduced by 4.3BSD-Reno and never corrected. Obtained from: BSD/OS, NetBSD MFC after: 1 month PR: kern/28360
|
#
9a10980e |
|
28-Sep-2001 |
Jonathan Lemon <jlemon@FreeBSD.org> |
Centralize satosin(), sintosa() and ifatoia() macros in <netinet/in.h> Remove local definitions.
|
#
9494d596 |
|
25-Sep-2001 |
Brooks Davis <brooks@FreeBSD.org> |
Make faith loadable, unloadable, and clonable.
|
#
b40ce416 |
|
12-Sep-2001 |
Julian Elischer <julian@FreeBSD.org> |
KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
|
#
e43cc4ae |
|
04-Aug-2001 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
When running aplication joined multicast address, removing network card, and kill aplication. imo_membership[].inm_ifp refer interface pointer after removing interface. When kill aplication, release socket,and imo_membership. imo_membership use already not exist interface pointer. Then, kernel panic. PR: 29345 Submitted by: Inoue Yuichi <inoue@nd.net.fujitsu.co.jp> Obtained from: KAME MFC after: 3 days
|
#
13cf67f3 |
|
26-Jul-2001 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
move ipsec security policy allocation into in_pcballoc, before making pcbs available to the outside world. otherwise, we will see inpcb without ipsec security policy attached (-> panic() in ipsec.c). Obtained from: KAME MFC after: 3 days
|
#
8bf82a92 |
|
28-Jun-2001 |
Ruslan Ermilov <ru@FreeBSD.org> |
Backout CSRG revision 7.22 to this file (if in_losing notices an RTF_DYNAMIC route, it got freed twice). I am not sure what was the actual problem in 1992, but the current behavior is memory leak if PCB holds a reference to a dynamically created/modified routing table entry. (rt_refcnt>0 and we don't call rtfree().) My test bed was: 1. Set net.inet.tcp.msl to a low value (for test purposes), e.g., 5 seconds, to speed up the transition of TCP connection to a "closed" state. 2. Add a network route which causes ICMP redirect from the gateway. 3. ping(8) host H that matches this route; this creates RTF_DYNAMIC RTF_HOST route to H. (I was forced to use ICMP to cause gateway to generate ICMP host redirect, because gateway in question is a 4.2-STABLE system vulnerable to a problem that was fixed later in ip_icmp.c,v 1.39.2.6, and TCP packets with DF bit set were triggering this bug.) 4. telnet(1) to H 5. Block access to H with ipfw(8) 6. Send something in telnet(1) session; this causes EPERM, followed by an in_losing() call in a few seconds. 7. Delete ipfw(8) rule blocking access to H, and wait for TCP connection moving to a CLOSED state; PCB is freed. 8. Delete host route to H. 9. Watch with netstat(1) that `rttrash' increased. 10. Repeat steps 3-9, and watch `rttrash' increases. PR: kern/25421 MFC after: 2 weeks
|
#
33841545 |
|
10-Jun-2001 |
Hajimu UMEMOTO <ume@FreeBSD.org> |
Sync with recent KAME. This work was based on kame-20010528-freebsd43-snap.tgz and some critical problem after the snap was out were fixed. There are many many changes since last KAME merge. TODO: - The definitions of SADB_* in sys/net/pfkeyv2.h are still different from RFC2407/IANA assignment because of binary compatibility issue. It should be fixed under 5-CURRENT. - ip6po_m member of struct ip6_pktopts is no longer used. But, it is still there because of binary compatibility issue. It should be removed under 5-CURRENT. Reviewed by: itojun Obtained from: KAME MFC after: 3 weeks
|
#
ccd6f42d |
|
16-Mar-2001 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Fix a style(9) nit.
|
#
503d3c02 |
|
12-Mar-2001 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Correctly cleanup in case of failure to bind a pcb. PR: 25751 Submitted by: <unicorn@Forest.Od.UA>
|
#
234ff7c4 |
|
04-Mar-2001 |
Bosko Milekic <bmilekic@FreeBSD.org> |
During a flood, we don't call rtfree(), but we remove the entry ourselves. However, if the RTF_DELCLONE and RTF_WASCLONED condition passes, but the ref count is > 1, we won't decrement the count at all. This could lead to route entries never being deleted. Here, we call rtfree() not only if the initial two conditions fail, but also if the ref count is > 1 (and we therefore don't immediately delete the route, but let rtfree() handle it). This is an urgent MFC candidate. Thanks go to Mike Silbersack for the fix, once again. :-) Submitted by: Mike Silbersack <silby@silby.com>
|
#
970680fa |
|
28-Feb-2001 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Fix jails.
|
#
c693a045 |
|
26-Feb-2001 |
Jonathan Lemon <jlemon@FreeBSD.org> |
Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly. For TCP, verify that the sequence number in the ICMP packet falls within the tcp receive window before performing any actions indicated by the icmp packet. Clean up some layering violations (access to tcp internals from in_pcb)
|
#
d1c54148 |
|
22-Feb-2001 |
Jesper Skriver <jesper@FreeBSD.org> |
Redo the security update done in rev 1.54 of src/sys/netinet/tcp_subr.c and 1.84 of src/sys/netinet/udp_usrreq.c The changes broken down: - remove 0 as a wildcard for addresses and port numbers in src/sys/netinet/in_pcb.c:in_pcbnotify() - add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify all sessions with the specific remote address. - change - src/sys/netinet/udp_usrreq.c:udp_ctlinput() - src/sys/netinet/tcp_subr.c:tcp_ctlinput() to use in_pcbnotifyall() to notify multiple sessions, instead of using in_pcbnotify() with 0 as src address and as port numbers. - remove check for src port == 0 in - src/sys/netinet/tcp_subr.c:tcp_ctlinput() - src/sys/netinet/udp_usrreq.c:udp_ctlinput() as they are no longer needed. - move handling of redirects and host dead from in_pcbnotify() to udp_ctlinput() and tcp_ctlinput(), so they will call in_pcbnotifyall() to notify all sessions with the specific remote address. Approved by: jlemon Inspired by: NetBSD
|
#
91421ba2 |
|
20-Feb-2001 |
Robert Watson <rwatson@FreeBSD.org> |
o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use. Notes: o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure. Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
|
#
c2221099 |
|
20-Feb-2001 |
Jesper Skriver <jesper@FreeBSD.org> |
Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify Forgotten by phk, when committing fix in kern/23986 PR: kern/23986 Reviewed by: phk Approved by: phk
|
#
37d40066 |
|
04-Feb-2001 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Another round of the <sys/queue.h> FOREACH transmogriffer. Created with: sed(1) Reviewed by: md5(1)
|
#
fc2ffbe6 |
|
04-Feb-2001 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
|
#
550b1518 |
|
23-Jan-2001 |
Wes Peters <wes@FreeBSD.org> |
When attempting to bind to an ephemeral port, if no such port is available, the error return should be EADDRNOTAVAIL rather than EAGAIN. PR: 14181 Submitted by: Dima Dorfman <dima@unixfreak.org> Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
|
#
a3ea6d41 |
|
21-Jan-2001 |
Dag-Erling Smørgrav <des@FreeBSD.org> |
First step towards an MP-safe zone allocator: - have zalloc() and zfree() always lock the vm_zone. - remove zalloci() and zfreei(), which are now redundant. Reviewed by: bmilekic, jasone
|
#
598ce68d |
|
26-Dec-2000 |
Assar Westerlund <assar@FreeBSD.org> |
include tcp header files to get the prototype for tcp_seq_vs_sess
|
#
442fad67 |
|
24-Dec-2000 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and to be configurable with respect to acting only in SYN or in all TCP states. PR: 23665 Submitted by: Jesper Skriver <jesper@skriver.dk>
|
#
7cc0979f |
|
08-Dec-2000 |
David Malone <dwmalone@FreeBSD.org> |
Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
|
#
e4bdf25d |
|
17-Sep-2000 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Properly jail UDP sockets. This is quite a bit more tricky than TCP. This fixes a !root userland panic, and some cases where the wrong interface was chosen for a jailed UDP socket. PR: 20167, 19839, 20946
|
#
e7f32693 |
|
21-Jul-2000 |
Jayanth Vijayaraghavan <jayanth@FreeBSD.org> |
When a connection is being dropped due to a listen queue overflow, delete the cloned route that is associated with the connection. This does not exhaust the routing table memory when the system is under a SYN flood attack. The route entry is not deleted if there is any prior information cached in it. Reviewed by: Peter Wemm,asmodai
|
#
686cdd19 |
|
04-Jul-2000 |
Jun-ichiro itojun Hagino <itojun@FreeBSD.org> |
sync with kame tree as of july00. tons of bug fixes/improvements. API changes: - additional IPv6 ioctls - IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8). (also syntax change)
|
#
77978ab8 |
|
04-Jul-2000 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Previous commit changing SYSCTL_HANDLER_ARGS violated KNF. Pointed out by: bde
|
#
82d9ae4e |
|
03-Jul-2000 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Style police catches up with rev 1.26 of src/sys/sys/sysctl.h: Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources: -sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
|
#
ff079ca4 |
|
18-May-2000 |
Peter Wemm <peter@FreeBSD.org> |
Return ECONNRESET instead of EINVAL if the connection has been shot down as a result of a reset. Returning EINVAL in that case makes no sense at all and just confuses people as to what happened. It could be argued that we should save the original address somewhere so that getsockname() etc can tell us what it used to be so we know where the problem connection attempts are coming from.
|
#
75daea93 |
|
01-Apr-2000 |
Paul Saab <ps@FreeBSD.org> |
Try and make the kernel build again without INET6.
|
#
fdaf052e |
|
01-Apr-2000 |
Yoshinobu Inoue <shin@FreeBSD.org> |
Support per socket based IPv4 mapped IPv6 addr enable/disable control. Submitted by: ume
|
#
333aa64d |
|
21-Mar-2000 |
Brian Feldman <green@FreeBSD.org> |
in6_pcb.c: Remove a bogus (redundant, just weird, etc.) key_freeso(so). There are no consumers of it now, nor does it seem there ever will be. in6?_pcb.c: Add an if (inp->in6?p_sp != NULL) before the call to ipsec[46]_delete_pcbpolicy(inp). In low-memory conditions this can cause a crash because in6?_sp can be NULL...
|
#
6a800098 |
|
22-Dec-1999 |
Yoshinobu Inoue <shin@FreeBSD.org> |
IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
|
#
369dc8ce |
|
21-Dec-1999 |
Eivind Eklund <eivind@FreeBSD.org> |
Change incorrect NULLs to 0s
|
#
cfa1ca9d |
|
07-Dec-1999 |
Yoshinobu Inoue <shin@FreeBSD.org> |
udp IPv6 support, IPv6/IPv4 tunneling support in kernel, packet divert at kernel for IPv6/IPv4 translater daemon This includes queue related patch submitted by jburkhol@home.com. Submitted by: queue related patch from jburkhol@home.com Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
|
#
82cd038d |
|
21-Nov-1999 |
Yoshinobu Inoue <shin@FreeBSD.org> |
KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP for IPv6 yet) With this patch, you can assigne IPv6 addr automatically, and can reply to IPv6 ping. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
|
#
76429de4 |
|
05-Nov-1999 |
Yoshinobu Inoue <shin@FreeBSD.org> |
KAME related header files additions and merges. (only those which don't affect c source files so much) Reviewed by: cvs-committers Obtained from: KAME project
|
#
2f9a2132 |
|
18-Sep-1999 |
Brian Feldman <green@FreeBSD.org> |
Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually. Make a sonewconn3() which takes an extra argument (proc) so new sockets created with sonewconn() from a user's system call get the correct credentials, not just the parent's credentials.
|
#
c3aac50f |
|
27-Aug-1999 |
Peter Wemm <peter@FreeBSD.org> |
$Id$ -> $FreeBSD$
|
#
24ad8fe5 |
|
12-Jul-1999 |
Brian Feldman <green@FreeBSD.org> |
Correct a mistake in so_cred changes. In practice, I don't think that it would make a difference. However, my previous diff _did_ change the behavior in some way (not necessarily break it), so I'm fixing it. Found by: bde Submitted by: bde
|
#
5a903f8d |
|
25-Jun-1999 |
Pierre Beyssac <pb@FreeBSD.org> |
In in_pcbconnect(), check the return value from in_pcbbind() and exit on errors. If we don't, in_pcbrehash() is called without a preceeding in_pcbinshash(), causing a crash. There are apparently several conditions that could cause the crash; PR misc/12256 is only one of these. PR: misc/12256
|
#
f29be021 |
|
17-Jun-1999 |
Brian Feldman <green@FreeBSD.org> |
Reviewed by: the cast of thousands This is the change to struct sockets that gets rid of so_uid and replaces it with a much more useful struct pcred *so_cred. This is here to be able to do socket-level credential checks (i.e. IPFW uid/gid support, to be added to HEAD soon). Along with this comes an update to pidentd which greatly simplifies the code necessary to get a uid from a socket. Soon to come: a sysctl() interface to finding individual sockets' credentials.
|
#
75c13541 |
|
28-Apr-1999 |
Poul-Henning Kamp <phk@FreeBSD.org> |
This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
|
#
f711d546 |
|
27-Apr-1999 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc *), prototyped in <sys/proc.h>. 3: s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.
|
#
831a80b0 |
|
27-Jan-1999 |
Matthew Dillon <dillon@FreeBSD.org> |
Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
|
#
f1d19042 |
|
07-Dec-1998 |
Archie Cobbs <archie@FreeBSD.org> |
The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
|
#
52b65dbe |
|
17-Sep-1998 |
Bill Fenner <fenner@FreeBSD.org> |
Fix the bind security fix introduced in rev 1.38 to work with multicast: - Don't bother checking for conflicting sockets if we're binding to a multicast address. - Don't return an error if we're binding to INADDR_ANY, the conflicting socket is bound to INADDR_ANY, and the conflicting socket has SO_REUSEPORT set. PR: kern/7713
|
#
98271db4 |
|
15-May-1998 |
Garrett Wollman <wollman@FreeBSD.org> |
Convert socket structures to be type-stable and add a version number. Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs. Convert PF_LOCAL PCB structures to be type-stable and add a version number. Define an external format for infomation about socket structures and use it in several places. Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines. It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
|
#
4565cbea |
|
19-Apr-1998 |
Poul-Henning Kamp <phk@FreeBSD.org> |
According to: ftp://ftp.isi.edu/in-notes/iana/assignments/port-numbers port numbers are divided into three ranges: 0 - 1023 Well Known Ports 1024 - 49151 Registered Ports 49152 - 65535 Dynamic and/or Private Ports This patch changes the "local port range" from 40000-44999 to the range shown above (plus fix the comment in in_pcb.c). WARNING: This may have an impact on firewall configurations! PR: 5402 Reviewed by: phk Submitted by: Stephen J. Roznowski <sjr@home.net>
|
#
08637435 |
|
28-Mar-1998 |
Bruce Evans <bde@FreeBSD.org> |
Moved some #includes from <sys/param.h> nearer to where they are actually used.
|
#
8781d8e9 |
|
28-Mar-1998 |
Bruce Evans <bde@FreeBSD.org> |
Fixed style bugs (mostly) in previous commit.
|
#
3d4d47f3 |
|
24-Mar-1998 |
Garrett Wollman <wollman@FreeBSD.org> |
Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates its own zone; this is used particularly by TCP which allocates both inpcb and tcpcb in a single allocation. (Some hackery ensures that the tcpcb is reasonably aligned.) Also keep track of the number of pcbs of each type allocated, and keep a generation count (instance version number) for future use.
|
#
4049a042 |
|
01-Mar-1998 |
Guido van Rooij <guido@FreeBSD.org> |
Make sure that you can only bind a more specific address when it is done by the same uid. Obtained from: OpenBSD
|
#
c3229e05 |
|
27-Jan-1998 |
David Greenman <dg@FreeBSD.org> |
Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first. Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however. These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems. Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult. WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).
|
#
42fa505b |
|
24-Dec-1997 |
David Greenman <dg@FreeBSD.org> |
The spl fixes in in_setsockaddr and in_setpeeraddr that were meant to fix PR#3618 weren't sufficient since malloc() can block - allowing the net interrupts in and leading to the same problem mentioned in the PR (a panic). The order of operations has been changed so that this is no longer a problem. Needs to be brought into the 2.2.x branch. PR: 3618
|
#
90d0144c |
|
22-Dec-1997 |
Alexander Langer <alex@FreeBSD.org> |
Removed unnecessary setting of 'error' -- binding to a privileged port by a non-root user always returns EACCES.
|
#
55b211e3 |
|
28-Oct-1997 |
Bruce Evans <bde@FreeBSD.org> |
Removed unused #includes.
|
#
57bf258e |
|
16-Aug-1997 |
Garrett Wollman <wollman@FreeBSD.org> |
Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
|
#
fdc984f7 |
|
18-May-1997 |
Tor Egge <tegge@FreeBSD.org> |
Break apart initialization of s and inp from the declarations in in_setsockaddr and in_setpeeraddr. Suggested by: Justin T. Gibbs <gibbs@plutotech.com>
|
#
db112f04 |
|
18-May-1997 |
Tor Egge <tegge@FreeBSD.org> |
Disallow network interrupts while the address is found and copied in in_setsockaddr and in_setpeeraddr. Handle the case where the socket was disconnected before the network interrupts were disabled. Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
|
#
a29f300e |
|
27-Apr-1997 |
Garrett Wollman <wollman@FreeBSD.org> |
The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
|
#
ca98b82c |
|
02-Apr-1997 |
David Greenman <dg@FreeBSD.org> |
Reorganize elements of the inpcb struct to take better advantage of cache lines. Removed the struct ip proto since only a couple of chars were actually being used in it. Changed the order of compares in the PCB hash lookup to take advantage of partial cache line fills (on PPro). Discussed-with: wollman
|
#
fce002fd |
|
24-Mar-1997 |
Bruce Evans <bde@FreeBSD.org> |
Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include it when it is not used. In most cases, the reasons for including it went away when the special ioctl headers became self-sufficient.
|
#
ddd79a97 |
|
03-Mar-1997 |
David Greenman <dg@FreeBSD.org> |
Improved performance of hash algorithm while (hopefully) not reducing the quality of the hash distribution. This does not fix a problem dealing with poor distribution when using lots of IP aliases and listening on the same port on every one of them...some other day perhaps; fixing that requires significant code changes. The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>
|
#
6875d254 |
|
22-Feb-1997 |
Peter Wemm <peter@FreeBSD.org> |
Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
|
#
117bcae7 |
|
18-Feb-1997 |
Garrett Wollman <wollman@FreeBSD.org> |
Convert raw IP from mondo-switch-statement-from-Hell to pr_usrreqs. Collapse duplicates with udp_usrreq.c and tcp_usrreq.c (calling the generic routines in uipc_socket2.c and in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached socket now traps, rather than harmlessly returning an error; this should never happen. Allow the raw IP buffer sizes to be controlled via sysctl.
|
#
1130b656 |
|
14-Jan-1997 |
Jordan K. Hubbard <jkh@FreeBSD.org> |
Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
|
#
59562606 |
|
13-Dec-1996 |
Garrett Wollman <wollman@FreeBSD.org> |
Convert the interface address and IP interface address structures to TAILQs. Fix places which referenced these for no good reason that I can see (the references remain, but were fixed to compile again; they are still questionable).
|
#
37bd2b30 |
|
29-Oct-1996 |
Peter Wemm <peter@FreeBSD.org> |
Fix braino on my part. When we have three different port ranges (default, "high" and "secure"), we can't use a single variable to track the most recently used port in all three ranges.. :-] This caused the next transient port to be allocated from the start of the range more often than it should.
|
#
6d6a026b |
|
07-Oct-1996 |
David Greenman <dg@FreeBSD.org> |
Improved in_pcblookuphash() to support wildcarding, and changed relavent callers of it to take advantage of this. This reduces new connection request overhead in the face of a large number of PCBs in the system. Thanks to David Filo <filo@yahoo.com> for suggesting this and providing a sample implementation (which wasn't used, but showed that it could be done). Reviewed by: wollman
|
#
321a2846 |
|
23-Aug-1996 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Mark sockets where the kernel chose the port# for. This can be used by netstat to behave more intelligently.
|
#
bbd42ad0 |
|
12-Aug-1996 |
Peter Wemm <peter@FreeBSD.org> |
Add two more portrange sysctls, which control the area of the below IPPORT_RESERVED that is used for selection when bind() is told to allocate a reserved port. Also, implement simple sanity checking for all the addresses set, to make it a little harder for a user/sysadmin to shoot themselves in the feet.
|
#
62d2e87e |
|
30-May-1996 |
Peter Wemm <peter@FreeBSD.org> |
More closely preserve the original operation of rresvport() when using IP_PORTRANGE_LOW.
|
#
2ee45d7d |
|
11-Mar-1996 |
David Greenman <dg@FreeBSD.org> |
Move or add #include <queue.h> in preparation for upcoming struct socket changes.
|
#
33b3ac06 |
|
22-Feb-1996 |
Peter Wemm <peter@FreeBSD.org> |
Make the default behavior of local port assignment match traditional systems (my last change did not mix well with some firewall configurations). As much as I dislike firewalls, this is one thing I I was not prepared to break by default.. :-) Allow the user to nominate one of three ranges of port numbers as candidates for selecting a local address to replace a zero port number. The ranges are selected via a setsockopt(s, IPPROTO_IP, IP_PORTRANGE, &arg) call. The three ranges are: default, high (to bypass firewalls) and low (to get a port below 1024). The default and high port ranges are sysctl settable under sysctl net.inet.ip.portrange.* This code also fixes a potential deadlock if the system accidently ran out of local port addresses. It'd drop into an infinite while loop. The secure port selection (for root) should reduce overheads and increase reliability of rlogin/rlogind/rsh/rshd if they are modified to take advantage of it. Partly suggested by: pst Reviewed by: wollman
|
#
101f9fc8 |
|
19-Jan-1996 |
Peter Wemm <peter@FreeBSD.org> |
Change the default local address range for IP from 1024 through 5000 to 20000 through 30000. These numbers are used for local IP port numbers when an explicit address is not specified. The values are sysctl modifiable under: net.inet.ip.port_{first|last}_auto These numbers do not overlap with any known server addresses, without going above 32768 which are "negative" on some other implementations. 20000 through 30000 is 2.5 times larger than the old range, but some have suggested even that may not be enough... (gasp!) Setting a low address of 10000 should be plenty.. :-)
|
#
0312fbe9 |
|
14-Nov-1995 |
Poul-Henning Kamp <phk@FreeBSD.org> |
New style sysctl & staticize alot of stuff.
|
#
a98ca469 |
|
29-Oct-1995 |
Poul-Henning Kamp <phk@FreeBSD.org> |
Second batch of cleanup changes. This time mostly making a lot of things static and some unused variables here and there.
|
#
2469dd60 |
|
21-Sep-1995 |
Garrett Wollman <wollman@FreeBSD.org> |
Merge 4.4-Lite-2: use M_NOWAIT in in_pcballoc(), and return EACCES rather than EPERM on illegal attempt to bind a reserved port. Obtained from: 4.4BSD-Lite-2
|
#
efe4b0eb |
|
21-Sep-1995 |
Garrett Wollman <wollman@FreeBSD.org> |
Second try: get 4.4-Lite-2 into the source tree. The conflicts don't matter because none of our working source files are on the CSRG branch any more. Obtained from: 4.4BSD-Lite-2
|
#
9b2e5354 |
|
30-May-1995 |
Rodney W. Grimes <rgrimes@FreeBSD.org> |
Remove trailing whitespace.
|
#
0d7b7d3e |
|
03-May-1995 |
David Greenman <dg@FreeBSD.org> |
Changed in_pcblookuphash() to not automatically call in_pcblookup() if the lookup fails. Updated callers to deal with this. Call in_pcblookuphash instead of in_pcblookup() in in_pcbconnect; this improves performance of UDP output by about 17% in the standard case.
|
#
7bc4aca7 |
|
10-Apr-1995 |
David Greenman <dg@FreeBSD.org> |
Added splnet protections for PCB list manipulations and traversals.
|
#
15bd2b43 |
|
08-Apr-1995 |
David Greenman <dg@FreeBSD.org> |
Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash, and in_pcblookuphash.
|
#
b5e8ce9f |
|
16-Mar-1995 |
Bruce Evans <bde@FreeBSD.org> |
Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
|
#
2c8fe19f |
|
14-Mar-1995 |
David Greenman <dg@FreeBSD.org> |
pcb allocations are not always done on behalf of a process; it is not okay to wait.
|
#
3dbdc25c |
|
02-Mar-1995 |
David Greenman <dg@FreeBSD.org> |
Move exact match pcb's to the head of the list to improve lookup performance.
|
#
41f82abe |
|
15-Feb-1995 |
Garrett Wollman <wollman@FreeBSD.org> |
Transaction TCP support now standard. Hack away!
|
#
999f1343 |
|
08-Feb-1995 |
Garrett Wollman <wollman@FreeBSD.org> |
T/TCP changes to generic IP code. This is all ifdefed TTCP so should have no effect on most users for now. (Eventually, once this code is fully tested, the ifdefs will go away.)
|
#
3c4dd356 |
|
02-Aug-1994 |
David Greenman <dg@FreeBSD.org> |
Added $Id$
|
#
26f9a767 |
|
25-May-1994 |
Rodney W. Grimes <rgrimes@FreeBSD.org> |
The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
|
#
df8bae1d |
|
24-May-1994 |
Rodney W. Grimes <rgrimes@FreeBSD.org> |
BSD 4.4 Lite Kernel Sources
|