History log of /openbsd-current/sys/net/bpfdesc.h
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.49 15-Aug-2024 dlg

add BIOCSETFNR, which is like BIOCSETF but doesnt reset the buffer or stats.

from Matthew Luckie <mjl@luckie.org.nz> via tech@
deraadt@ likes it.


Revision tags: OPENBSD_7_3_BASE OPENBSD_7_4_BASE OPENBSD_7_5_BASE
# 1.48 09-Mar-2023 dlg

add a timeout between capturing a packet and making the buffer readable.

before this, there were three reasons that a bpf read will finish.

the first is the obvious one: the bpf packet buffer in the kernel
fills up. by default this is about 32k, so if you're only capturing
a small packet packet every few seconds, it can take a long time
for the buffer to fill up before you can read them.

the second is if bpf has been configured to enable immediate mode with
ioctl(BIOCIMMEDIATE). this means that when any packet is written into
the bpf buffer, the buffer is immediately readable. this is fine
if the packet rate is low, but if the packet rate is high you don't
get the benefit of buffering many packets that bpf is supposed to
provide.

the third mechanism is if bpf has been configured with the BIOCSRTIMEOUT
ioctl, which sets a maximum wait time on a bpf read. BIOCSRTIMEOUT
means than a clock starts ticking down when a program (eg pflogd)
reads from bpf. when the clock reaches zero then the read returns
with whatever is in the bpf packet buffer. however, there could be
nothing in the buffer, and the read will still complete.

deraadt@ noticed this behaviour with pflogd. it wants packets logged
by pf to end up on disk in a timely fashion, but it's fine with
tolerating a bit of delay so it can take advantatage of buffering
to amortise the cost of the reads per packet. it currently does
this with BIOCSRTIMEOUT set to half a second, which means it's
always waking up every half second even if there's nothing to log.

this diff adds BIOCSWTIMEOUT, which specifies a timeout from when
bpf first puts a packet in the capture buffer, and when the buffer
becomes readable.

by default this wait timeout is infinite, meaning the buffer has
to be filled before it becomes readable. BIOCSWTIMEOUT can be set
to enable the new functionality. BIOCIMMEDIATE is turned into a
variation of BIOCSWTIMEOUT with the wait time set to 0, ie, wait 0
seconds between when a packet is written to the buffer and when the
buffer becomes readable. combining BIOCSWTIMEOUT and
BIOCIMMEDIATE simplifies the code a lot.

for pflogd, this means if there are no packets to capture, pflogd
won't wake up every half second to do nothing. however, when a
packet is logged by pf, bpf will wait another half second to see
if any more packets arrive (or the buffer fills up) before the read
fires.

discussed a lot with deraadt@ and sashan@
ok sashan@


Revision tags: OPENBSD_7_2_BASE
# 1.47 09-Jul-2022 visa

Unwrap klist from struct selinfo as this code no longer uses selwakeup().

OK jsg@


Revision tags: OPENBSD_7_1_BASE
# 1.46 17-Mar-2022 visa

Use the refcnt API in bpf.

OK sashan@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.45 21-Jan-2021 dlg

let vfs keep track of nonblocking state for us.

ok claudio@ mvs@


# 1.44 02-Jan-2021 cheloha

bpf(4): remove ticks

Change bd_rtout to a uint64_t of nanoseconds. Update the code in
bpfioctl() and bpfread() accordingly.

Add a local copy of nsecuptime() to make the diff smaller. This will
need to move to kern_tc.c if/when we have another user elsewhere in
the kernel.

Prompted by mpi@. With input from dlg@.

ok dlg@ mpi@ visa@


# 1.43 26-Dec-2020 cheloha

bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders member

bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:

1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().

2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().

3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.

So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.

The read timeout, if any, is now tracked locally by the thread in
bpfread().

Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.

Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.

Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).

ok dlg@


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.48 09-Mar-2023 dlg

add a timeout between capturing a packet and making the buffer readable.

before this, there were three reasons that a bpf read will finish.

the first is the obvious one: the bpf packet buffer in the kernel
fills up. by default this is about 32k, so if you're only capturing
a small packet packet every few seconds, it can take a long time
for the buffer to fill up before you can read them.

the second is if bpf has been configured to enable immediate mode with
ioctl(BIOCIMMEDIATE). this means that when any packet is written into
the bpf buffer, the buffer is immediately readable. this is fine
if the packet rate is low, but if the packet rate is high you don't
get the benefit of buffering many packets that bpf is supposed to
provide.

the third mechanism is if bpf has been configured with the BIOCSRTIMEOUT
ioctl, which sets a maximum wait time on a bpf read. BIOCSRTIMEOUT
means than a clock starts ticking down when a program (eg pflogd)
reads from bpf. when the clock reaches zero then the read returns
with whatever is in the bpf packet buffer. however, there could be
nothing in the buffer, and the read will still complete.

deraadt@ noticed this behaviour with pflogd. it wants packets logged
by pf to end up on disk in a timely fashion, but it's fine with
tolerating a bit of delay so it can take advantatage of buffering
to amortise the cost of the reads per packet. it currently does
this with BIOCSRTIMEOUT set to half a second, which means it's
always waking up every half second even if there's nothing to log.

this diff adds BIOCSWTIMEOUT, which specifies a timeout from when
bpf first puts a packet in the capture buffer, and when the buffer
becomes readable.

by default this wait timeout is infinite, meaning the buffer has
to be filled before it becomes readable. BIOCSWTIMEOUT can be set
to enable the new functionality. BIOCIMMEDIATE is turned into a
variation of BIOCSWTIMEOUT with the wait time set to 0, ie, wait 0
seconds between when a packet is written to the buffer and when the
buffer becomes readable. combining BIOCSWTIMEOUT and
BIOCIMMEDIATE simplifies the code a lot.

for pflogd, this means if there are no packets to capture, pflogd
won't wake up every half second to do nothing. however, when a
packet is logged by pf, bpf will wait another half second to see
if any more packets arrive (or the buffer fills up) before the read
fires.

discussed a lot with deraadt@ and sashan@
ok sashan@


Revision tags: OPENBSD_7_2_BASE
# 1.47 09-Jul-2022 visa

Unwrap klist from struct selinfo as this code no longer uses selwakeup().

OK jsg@


Revision tags: OPENBSD_7_1_BASE
# 1.46 17-Mar-2022 visa

Use the refcnt API in bpf.

OK sashan@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.45 21-Jan-2021 dlg

let vfs keep track of nonblocking state for us.

ok claudio@ mvs@


# 1.44 02-Jan-2021 cheloha

bpf(4): remove ticks

Change bd_rtout to a uint64_t of nanoseconds. Update the code in
bpfioctl() and bpfread() accordingly.

Add a local copy of nsecuptime() to make the diff smaller. This will
need to move to kern_tc.c if/when we have another user elsewhere in
the kernel.

Prompted by mpi@. With input from dlg@.

ok dlg@ mpi@ visa@


# 1.43 26-Dec-2020 cheloha

bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders member

bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:

1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().

2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().

3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.

So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.

The read timeout, if any, is now tracked locally by the thread in
bpfread().

Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.

Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.

Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).

ok dlg@


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.47 09-Jul-2022 visa

Unwrap klist from struct selinfo as this code no longer uses selwakeup().

OK jsg@


Revision tags: OPENBSD_7_1_BASE
# 1.46 17-Mar-2022 visa

Use the refcnt API in bpf.

OK sashan@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.45 21-Jan-2021 dlg

let vfs keep track of nonblocking state for us.

ok claudio@ mvs@


# 1.44 02-Jan-2021 cheloha

bpf(4): remove ticks

Change bd_rtout to a uint64_t of nanoseconds. Update the code in
bpfioctl() and bpfread() accordingly.

Add a local copy of nsecuptime() to make the diff smaller. This will
need to move to kern_tc.c if/when we have another user elsewhere in
the kernel.

Prompted by mpi@. With input from dlg@.

ok dlg@ mpi@ visa@


# 1.43 26-Dec-2020 cheloha

bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders member

bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:

1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().

2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().

3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.

So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.

The read timeout, if any, is now tracked locally by the thread in
bpfread().

Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.

Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.

Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).

ok dlg@


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.46 17-Mar-2022 visa

Use the refcnt API in bpf.

OK sashan@ bluhm@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.45 21-Jan-2021 dlg

let vfs keep track of nonblocking state for us.

ok claudio@ mvs@


# 1.44 02-Jan-2021 cheloha

bpf(4): remove ticks

Change bd_rtout to a uint64_t of nanoseconds. Update the code in
bpfioctl() and bpfread() accordingly.

Add a local copy of nsecuptime() to make the diff smaller. This will
need to move to kern_tc.c if/when we have another user elsewhere in
the kernel.

Prompted by mpi@. With input from dlg@.

ok dlg@ mpi@ visa@


# 1.43 26-Dec-2020 cheloha

bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders member

bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:

1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().

2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().

3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.

So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.

The read timeout, if any, is now tracked locally by the thread in
bpfread().

Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.

Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.

Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).

ok dlg@


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.45 21-Jan-2021 dlg

let vfs keep track of nonblocking state for us.

ok claudio@ mvs@


# 1.44 02-Jan-2021 cheloha

bpf(4): remove ticks

Change bd_rtout to a uint64_t of nanoseconds. Update the code in
bpfioctl() and bpfread() accordingly.

Add a local copy of nsecuptime() to make the diff smaller. This will
need to move to kern_tc.c if/when we have another user elsewhere in
the kernel.

Prompted by mpi@. With input from dlg@.

ok dlg@ mpi@ visa@


# 1.43 26-Dec-2020 cheloha

bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders member

bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:

1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().

2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().

3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.

So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.

The read timeout, if any, is now tracked locally by the thread in
bpfread().

Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.

Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.

Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).

ok dlg@


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.44 02-Jan-2021 cheloha

bpf(4): remove ticks

Change bd_rtout to a uint64_t of nanoseconds. Update the code in
bpfioctl() and bpfread() accordingly.

Add a local copy of nsecuptime() to make the diff smaller. This will
need to move to kern_tc.c if/when we have another user elsewhere in
the kernel.

Prompted by mpi@. With input from dlg@.

ok dlg@ mpi@ visa@


# 1.43 26-Dec-2020 cheloha

bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders member

bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:

1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().

2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().

3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.

So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.

The read timeout, if any, is now tracked locally by the thread in
bpfread().

Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.

Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.

Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).

ok dlg@


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.43 26-Dec-2020 cheloha

bpf(4): bpf_d struct: replace bd_rdStart member with bd_nreaders member

bd_rdStart is strange. It nominally represents the start of a read(2)
on a given bpf(4) descriptor, but there are several problems with it:

1. If there are multiple readers, the bd_rdStart is not set by subsequent
readers, so their timeout is screwed up. The read timeout should really
be tracked on a per-thread basis in bpfread().

2. We set bd_rdStart for poll(2), select(2), and kevent(2), even though
that makes no sense. We should not be setting bd_rdStart in bpfpoll()
or bpfkqfilter().

3. bd_rdStart is buggy. If ticks is 0 when the read starts then
bpf_catchpacket() won't wake up the reader. This is a problem
inherent to the design of bd_rdStart: it serves as both a boolean
and a scalar value, even though 0 is a valid value in the scalar
range.

So let's replace it with a better struct member. "bd_nreaders" is a
count of threads sleeping in bpfread(). It is incremented before a
thread goes to sleep in bpfread() and decremented when a thread wakes
up. If bd_nreaders is greater than zero when we reach bpf_catchpacket()
and fbuf is non-NULL we wake up all readers.

The read timeout, if any, is now tracked locally by the thread in
bpfread().

Unlike bd_rdStart, bpfpoll() and bpfkqfilter() don't touch
bd_nreaders.

Prompted by mpi@. Basic idea from dlg@. Lots of input from dlg@.

Tested by dlg@ with tcpdump(8) (blocking read) and flow-collector
(https://github.com/eait-itig/flow-collector, non-blocking read).

ok dlg@


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.42 11-Dec-2020 cheloha

bpf(4): BIOCGRTIMEOUT, BIOCSRTIMEOUT: protect bd_rtout with bd_mtx

Reading and writing bd_rtout is not an atomic operation, so it needs
to be done under the per-descriptor mutex.

While here, start annotating locking in bpfdesc.h. There's lots more
to do on this front, but you have to start somewhere.

Tweaked by mpi@.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.41 13-May-2020 cheloha

bpf(4): separate descriptor non-blocking status from read timeout

If you set FIONBIO on a bpf(4) descriptor you enable non-blocking mode
and also clobber any read timeout set for the descriptor. The reverse
is also true: do BIOCSRTIMEOUT and you'll set a timeout and
simultaneously disable non-blocking status. The two are mutually
exclusive.

This relationship is undocumented and might cause a bug. At the
very least it makes reasoning about the code difficult.

This patch adds a new member to bpf_d, bd_rnonblock, to store the
non-blocking status of the descriptor. The read timeout is still
kept in bd_rtout.

With this in place, non-blocking status and the read timeout can
coexist. Setting one state does not clear the other, and vice versa.

Separating the two states also clears the way for changing the bpf(4)
read timeout to use the system clock instead of ticks. More on that
in a later patch.

With insight from dlg@ regarding the purpose of the read timeout.

ok dlg@


Revision tags: OPENBSD_6_7_BASE
# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.40 02-Jan-2020 claudio

Switch bpf to use pgsigio(9) and sigio_init(9) instead of handrolling
something with csignal().
OK visa@


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

branches: 1.38.2;
BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.39 21-Oct-2019 sashan

put bpfdesc reference counting back, revert change introduced in 1.175 as:
BPF: remove redundant reference counting of filedescriptors

Anton@ made problem crystal clear:
I've been looking into a similar bpf panic reported by syzkaller,
which looks somewhat related. The one reported by syzkaller is caused
by issuing ioctl(SIOCIFDESTROY) on the interface which the packet filter
is attached to. This will in turn invoke the following functions
expressed as an inverted stacktrace:
1. bpfsdetach()
2. vdevgone()
3. VOP_REVOKE()
4. vop_generic_revoke()
5. vgonel()
6. vclean(DOCLOSE)
7. VOP_CLOSE()
8. bpfclose()

Note that bpfclose() is called before changing the vnode type. In
bpfclose(), the `struct bpf_d` is immediately removed from the global
bpf_d_list list and might end up sleeping inside taskq_barrier(systq).
Since the bpf file descriptor (fd) is still present and valid, another
thread could perform an ioctl() on the fd only to fault since
bpfilter_lookup() will return NULL. The vnode is not locked in this path
either so it won't end up waiting on the ongoing vclean().

Steps to trigger the similar type of panic are straightforward, let there be
two processes running concurrently:

process A:
while true ; do ifconfig tun0 up ; ifconfig tun0 destroy ; done

process B:
while true ; do tcpdump -i tun0 ; done

panic happens within few secs (Dell PowerEdge 710)

OK @visa, OK @anton


Revision tags: OPENBSD_6_6_BASE
# 1.38 18-May-2019 sashan

BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.38 18-May-2019 sashan

BPF: remove redundant reference counting of filedescriptors

OK visa@, OK mpi@


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.37 15-Apr-2019 sashan

moving BPF to RCU

OK visa@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE
# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.36 24-Jan-2018 dlg

add support for bpf on "subsystems", not just network interfaces

bpf assumed that it was being unconditionally attached to network
interfaces, and maintained a pointer to a struct ifnet *. this was
mostly used to get at the name of the interface, which is how
userland asks to be attached to a particular interface. this diff
adds a pointer to the name and uses it instead of the interface
pointer for these lookups. this in turn allows bpf to be attached
to arbitrary subsystems in the kernel which just have to supply a
name rather than an interface pointer. for example, bpf could be
attached to pf_test so you can see what packets are about to be
filtered. mpi@ is using this to look at usb transfers.

bpf still uses the interface pointer for bpfwrite, and for enabling
and disabling promisc. however, these are nopped out for subsystems.

ok mpi@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.35 24-Jan-2017 krw

A space here, a space there. Soon we're talking real whitespace
rectification.


# 1.34 09-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

Tested by Hrvoje Popovski who reported a recursion in the previous
attempt.

ok bluhm@


# 1.33 03-Jan-2017 mpi

Revert previous, there's still a problem with recursive entries in
bpf_mpath_ether().

Problem reported by Hrvoje Popovski.


# 1.32 02-Jan-2017 mpi

Use a mutex to serialize accesses to buffer slots.

With this change bpf_catchpacket() no longer need the KERNEL_LOCK().

ok bluhm@, jmatthew@


# 1.31 22-Aug-2016 mpi

Call csignal() and selwakeup() from a KERNEL_LOCK'd task.

This will allow us make bpf_tap() KERNEL_LOCK() free.

Discussed with dlg@ and input from guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.30 30-Mar-2016 dlg

remove support for BIOCGQUEUE and BIOSGQUEUE

nothing uses them, and the implementation make incorrect assumptions
about mbufs within bpf processing that could lead to some weird
failures.

ok sthen@ deraadt@ mpi@


Revision tags: OPENBSD_5_9_BASE
# 1.29 03-Dec-2015 mpi

Use SRPL_HEAD() and SRPL_ENTRY() to be consistent with and allow to
fallback to a SLIST.

ok dlg@, jasper@


# 1.28 09-Sep-2015 dlg

convert bpf to using an srp list for the list of descriptors.

this replaces the hand rolled list. the code has always used hand
rolled lists, but that gets a bit cumbersome when theyre SRPs.

requested ages ago by mpi@


# 1.27 01-Sep-2015 dlg

reintroduce bpf.c r1.121.

this differs slightly from 1.121 in that it uses the new srp_follow()
to walk the list of descriptors on an interface. this is instead
of interleaving srp_enter() and srp_leave(), which can lead to races
and corruption if you're touching the same SRPs at different IPLs
on the same CPU.

ok deraadt@ jmatthew@


# 1.26 23-Aug-2015 dlg

back out bpf+srp. its blowing up in a bridge setup.

ill debug this out of the tree.


# 1.25 16-Aug-2015 dlg

make bpf_mtap mpsafe by using SRPs.

this was originally implemented by jmatthew@ last year, and updated
by us both during s2k15.

there are four data structures that need to be looked after.

the first is the bpf interface itself. it is allocated and freed
at the same time as an actual interface, so if you're able to send
or receive packets, you're able to run bpf on an interface too.
dont need to do any work there.

the second are bpf descriptors. these represent userland attaching
to a bpf interface, so you can have many of them on a single bpf
interface. they were arranged in a singly linked list before. now
the head and next pointers are replaced with SRP pointers and
followed by srp_enter. the list updates are serialised by the kernel
lock.

the third are the bpf filters. there is an inbound and outbound
filter on each bpf descriptor, ann a process can replace them at
any time. the pointers from the descriptor to those is also changed
to be accessed via srp_enter. updates are serialised by the kernel
lock.

the fourth thing is the ring that bpf writes to for userland to
read. there's one of these per descriptor. because these are only
updated when a filter matches (which is hopefully a relatively rare
event), we take the kernel lock to serialise the writes to the ring.

all this together means you can run bpf against a packet without
taking the kernel lock unless you actually caught a packet and need
to send it to userland. even better, you can run bpf in parallel,
so if we ever support multiple rings on a single interface, we can
run bpf on each ring on different cpus safely.

ive hit this pretty hard in production at work (yay dhcrelay) on
myx (which does rx outside the biglock).

ok jmatthew@ mpi@ millert@


Revision tags: OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.24 10-Feb-2015 pelikan

make bpf(4) able to filter based on a pf(4) queue ID for tcpdump -Q qname

ALTQ version has been on tech@ for years, people were generally ok with it.

ok henning


# 1.23 05-Oct-2014 lteo

fix typo in comment: correspoding -> corresponding


Revision tags: OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.22 18-Dec-2013 krw

Revert the *other* part of bpf.c's r1.84. May finally fix RD Thrush's
encounter with "timeout_add: to_ticks (-1) < 0". Pointed out by RD
Thrush.


# 1.21 12-Nov-2013 dlg

try bpf.c r1.84 again, this time without semantic changes to if statements.

cheers to sthen@ and krw@ for properly dealing with the fallout of my
first commit.


# 1.20 11-Nov-2013 sthen

Revert bpf.c 1.84 / bpfdesc.h 1.19 for now, "panic: timeout_add: to_ticks (-1)
< 0" seen by RD Thrush, http://article.gmane.org/gmane.os.openbsd.bugs/20113
where he has a long-running process using bpf which is active at the time of
panic. krw@ agrees with reverting for now.


# 1.19 11-Nov-2013 dlg

replace the user of ticks in a condition like "interval + start < ticks"
with "ticks - start > interval" because the latter copes with the ticks
value wrapping.

pointed out by guenther@
ok krw@


# 1.18 24-Oct-2013 deraadt

Move obvious kernel prototypes (and structure's with kernel pointers,
obviously only used in the kernel) behind #ifdef _KERNEL


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE OPENBSD_5_2_BASE OPENBSD_5_3_BASE OPENBSD_5_4_BASE
# 1.17 25-Mar-2006 djm

allow bpf(4) to ignore packets based on their direction (inbound or
outbound), using a new BIOCSDIRFILT ioctl;
guidance, feedback and ok canacar@


Revision tags: OPENBSD_3_9_BASE
# 1.16 21-Nov-2005 millert

Move contents of sys/select.h to sys/selinfo.h in preparation for a
userland-visible sys/select.h. Consistent with what Net and Free do.
OK deraadt@, tested with full ports build by naddy@.


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.15 17-Dec-2004 reyk

knf cleanup, convert old k&r-style functions to ansi-style for a
consistent style in sys/net/bpf.c.

ok henning@, "looks fine" canacar@


Revision tags: OPENBSD_3_6_BASE
# 1.14 22-Jun-2004 canacar

Add a new "filter drop" flag to bpf and related ioclts.
When enabled, it notifies the calling interface that the packet
matches a bpf filter and should be dropped.
ok henning@ markus@ frantzen@


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.13 28-May-2004 grange

bpf device cloning.
Now to have more bpf devices just add device nodes in /dev,
no need to recompile kernel anymore.

Code from form@pdp-11.org.ru, some help from markus@.
ok markus@ canacar@ deraadt@


# 1.12 08-May-2004 canacar

reference count bpf descriptors to protect against disappearing interfaces
while asleep in read. ok deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.11 22-Oct-2003 canacar

Add locking and write filtering to bpf descriptors.
Locking prevents dangerous ioctls such as changing the
interface and sending signals to be executed by an
unprivileged process. A filter can also be applied
to packets injected through a bpf descriptor.

These features allow programs using bpf descriptors to
safely drop/seperate privileges.

ok frantzen@ henning@ mcbride@


Revision tags: OPENBSD_3_4_BASE
# 1.10 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.9 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_3_0_BASE UBC_BASE
# 1.8 09-Jun-2001 angelos

branches: 1.8.4;
By popular demand, protect from multiple inclusion, and fix to use the
same naming style.


# 1.7 28-May-2001 dugsong

add BIOC[GS]HDRCMPLT ioctl for BPF, to disable overwriting of link level source address in forged frames. from NetBSD. art@ok


Revision tags: OPENBSD_2_8_BASE OPENBSD_2_9_BASE
# 1.6 19-Jun-2000 jason

de-#ifdef-ize


Revision tags: OPENBSD_2_6_BASE OPENBSD_2_7_BASE SMP_BASE kame_19991208
# 1.5 08-Aug-1999 niklas

branches: 1.5.4;
Support detaching of network interfaces. Still work to do in ipf, and
other families than inet.


Revision tags: OPENBSD_2_4_BASE OPENBSD_2_5_BASE
# 1.4 26-Jun-1998 deraadt

fix bpf select(); from mts@rare.net


Revision tags: OPENBSD_2_2_BASE OPENBSD_2_3_BASE
# 1.3 31-Aug-1997 deraadt

for non-tty TIOCSPGRP/F_SETOWN/FIOSETOWN pgid setting calls, store uid
and euid as well, then deliver them using new csignal() interface
which ensures that pgid setting process is permitted to signal the
pgid process(es). Thanks to newsham@aloha.net for extensive help and
discussion.


Revision tags: OPENBSD_2_1_BASE
# 1.2 24-Feb-1997 niklas

OpenBSD tags + some prototyping police


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision