Cross Reference: /openbsd-current/sys/net/ifq.c

History log of /openbsd-current/sys/net/ifq.c
Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 1.53	10-Nov-2023	bluhm	Make ifq and ifiq interface MP safe. Rename ifq_set_maxlen() to ifq_init_maxlen(). This function neither uses WRITE_ONCE() nor a mutex and is called before the ifq mutex is initialized. The new name expresses that it should be used only during interface attach when there is no concurrency. Protect ifq_len(), ifq_empty(), ifiq_len(), and ifiq_empty() with READ_ONCE(). They can be used without lock as they only read a single integer. OK dlg@
# 1.52	08-Oct-2023	claudio	Revert commitid: KtmyJEoS0WWxmlZ5 --- Protect interface queues with read once and mutex. Reading atomic values need at least read once and writing values should have a mutex. This is what mbuf queues already do. Add READ_ONCE() to ifq and ifiq macros for len and empty. Convert ifq_set_maxlen() to a function that grabs ifq_mtx. OK mvs@ --- ifq_set_maxlen() is called before the ifq_mtx is initalized and this at least crashes WITNESS kernels on boot. Reported-by: syzbot+7b218ef53432b5d56d7d@syzkaller.appspotmail.com
# 1.51	05-Oct-2023	bluhm	Protect interface queues with read once and mutex. Reading atomic values need at least read once and writing values should have a mutex. This is what mbuf queues already do. Add READ_ONCE() to ifq and ifiq macros for len and empty. Convert ifq_set_maxlen() to a function that grabs ifq_mtx. OK mvs@
Revision tags: OPENBSD_7_4_BASE
# 1.50	30-Jul-2023	dlg	count the number of times a ring was marked as oactive. this is interesting as an indicator of how busy or overloaded a transmit queue is before the next indicator which is the number of qdrops.
Revision tags: OPENBSD_7_3_BASE
# 1.49	09-Jan-2023	dlg	flesh out ifiq_enqueue
# 1.48	09-Jan-2023	dlg	count the number times a packet was dropped by bpf as fdrops.
# 1.47	22-Nov-2022	dlg	count how many times ifiqs enqueue and dequeue packets. network cards try to enqueue a list of packets on an ifiq once per interrupt and ifiqs already count how many packets they're handling. this let's us see how well interrupt mitigation is working on a ring or interface. ifiqs are supposed to provide backpressure signalling to a driver if it enqueues a lot more work than it's able to process in softnet, so recording dequeues let's us see this ratio.
Revision tags: OPENBSD_7_2_BASE
# 1.46	30-Apr-2022	bluhm	Run IP input and forwarding with shared netlock. Also distribute packets from the interface receive rings into multiple net task queues. Note that we still have only one softnet task. So there will be no concurrency yet, but we can notice wrong exclusive lock assertions. Soon the final step will be to increase the NET_TASKQ define. lots of testing Hrvoje Popovski; OK sashan@
Revision tags: OPENBSD_7_1_BASE
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.52	08-Oct-2023	claudio	Revert commitid: KtmyJEoS0WWxmlZ5 --- Protect interface queues with read once and mutex. Reading atomic values need at least read once and writing values should have a mutex. This is what mbuf queues already do. Add READ_ONCE() to ifq and ifiq macros for len and empty. Convert ifq_set_maxlen() to a function that grabs ifq_mtx. OK mvs@ --- ifq_set_maxlen() is called before the ifq_mtx is initalized and this at least crashes WITNESS kernels on boot. Reported-by: syzbot+7b218ef53432b5d56d7d@syzkaller.appspotmail.com
# 1.51	05-Oct-2023	bluhm	Protect interface queues with read once and mutex. Reading atomic values need at least read once and writing values should have a mutex. This is what mbuf queues already do. Add READ_ONCE() to ifq and ifiq macros for len and empty. Convert ifq_set_maxlen() to a function that grabs ifq_mtx. OK mvs@
Revision tags: OPENBSD_7_4_BASE
# 1.50	30-Jul-2023	dlg	count the number of times a ring was marked as oactive. this is interesting as an indicator of how busy or overloaded a transmit queue is before the next indicator which is the number of qdrops.
Revision tags: OPENBSD_7_3_BASE
# 1.49	09-Jan-2023	dlg	flesh out ifiq_enqueue
# 1.48	09-Jan-2023	dlg	count the number times a packet was dropped by bpf as fdrops.
# 1.47	22-Nov-2022	dlg	count how many times ifiqs enqueue and dequeue packets. network cards try to enqueue a list of packets on an ifiq once per interrupt and ifiqs already count how many packets they're handling. this let's us see how well interrupt mitigation is working on a ring or interface. ifiqs are supposed to provide backpressure signalling to a driver if it enqueues a lot more work than it's able to process in softnet, so recording dequeues let's us see this ratio.
Revision tags: OPENBSD_7_2_BASE
# 1.46	30-Apr-2022	bluhm	Run IP input and forwarding with shared netlock. Also distribute packets from the interface receive rings into multiple net task queues. Note that we still have only one softnet task. So there will be no concurrency yet, but we can notice wrong exclusive lock assertions. Soon the final step will be to increase the NET_TASKQ define. lots of testing Hrvoje Popovski; OK sashan@
Revision tags: OPENBSD_7_1_BASE
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.51	05-Oct-2023	bluhm	Protect interface queues with read once and mutex. Reading atomic values need at least read once and writing values should have a mutex. This is what mbuf queues already do. Add READ_ONCE() to ifq and ifiq macros for len and empty. Convert ifq_set_maxlen() to a function that grabs ifq_mtx. OK mvs@
Revision tags: OPENBSD_7_4_BASE
# 1.50	30-Jul-2023	dlg	count the number of times a ring was marked as oactive. this is interesting as an indicator of how busy or overloaded a transmit queue is before the next indicator which is the number of qdrops.
Revision tags: OPENBSD_7_3_BASE
# 1.49	09-Jan-2023	dlg	flesh out ifiq_enqueue
# 1.48	09-Jan-2023	dlg	count the number times a packet was dropped by bpf as fdrops.
# 1.47	22-Nov-2022	dlg	count how many times ifiqs enqueue and dequeue packets. network cards try to enqueue a list of packets on an ifiq once per interrupt and ifiqs already count how many packets they're handling. this let's us see how well interrupt mitigation is working on a ring or interface. ifiqs are supposed to provide backpressure signalling to a driver if it enqueues a lot more work than it's able to process in softnet, so recording dequeues let's us see this ratio.
Revision tags: OPENBSD_7_2_BASE
# 1.46	30-Apr-2022	bluhm	Run IP input and forwarding with shared netlock. Also distribute packets from the interface receive rings into multiple net task queues. Note that we still have only one softnet task. So there will be no concurrency yet, but we can notice wrong exclusive lock assertions. Soon the final step will be to increase the NET_TASKQ define. lots of testing Hrvoje Popovski; OK sashan@
Revision tags: OPENBSD_7_1_BASE
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.50	30-Jul-2023	dlg	count the number of times a ring was marked as oactive. this is interesting as an indicator of how busy or overloaded a transmit queue is before the next indicator which is the number of qdrops.
Revision tags: OPENBSD_7_3_BASE
# 1.49	09-Jan-2023	dlg	flesh out ifiq_enqueue
# 1.48	09-Jan-2023	dlg	count the number times a packet was dropped by bpf as fdrops.
# 1.47	22-Nov-2022	dlg	count how many times ifiqs enqueue and dequeue packets. network cards try to enqueue a list of packets on an ifiq once per interrupt and ifiqs already count how many packets they're handling. this let's us see how well interrupt mitigation is working on a ring or interface. ifiqs are supposed to provide backpressure signalling to a driver if it enqueues a lot more work than it's able to process in softnet, so recording dequeues let's us see this ratio.
Revision tags: OPENBSD_7_2_BASE
# 1.46	30-Apr-2022	bluhm	Run IP input and forwarding with shared netlock. Also distribute packets from the interface receive rings into multiple net task queues. Note that we still have only one softnet task. So there will be no concurrency yet, but we can notice wrong exclusive lock assertions. Soon the final step will be to increase the NET_TASKQ define. lots of testing Hrvoje Popovski; OK sashan@
Revision tags: OPENBSD_7_1_BASE
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.49	09-Jan-2023	dlg	flesh out ifiq_enqueue
# 1.48	09-Jan-2023	dlg	count the number times a packet was dropped by bpf as fdrops.
# 1.47	22-Nov-2022	dlg	count how many times ifiqs enqueue and dequeue packets. network cards try to enqueue a list of packets on an ifiq once per interrupt and ifiqs already count how many packets they're handling. this let's us see how well interrupt mitigation is working on a ring or interface. ifiqs are supposed to provide backpressure signalling to a driver if it enqueues a lot more work than it's able to process in softnet, so recording dequeues let's us see this ratio.
Revision tags: OPENBSD_7_2_BASE
# 1.46	30-Apr-2022	bluhm	Run IP input and forwarding with shared netlock. Also distribute packets from the interface receive rings into multiple net task queues. Note that we still have only one softnet task. So there will be no concurrency yet, but we can notice wrong exclusive lock assertions. Soon the final step will be to increase the NET_TASKQ define. lots of testing Hrvoje Popovski; OK sashan@
Revision tags: OPENBSD_7_1_BASE
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.47	22-Nov-2022	dlg	count how many times ifiqs enqueue and dequeue packets. network cards try to enqueue a list of packets on an ifiq once per interrupt and ifiqs already count how many packets they're handling. this let's us see how well interrupt mitigation is working on a ring or interface. ifiqs are supposed to provide backpressure signalling to a driver if it enqueues a lot more work than it's able to process in softnet, so recording dequeues let's us see this ratio.
Revision tags: OPENBSD_7_2_BASE
# 1.46	30-Apr-2022	bluhm	Run IP input and forwarding with shared netlock. Also distribute packets from the interface receive rings into multiple net task queues. Note that we still have only one softnet task. So there will be no concurrency yet, but we can notice wrong exclusive lock assertions. Soon the final step will be to increase the NET_TASKQ define. lots of testing Hrvoje Popovski; OK sashan@
Revision tags: OPENBSD_7_1_BASE
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.46	30-Apr-2022	bluhm	Run IP input and forwarding with shared netlock. Also distribute packets from the interface receive rings into multiple net task queues. Note that we still have only one softnet task. So there will be no concurrency yet, but we can notice wrong exclusive lock assertions. Soon the final step will be to increase the NET_TASKQ define. lots of testing Hrvoje Popovski; OK sashan@
Revision tags: OPENBSD_7_1_BASE
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.45	18-Jan-2022	dlg	return EIO, not ENXIO, when the interface underneath ifq_deq_sleep dies. this is consistent with other drivers when they report their underlying device being detached.
Revision tags: OPENBSD_7_0_BASE
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.44	09-Jul-2021	dlg	ifq_hdatalen can return 0 if ifq_empty is true, which avoids locks.
Revision tags: OPENBSD_6_9_BASE
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.43	20-Feb-2021	dlg	default interfaces to bpf_mtap_ether for their if_bpf_mtap handler. call (*ifp->if_bpf_mtap) instead of bpf_mtap_ether in ifiq_input and if_vinput.
# 1.42	20-Feb-2021	dlg	add a MONITOR flag to ifaces to say they're only used for watching packets. an example use of this is when you have a span port on a switch and you want to be able to see the packets coming out of it with tcpdump, but do not want these packets to enter the network stack for processing. this is particularly important if the span port is pushing a copy of any packets related to the machine doing the monitoring as it will confuse pf states and the stack. ok benno@
Revision tags: OPENBSD_6_8_BASE
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.41	07-Jul-2020	dlg	add kstats for rx queues (ifiqs) and transmit queues (ifqs). this means you can observe what the network stack is trying to do when it's working with a nic driver that supports multiple rings. a nic with only one set of rings still gets queues though, and this still exports their stats. here is a small example of what kstat(8) currently outputs for these stats: em0:0:rxq:0 packets: 2292 packets bytes: 229846 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets em0:0:txq:0 packets: 1297 packets bytes: 193413 bytes qdrops: 0 packets errors: 0 packets qlen: 0 packets maxqlen: 511 packets oactive: false
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.40	17-Jun-2020	dlg	make ph_flowid in mbufs 16bits by storing whether it's set in csum_flags. i've been wanting to do this for a while, and now that we've got stoeplitz and it gives us 16 bits, it seems like the right time.
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.39	21-May-2020	dlg	back out 1.38. some bits of the stack aren't ready for it yet. mark patruck found significant packet drops with trunk(4), and there's some reports that pppx or pipex relies on some implicit locking that it shouldn't. i can fix those without this diff being in the tree.
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.38	20-May-2020	dlg	defer calling !IFXF_MPSAFE driver start routines to the systq this reuses the tx mitigation machinery, but instead of deferring some start calls to the nettq, it defers all calls to the systq. this is to avoid taking the KERNEL_LOCK while processing packets in the stack. i've been running this in production for 6 or so months, and the start of a release is a good time to get more people trying it too. ok jmatthew@
Revision tags: OPENBSD_6_7_BASE
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.37	10-Mar-2020	tobhe	Make sure return value 'error' is initialized to '0'. ok dlg@ deraadt@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.36	25-Jan-2020	dlg	tweaks sleeping for an mbuf so it's more mpsafe. the stack puts an mbuf on the tun ifq, and ifqs protect themselves with a mutex. rather than invent another lock that tun can wrap these ifq ops with and also coordinate it's conditionals (reading and dying) with, try and reuse the ifq mtx for the tun stuff too. because ifqs are more special than tun, this adds a special ifq_deq_sleep to ifq code that tun can call. tun just passes the reading and dying variables to ifq to check, but the tricky stuff about ifqs are kept in the right place. with this, tun_dev_read should be callable without the kernel lock.
Revision tags: OPENBSD_6_6_BASE
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.35	08-Oct-2019	dlg	back out the use of ifiq pressure, and go back to using a packet count. the pressure thresholds were too low in a lot of situations, and still produced hard to understand interactions at high thresholds. until we understand the numbers better, and for release, we're going back counting the length of the per interface input queues. this was originally based on a report of bad tcp performance with em(4) by mlarkin, but is very convincingly demonstrated by a bunch of work procter@ has been doing. deraadt@ is keen on the pressure backout so he can cut a release.
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.34	16-Aug-2019	dlg	ifq_hdatalen should keep the mbuf it's looking at, not leak it. ie, use ifq_deq_rollback after looking at the head mbuf instead of ifq_deq_commit. this is used in tun/tap, where it had the effect that you'd get the datalen for the packet, and then when you try to read that many bytes it had gone. cool and normal. this was found by a student who was trying to do just that. i've always just read the packet into a large buffer.
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.33	03-Jul-2019	dlg	add the kernel side of net.link.ifrxq.pressure_return and pressure_drop these values are used as the backpressure thresholds in the interface rx q processing code. theyre being exposed as tunables to userland while we are figuring out what the best values for them are. ok visa@ deraadt@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.32	01-Jul-2019	dlg	reintroduce ifiq_input counting backpressure instead of counting the number of packets on an ifiq, count the number of times a nic has tried to queue packets before the stack processes them. this new semantic interacted badly with virtual interfaces like vlan and trunk, but these interfaces have been tweaked to call if_vinput instead of if_input so their packets are processed directly because theyre already running inside the stack. im putting this in so we can see what the effect is. if it goes badly i'll back it out again. ok cheloha@ proctor@ visa@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.31	16-Apr-2019	dlg	have another go at tx mitigation the idea is to call the hardware transmit routine less since in a lot of cases posting a producer ring update to the chip is (very) expensive. it's better to do it for several packets instead of each packet, hence calling this tx mitigation. this diff defers the call to the transmit routine to a network taskq, or until a backlog of packets has built up. dragonflybsd uses 16 as the size of it's backlog, so i'm copying them for now. i've tried this before, but previous versions caused deadlocks. i discovered that the deadlocks in the previous version was from ifq_barrier calling taskq_barrier against the nettq. interfaces generally hold NET_LOCK while calling ifq_barrier, but the tq might already be waiting for the lock we hold. this version just doesnt have ifq_barrier call taskq_barrier. it instead relies on the IFF_RUNNING flag and normal ifq serialiser barrier to guarantee the start routine wont be called when an interface is going down. the taskq_barrier is only used during interface destruction to make sure the task struct wont get used in the future, which is already done without the NET_LOCK being held. tx mitigation provides a nice performanace bump in some setups. up to 25% in some cases. tested by tb@ and hrvoje popovski (who's running this in production). ok visa@
Revision tags: OPENBSD_6_5_BASE
# 1.30	29-Mar-2019	dlg	while here, drop ifq_is_serialized and IFQ_ASSERT_SERIALIZED nothing uses them, and they can generate false positives if the serialiser is running at a lower IPL on the same cpu as a call to ifq_is_serialiazed.
# 1.29	29-Mar-2019	dlg	deprecate ifiq_barrier. drivers don't need to call it because the stack runs work in ifiqs. again, only the stack has to care about waiting for pending work when shutting down, not drivers. ifiq_destroy already does a task_del and task_barrier dance, so we don't need ifiq_barrier.
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.28	04-Mar-2019	dlg	move back to ifiq_input counting packets instead of queue operations. the backpressure seems to have kicked in too early, introducing a lot of packet loss where there wasn't any before. secondly, counting operations interacted extremely badly with pseudo-interfaces. for example, if you have a physical interface that rxes 100 vlan encapsulated packets, it will call ifiq_input once for all 100 packets. when the network stack is running vlan_input against thes packets, vlan_input will take the packet and call ifiq_input against each of them. because the stack is running packets on the parent interface, it can't run the packets on the vlan interface, so you end up with ifiq_input being called 100 times, and we dropped packets after 16 calls to ifiq_input without a matching run of the stack. chris cappuccio hit some weird stuff too. discussed with claudio@
# 1.27	04-Mar-2019	dlg	don't need to initialise qdrops twice when setting up ifqs and ifiqs.
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.26	01-Mar-2019	dlg	rework how ifiq_input decides the stack is busy and whether it should drop previously ifiq_input uses the traditional backpressure or defense mechanism and counts packets to decide when to shed load by dropping. currently it ends up waiting for 10240 packets to get queued on the stack before it would decide to drop packets. this may be ok for some machines, but for a lot this was too much. this diff reworks how ifiqs measure how busy the stack is by introducing an ifiq_pressure counter that is incremented when ifiq_input is called, and cleared when ifiq_process calls the network stack to process the queue. if ifiq_input is called multiple times before ifiq_process in a net taskq runs, ifiq_pressure goes up, and ifiq_input uses a high value to decide the stack is busy and it should drop. i was hoping there would be no performance impact from this change, but hrvoje popovski notes a slight bump in forwarding performance. my own testing shows that the ifiq input list length grows to a fraction of the 10240 it used to get to, which means the maximum burst of packets through the stack is smoothed out a bit. instead of big lists of packets followed by big periods of drops, we get relatively small bursts of packets with smaller gaps where we drop. the follow-on from this is to make drivers implementing rx ring moderation to use the return value of ifiq_input to scale the ring allocation down, allowing the hardware to drop packets so software doesnt have to.
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.25	16-Dec-2018	dlg	add task_pending jsg@ wants this for drm, and i've had a version of it in diffs sine 2016, but obviously havent needed to use it just yet. task_pending is modelled on timeout_pending, and tells you if the task is on a list waiting to execute. ok jsg@
# 1.24	11-Dec-2018	dlg	provide ifq_is_priq, mostly so things can tell if hfsc is in effect or not.
# 1.23	11-Dec-2018	dlg	add ifq_hdatalen for getting the size of the packet at the head of an ifq this gets the locks right, and returns 0 if there's no packet available. ok stsp@
Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.22	25-Jan-2018	mpi	Assert that ifiq_destroy() is not called with the NET_LOCK() held. Calling taskq_barrier() on a softnet thread while holding the lock is clearly a deadlock. ok visa@, dlg@, bluhm@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@
# 1.21	04-Jan-2018	tb	Back out tx mitigation again because it breaks suspend and resume at least on x230 and x240. Problem noted by claudio on icb. ok dlg
# 1.20	02-Jan-2018	dlg	reintroduce tx mitigation to quote the previous commit: > this replaces ifq_start with code that waits until at least 4 packets > have been queued on the ifq before calling the drivers start routine. > if less than 4 packets get queued, the start routine is called from > a task in a softnet tq. > > 4 packets was chosen this time based on testing sephe did in dfly > which showed no real improvement when bundling more packets. hrvoje > popovski tested this on several nics and found an improvement of > 10 to 20 percent when forwarding across the board. > > because some of the ifq's work could be sitting on a softnet tq, > ifq_barrier now calls taskq_barrier to guarantee any work that was > pending there has finished. > > ok mpi@ visa@ this was backed out because of a race in the net80211 stack that anton@ hit. mpi@ committed a workaround for it in revision 1.30 of src/sys/net80211/ieee80211_pae_output.c. im putting this in again so we can see what breaks next.
# 1.19	15-Dec-2017	dlg	ifq_barrier should be callable by any nic, not just MPSAFE ones. if (when) tx mitigation goes in again, all nics will have deferred work that will need a barrier to ensure isn't running anymore. found by bluhm@ when tx mit was in.
# 1.18	15-Dec-2017	dlg	add ifiqueues for mp safety and nics with multiple rx rings. currently there is a single mbuf_queue per interface, which all rings on a nic shove packets onto. while the list inside this queue is protected by a mutex, the counters around it (ie, ipackets, ibytes, idrops) are not. this means updates can be lost, and reading the statistics is also inconsistent. having a single queue means that busy rx rings can dominate and then starve the others. ifiqueue structs are like ifqueue structs. they provide per ring queues, and independent counters for each ring. when ifdata is read for userland, these counters are aggregated. having a queue per ring now allows for per ring backpressure to be applied. MCLGETI will have it's day again. right now we assume every interface wants an input queue and unconditionally provide one. individual interfaces can opt into more. im not completely happy about the shape of this atm, but shuffling it around more makes the diff bigger. ok visa@
# 1.17	14-Dec-2017	dlg	i forgot to convert ifq_barrier_task to cond_signal.
# 1.16	14-Dec-2017	dlg	replace the bare sleep state handling in barriers with wait cond code
# 1.15	14-Nov-2017	dlg	anton@ reports that ifq tx mitigation breaks iwm somehow. back it out until i can figure the problem out.
# 1.14	14-Nov-2017	dlg	move the adding of an ifqs counters in if_getdata to ifq.c this keeps the knowledge of ifq locking in ifq.c ok visa@
# 1.13	14-Nov-2017	dlg	reintroduce tx mitigation, like dfly does and like we used to do. this replaces ifq_start with code that waits until at least 4 packets have been queued on the ifq before calling the drivers start routine. if less than 4 packets get queued, the start routine is called from a task in a softnet tq. 4 packets was chosen this time based on testing sephe did in dfly which showed no real improvement when bundling more packets. hrvoje popovski tested this on several nics and found an improvement of 10 to 20 percent when forwarding across the board. because some of the ifq's work could be sitting on a softnet tq, ifq_barrier now calls taskq_barrier to guarantee any work that was pending there has finished. ok mpi@ visa@
Revision tags: OPENBSD_6_2_BASE
# 1.12	02-Jun-2017	dlg	be less tricky about when ifq_free is handled. instead of assuming start routines only run inside the ifq serialiser, only rely on the serialisation provided by the ifq mtx which is explicitly used during ifq_deq ops. ie, free the mbufs in ifq_free at the end of ifq_deq ops instead of in the ifq_serialiser loop. ifq deq ops arent necessarily called within the serialiser. this should fix panics caused by fq codel on top of bce (which calls bce_start from it's tx completion path instead of ifq_restart). ok mikeb@
# 1.11	03-May-2017	mikeb	Provide a function to dispose of a list of mbufs on dequeue ifq_mfreeml() is similar to the ifq_mfreem(), but takes an mbuf list as an argument. This also lets these functions subtract the number of packets to be disposed of from the ifq length. OK dlg
# 1.10	03-May-2017	dlg	add ifq_mfreem() so ifq backends can free packets during dequeue. a goal of the ifq api is to avoid freeing an mbuf while holding a lock. to acheive this it allowed the backend enqueue operation to return a single mbuf to be freed. however, mikeb@ is working on a backend that wants to free packets during dequeue. to support this, ifq_mfreem queues a packet during dequeue for freeing at the end of the ifq serialiser. there's some doco in ifq.h about it. requested by mikeb@
Revision tags: OPENBSD_6_1_BASE
# 1.9	07-Mar-2017	mikeb	Change priq enqueue policy to drop lower priority packets The new priority queueing enqueue policy is such that when the aggregate queue depth of an outgoing queue is exceeded we attempt to find a non-empty queue of packets with lower priority than the priority of a packet we're trying to enqueue and if there's such queue, we drop the first packet from it. This ensures that high priority traffic will almost always find the place on the queue and low priority bulk traffic gets a better chance at regulating its throughput. There's no change in the behavior if altered priorities are not used (e.g. via "set prio" Pf directive, VLAN priorities and so on). With a correction from dlg@, additional tests by dhill@ OK bluhm, mpi
# 1.8	07-Mar-2017	mikeb	Convert priority queue lists to mbuf_lists This simplifies the code quite a bit making it easier to reason about. dlg@ has begrudgingly submitted to populism, OK bluhm, mpi
# 1.7	07-Mar-2017	dlg	deprecate ifq_enqueue_try, and let backends drop arbitrary mbufs. mikeb@ wants priq to be able to drop lower priority packets if the current one is high. because ifq avoids freeing an mbuf while an ifq mutex is held, he needs a way for a backend to return an arbitrary mbuf to drop rather than signal that the current one needs to be dropped. this lets the backends return the mbuf to be dropped, which may or may not be the current one. to support this ifq_enqueue_try has to be dropped because it can only signal about the current mbuf. nothing uses it (except ifq_enqueue), so we can get rid of it. it wasnt even documented. this diff includes some tweaks by mikeb@ around the statistics gathered in ifq_enqueue when an mbuf is dropped.
# 1.6	24-Jan-2017	dlg	add support for multiple transmit ifqueues per network interface. an ifq to transmit a packet is picked by the current traffic conditioner (ie, priq or hfsc) by providing an index into an array of ifqs. by default interfaces get a single ifq but can ask for more using if_attach_queues(). the vast majority of our drivers still think there's a 1:1 mapping between interfaces and transmit queues, so their if_start routines take an ifnet pointer instead of a pointer to the ifqueue struct. instead of changing all the drivers in the tree, drivers can opt into using an if_qstart routine and setting the IFXF_MPSAFE flag. the stack provides a compatability wrapper from the new if_qstart handler to the previous if_start handlers if IFXF_MPSAFE isnt set. enabling hfsc on an interface configures it to transmit everything through the first ifq. any other ifqs are left configured as priq, but unused, when hfsc is enabled. getting this in now so everyone can kick the tyres. ok mpi@ visa@ (who provided some tweaks for cnmac).
# 1.5	20-Jan-2017	dlg	keep output packet counters on the ifq structure. these copy what is counted on the output path on the ifnet struct, except ifq counts both packets and bytes when a packet is queued instead of just the bytes. all the counters are protected by the ifq mutex except for ifq_errors, which can be updated safely from inside a start routine because the ifq machinery serialises them. ok mpi@
Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.4	29-Dec-2015	dlg	store curcpu() in ifq_serializer so we can check it. this in turn gives us ifq_is_serialized() and an IFQ_ASSERT_SERIALIZED() macro. ok mpi@
# 1.3	09-Dec-2015	dlg	rework ifq_serialise to avoid some atomic ops. now both the list of work and the flag saying if something is running the list are protected by a single mutex. it cuts the number of interlocked ops for an uncontended run of the queue from 5 down to 2. jmatthew likes it.
# 1.2	09-Dec-2015	dlg	rework the if_start mpsafe serialisation so it can serialise arbitrary work work is represented by struct task. the start routine is now wrapped by a task which is serialised by the infrastructure. if_start_barrier has been renamed to ifq_barrier and is now implemented as a task that gets serialised with the start routine. this also adds an ifq_restart() function. it serialises a call to ifq_clr_oactive and calls the start routine again. it exists to avoid a race that kettenis@ identified in between when a start routine discovers theres no space left on a ring, and when it calls ifq_set_oactive. if the txeof side of the driver empties the ring and calls ifq_clr_oactive in between the above calls in start, the queue will be marked oactive and the stack will never call the start routine again. by serialising the ifq_set_oactive call in the start routine and ifq_clr_oactive calls we avoid that race. tested on various nics ok mpi@
# 1.1	08-Dec-2015	dlg	split the interface send queue (struct ifqueue) implementation out. the intention is to make it more clear what belongs to a transmit queue and what belongs to an interface. suggested by and ok mpi@