1<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 4<html> 5 6<head> 7 8<title>Postfix Queue Scheduler</title> 9 10<meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> 11 12</head> 13 14<body> 15 16<h1><img src="postfix-logo.jpg" width="203" height="98" ALT="">Postfix 17Queue Scheduler</h1> 18 19<hr> 20 21<h2> Overview </h2> 22 23<p> The queue manager is by far the most complex part of the Postfix 24mail system. It schedules delivery of new mail, retries failed 25deliveries at specific times, and removes mail from the queue after 26the last delivery attempt. There are two major classes of mechanisms 27that control the operation of the queue manager. </p> 28 29<p> Topics covered by this document: </p> 30 31<ul> 32 33<li> <a href="#concurrency"> Concurrency scheduling</a>, concerned 34with the number of concurrent deliveries to a specific destination, 35including decisions on when to suspend deliveries after persistent 36failures. 37 38<li> <a href="#jobs"> Preemptive scheduling</a>, concerned with 39the selection of email messages and recipients for a given destination. 40 41<li> <a href="#credits"> Credits</a>, something this document would not be 42complete without. 43 44</ul> 45 46<!-- 47 48<p> Once started, the qmgr(8) process runs until "postfix reload" 49or "postfix stop". As a persistent process, the queue manager has 50to meet strict requirements with respect to code correctness and 51robustness. Unlike non-persistent daemon processes, the queue manager 52cannot benefit from Postfix's process rejuvenation mechanism that 53limit the impact from resource leaks and other coding errors 54(translation: replacing a process after a short time covers up bugs 55before they can become a problem). </p> 56 57--> 58 59<h2> <a name="concurrency"> Concurrency scheduling </a> </h2> 60 61<p> The following sections document the Postfix 2.5 concurrency 62scheduler, after a discussion of the limitations of the existing 63concurrency scheduler. This is followed by results of medium-concurrency 64experiments, and a discussion of trade-offs between performance and 65robustness. </p> 66 67<p> The material is organized as follows: </p> 68 69<ul> 70 71<li> <a href="#concurrency_drawbacks"> Drawbacks of the existing 72concurrency scheduler </a> 73 74<li> <a href="#concurrency_summary_2_5"> Summary of the Postfix 2.5 75concurrency feedback algorithm </a> 76 77<li> <a href="#dead_summary_2_5"> Summary of the Postfix 2.5 "dead 78destination" detection algorithm </a> 79 80<li> <a href="#pseudo_code_2_5"> Pseudocode for the Postfix 2.5 81concurrency scheduler </a> 82 83<li> <a href="#concurrency_results"> Results for delivery to 84concurrency limited servers </a> 85 86<li> <a href="#concurrency_discussion"> Discussion of concurrency 87limited server results </a> 88 89<li> <a href="#concurrency_limitations"> Limitations of less-than-1 90per delivery feedback </a> 91 92<li> <a href="#concurrency_config"> Concurrency configuration 93parameters </a> 94 95</ul> 96 97<h3> <a name="concurrency_drawbacks"> Drawbacks of the existing 98concurrency scheduler </a> </h3> 99 100<p> From the start, Postfix has used a simple but robust algorithm 101where the per-destination delivery concurrency is decremented by 1 102after delivery failed due to connection or handshake failure, and 103incremented by 1 otherwise. Of course the concurrency is never 104allowed to exceed the maximum per-destination concurrency limit. 105And when a destination's concurrency level drops to zero, the 106destination is declared "dead" and delivery is suspended. </p> 107 108<p> Drawbacks of +/-1 concurrency feedback per delivery are: <p> 109 110<ul> 111 112<li> <p> Overshoot due to exponential delivery concurrency growth 113with each pseudo-cohort(*). This can be an issue with high-concurrency 114channels. For example, with the default initial concurrency of 5, 115concurrency would proceed over time as (5-10-20). </p> 116 117<li> <p> Throttling down to zero concurrency after a single 118pseudo-cohort(*) failure. This was especially an issue with 119low-concurrency channels where a single failure could be sufficient 120to mark a destination as "dead", causing the suspension of further 121deliveries to the affected destination. </p> 122 123</ul> 124 125<p> (*) A pseudo-cohort is a number of delivery requests equal to 126a destination's delivery concurrency. </p> 127 128<p> The revised concurrency scheduler has a highly modular structure. 129It uses separate mechanisms for per-destination concurrency control 130and for "dead destination" detection. The concurrency control in 131turn is built from two separate mechanisms: it supports less-than-1 132feedback per delivery to allow for more gradual concurrency 133adjustments, and it uses feedback hysteresis to suppress concurrency 134oscillations. And instead of waiting for delivery concurrency to 135throttle down to zero, a destination is declared "dead" after a 136configurable number of pseudo-cohorts reports connection or handshake 137failure. </p> 138 139<h3> <a name="concurrency_summary_2_5"> Summary of the Postfix 2.5 140concurrency feedback algorithm </a> </h3> 141 142<p> We want to increment a destination's delivery concurrency when 143some (not necessarily consecutive) number of deliveries complete 144without connection or handshake failure. This is implemented with 145positive feedback g(N) where N is the destination's delivery 146concurrency. With g(N)=1 feedback per delivery, concurrency increases 147by 1 after each positive feedback event; this gives us the old 148scheduler's exponential growth in time. With g(N)=1/N feedback per 149delivery, concurrency increases by 1 after an entire pseudo-cohort 150N of positive feedback reports; this gives us linear growth in time. 151Less-than-1 feedback per delivery and integer truncation naturally 152give us hysteresis, so that transitions to larger concurrency happen 153every 1/g(N) positive feedback events. </p> 154 155<p> We want to decrement a destination's delivery concurrency when 156some (not necessarily consecutive) number of deliveries complete 157after connection or handshake failure. This is implemented with 158negative feedback f(N) where N is the destination's delivery 159concurrency. With f(N)=1 feedback per delivery, concurrency decreases 160by 1 after each negative feedback event; this gives us the old 161scheduler's behavior where concurrency is throttled down dramatically 162after a single pseudo-cohort failure. With f(N)=1/N feedback per 163delivery, concurrency backs off more gently. Again, less-than-1 164feedback per delivery and integer truncation naturally give us 165hysteresis, so that transitions to lower concurrency happen every 1661/f(N) negative feedback events. </p> 167 168<p> However, with negative feedback we introduce a subtle twist. 169We "reverse" the negative hysteresis cycle so that the transition 170to lower concurrency happens at the <b>beginning</b> of a sequence 171of 1/f(N) negative feedback events. Otherwise, a correction for 172overload would be made too late. This makes the choice of f(N) 173relatively unimportant, as borne out by measurements later in this 174document. </p> 175 176<p> In summary, the main ingredients for the Postfix 2.5 concurrency 177feedback algorithm are a) the option of less-than-1 positive feedback 178per delivery to avoid overwhelming servers, b) the option of 179less-than-1 negative feedback per delivery to avoid giving up too 180fast, c) feedback hysteresis to avoid rapid oscillation, and d) a 181"reverse" hysteresis cycle for negative feedback, so that it can 182correct for overload quickly. </p> 183 184<h3> <a name="dead_summary_2_5"> Summary of the Postfix 2.5 "dead destination" detection algorithm </a> </h3> 185 186<p> We want to suspend deliveries to a specific destination after 187some number of deliveries suffers connection or handshake failure. 188The old scheduler declares a destination "dead" when negative (-1) 189feedback throttles the delivery concurrency down to zero. With 190less-than-1 feedback per delivery, this throttling down would 191obviously take too long. We therefore have to separate "dead 192destination" detection from concurrency feedback. This is implemented 193by introducing the concept of pseudo-cohort failure. The Postfix 1942.5 concurrency scheduler declares a destination "dead" after a 195configurable number of pseudo-cohorts suffers from connection or 196handshake failures. The old scheduler corresponds to the special 197case where the pseudo-cohort failure limit is equal to 1. </p> 198 199<h3> <a name="pseudo_code_2_5"> Pseudocode for the Postfix 2.5 concurrency scheduler </a> </h3> 200 201<p> The pseudo code shows how the ideas behind new concurrency 202scheduler are implemented as of November 2007. The actual code can 203be found in the module qmgr/qmgr_queue.c. </p> 204 205<pre> 206Types: 207 Each destination has one set of the following variables 208 int concurrency 209 double success 210 double failure 211 double fail_cohorts 212 213Feedback functions: 214 N is concurrency; x, y are arbitrary numbers in [0..1] inclusive 215 positive feedback: g(N) = x/N | x/sqrt(N) | x 216 negative feedback: f(N) = y/N | y/sqrt(N) | y 217 218Initialization: 219 concurrency = initial_concurrency 220 success = 0 221 failure = 0 222 fail_cohorts = 0 223 224After success: 225 fail_cohorts = 0 226 Be prepared for feedback > hysteresis, or rounding error 227 success += g(concurrency) 228 while (success >= 1) Hysteresis 1 229 concurrency += 1 Hysteresis 1 230 failure = 0 231 success -= 1 Hysteresis 1 232 Be prepared for overshoot 233 if (concurrency > concurrency limit) 234 concurrency = concurrency limit 235 236Safety: 237 Don't apply positive feedback unless 238 concurrency < busy_refcount + init_dest_concurrency 239 otherwise negative feedback effect could be delayed 240 241After failure: 242 if (concurrency > 0) 243 fail_cohorts += 1.0 / concurrency 244 if (fail_cohorts > cohort_failure_limit) 245 concurrency = 0 246 if (concurrency > 0) 247 Be prepared for feedback > hysteresis, rounding errors 248 failure -= f(concurrency) 249 while (failure < 0) 250 concurrency -= 1 Hysteresis 1 251 failure += 1 Hysteresis 1 252 success = 0 253 Be prepared for overshoot 254 if (concurrency < 1) 255 concurrency = 1 256</pre> 257 258<h3> <a name="concurrency_results"> Results for delivery to concurrency-limited servers </a> </h3> 259 260<p> Discussions about the concurrency scheduler redesign started 261early 2004, when the primary goal was to find alternatives that did 262not exhibit exponential growth or rapid concurrency throttling. No 263code was implemented until late 2007, when the primary concern had 264shifted towards better handling of server concurrency limits. For 265this reason we measure how well the new scheduler does this 266job. The table below compares mail delivery performance of the old 267+/-1 feedback per delivery with several less-than-1 feedback 268functions, for different limited-concurrency server scenarios. 269Measurements were done with a FreeBSD 6.2 client and with FreeBSD 2706.2 and various Linux servers. </p> 271 272<p> Server configuration: </p> 273 274<ul> <li> The mail flow was slowed down with 1 second latency per 275recipient ("smtpd_client_restrictions = sleep 1"). The purpose was 276to make results less dependent on hardware details, by avoiding 277slow-downs by queue file I/O, logging I/O, and network I/O. 278 279<li> Concurrency was limited by the server process limit 280("default_process_limit = 5" and "smtpd_client_event_limit_exceptions 281= static:all"). Postfix was stopped and started after changing the 282process limit, because the same number is also used as the backlog 283argument to the listen(2) system call, and "postfix reload" does 284not re-issue this call. 285 286<li> Mail was discarded with "local_recipient_maps = static:all" and 287"local_transport = discard". The discard action in access maps or 288header/body checks 289could not be used as it fails to update the in_flow_delay counters. 290 291</ul> 292 293<p> Client configuration: </p> 294 295<ul> 296 297<li> Queue file overhead was minimized by sending one message to a 298virtual alias that expanded into 2000 different remote recipients. 299All recipients were accounted for according to the maillog file. 300The virtual_alias_expansion_limit setting was increased to avoid 301complaints from the cleanup(8) server. 302 303<li> The number of deliveries was maximized with 304"smtp_destination_recipient_limit = 2". A smaller limit would cause 305Postfix to schedule the concurrency per recipient instead of domain, 306which is not what we want. 307 308<li> Maximum concurrency was limited with 309"smtp_destination_concurrency_limit = 20", and 310initial_destination_concurrency was set to the same value. 311 312<li> The positive and negative concurrency feedback hysteresis was 3131. Concurrency was incremented by 1 at the END of 1/feedback steps 314of positive feedback, and was decremented by 1 at the START of 3151/feedback steps of negative feedback. 316 317<li> The SMTP client used the default 30s SMTP connect timeout and 318300s SMTP greeting timeout. 319 320</ul> 321 322<h4> Impact of the 30s SMTP connect timeout </h4> 323 324<p> The first results are for a FreeBSD 6.2 server, where our 325artificially low listen(2) backlog results in a very short kernel 326queue for established connections. The table shows that all deferred 327deliveries failed due to a 30s connection timeout, and none failed 328due to a server greeting timeout. This measurement simulates what 329happens when the server's connection queue is completely full under 330load, and the TCP engine drops new connections. </p> 331 332<blockquote> 333 334<table> 335 336<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 337style</th> <th>connection<br> caching</th> <th>percentage<br> 338deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 339<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 340 341<tr> <td align="center" colspan="9"> <hr> </td> </tr> 342 343<tr><td align="center">20</td> <td align="center">5</td> <td 344align="center">1/N</td> <td align="center">no</td> <td 345align="center">9.9</td> <td align="center">19.4</td> <td 346align="center">0.49</td> <td align="center">198</td> <td 347align="center">-</td> </tr> 348 349<tr><td align="center">20</td> <td align="center">5</td> <td 350align="center">1/N</td> <td align="center">yes</td> <td 351align="center">10.3</td> <td align="center">19.4</td> <td 352align="center">0.49</td> <td align="center">206</td> <td 353align="center">-</td> </tr> 354 355<tr><td align="center">20</td> <td align="center">5</td> <td 356align="center">1/sqrt(N)</td> <td align="center">no</td> 357<td align="center">10.4</td> <td align="center">19.6</td> <td 358align="center">0.59</td> <td align="center">208</td> <td 359align="center">-</td> </tr> 360 361<tr><td align="center">20</td> <td align="center">5</td> <td 362align="center">1/sqrt(N)</td> <td align="center">yes</td> 363<td align="center">10.6</td> <td align="center">19.6</td> <td 364align="center">0.61</td> <td align="center">212</td> <td 365align="center">-</td> </tr> 366 367<tr><td align="center">20</td> <td align="center">5</td> <td 368align="center">1</td> <td align="center">no</td> <td 369align="center">10.1</td> <td align="center">19.5</td> <td 370align="center">1.29</td> <td align="center">202</td> <td 371align="center">-</td> </tr> 372 373<tr><td align="center">20</td> <td align="center">5</td> <td 374align="center">1</td> <td align="center">yes</td> <td 375align="center">10.8</td> <td align="center">19.3</td> <td 376align="center">1.57</td> <td align="center">216</td> <td 377align="center">-</td> </tr> 378 379<tr> <td align="center" colspan="9"> <hr> </td> </tr> 380 381</table> 382 383<p> A busy server with a completely full connection queue. N is 384the client delivery concurrency. Failed deliveries time out after 38530s without completing the TCP handshake. See text for a discussion 386of results. </p> 387 388</blockquote> 389 390<h4> Impact of the 300s SMTP greeting timeout </h4> 391 392<p> The next table shows results for a Fedora Core 8 server (results 393for RedHat 7.3 are identical). In this case, the artificially small 394listen(2) backlog argument does not impact our measurement. The 395table shows that practically all deferred deliveries fail after the 396300s SMTP greeting timeout. As these timeouts were 10x longer than 397with the first measurement, we increased the recipient count (and 398thus the running time) by a factor of 10 to keep the results 399comparable. The deferred mail percentages are a factor 10 lower 400than with the first measurement, because the 1s per-recipient delay 401was 1/300th of the greeting timeout instead of 1/30th of the 402connection timeout. </p> 403 404<blockquote> 405 406<table> 407 408<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 409style</th> <th>connection<br> caching</th> <th>percentage<br> 410deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 411<th colspan=2>timed-out in<br> connect/greeting </th> </tr> 412 413<tr> <td align="center" colspan="9"> <hr> </td> </tr> 414 415<tr> <td align="center">20</td> <td align="center">5</td> <td 416align="center">1/N</td> <td align="center">no</td> <td 417align="center">1.16</td> <td align="center">19.8</td> <td 418align="center">0.37</td> <td align="center">-</td> <td 419align="center">230</td> </tr> 420 421<tr> <td align="center">20</td> <td align="center">5</td> <td 422align="center">1/N</td> <td align="center">yes</td> <td 423align="center">1.36</td> <td align="center">19.8</td> <td 424align="center">0.36</td> <td align="center">-</td> <td 425align="center">272</td> </tr> 426 427<tr> <td align="center">20</td> <td align="center">5</td> <td 428align="center">1/sqrt(N)</td> <td align="center">no</td> 429<td align="center">1.21</td> <td align="center">19.9</td> <td 430align="center">0.23</td> <td align="center">4</td> <td 431align="center">238</td> </tr> 432 433<tr> <td align="center">20</td> <td align="center">5</td> <td 434align="center">1/sqrt(N)</td> <td align="center">yes</td> 435<td align="center">1.36</td> <td align="center">20.0</td> <td 436align="center">0.23</td> <td align="center">-</td> <td 437align="center">272</td> </tr> 438 439<tr> <td align="center">20</td> <td align="center">5</td> <td 440align="center">1</td> <td align="center">no</td> <td 441align="center">1.18</td> <td align="center">20.0</td> <td 442align="center">0.16</td> <td align="center">-</td> <td 443align="center">236</td> </tr> 444 445<tr> <td align="center">20</td> <td align="center">5</td> <td 446align="center">1</td> <td align="center">yes</td> <td 447align="center">1.39</td> <td align="center">20.0</td> <td 448align="center">0.16</td> <td align="center">-</td> <td 449align="center">278</td> </tr> 450 451<tr> <td align="center" colspan="9"> <hr> </td> </tr> 452 453</table> 454 455<p> A busy server with a non-full connection queue. N is the client 456delivery concurrency. Failed deliveries complete at the TCP level, 457but time out after 300s while waiting for the SMTP greeting. See 458text for a discussion of results. </p> 459 460</blockquote> 461 462<h4> Impact of active server concurrency limiter </h4> 463 464<p> The final concurrency-limited result shows what happens when 465SMTP connections don't time out, but are rejected immediately with 466the Postfix server's smtpd_client_connection_count_limit feature 467(the server replies with a 421 status and disconnects immediately). 468Similar results can be expected with concurrency limiting features 469built into other MTAs or firewalls. For this measurement we specified 470a server concurrency limit and a client initial destination concurrency 471of 5, and a server process limit of 10; all other conditions were 472the same as with the first measurement. The same result would be 473obtained with a FreeBSD or Linux server, because the "pushing back" 474is done entirely by the receiving side. </p> 475 476<blockquote> 477 478<table> 479 480<tr> <th>client<br> limit</th> <th>server<br> limit</th> <th>feedback<br> 481style</th> <th>connection<br> caching</th> <th>percentage<br> 482deferred</th> <th colspan="2">client concurrency<br> average/stddev</th> 483<th>theoretical<br>defer rate</th> </tr> 484 485<tr> <td align="center" colspan="9"> <hr> </td> </tr> 486 487<tr> <td align="center">20</td> <td align="center">5</td> <td 488align="center">1/N</td> <td align="center">no</td> <td 489align="center">16.5</td> <td align="center">5.17</td> <td 490align="center">0.38</td> <td align="center">1/6</td> </tr> 491 492<tr> <td align="center">20</td> <td align="center">5</td> <td 493align="center">1/N</td> <td align="center">yes</td> <td 494align="center">16.5</td> <td align="center">5.17</td> <td 495align="center">0.38</td> <td align="center">1/6</td> </tr> 496 497<tr> <td align="center">20</td> <td align="center">5</td> <td 498align="center">1/sqrt(N)</td> <td align="center">no</td> 499<td align="center">24.5</td> <td align="center">5.28</td> <td 500align="center">0.45</td> <td align="center">1/4</td> </tr> 501 502<tr> <td align="center">20</td> <td align="center">5</td> <td 503align="center">1/sqrt(N)</td> <td align="center">yes</td> 504<td align="center">24.3</td> <td align="center">5.28</td> <td 505align="center">0.46</td> <td align="center">1/4</td> </tr> 506 507<tr> <td align="center">20</td> <td align="center">5</td> <td 508align="center">1</td> <td align="center">no</td> <td 509align="center">49.7</td> <td align="center">5.63</td> <td 510align="center">0.67</td> <td align="center">1/2</td> </tr> 511 512<tr> <td align="center">20</td> <td align="center">5</td> <td 513align="center">1</td> <td align="center">yes</td> <td 514align="center">49.7</td> <td align="center">5.68</td> <td 515align="center">0.70</td> <td align="center">1/2</td> </tr> 516 517<tr> <td align="center" colspan="9"> <hr> </td> </tr> 518 519</table> 520 521<p> A server with active per-client concurrency limiter that replies 522with 421 and disconnects. N is the client delivery concurrency. 523The theoretical defer rate is 1/(1+roundup(1/feedback)). This is 524always 1/2 with the fixed +/-1 feedback per delivery; with the 525concurrency-dependent feedback variants, the defer rate decreases 526with increasing concurrency. See text for a discussion of results. 527</p> 528 529</blockquote> 530 531<h3> <a name="concurrency_discussion"> Discussion of concurrency-limited server results </a> </h3> 532 533<p> All results in the previous sections are based on the first 534delivery runs only; they do not include any second etc. delivery 535attempts. It's also worth noting that the measurements look at 536steady-state behavior only. They don't show what happens when the 537client starts sending at a much higher or lower concurrency. 538</p> 539 540<p> The first two examples show that the effect of feedback 541is negligible when concurrency is limited due to congestion. This 542is because the initial concurrency is already at the client's 543concurrency maximum, and because there is 10-100 times more positive 544than negative feedback. Under these conditions, it is no surprise 545that the contribution from SMTP connection caching is also negligible. 546</p> 547 548<p> In the last example, the old +/-1 feedback per delivery will 549defer 50% of the mail when confronted with an active (anvil-style) 550server concurrency limit, where the server hangs up immediately 551with a 421 status (a TCP-level RST would have the same result). 552Less aggressive feedback mechanisms fare better than more aggressive 553ones. Concurrency-dependent feedback fares even better at higher 554concurrencies than shown here, but has limitations as discussed in 555the next section. </p> 556 557<h3> <a name="concurrency_limitations"> Limitations of less-than-1 per delivery feedback </a> </h3> 558 559<p> Less-than-1 feedback is of interest primarily when sending large 560amounts of mail to destinations with active concurrency limiters 561(servers that reply with 421, or firewalls that send RST). When 562sending small amounts of mail per destination, less-than-1 per-delivery 563feedback won't have a noticeable effect on the per-destination 564concurrency, because the number of deliveries to the same destination 565is too small. You might just as well use zero per-delivery feedback 566and stay with the initial per-destination concurrency. And when 567mail deliveries fail due to congestion instead of active concurrency 568limiters, the measurements above show that per-delivery feedback 569has no effect. With large amounts of mail you might just as well 570use zero per-delivery feedback and start with the maximal per-destination 571concurrency. </p> 572 573<p> The scheduler with less-than-1 concurrency 574feedback per delivery solves a problem with servers that have active 575concurrency limiters. This works only because feedback is handled 576in a peculiar manner: positive feedback will increment the concurrency 577by 1 at the <b>end</b> of a sequence of events of length 1/feedback, 578while negative feedback will decrement concurrency by 1 at the 579<b>beginning</b> of such a sequence. This is how Postfix adjusts 580quickly for overshoot without causing lots of mail to be deferred. 581Without this difference in feedback treatment, less-than-1 feedback 582per delivery would defer 50% of the mail, and would be no better 583in this respect than the old +/-1 feedback per delivery. </p> 584 585<p> Unfortunately, the same feature that corrects quickly for 586concurrency overshoot also makes the scheduler more sensitive for 587noisy negative feedback. The reason is that one lonely negative 588feedback event has the same effect as a complete sequence of length 5891/feedback: in both cases delivery concurrency is dropped by 1 590immediately. As a worst-case scenario, consider multiple servers 591behind a load balancer on a single IP address, and no backup MX 592address. When 1 out of K servers fails to complete the SMTP handshake 593or drops the connection, a scheduler with 1/N (N = concurrency) 594feedback stops increasing its concurrency once it reaches a concurrency 595level of about K, even though the good servers behind the load 596balancer are perfectly capable of handling more traffic. </p> 597 598<p> This noise problem gets worse as the amount of positive feedback 599per delivery gets smaller. A compromise is to use fixed less-than-1 600positive feedback values instead of concurrency-dependent positive 601feedback. For example, to tolerate 1 of 4 bad servers in the above 602load balancer scenario, use positive feedback of 1/4 per "good" 603delivery (no connect or handshake error), and use an equal or smaller 604amount of negative feedback per "bad" delivery. The downside of 605using concurrency-independent feedback is that some of the old +/-1 606feedback problems will return at large concurrencies. Sites that 607must deliver mail at non-trivial per-destination concurrencies will 608require special configuration. </p> 609 610<h3> <a name="concurrency_config"> Concurrency configuration parameters </a> </h3> 611 612<p> The Postfix 2.5 concurrency scheduler is controlled with the 613following configuration parameters, where "<i>transport</i>_foo" 614provides a transport-specific parameter override. All parameter 615default settings are compatible with earlier Postfix versions. </p> 616 617<blockquote> 618 619<table border="0"> 620 621<tr> <th> Parameter name </th> <th> Postfix version </th> <th> 622Description </th> </tr> 623 624<tr> <td colspan="3"> <hr> </td> </tr> 625 626<tr> <td> initial_destination_concurrency<br> 627<i>transport</i>_initial_destination_concurrency </td> <td 628align="center"> all<br> 2.5 </td> <td> Initial per-destination 629delivery concurrency </td> </tr> 630 631<tr> <td> default_destination_concurrency_limit<br> 632<i>transport</i>_destination_concurrency_limit </td> <td align="center"> 633all<br> all </td> <td> Maximum per-destination delivery concurrency 634</td> </tr> 635 636<tr> <td> default_destination_concurrency_positive_feedback<br> 637<i>transport</i>_destination_concurrency_positive_feedback </td> 638<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination positive 639feedback amount, per delivery that does not fail with connection 640or handshake failure </td> </tr> 641 642<tr> <td> default_destination_concurrency_negative_feedback<br> 643<i>transport</i>_destination_concurrency_negative_feedback </td> 644<td align="center"> 2.5<br> 2.5 </td> <td> Per-destination negative 645feedback amount, per delivery that fails with connection or handshake 646failure </td> </tr> 647 648<tr> <td> default_destination_concurrency_failed_cohort_limit<br> 649<i>transport</i>_destination_concurrency_failed_cohort_limit </td> 650<td align="center"> 2.5<br> 2.5 </td> <td> Number of failed 651pseudo-cohorts after which a destination is declared "dead" and 652delivery is suspended </td> </tr> 653 654<tr> <td> destination_concurrency_feedback_debug</td> <td align="center"> 6552.5 </td> <td> Enable verbose logging of concurrency scheduler 656activity </td> </tr> 657 658<tr> <td colspan="3"> <hr> </td> </tr> 659 660</table> 661 662</blockquote> 663 664<h2> <a name="jobs"> Preemptive scheduling </a> </h2> 665 666<p> 667 668The following sections describe the new queue manager and its 669preemptive scheduler algorithm. Note that the document was originally 670written to describe the changes between the new queue manager (in 671this text referred to as <tt>nqmgr</tt>, the name it was known by 672before it became the default queue manager) and the old queue manager 673(referred to as <tt>oqmgr</tt>). This is why it refers to <tt>oqmgr</tt> 674every so often. 675 676</p> 677 678<p> 679 680This document is divided into sections as follows: 681 682</p> 683 684<ul> 685 686<li> <a href="#<tt>nqmgr</tt>_structures"> The structures used by 687nqmgr </a> 688 689<li> <a href="#<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 690up the message </a> - how it is assigned to transports, jobs, peers, 691entries 692 693<li> <a href="#<tt>nqmgr</tt>_selection"> How the entry selection 694works </a> 695 696<li> <a href="#<tt>nqmgr</tt>_preemption"> How the preemption 697works </a> - what messages may be preempted and how and what messages 698are chosen to preempt them 699 700<li> <a href="#<tt>nqmgr</tt>_concurrency"> How destination concurrency 701limits affect the scheduling algorithm </a> 702 703<li> <a href="#<tt>nqmgr</tt>_memory"> Dealing with memory resource 704limits </a> 705 706</ul> 707 708<h3> <a name="<tt>nqmgr</tt>_structures"> The structures used by 709nqmgr </a> </h3> 710 711<p> 712 713Let's start by recapitulating the structures and terms used when 714referring to queue manager and how it operates. Many of these are 715partially described elsewhere, but it is nice to have a coherent 716overview in one place: 717 718</p> 719 720<ul> 721 722<li> <p> Each message structure represents one mail message which 723Postfix is to deliver. The message recipients specify to what 724destinations is the message to be delivered and what transports are 725going to be used for the delivery. </p> 726 727<li> <p> Each recipient entry groups a batch of recipients of one 728message which are all going to be delivered to the same destination. 729</p> 730 731<li> <p> Each transport structure groups everything what is going 732to be delivered by delivery agents dedicated for that transport. 733Each transport maintains a set of queues (describing the destinations 734it shall talk to) and jobs (referencing the messages it shall 735deliver). </p> 736 737<li> <p> Each transport queue (not to be confused with the on-disk 738active queue or incoming queue) groups everything what is going be 739delivered to given destination (aka nexthop) by its transport. Each 740queue belongs to one transport, so each destination may be referred 741to by several queues, one for each transport. Each queue maintains 742a list of all recipient entries (batches of message recipients) 743which shall be delivered to given destination (the todo list), and 744a list of recipient entries already being delivered by the delivery 745agents (the busy list). </p> 746 747<li> <p> Each queue corresponds to multiple peer structures. Each 748peer structure is like the queue structure, belonging to one transport 749and referencing one destination. The difference is that it lists 750only the recipient entries which all originate from the same message, 751unlike the queue structure, whose entries may originate from various 752messages. For messages with few recipients, there is usually just 753one recipient entry for each destination, resulting in one recipient 754entry per peer. But for large mailing list messages the recipients 755may need to be split to multiple recipient entries, in which case 756the peer structure may list many entries for single destination. 757</p> 758 759<li> <p> Each transport job groups everything it takes to deliver 760one message via its transport. Each job represents one message 761within the context of the transport. The job belongs to one transport 762and message, so each message may have multiple jobs, one for each 763transport. The job groups all the peer structures, which describe 764the destinations the job's message has to be delivered to. </p> 765 766</ul> 767 768<p> 769 770The first four structures are common to both <tt>nqmgr</tt> and 771<tt>oqmgr</tt>, the latter two were introduced by <tt>nqmgr</tt>. 772 773</p> 774 775<p> 776 777These terms are used extensively in the text below, feel free to 778look up the description above anytime you'll feel you have lost a 779sense what is what. 780 781</p> 782 783<h3> <a name="<tt>nqmgr</tt>_pickup"> What happens when nqmgr picks 784up the message </a> </h3> 785 786<p> 787 788Whenever <tt>nqmgr</tt> moves a queue file into the active queue, 789the following happens: It reads all necessary information from the 790queue file as <tt>oqmgr</tt> does, and also reads as many recipients 791as possible - more on that later, for now let's just pretend it 792always reads all recipients. 793 794</p> 795 796<p> 797 798Then it resolves the recipients as <tt>oqmgr</tt> does, which 799means obtaining (address, nexthop, transport) triple for each 800recipient. For each triple, it finds the transport; if it does not 801exist yet, it instantiates it (unless it's dead). Within the 802transport, it finds the destination queue for given nexthop; if it 803does not exist yet, it instantiates it (unless it's dead). The 804triple is then bound to given destination queue. This happens in 805qmgr_resolve() and is basically the same as in <tt>oqmgr</tt>. 806 807</p> 808 809<p> 810 811Then for each triple which was bound to some queue (and thus 812transport), the program finds the job which represents the message 813within that transport's context; if it does not exist yet, it 814instantiates it. Within the job, it finds the peer which represents 815the bound destination queue within this jobs context; if it does 816not exist yet, it instantiates it. Finally, it stores the address 817from the resolved triple to the recipient entry which is appended 818to both the queue entry list and the peer entry list. The addresses 819for same nexthop are batched in the entries up to recipient_concurrency 820limit for that transport. This happens in qmgr_assign() and apart 821from that it operates with job and peer structures it is basically the 822same as in <tt>oqmgr</tt>. 823 824</p> 825 826<p> 827 828When the job is instantiated, it is enqueued on the transport's job 829list based on the time its message was picked up by <tt>nqmgr</tt>. 830For first batch of recipients this means it is appended to the end 831of the job list, but the ordering of the job list by the enqueue 832time is important as we will see shortly. 833 834</p> 835 836<p> 837 838[Now you should have pretty good idea what is the state of the 839<tt>nqmgr</tt> after couple of messages was picked up, what is the 840relation between all those job, peer, queue and entry structures.] 841 842</p> 843 844<h3> <a name="<tt>nqmgr</tt>_selection"> How the entry selection 845works </a> </h3> 846 847<p> 848 849Having prepared all those above mentioned structures, the task of 850the <tt>nqmgr</tt>'s scheduler is to choose the recipient entries 851one at a time and pass them to the delivery agent for corresponding 852transport. Now how does this work? 853 854</p> 855 856<p> 857 858The first approximation of the new scheduling algorithm is like this: 859 860</p> 861 862<blockquote> 863<pre> 864foreach transport (round-robin-by-transport) 865do 866 if transport busy continue 867 if transport process limit reached continue 868 foreach transport's job (in the order of the transport's job list) 869 do 870 foreach job's peer (round-robin-by-destination) 871 if peer->queue->concurrency < peer->queue->window 872 return next peer entry. 873 done 874 done 875done 876</pre> 877</blockquote> 878 879<p> 880 881Now what is the "order of the transport's job list"? As we know 882already, the job list is by default kept in the order the message 883was picked up by the <tt>nqmgr</tt>. So by default we get the 884top-level round-robin transport, and within each transport we get 885the FIFO message delivery. The round-robin of the peers by the 886destination is perhaps of little importance in most real-life cases 887(unless the recipient_concurrency limit is reached, in one job there 888is only one peer structure for each destination), but theoretically 889it makes sure that even within single jobs, destinations are treated 890fairly. 891 892</p> 893 894<p> 895 896[By now you should have a feeling you really know how the scheduler 897works, except for the preemption, under ideal conditions - that is, 898no recipient resource limits and no destination concurrency problems.] 899 900</p> 901 902<h3> <a name="<tt>nqmgr</tt>_preemption"> How the preemption 903works </a> </h3> 904 905<p> 906 907As you might perhaps expect by now, the transport's job list does 908not remain sorted by the job's message enqueue time all the time. 909The most cool thing about <tt>nqmgr</tt> is not the simple FIFO 910delivery, but that it is able to slip mail with little recipients 911past the mailing-list bulk mail. This is what the job preemption 912is about - shuffling the jobs on the transport's job list to get 913the best message delivery rates. Now how is it achieved? 914 915</p> 916 917<p> 918 919First I have to tell you that there are in fact two job lists in 920each transport. One is the scheduler's job list, which the scheduler 921is free to play with, while the other one keeps the jobs always 922listed in the order of the enqueue time and is used for recipient 923pool management we will discuss later. For now, we will deal with 924the scheduler's job list only. 925 926</p> 927 928<p> 929 930So, we have the job list, which is first ordered by the time the 931jobs' messages were enqueued, oldest messages first, the most recently 932picked one at the end. For now, let's assume that there are no 933destination concurrency problems. Without preemption, we pick some 934entry of the first (oldest) job on the queue, assign it to delivery 935agent, pick another one from the same job, assign it again, and so 936on, until all the entries are used and the job is delivered. We 937would then move onto the next job and so on and on. Now how do we 938manage to sneak in some entries from the recently added jobs when 939the first job on the job list belongs to a message going to the 940mailing-list and has thousands of recipient entries? 941 942</p> 943 944<p> 945 946The <tt>nqmgr</tt>'s answer is that we can artificially "inflate" 947the delivery time of that first job by some constant for free - it 948is basically the same trick you might remember as "accumulation of 949potential" from the amortized complexity lessons. For example, 950instead of delivering the entries of the first job on the job list 951every time a delivery agent becomes available, we can do it only 952every second time. If you view the moments the delivery agent becomes 953available on a timeline as "delivery slots", then instead of using 954every delivery slot for the first job, we can use only every other 955slot, and still the overall delivery efficiency of the first job 956remains the same. So the delivery <tt>11112222</tt> becomes 957<tt>1.1.1.1.2.2.2.2</tt> (1 and 2 are the imaginary job numbers, . 958denotes the free slot). Now what do we do with free slots? 959 960</p> 961 962<p> 963 964As you might have guessed, we will use them for sneaking the mail 965with little recipients in. For example, if we have one four-recipient 966mail followed by four one recipients mail, the delivery sequence 967(that is, the sequence in which the jobs are assigned to the 968delivery slots) might look like this: <tt>12131415</tt>. Hmm, fine 969for sneaking in the single recipient mail, but how do we sneak in 970the mail with more than one recipient? Say if we have one four-recipient 971mail followed by two two-recipient mails? 972 973</p> 974 975<p> 976 977The simple answer would be to use delivery sequence <tt>12121313</tt>. 978But the problem is that this does not scale well. Imagine you have 979mail with thousand recipients followed by mail with hundred recipients. 980It is tempting to suggest the delivery sequence like <tt>121212....</tt>, 981but alas! Imagine there arrives another mail with say ten recipients. 982But there are no free slots anymore, so it can't slip by, not even 983if it had just only one recipients. It will be stuck until the 984hundred-recipient mail is delivered, which really sucks. 985 986</p> 987 988<p> 989 990So, it becomes obvious that while inflating the message to get 991free slots is great idea, one has to be really careful of how the 992free slots are assigned, otherwise one might corner himself. So, 993how does <tt>nqmgr</tt> really use the free slots? 994 995</p> 996 997<p> 998 999The key idea is that one does not have to generate the free slots 1000in a uniform way. The delivery sequence <tt>111...1</tt> is no 1001worse than <tt>1.1.1.1</tt>, in fact, it is even better as some 1002entries are in the first case selected earlier than in the second 1003case, and none is selected later! So it is possible to first 1004"accumulate" the free delivery slots and then use them all at once. 1005It is even possible to accumulate some, then use them, then accumulate 1006some more and use them again, as in <tt>11..1.1</tt> . 1007 1008</p> 1009 1010<p> 1011 1012Let's get back to the one hundred recipient example. We now know 1013that we could first accumulate one hundred free slots, and only 1014after then to preempt the first job and sneak the one hundred 1015recipient mail in. Applying the algorithm recursively, we see the 1016hundred recipient job can accumulate ten free delivery slots, and 1017then we could preempt it and sneak in the ten-recipient mail... 1018Wait wait wait! Could we? Aren't we overinflating the original one 1019thousand recipient mail? 1020 1021</p> 1022 1023<p> 1024 1025Well, despite it looks so at the first glance, another trick will 1026allow us to answer "no, we are not!". If we had said that we will 1027inflate the delivery time twice at maximum, and then we consider 1028every other slot as a free slot, then we would overinflate in case 1029of the recursive preemption. BUT! The trick is that if we use only 1030every n-th slot as a free slot for n>2, there is always some worst 1031inflation factor which we can guarantee not to be breached, even 1032if we apply the algorithm recursively. To be precise, if for every 1033k>1 normally used slots we accumulate one free delivery slot, than 1034the inflation factor is not worse than k/(k-1) no matter how many 1035recursive preemptions happen. And it's not worse than (k+1)/k if 1036only non-recursive preemption happens. Now, having got through the 1037theory and the related math, let's see how <tt>nqmgr</tt> implements 1038this. 1039 1040</p> 1041 1042<p> 1043 1044Each job has so called "available delivery slot" counter. Each 1045transport has a <i>transport</i>_delivery_slot_cost parameter, which 1046defaults to default_delivery_slot_cost parameter which is set to 5 1047by default. This is the k from the paragraph above. Each time k 1048entries of the job are selected for delivery, this counter is 1049incremented by one. Once there are some slots accumulated, job which 1050requires no more than that number of slots to be fully delivered 1051can preempt this job. 1052 1053</p> 1054 1055<p> 1056 1057[Well, the truth is, the counter is incremented every time an entry 1058is selected and it is divided by k when it is used. Or even more 1059true, there is no division, the other side of the equation is 1060multiplied by k. But for the understanding it's good enough to use 1061the above approximation of the truth.] 1062 1063</p> 1064 1065<p> 1066 1067OK, so now we know the conditions which must be satisfied so one 1068job can preempt another one. But what job gets preempted, how do 1069we choose what job preempts it if there are several valid candidates, 1070and when does all this exactly happen? 1071 1072</p> 1073 1074<p> 1075 1076The answer for the first part is simple. The job whose entry was 1077selected the last time is so called current job. Normally, it is 1078the first job on the scheduler's job list, but destination concurrency 1079limits may change this as we will see later. It is always only the 1080current job which may get preempted. 1081 1082</p> 1083 1084<p> 1085 1086Now for the second part. The current job has certain amount of 1087recipient entries, and as such may accumulate at maximum some amount 1088of available delivery slots. It might have already accumulated some, 1089and perhaps even already used some when it was preempted before 1090(remember a job can be preempted several times). In either case, 1091we know how many are accumulated and how many are left to deliver, 1092so we know how many it may yet accumulate at maximum. Every other 1093job which may be delivered by less than that number of slots is a 1094valid candidate for preemption. How do we choose among them? 1095 1096</p> 1097 1098<p> 1099 1100The answer is - the one with maximum enqueue_time/recipient_entry_count. 1101That is, the older the job is, the more we should try to deliver 1102it in order to get best message delivery rates. These rates are of 1103course subject to how many recipients the message has, therefore 1104the division by the recipient (entry) count. No one shall be surprised 1105that message with n recipients takes n times longer to deliver than 1106message with one recipient. 1107 1108</p> 1109 1110<p> 1111 1112Now let's recap the previous two paragraphs. Isn't it too complicated? 1113Why don't the candidates come only among the jobs which can be 1114delivered within the number of slots the current job already 1115accumulated? Why do we need to estimate how much it has yet to 1116accumulate? If you found out the answer, congratulate yourself. If 1117we did it this simple way, we would always choose the candidate 1118with least recipient entries. If there were enough single recipient 1119mails coming in, they would always slip by the bulk mail as soon 1120as possible, and the two and more recipients mail would never get 1121a chance, no matter how long they have been sitting around in the 1122job list. 1123 1124</p> 1125 1126<p> 1127 1128This candidate selection has interesting implication - that when 1129we choose the best candidate for preemption (this is done in 1130qmgr_choose_candidate()), it may happen that we may not use it for 1131preemption immediately. This leads to an answer to the last part 1132of the original question - when does the preemption happen? 1133 1134</p> 1135 1136<p> 1137 1138The preemption attempt happens every time next transport's recipient 1139entry is to be chosen for delivery. To avoid needless overhead, the 1140preemption is not attempted if the current job could never accumulate 1141more than <i>transport</i>_minimum_delivery_slots (defaults to 1142default_minimum_delivery_slots which defaults to 3). If there is 1143already enough accumulated slots to preempt the current job by the 1144chosen best candidate, it is done immediately. This basically means 1145that the candidate is moved in front of the current job on the 1146scheduler's job list and decreasing the accumulated slot counter 1147by the amount used by the candidate. If there is not enough slots... 1148well, I could say that nothing happens and the another preemption 1149is attempted the next time. But that's not the complete truth. 1150 1151</p> 1152 1153<p> 1154 1155The truth is that it turns out that it is not really necessary to 1156wait until the jobs counter accumulates all the delivery slots in 1157advance. Say we have ten-recipient mail followed by two two-recipient 1158mails. If the preemption happened when enough delivery slot accumulate 1159(assuming slot cost 2), the delivery sequence becomes 1160<tt>11112211113311</tt>. Now what would we get if we would wait 1161only for 50% of the necessary slots to accumulate and we promise 1162we would wait for the remaining 50% later, after we get back 1163to the preempted job? If we use such slot loan, the delivery sequence 1164becomes <tt>11221111331111</tt>. As we can see, it makes it no 1165considerably worse for the delivery of the ten-recipient mail, but 1166it allows the small messages to be delivered sooner. 1167 1168</p> 1169 1170<p> 1171 1172The concept of these slot loans is where the 1173<i>transport</i>_delivery_slot_discount and 1174<i>transport</i>_delivery_slot_loan come from (they default to 1175default_delivery_slot_discount and default_delivery_slot_loan, whose 1176values are by default 50 and 3, respectively). The discount (resp. 1177loan) specifies how many percent (resp. how many slots) one "gets 1178in advance", when the number of slots required to deliver the best 1179candidate is compared with the number of slots the current slot had 1180accumulated so far. 1181 1182</p> 1183 1184<p> 1185 1186And it pretty much concludes this chapter. 1187 1188</p> 1189 1190<p> 1191 1192[Now you should have a feeling that you pretty much understand the 1193scheduler and the preemption, or at least that you will have it 1194after you read the last chapter couple more times. You shall clearly 1195see the job list and the preemption happening at its head, in ideal 1196delivery conditions. The feeling of understanding shall last until 1197you start wondering what happens if some of the jobs are blocked, 1198which you might eventually figure out correctly from what had been 1199said already. But I would be surprised if your mental image of the 1200scheduler's functionality is not completely shattered once you 1201start wondering how it works when not all recipients may be read 1202in-core. More on that later.] 1203 1204</p> 1205 1206<h3> <a name="<tt>nqmgr</tt>_concurrency"> How destination concurrency 1207limits affect the scheduling algorithm </a> </h3> 1208 1209<p> 1210 1211The <tt>nqmgr</tt> uses the same algorithm for destination concurrency 1212control as <tt>oqmgr</tt>. Now what happens when the destination 1213limits are reached and no more entries for that destination may be 1214selected by the scheduler? 1215 1216</p> 1217 1218<p> 1219 1220From user's point of view it is all simple. If some of the peers 1221of a job can't be selected, those peers are simply skipped by the 1222entry selection algorithm (the pseudo-code described before) and 1223only the selectable ones are used. If none of the peers may be 1224selected, the job is declared a "blocker job". Blocker jobs are 1225skipped by the entry selection algorithm and they are also excluded 1226from the candidates for preemption of current job. Thus the scheduler 1227effectively behaves as if the blocker jobs didn't exist on the job 1228list at all. As soon as at least one of the peers of a blocker job 1229becomes unblocked (that is, the delivery agent handling the delivery 1230of the recipient entry for given destination successfully finishes), 1231the job's blocker status is removed and the job again participates 1232in all further scheduler actions normally. 1233 1234</p> 1235 1236<p> 1237 1238So the summary is that the users don't really have to be concerned 1239about the interaction of the destination limits and scheduling 1240algorithm. It works well on its own and there are no knobs they 1241would need to control it. 1242 1243</p> 1244 1245<p> 1246 1247From a programmer's point of view, the blocker jobs complicate the 1248scheduler quite a lot. Without them, the jobs on the job list would 1249be normally delivered in strict FIFO order. If the current job is 1250preempted, the job preempting it is completely delivered unless it 1251is preempted itself. Without blockers, the current job is thus 1252always either the first job on the job list, or the top of the stack 1253of jobs preempting the first job on the job list. 1254 1255</p> 1256 1257<p> 1258 1259The visualization of the job list and the preemption stack without 1260blockers would be like this: 1261 1262</p> 1263 1264<blockquote> 1265<pre> 1266first job-> 1--2--3--5--6--8--... <- job list 1267on job list | 1268 4 <- preemption stack 1269 | 1270current job-> 7 1271</pre> 1272</blockquote> 1273 1274<p> 1275 1276In the example above we see that job 1 was preempted by job 4 and 1277then job 4 was preempted by job 7. After job 7 is completed, remaining 1278entries of job 4 are selected, and once they are all selected, job 12791 continues. 1280 1281</p> 1282 1283<p> 1284 1285As we see, it's all very clean and straightforward. Now how does 1286this change because of blockers? 1287 1288</p> 1289 1290<p> 1291 1292The answer is: a lot. Any job may become blocker job at any time, 1293and also become normal job again at any time. This has several 1294important implications: 1295 1296</p> 1297 1298<ol> 1299 1300<li> <p> 1301 1302The jobs may be completed in arbitrary order. For example, in the 1303example above, if the current job 7 becomes blocked, the next job 13044 may complete before the job 7 becomes unblocked again. Or if both 13057 and 4 are blocked, then 1 is completed, then 7 becomes unblocked 1306and is completed, then 2 is completed and only after that 4 becomes 1307unblocked and is completed... You get the idea. 1308 1309</p> 1310 1311<p> 1312 1313[Interesting side note: even when jobs are delivered out of order, 1314from single destination's point of view the jobs are still delivered 1315in the expected order (that is, FIFO unless there was some preemption 1316involved). This is because whenever a destination queue becomes 1317unblocked (the destination limit allows selection of more recipient 1318entries for that destination), all jobs which have peers for that 1319destination are unblocked at once.] 1320 1321</p> 1322 1323<li> <p> 1324 1325The idea of the preemption stack at the head of the job list is 1326gone. That is, it must be possible to preempt any job on the job 1327list. For example, if the jobs 7, 4, 1 and 2 in the example above 1328become all blocked, job 3 becomes the current job. And of course 1329we do not want the preemption to be affected by the fact that there 1330are some blocked jobs or not. Therefore, if it turns out that job 13313 might be preempted by job 6, the implementation shall make it 1332possible. 1333 1334</p> 1335 1336<li> <p> 1337 1338The idea of the linear preemption stack itself is gone. It's no 1339longer true that one job is always preempted by only one job at one 1340time (that is directly preempted, not counting the recursively 1341nested jobs). For example, in the example above, job 1 is directly 1342preempted by only job 4, and job 4 by job 7. Now assume job 7 becomes 1343blocked, and job 4 is being delivered. If it accumulates enough 1344delivery slots, it is natural that it might be preempted for example 1345by job 8. Now job 4 is preempted by both job 7 AND job 8 at the 1346same time. 1347 1348</p> 1349 1350</ol> 1351 1352<p> 1353 1354Now combine the points 2) and 3) with point 1) again and you realize 1355that the relations on the once linear job list became pretty 1356complicated. If we extend the point 3) example: jobs 7 and 8 preempt 1357job 4, now job 8 becomes blocked too, then job 4 completes. Tricky, 1358huh? 1359 1360</p> 1361 1362<p> 1363 1364If I illustrate the relations after the above mentioned examples 1365(but those in point 1)), the situation would look like this: 1366 1367</p> 1368 1369<blockquote> 1370<pre> 1371 v- parent 1372 1373adoptive parent -> 1--2--3--5--... <- "stack" level 0 1374 | | 1375parent gone -> ? 6 <- "stack" level 1 1376 / \ 1377children -> 7 8 ^- child <- "stack" level 2 1378 1379 ^- siblings 1380</pre> 1381</blockquote> 1382 1383<p> 1384 1385Now how does <tt>nqmgr</tt> deal with all these complicated relations? 1386 1387</p> 1388 1389<p> 1390 1391Well, it maintains them all as described, but fortunately, all these 1392relations are necessary only for purposes of proper counting of 1393available delivery slots. For purposes of ordering the jobs for 1394entry selection, the original rule still applies: "the job preempting 1395the current job is moved in front of the current job on the job 1396list". So for entry selection purposes, the job relations remain 1397as simple as this: 1398 1399</p> 1400 1401<blockquote> 1402<pre> 14037--8--1--2--6--3--5--.. <- scheduler's job list order 1404</pre> 1405</blockquote> 1406 1407<p> 1408 1409The job list order and the preemption parent/child/siblings relations 1410are maintained separately. And because the selection works only 1411with the job list, you can happily forget about those complicated 1412relations unless you want to study the <tt>nqmgr</tt> sources. In 1413that case the text above might provide some helpful introduction 1414to the problem domain. Otherwise I suggest you just forget about 1415all this and stick with the user's point of view: the blocker jobs 1416are simply ignored. 1417 1418</p> 1419 1420<p> 1421 1422[By now, you should have a feeling that there is more things going 1423under the hood than you ever wanted to know. You decide that 1424forgetting about this chapter is the best you can do for the sake 1425of your mind's health and you basically stick with the idea how the 1426scheduler works in ideal conditions, when there are no blockers, 1427which is good enough.] 1428 1429</p> 1430 1431<h3> <a name="<tt>nqmgr</tt>_memory"> Dealing with memory resource 1432limits </a> </h3> 1433 1434<p> 1435 1436When discussing the <tt>nqmgr</tt> scheduler, we have so far assumed 1437that all recipients of all messages in the active queue are completely 1438read into the memory. This is simply not true. There is an upper 1439bound on the amount of memory the <tt>nqmgr</tt> may use, and 1440therefore it must impose some limits on the information it may store 1441in the memory at any given time. 1442 1443</p> 1444 1445<p> 1446 1447First of all, not all messages may be read in-core at once. At any 1448time, only qmgr_message_active_limit messages may be read in-core 1449at maximum. When read into memory, the messages are picked from the 1450incoming and deferred message queues and moved to the active queue 1451(incoming having priority), so if there is more than 1452qmgr_message_active_limit messages queued in the active queue, the 1453rest will have to wait until (some of) the messages in the active 1454queue are completely delivered (or deferred). 1455 1456</p> 1457 1458<p> 1459 1460Even with the limited amount of in-core messages, there is another 1461limit which must be imposed in order to avoid memory exhaustion. 1462Each message may contain huge amount of recipients (tens or hundreds 1463of thousands are not uncommon), so if <tt>nqmgr</tt> read all 1464recipients of all messages in the active queue, it may easily run 1465out of memory. Therefore there must be some upper bound on the 1466amount of message recipients which are read into the memory at the 1467same time. 1468 1469</p> 1470 1471<p> 1472 1473Before discussing how exactly <tt>nqmgr</tt> implements the recipient 1474limits, let's see how the sole existence of the limits themselves 1475affects the <tt>nqmgr</tt> and its scheduler. 1476 1477</p> 1478 1479<p> 1480 1481The message limit is straightforward - it just limits the size of 1482the 1483lookahead the <tt>nqmgr</tt>'s scheduler has when choosing which 1484message can preempt the current one. Messages not in the active 1485queue simply are not considered at all. 1486 1487</p> 1488 1489<p> 1490 1491The recipient limit complicates more things. First of all, the 1492message reading code must support reading the recipients in batches, 1493which among other things means accessing the queue file several 1494times and continuing where the last recipient batch ended. This is 1495invoked by the scheduler whenever the current job has space for more 1496recipients, subject to transport's refill_limit and refill_delay parameters. 1497It is also done any time when all 1498in-core recipients of the message are dealt with (which may also 1499mean they were deferred) but there are still more in the queue file. 1500 1501</p> 1502 1503<p> 1504 1505The second complication is that with some recipients left unread 1506in the queue file, the scheduler can't operate with exact counts 1507of recipient entries. With unread recipients, it is not clear how 1508many recipient entries there will be, as they are subject to 1509per-destination grouping. It is not even clear to what transports 1510(and thus jobs) the recipients will be assigned. And with messages 1511coming from the deferred queue, it is not even clear how many unread 1512recipients are still to be delivered. This all means that the 1513scheduler must use only estimates of how many recipients entries 1514there will be. Fortunately, it is possible to estimate the minimum 1515and maximum correctly, so the scheduler can always err on the safe 1516side. Obviously, the better the estimates, the better results, so 1517it is best when we are able to read all recipients in-core and turn 1518the estimates into exact counts, or at least try to read as many 1519as possible to make the estimates as accurate as possible. 1520 1521</p> 1522 1523<p> 1524 1525The third complication is that it is no longer true that the scheduler 1526is done with a job once all of its in-core recipients are delivered. 1527It is possible that the job will be revived later, when another 1528batch of recipients is read in core. It is also possible that some 1529jobs will be created for the first time long after the first batch 1530of recipients was read in core. The <tt>nqmgr</tt> code must be 1531ready to handle all such situations. 1532 1533</p> 1534 1535<p> 1536 1537And finally, the fourth complication is that the <tt>nqmgr</tt> 1538code must somehow impose the recipient limit itself. Now how does 1539it achieve it? 1540 1541</p> 1542 1543<p> 1544 1545Perhaps the easiest solution would be to say that each message may 1546have at maximum X recipients stored in-core, but such solution would 1547be poor for several reasons. With reasonable qmgr_message_active_limit 1548values, the X would have to be quite low to maintain reasonable 1549memory footprint. And with low X lots of things would not work well. 1550The <tt>nqmgr</tt> would have problems to use the 1551<i>transport</i>_destination_recipient_limit efficiently. The 1552scheduler's preemption would be suboptimal as the recipient count 1553estimates would be inaccurate. The message queue file would have 1554to be accessed many times to read in more recipients again and 1555again. 1556 1557</p> 1558 1559<p> 1560 1561Therefore it seems reasonable to have a solution which does not use 1562a limit imposed on per-message basis, but which maintains a pool 1563of available recipient slots, which can be shared among all messages 1564in the most efficient manner. And as we do not want separate 1565transports to compete for resources whenever possible, it seems 1566appropriate to maintain such recipient pool for each transport 1567separately. This is the general idea, now how does it work in 1568practice? 1569 1570</p> 1571 1572<p> 1573 1574First we have to solve little chicken-and-egg problem. If we want 1575to use the per-transport recipient pools, we first need to know to 1576what transport(s) is the message assigned. But we will find that 1577out only after we read in the recipients first. So it is obvious 1578that we first have to read in some recipients, use them to find out 1579to what transports is the message to be assigned, and only after 1580that we can use the per-transport recipient pools. 1581 1582</p> 1583 1584<p> 1585 1586Now how many recipients shall we read for the first time? This is 1587what qmgr_message_recipient_minimum and qmgr_message_recipient_limit 1588values control. The qmgr_message_recipient_minimum value specifies 1589how many recipients of each message we will read for the first time, 1590no matter what. It is necessary to read at least one recipient 1591before we can assign the message to a transport and create the first 1592job. However, reading only qmgr_message_recipient_minimum recipients 1593even if there are only few messages with few recipients in-core would 1594be wasteful. Therefore if there is less than qmgr_message_recipient_limit 1595recipients in-core so far, the first batch of recipients may be 1596larger than qmgr_message_recipient_minimum - as large as is required 1597to reach the qmgr_message_recipient_limit limit. 1598 1599</p> 1600 1601<p> 1602 1603Once the first batch of recipients was read in core and the message 1604jobs were created, the size of the subsequent recipient batches (if 1605any - of course it's best when all recipients are read in one batch) 1606is based solely on the position of the message jobs on their 1607corresponding transports' job lists. Each transport has a pool of 1608<i>transport</i>_recipient_limit recipient slots which it can 1609distribute among its jobs (how this is done is described later). 1610The subsequent recipient batch may be as large as the sum of all 1611recipient slots of all jobs of the message permits (plus the 1612qmgr_message_recipient_minimum amount which always applies). 1613 1614</p> 1615 1616<p> 1617 1618For example, if a message has three jobs, first with 1 recipient 1619still in-core and 4 recipient slots, second with 5 recipient in-core 1620and 5 recipient slots, and third with 2 recipients in-core and 0 1621recipient slots, it has 1+5+2=7 recipients in-core and 4+5+0=9 jobs' 1622recipients slots in total. This means that we could immediately 1623read 2+qmgr_message_recipient_minimum more recipients of that message 1624in core. 1625 1626</p> 1627 1628<p> 1629 1630The above example illustrates several things which might be worth 1631mentioning explicitly: first, note that although the per-transport 1632slots are assigned to particular jobs, we can't guarantee that once 1633the next batch of recipients is read in core, that the corresponding 1634amounts of recipients will be assigned to those jobs. The jobs lend 1635its slots to the message as a whole, so it is possible that some 1636jobs end up sponsoring other jobs of their message. For example, 1637if in the example above the 2 newly read recipients were assigned 1638to the second job, the first job sponsored the second job with 2 1639slots. The second notable thing is the third job, which has more 1640recipients in-core than it has slots. Apart from sponsoring by other 1641job we just saw it can be result of the first recipient batch, which 1642is sponsored from global recipient pool of qmgr_message_recipient_limit 1643recipients. It can be also sponsored from the message recipient 1644pool of qmgr_message_recipient_minimum recipients. 1645 1646</p> 1647 1648<p> 1649 1650Now how does each transport distribute the recipient slots among 1651its jobs? The strategy is quite simple. As most scheduler activity 1652happens on the head of the job list, it is our intention to make 1653sure that the scheduler has the best estimates of the recipient 1654counts for those jobs. As we mentioned above, this means that we 1655want to try to make sure that the messages of those jobs have all 1656recipients read in-core. Therefore the transport distributes the 1657slots "along" the job list from start to end. In this case the job 1658list sorted by message enqueue time is used, because it doesn't 1659change over time as the scheduler's job list does. 1660 1661</p> 1662 1663<p> 1664 1665More specifically, each time a job is created and appended to the 1666job list, it gets all unused recipient slots from its transport's 1667pool. It keeps them until all recipients of its message are read. 1668When this happens, all unused recipient slots are transferred to 1669the next job (which is now in fact now first such job) on the job 1670list which still has some recipients unread, or eventually back to 1671the transport pool if there is no such job. Such transfer then also 1672happens whenever a recipient entry of that job is delivered. 1673 1674</p> 1675 1676<p> 1677 1678There is also a scenario when a job is not appended to the end of 1679the job list (for example it was created as a result of second or 1680later recipient batch). Then it works exactly as above, except that 1681if it was put in front of the first unread job (that is, the job 1682of a message which still has some unread recipients in queue file), 1683that job is first forced to return all of its unused recipient slots 1684to the transport pool. 1685 1686</p> 1687 1688<p> 1689 1690The algorithm just described leads to the following state: The first 1691unread job on the job list always gets all the remaining recipient 1692slots of that transport (if there are any). The jobs queued before 1693this job are completely read (that is, all recipients of their 1694message were already read in core) and have at maximum as many slots 1695as they still have recipients in-core (the maximum is there because 1696of the sponsoring mentioned before) and the jobs after this job get 1697nothing from the transport recipient pool (unless they got something 1698before and then the first unread job was created and enqueued in 1699front of them later - in such case the also get at maximum as many 1700slots as they have recipients in-core). 1701 1702</p> 1703 1704<p> 1705 1706Things work fine in such state for most of the time, because the 1707current job is either completely read in-core or has as much recipient 1708slots as there are, but there is one situation which we still have 1709to take care of specially. Imagine if the current job is preempted 1710by some unread job from the job list and there are no more recipient 1711slots available, so this new current job could read only batches 1712of qmgr_message_recipient_minimum recipients at a time. This would 1713really degrade performance. For this reason, each transport has 1714extra pool of <i>transport</i>_extra_recipient_limit recipient 1715slots, dedicated exactly for this situation. Each time an unread 1716job preempts the current job, it gets half of the remaining recipient 1717slots from the normal pool and this extra pool. 1718 1719</p> 1720 1721<p> 1722 1723And that's it. It sure does sound pretty complicated, but fortunately 1724most people don't really have to care how exactly it works as long 1725as it works. Perhaps the only important things to know for most 1726people are the following upper bound formulas: 1727 1728</p> 1729 1730<p> 1731 1732Each transport has at maximum 1733 1734</p> 1735 1736<blockquote> 1737<pre> 1738max( 1739qmgr_message_recipient_minimum * qmgr_message_active_limit 1740+ *_recipient_limit + *_extra_recipient_limit, 1741qmgr_message_recipient_limit 1742) 1743</pre> 1744</blockquote> 1745 1746<p> 1747 1748recipients in core. 1749 1750</p> 1751 1752<p> 1753 1754The total amount of recipients in core is 1755 1756</p> 1757 1758<blockquote> 1759<pre> 1760max( 1761qmgr_message_recipient_minimum * qmgr_message_active_limit 1762+ sum( *_recipient_limit + *_extra_recipient_limit ), 1763qmgr_message_recipient_limit 1764) 1765</pre> 1766</blockquote> 1767 1768<p> 1769 1770where the sum is over all used transports. 1771 1772</p> 1773 1774<p> 1775 1776And this terribly complicated chapter concludes the documentation 1777of <tt>nqmgr</tt> scheduler. 1778 1779</p> 1780 1781<p> 1782 1783[By now you should theoretically know the <tt>nqmgr</tt> scheduler 1784inside out. In practice, you still hope that you will never have 1785to really understand the last or last two chapters completely, and 1786fortunately most people really won't. Understanding how the scheduler 1787works in ideal conditions is more than good enough for vast majority 1788of users.] 1789 1790</p> 1791 1792<h2> <a name="credits"> Credits </a> </h2> 1793 1794<ul> 1795 1796<li> Wietse Venema designed and implemented the initial queue manager 1797with per-domain FIFO scheduling, and per-delivery +/-1 concurrency 1798feedback. 1799 1800<li> Patrik Rak designed and implemented preemption where mail with 1801fewer recipients can slip past mail with more recipients in a 1802controlled manner, and wrote up its documentation. 1803 1804<li> Wietse Venema initiated a discussion with Patrik Rak and Victor 1805Duchovni on alternatives for the +/-1 feedback scheduler's aggressive 1806behavior. This is when K/N feedback was reviewed (N = concurrency). 1807The discussion ended without a good solution for both negative 1808feedback and dead site detection. 1809 1810<li> Victor Duchovni resumed work on concurrency feedback in the 1811context of concurrency-limited servers. 1812 1813<li> Wietse Venema then re-designed the concurrency scheduler in 1814terms of the simplest possible concepts: less-than-1 concurrency 1815feedback per delivery, forward and reverse concurrency feedback 1816hysteresis, and pseudo-cohort failure. At this same time, concurrency 1817feedback was separated from dead site detection. 1818 1819<li> These simplifications, and their modular implementation, helped 1820to develop further insights into the different roles that positive 1821and negative concurrency feedback play, and helped to identify some 1822worst-case scenarios. 1823 1824</ul> 1825 1826</body> 1827 1828</html> 1829