1/* 2 * INET An implementation of the TCP/IP protocol suite for the LINUX 3 * operating system. INET is implemented using the BSD Socket 4 * interface as the means of communication with the user level. 5 * 6 * Implementation of the Transmission Control Protocol(TCP). 7 * 8 * Version: $Id: tcp.c,v 1.2 2010-11-04 09:38:03 $ 9 * 10 * Authors: Ross Biro 11 * Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG> 12 * Mark Evans, <evansmp@uhura.aston.ac.uk> 13 * Corey Minyard <wf-rch!minyard@relay.EU.net> 14 * Florian La Roche, <flla@stud.uni-sb.de> 15 * Charles Hedrick, <hedrick@klinzhai.rutgers.edu> 16 * Linus Torvalds, <torvalds@cs.helsinki.fi> 17 * Alan Cox, <gw4pts@gw4pts.ampr.org> 18 * Matthew Dillon, <dillon@apollo.west.oic.com> 19 * Arnt Gulbrandsen, <agulbra@nvg.unit.no> 20 * Jorge Cwik, <jorge@laser.satlink.net> 21 * 22 * Fixes: 23 * Alan Cox : Numerous verify_area() calls 24 * Alan Cox : Set the ACK bit on a reset 25 * Alan Cox : Stopped it crashing if it closed while 26 * sk->inuse=1 and was trying to connect 27 * (tcp_err()). 28 * Alan Cox : All icmp error handling was broken 29 * pointers passed where wrong and the 30 * socket was looked up backwards. Nobody 31 * tested any icmp error code obviously. 32 * Alan Cox : tcp_err() now handled properly. It 33 * wakes people on errors. poll 34 * behaves and the icmp error race 35 * has gone by moving it into sock.c 36 * Alan Cox : tcp_send_reset() fixed to work for 37 * everything not just packets for 38 * unknown sockets. 39 * Alan Cox : tcp option processing. 40 * Alan Cox : Reset tweaked (still not 100%) [Had 41 * syn rule wrong] 42 * Herp Rosmanith : More reset fixes 43 * Alan Cox : No longer acks invalid rst frames. 44 * Acking any kind of RST is right out. 45 * Alan Cox : Sets an ignore me flag on an rst 46 * receive otherwise odd bits of prattle 47 * escape still 48 * Alan Cox : Fixed another acking RST frame bug. 49 * Should stop LAN workplace lockups. 50 * Alan Cox : Some tidyups using the new skb list 51 * facilities 52 * Alan Cox : sk->keepopen now seems to work 53 * Alan Cox : Pulls options out correctly on accepts 54 * Alan Cox : Fixed assorted sk->rqueue->next errors 55 * Alan Cox : PSH doesn't end a TCP read. Switched a 56 * bit to skb ops. 57 * Alan Cox : Tidied tcp_data to avoid a potential 58 * nasty. 59 * Alan Cox : Added some better commenting, as the 60 * tcp is hard to follow 61 * Alan Cox : Removed incorrect check for 20 * psh 62 * Michael O'Reilly : ack < copied bug fix. 63 * Johannes Stille : Misc tcp fixes (not all in yet). 64 * Alan Cox : FIN with no memory -> CRASH 65 * Alan Cox : Added socket option proto entries. 66 * Also added awareness of them to accept. 67 * Alan Cox : Added TCP options (SOL_TCP) 68 * Alan Cox : Switched wakeup calls to callbacks, 69 * so the kernel can layer network 70 * sockets. 71 * Alan Cox : Use ip_tos/ip_ttl settings. 72 * Alan Cox : Handle FIN (more) properly (we hope). 73 * Alan Cox : RST frames sent on unsynchronised 74 * state ack error. 75 * Alan Cox : Put in missing check for SYN bit. 76 * Alan Cox : Added tcp_select_window() aka NET2E 77 * window non shrink trick. 78 * Alan Cox : Added a couple of small NET2E timer 79 * fixes 80 * Charles Hedrick : TCP fixes 81 * Toomas Tamm : TCP window fixes 82 * Alan Cox : Small URG fix to rlogin ^C ack fight 83 * Charles Hedrick : Rewrote most of it to actually work 84 * Linus : Rewrote tcp_read() and URG handling 85 * completely 86 * Gerhard Koerting: Fixed some missing timer handling 87 * Matthew Dillon : Reworked TCP machine states as per RFC 88 * Gerhard Koerting: PC/TCP workarounds 89 * Adam Caldwell : Assorted timer/timing errors 90 * Matthew Dillon : Fixed another RST bug 91 * Alan Cox : Move to kernel side addressing changes. 92 * Alan Cox : Beginning work on TCP fastpathing 93 * (not yet usable) 94 * Arnt Gulbrandsen: Turbocharged tcp_check() routine. 95 * Alan Cox : TCP fast path debugging 96 * Alan Cox : Window clamping 97 * Michael Riepe : Bug in tcp_check() 98 * Matt Dillon : More TCP improvements and RST bug fixes 99 * Matt Dillon : Yet more small nasties remove from the 100 * TCP code (Be very nice to this man if 101 * tcp finally works 100%) 8) 102 * Alan Cox : BSD accept semantics. 103 * Alan Cox : Reset on closedown bug. 104 * Peter De Schrijver : ENOTCONN check missing in tcp_sendto(). 105 * Michael Pall : Handle poll() after URG properly in 106 * all cases. 107 * Michael Pall : Undo the last fix in tcp_read_urg() 108 * (multi URG PUSH broke rlogin). 109 * Michael Pall : Fix the multi URG PUSH problem in 110 * tcp_readable(), poll() after URG 111 * works now. 112 * Michael Pall : recv(...,MSG_OOB) never blocks in the 113 * BSD api. 114 * Alan Cox : Changed the semantics of sk->socket to 115 * fix a race and a signal problem with 116 * accept() and async I/O. 117 * Alan Cox : Relaxed the rules on tcp_sendto(). 118 * Yury Shevchuk : Really fixed accept() blocking problem. 119 * Craig I. Hagan : Allow for BSD compatible TIME_WAIT for 120 * clients/servers which listen in on 121 * fixed ports. 122 * Alan Cox : Cleaned the above up and shrank it to 123 * a sensible code size. 124 * Alan Cox : Self connect lockup fix. 125 * Alan Cox : No connect to multicast. 126 * Ross Biro : Close unaccepted children on master 127 * socket close. 128 * Alan Cox : Reset tracing code. 129 * Alan Cox : Spurious resets on shutdown. 130 * Alan Cox : Giant 15 minute/60 second timer error 131 * Alan Cox : Small whoops in polling before an 132 * accept. 133 * Alan Cox : Kept the state trace facility since 134 * it's handy for debugging. 135 * Alan Cox : More reset handler fixes. 136 * Alan Cox : Started rewriting the code based on 137 * the RFC's for other useful protocol 138 * references see: Comer, KA9Q NOS, and 139 * for a reference on the difference 140 * between specifications and how BSD 141 * works see the 4.4lite source. 142 * A.N.Kuznetsov : Don't time wait on completion of tidy 143 * close. 144 * Linus Torvalds : Fin/Shutdown & copied_seq changes. 145 * Linus Torvalds : Fixed BSD port reuse to work first syn 146 * Alan Cox : Reimplemented timers as per the RFC 147 * and using multiple timers for sanity. 148 * Alan Cox : Small bug fixes, and a lot of new 149 * comments. 150 * Alan Cox : Fixed dual reader crash by locking 151 * the buffers (much like datagram.c) 152 * Alan Cox : Fixed stuck sockets in probe. A probe 153 * now gets fed up of retrying without 154 * (even a no space) answer. 155 * Alan Cox : Extracted closing code better 156 * Alan Cox : Fixed the closing state machine to 157 * resemble the RFC. 158 * Alan Cox : More 'per spec' fixes. 159 * Jorge Cwik : Even faster checksumming. 160 * Alan Cox : tcp_data() doesn't ack illegal PSH 161 * only frames. At least one pc tcp stack 162 * generates them. 163 * Alan Cox : Cache last socket. 164 * Alan Cox : Per route irtt. 165 * Matt Day : poll()->select() match BSD precisely on error 166 * Alan Cox : New buffers 167 * Marc Tamsky : Various sk->prot->retransmits and 168 * sk->retransmits misupdating fixed. 169 * Fixed tcp_write_timeout: stuck close, 170 * and TCP syn retries gets used now. 171 * Mark Yarvis : In tcp_read_wakeup(), don't send an 172 * ack if state is TCP_CLOSED. 173 * Alan Cox : Look up device on a retransmit - routes may 174 * change. Doesn't yet cope with MSS shrink right 175 * but it's a start! 176 * Marc Tamsky : Closing in closing fixes. 177 * Mike Shaver : RFC1122 verifications. 178 * Alan Cox : rcv_saddr errors. 179 * Alan Cox : Block double connect(). 180 * Alan Cox : Small hooks for enSKIP. 181 * Alexey Kuznetsov: Path MTU discovery. 182 * Alan Cox : Support soft errors. 183 * Alan Cox : Fix MTU discovery pathological case 184 * when the remote claims no mtu! 185 * Marc Tamsky : TCP_CLOSE fix. 186 * Colin (G3TNE) : Send a reset on syn ack replies in 187 * window but wrong (fixes NT lpd problems) 188 * Pedro Roque : Better TCP window handling, delayed ack. 189 * Joerg Reuter : No modification of locked buffers in 190 * tcp_do_retransmit() 191 * Eric Schenk : Changed receiver side silly window 192 * avoidance algorithm to BSD style 193 * algorithm. This doubles throughput 194 * against machines running Solaris, 195 * and seems to result in general 196 * improvement. 197 * Stefan Magdalinski : adjusted tcp_readable() to fix FIONREAD 198 * Willy Konynenberg : Transparent proxying support. 199 * Mike McLagan : Routing by source 200 * Keith Owens : Do proper merging with partial SKB's in 201 * tcp_do_sendmsg to avoid burstiness. 202 * Eric Schenk : Fix fast close down bug with 203 * shutdown() followed by close(). 204 * Andi Kleen : Make poll agree with SIGIO 205 * Salvatore Sanfilippo : Support SO_LINGER with linger == 1 and 206 * lingertime == 0 (RFC 793 ABORT Call) 207 * Hirokazu Takahashi : Use copy_from_user() instead of 208 * csum_and_copy_from_user() if possible. 209 * 210 * This program is free software; you can redistribute it and/or 211 * modify it under the terms of the GNU General Public License 212 * as published by the Free Software Foundation; either version 213 * 2 of the License, or(at your option) any later version. 214 * 215 * Description of States: 216 * 217 * TCP_SYN_SENT sent a connection request, waiting for ack 218 * 219 * TCP_SYN_RECV received a connection request, sent ack, 220 * waiting for final ack in three-way handshake. 221 * 222 * TCP_ESTABLISHED connection established 223 * 224 * TCP_FIN_WAIT1 our side has shutdown, waiting to complete 225 * transmission of remaining buffered data 226 * 227 * TCP_FIN_WAIT2 all buffered data sent, waiting for remote 228 * to shutdown 229 * 230 * TCP_CLOSING both sides have shutdown but we still have 231 * data we have to finish sending 232 * 233 * TCP_TIME_WAIT timeout to catch resent junk before entering 234 * closed, can only be entered from FIN_WAIT2 235 * or CLOSING. Required because the other end 236 * may not have gotten our last ACK causing it 237 * to retransmit the data packet (which we ignore) 238 * 239 * TCP_CLOSE_WAIT remote side has shutdown and is waiting for 240 * us to finish writing our data and to shutdown 241 * (we have to close() to move on to LAST_ACK) 242 * 243 * TCP_LAST_ACK out side has shutdown after remote has 244 * shutdown. There may still be data in our 245 * buffer that we have to finish sending 246 * 247 * TCP_CLOSE socket is finished 248 */ 249 250#include <linux/module.h> 251#include <linux/types.h> 252#include <linux/fcntl.h> 253#include <linux/poll.h> 254#include <linux/init.h> 255#include <linux/fs.h> 256#include <linux/random.h> 257#include <linux/bootmem.h> 258#include <linux/cache.h> 259#include <linux/err.h> 260#include <linux/crypto.h> 261 262#include <net/icmp.h> 263#include <net/tcp.h> 264#include <net/xfrm.h> 265#include <net/ip.h> 266#include <net/netdma.h> 267 268#ifdef CONFIG_INET_GRO 269#include <typedefs.h> 270#include <bcmdefs.h> 271#endif /* CONFIG_INET_GRO */ 272 273#include <asm/uaccess.h> 274#include <asm/ioctls.h> 275 276int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT; 277 278DEFINE_SNMP_STAT(struct tcp_mib, tcp_statistics) __read_mostly; 279 280atomic_t tcp_orphan_count = ATOMIC_INIT(0); 281 282EXPORT_SYMBOL_GPL(tcp_orphan_count); 283 284int sysctl_tcp_mem[3] __read_mostly; 285int sysctl_tcp_wmem[3] __read_mostly; 286int sysctl_tcp_rmem[3] __read_mostly; 287 288EXPORT_SYMBOL(sysctl_tcp_mem); 289EXPORT_SYMBOL(sysctl_tcp_rmem); 290EXPORT_SYMBOL(sysctl_tcp_wmem); 291 292atomic_t tcp_memory_allocated; /* Current allocated memory. */ 293atomic_t tcp_sockets_allocated; /* Current number of TCP sockets. */ 294 295EXPORT_SYMBOL(tcp_memory_allocated); 296EXPORT_SYMBOL(tcp_sockets_allocated); 297 298/* 299 * Pressure flag: try to collapse. 300 * Technical note: it is used by multiple contexts non atomically. 301 * All the sk_stream_mem_schedule() is of this nature: accounting 302 * is strict, actions are advisory and have some latency. 303 */ 304int tcp_memory_pressure __read_mostly; 305 306EXPORT_SYMBOL(tcp_memory_pressure); 307 308void tcp_enter_memory_pressure(void) 309{ 310 if (!tcp_memory_pressure) { 311 NET_INC_STATS(LINUX_MIB_TCPMEMORYPRESSURES); 312 tcp_memory_pressure = 1; 313 } 314} 315 316EXPORT_SYMBOL(tcp_enter_memory_pressure); 317 318/* 319 * Wait for a TCP event. 320 * 321 * Note that we don't need to lock the socket, as the upper poll layers 322 * take care of normal races (between the test and the event) and we don't 323 * go look at any of the socket buffers directly. 324 */ 325unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait) 326{ 327 unsigned int mask; 328 struct sock *sk = sock->sk; 329 struct tcp_sock *tp = tcp_sk(sk); 330 331 poll_wait(file, sk->sk_sleep, wait); 332 if (sk->sk_state == TCP_LISTEN) 333 return inet_csk_listen_poll(sk); 334 335 /* Socket is not locked. We are protected from async events 336 by poll logic and correct handling of state changes 337 made by another threads is impossible in any case. 338 */ 339 340 mask = 0; 341 if (sk->sk_err) 342 mask = POLLERR; 343 344 /* 345 * POLLHUP is certainly not done right. But poll() doesn't 346 * have a notion of HUP in just one direction, and for a 347 * socket the read side is more interesting. 348 * 349 * Some poll() documentation says that POLLHUP is incompatible 350 * with the POLLOUT/POLLWR flags, so somebody should check this 351 * all. But careful, it tends to be safer to return too many 352 * bits than too few, and you can easily break real applications 353 * if you don't tell them that something has hung up! 354 * 355 * Check-me. 356 * 357 * Check number 1. POLLHUP is _UNMASKABLE_ event (see UNIX98 and 358 * our fs/select.c). It means that after we received EOF, 359 * poll always returns immediately, making impossible poll() on write() 360 * in state CLOSE_WAIT. One solution is evident --- to set POLLHUP 361 * if and only if shutdown has been made in both directions. 362 * Actually, it is interesting to look how Solaris and DUX 363 * solve this dilemma. I would prefer, if PULLHUP were maskable, 364 * then we could set it on SND_SHUTDOWN. BTW examples given 365 * in Stevens' books assume exactly this behaviour, it explains 366 * why PULLHUP is incompatible with POLLOUT. --ANK 367 * 368 * NOTE. Check for TCP_CLOSE is added. The goal is to prevent 369 * blocking on fresh not-connected or disconnected socket. --ANK 370 */ 371 if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) 372 mask |= POLLHUP; 373 if (sk->sk_shutdown & RCV_SHUTDOWN) 374 mask |= POLLIN | POLLRDNORM | POLLRDHUP; 375 376 /* Connected? */ 377 if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) { 378 /* Potential race condition. If read of tp below will 379 * escape above sk->sk_state, we can be illegally awaken 380 * in SYN_* states. */ 381 if ((tp->rcv_nxt != tp->copied_seq) && 382 (tp->urg_seq != tp->copied_seq || 383 tp->rcv_nxt != tp->copied_seq + 1 || 384 sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data)) 385 mask |= POLLIN | POLLRDNORM; 386 387 if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { 388 if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)) { 389 mask |= POLLOUT | POLLWRNORM; 390 } else { /* send SIGIO later */ 391 set_bit(SOCK_ASYNC_NOSPACE, 392 &sk->sk_socket->flags); 393 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); 394 395 /* Race breaker. If space is freed after 396 * wspace test but before the flags are set, 397 * IO signal will be lost. 398 */ 399 if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)) 400 mask |= POLLOUT | POLLWRNORM; 401 } 402 } 403 404 if (tp->urg_data & TCP_URG_VALID) 405 mask |= POLLPRI; 406 } 407 return mask; 408} 409 410int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) 411{ 412 struct tcp_sock *tp = tcp_sk(sk); 413 int answ; 414 415 switch (cmd) { 416 case SIOCINQ: 417 if (sk->sk_state == TCP_LISTEN) 418 return -EINVAL; 419 420 lock_sock(sk); 421 if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) 422 answ = 0; 423 else if (sock_flag(sk, SOCK_URGINLINE) || 424 !tp->urg_data || 425 before(tp->urg_seq, tp->copied_seq) || 426 !before(tp->urg_seq, tp->rcv_nxt)) { 427 answ = tp->rcv_nxt - tp->copied_seq; 428 429 /* Subtract 1, if FIN is in queue. */ 430 if (answ && !skb_queue_empty(&sk->sk_receive_queue)) 431 answ -= 432 tcp_hdr((struct sk_buff *)sk->sk_receive_queue.prev)->fin; 433 } else 434 answ = tp->urg_seq - tp->copied_seq; 435 release_sock(sk); 436 break; 437 case SIOCATMARK: 438 answ = tp->urg_data && tp->urg_seq == tp->copied_seq; 439 break; 440 case SIOCOUTQ: 441 if (sk->sk_state == TCP_LISTEN) 442 return -EINVAL; 443 444 if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) 445 answ = 0; 446 else 447 answ = tp->write_seq - tp->snd_una; 448 break; 449 default: 450 return -ENOIOCTLCMD; 451 } 452 453 return put_user(answ, (int __user *)arg); 454} 455 456static inline void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb) 457{ 458 TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; 459 tp->pushed_seq = tp->write_seq; 460} 461 462static inline int forced_push(struct tcp_sock *tp) 463{ 464 return after(tp->write_seq, tp->pushed_seq + (tp->max_window >> 1)); 465} 466 467static inline void skb_entail(struct sock *sk, struct sk_buff *skb) 468{ 469 struct tcp_sock *tp = tcp_sk(sk); 470 struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); 471 472 skb->csum = 0; 473 tcb->seq = tcb->end_seq = tp->write_seq; 474 tcb->flags = TCPCB_FLAG_ACK; 475 tcb->sacked = 0; 476 skb_header_release(skb); 477 tcp_add_write_queue_tail(sk, skb); 478 sk_charge_skb(sk, skb); 479 if (tp->nonagle & TCP_NAGLE_PUSH) 480 tp->nonagle &= ~TCP_NAGLE_PUSH; 481} 482 483static inline void tcp_mark_urg(struct tcp_sock *tp, int flags, 484 struct sk_buff *skb) 485{ 486 if (flags & MSG_OOB) { 487 tp->urg_mode = 1; 488 tp->snd_up = tp->write_seq; 489 TCP_SKB_CB(skb)->sacked |= TCPCB_URG; 490 } 491} 492 493static inline void tcp_push(struct sock *sk, int flags, int mss_now, 494 int nonagle) 495{ 496 struct tcp_sock *tp = tcp_sk(sk); 497 498 if (tcp_send_head(sk)) { 499 struct sk_buff *skb = tcp_write_queue_tail(sk); 500 if (!(flags & MSG_MORE) || forced_push(tp)) 501 tcp_mark_push(tp, skb); 502 tcp_mark_urg(tp, flags, skb); 503 __tcp_push_pending_frames(sk, mss_now, 504 (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle); 505 } 506} 507 508static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset, 509 size_t psize, int flags) 510{ 511 struct tcp_sock *tp = tcp_sk(sk); 512 int mss_now, size_goal; 513 int err; 514 ssize_t copied; 515 long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); 516 517 /* Wait for a connection to finish. */ 518 if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) 519 if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) 520 goto out_err; 521 522 clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); 523 524 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 525 size_goal = tp->xmit_size_goal; 526 copied = 0; 527 528 err = -EPIPE; 529 if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) 530 goto do_error; 531 532 while (psize > 0) { 533 struct sk_buff *skb = tcp_write_queue_tail(sk); 534 struct page *page = pages[poffset / PAGE_SIZE]; 535 int copy, i, can_coalesce; 536 int offset = poffset % PAGE_SIZE; 537 int size = min_t(size_t, psize, PAGE_SIZE - offset); 538 539 if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) { 540new_segment: 541 if (!sk_stream_memory_free(sk)) 542 goto wait_for_sndbuf; 543 544 skb = sk_stream_alloc_pskb(sk, 0, 0, 545 sk->sk_allocation); 546 if (!skb) 547 goto wait_for_memory; 548 549 skb_entail(sk, skb); 550 copy = size_goal; 551 } 552 553 if (copy > size) 554 copy = size; 555 556 i = skb_shinfo(skb)->nr_frags; 557 can_coalesce = skb_can_coalesce(skb, i, page, offset); 558 if (!can_coalesce && i >= MAX_SKB_FRAGS) { 559 tcp_mark_push(tp, skb); 560 goto new_segment; 561 } 562 if (!sk_stream_wmem_schedule(sk, copy)) 563 goto wait_for_memory; 564 565 if (can_coalesce) { 566 skb_shinfo(skb)->frags[i - 1].size += copy; 567 } else { 568 get_page(page); 569 skb_fill_page_desc(skb, i, page, offset, copy); 570 } 571 572 skb->len += copy; 573 skb->data_len += copy; 574 skb->truesize += copy; 575 sk->sk_wmem_queued += copy; 576 sk->sk_forward_alloc -= copy; 577 skb->ip_summed = CHECKSUM_PARTIAL; 578 tp->write_seq += copy; 579 TCP_SKB_CB(skb)->end_seq += copy; 580 skb_shinfo(skb)->gso_segs = 0; 581 582 if (!copied) 583 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH; 584 585 copied += copy; 586 poffset += copy; 587 if (!(psize -= copy)) 588 goto out; 589 590 if (skb->len < mss_now || (flags & MSG_OOB)) 591 continue; 592 593 if (forced_push(tp)) { 594 tcp_mark_push(tp, skb); 595 __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); 596 } else if (skb == tcp_send_head(sk)) 597 tcp_push_one(sk, mss_now); 598 continue; 599 600wait_for_sndbuf: 601 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); 602wait_for_memory: 603 if (copied) 604 tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); 605 606 if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) 607 goto do_error; 608 609 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 610 size_goal = tp->xmit_size_goal; 611 } 612 613out: 614#ifdef CONFIG_BCM47XX 615 if (copied && !(flags & MSG_MORE)) 616#else 617 if (copied) 618#endif 619 tcp_push(sk, flags, mss_now, tp->nonagle); 620 return copied; 621 622do_error: 623 if (copied) 624 goto out; 625out_err: 626 return sk_stream_error(sk, flags, err); 627} 628 629ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, 630 size_t size, int flags) 631{ 632 ssize_t res; 633 struct sock *sk = sock->sk; 634 635 if (!(sk->sk_route_caps & NETIF_F_SG) || 636 !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) 637 return sock_no_sendpage(sock, page, offset, size, flags); 638 639 lock_sock(sk); 640 TCP_CHECK_TIMER(sk); 641 res = do_tcp_sendpages(sk, &page, offset, size, flags); 642 TCP_CHECK_TIMER(sk); 643 release_sock(sk); 644 return res; 645} 646 647#define TCP_PAGE(sk) (sk->sk_sndmsg_page) 648#define TCP_OFF(sk) (sk->sk_sndmsg_off) 649 650static inline int select_size(struct sock *sk) 651{ 652 struct tcp_sock *tp = tcp_sk(sk); 653 int tmp = tp->mss_cache; 654 655 if (sk->sk_route_caps & NETIF_F_SG) { 656 if (sk_can_gso(sk)) 657 tmp = 0; 658 else { 659 int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER); 660 661 if (tmp >= pgbreak && 662 tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE) 663 tmp = pgbreak; 664 } 665 } 666 667 return tmp; 668} 669 670int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, 671 size_t size) 672{ 673 struct iovec *iov; 674 struct tcp_sock *tp = tcp_sk(sk); 675 struct sk_buff *skb; 676 int iovlen, flags; 677 int mss_now, size_goal; 678 int err, copied; 679 long timeo; 680 681 lock_sock(sk); 682 TCP_CHECK_TIMER(sk); 683 684 flags = msg->msg_flags; 685 timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); 686 687 /* Wait for a connection to finish. */ 688 if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) 689 if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) 690 goto out_err; 691 692 /* This should be in poll */ 693 clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); 694 695 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 696 size_goal = tp->xmit_size_goal; 697 698 /* Ok commence sending. */ 699 iovlen = msg->msg_iovlen; 700 iov = msg->msg_iov; 701 copied = 0; 702 703 err = -EPIPE; 704 if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) 705 goto do_error; 706 707 while (--iovlen >= 0) { 708 int seglen = iov->iov_len; 709 unsigned char __user *from = iov->iov_base; 710 711 iov++; 712 713 while (seglen > 0) { 714 int copy; 715 716 skb = tcp_write_queue_tail(sk); 717 718 if (!tcp_send_head(sk) || 719 (copy = size_goal - skb->len) <= 0) { 720 721new_segment: 722 /* Allocate new segment. If the interface is SG, 723 * allocate skb fitting to single page. 724 */ 725 if (!sk_stream_memory_free(sk)) 726 goto wait_for_sndbuf; 727 728 skb = sk_stream_alloc_pskb(sk, select_size(sk), 729 0, sk->sk_allocation); 730 if (!skb) 731 goto wait_for_memory; 732 733 /* 734 * Check whether we can use HW checksum. 735 */ 736 if (sk->sk_route_caps & NETIF_F_ALL_CSUM) 737 skb->ip_summed = CHECKSUM_PARTIAL; 738 739 skb_entail(sk, skb); 740 copy = size_goal; 741 } 742 743 /* Try to append data to the end of skb. */ 744 if (copy > seglen) 745 copy = seglen; 746 747 /* Where to copy to? */ 748 if (skb_tailroom(skb) > 0) { 749 /* We have some space in skb head. Superb! */ 750 if (copy > skb_tailroom(skb)) 751 copy = skb_tailroom(skb); 752 if ((err = skb_add_data(skb, from, copy)) != 0) 753 goto do_fault; 754 } else { 755 int merge = 0; 756 int i = skb_shinfo(skb)->nr_frags; 757 struct page *page = TCP_PAGE(sk); 758 int off = TCP_OFF(sk); 759 760 if (skb_can_coalesce(skb, i, page, off) && 761 off != PAGE_SIZE) { 762 /* We can extend the last page 763 * fragment. */ 764 merge = 1; 765 } else if (i == MAX_SKB_FRAGS || 766 (!i && 767 !(sk->sk_route_caps & NETIF_F_SG))) { 768 /* Need to add new fragment and cannot 769 * do this because interface is non-SG, 770 * or because all the page slots are 771 * busy. */ 772 tcp_mark_push(tp, skb); 773 goto new_segment; 774 } else if (page) { 775 if (off == PAGE_SIZE) { 776 put_page(page); 777 TCP_PAGE(sk) = page = NULL; 778 off = 0; 779 } 780 } else 781 off = 0; 782 783 if (copy > PAGE_SIZE - off) 784 copy = PAGE_SIZE - off; 785 786 if (!sk_stream_wmem_schedule(sk, copy)) 787 goto wait_for_memory; 788 789 if (!page) { 790 /* Allocate new cache page. */ 791 if (!(page = sk_stream_alloc_page(sk))) 792 goto wait_for_memory; 793 } 794 795 /* Time to copy data. We are close to 796 * the end! */ 797 err = skb_copy_to_page(sk, from, skb, page, 798 off, copy); 799 if (err) { 800 /* If this page was new, give it to the 801 * socket so it does not get leaked. 802 */ 803 if (!TCP_PAGE(sk)) { 804 TCP_PAGE(sk) = page; 805 TCP_OFF(sk) = 0; 806 } 807 goto do_error; 808 } 809 810 /* Update the skb. */ 811 if (merge) { 812 skb_shinfo(skb)->frags[i - 1].size += 813 copy; 814 } else { 815 skb_fill_page_desc(skb, i, page, off, copy); 816 if (TCP_PAGE(sk)) { 817 get_page(page); 818 } else if (off + copy < PAGE_SIZE) { 819 get_page(page); 820 TCP_PAGE(sk) = page; 821 } 822 } 823 824 TCP_OFF(sk) = off + copy; 825 } 826 827 if (!copied) 828 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH; 829 830 tp->write_seq += copy; 831 TCP_SKB_CB(skb)->end_seq += copy; 832 skb_shinfo(skb)->gso_segs = 0; 833 834 from += copy; 835 copied += copy; 836 if ((seglen -= copy) == 0 && iovlen == 0) 837 goto out; 838 839 if (skb->len < mss_now || (flags & MSG_OOB)) 840 continue; 841 842#ifdef CONFIG_INET_GSO 843 if (iov->iov_len > PAGE_SIZE) 844 continue; 845#endif /* CONFIG_INET_GSO */ 846 847 if (forced_push(tp)) { 848 tcp_mark_push(tp, skb); 849 __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); 850 } else if (skb == tcp_send_head(sk)) 851 tcp_push_one(sk, mss_now); 852 continue; 853 854wait_for_sndbuf: 855 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); 856wait_for_memory: 857 if (copied) 858 tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH); 859 860 if ((err = sk_stream_wait_memory(sk, &timeo)) != 0) 861 goto do_error; 862 863 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); 864 size_goal = tp->xmit_size_goal; 865 } 866 } 867 868out: 869 if (copied) 870 tcp_push(sk, flags, mss_now, tp->nonagle); 871 TCP_CHECK_TIMER(sk); 872 release_sock(sk); 873 return copied; 874 875do_fault: 876 if (!skb->len) { 877 tcp_unlink_write_queue(skb, sk); 878 /* It is the one place in all of TCP, except connection 879 * reset, where we can be unlinking the send_head. 880 */ 881 tcp_check_send_head(sk, skb); 882 sk_stream_free_skb(sk, skb); 883 } 884 885do_error: 886 if (copied) 887 goto out; 888out_err: 889 err = sk_stream_error(sk, flags, err); 890 TCP_CHECK_TIMER(sk); 891 release_sock(sk); 892 return err; 893} 894 895/* 896 * Handle reading urgent data. BSD has very simple semantics for 897 * this, no blocking and very strange errors 8) 898 */ 899 900static int tcp_recv_urg(struct sock *sk, long timeo, 901 struct msghdr *msg, int len, int flags, 902 int *addr_len) 903{ 904 struct tcp_sock *tp = tcp_sk(sk); 905 906 /* No URG data to read. */ 907 if (sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data || 908 tp->urg_data == TCP_URG_READ) 909 return -EINVAL; /* Yes this is right ! */ 910 911 if (sk->sk_state == TCP_CLOSE && !sock_flag(sk, SOCK_DONE)) 912 return -ENOTCONN; 913 914 if (tp->urg_data & TCP_URG_VALID) { 915 int err = 0; 916 char c = tp->urg_data; 917 918 if (!(flags & MSG_PEEK)) 919 tp->urg_data = TCP_URG_READ; 920 921 /* Read urgent data. */ 922 msg->msg_flags |= MSG_OOB; 923 924 if (len > 0) { 925 if (!(flags & MSG_TRUNC)) 926 err = memcpy_toiovec(msg->msg_iov, &c, 1); 927 len = 1; 928 } else 929 msg->msg_flags |= MSG_TRUNC; 930 931 return err ? -EFAULT : len; 932 } 933 934 if (sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN)) 935 return 0; 936 937 /* Fixed the recv(..., MSG_OOB) behaviour. BSD docs and 938 * the available implementations agree in this case: 939 * this call should never block, independent of the 940 * blocking state of the socket. 941 * Mike <pall@rz.uni-karlsruhe.de> 942 */ 943 return -EAGAIN; 944} 945 946/* Clean up the receive buffer for full frames taken by the user, 947 * then send an ACK if necessary. COPIED is the number of bytes 948 * tcp_recvmsg has given to the user so far, it speeds up the 949 * calculation of whether or not we must ACK for the sake of 950 * a window update. 951 */ 952void tcp_cleanup_rbuf(struct sock *sk, int copied) 953{ 954 struct tcp_sock *tp = tcp_sk(sk); 955 int time_to_ack = 0; 956 957#if TCP_DEBUG 958 struct sk_buff *skb = skb_peek(&sk->sk_receive_queue); 959 960 BUG_TRAP(!skb || before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq)); 961#endif 962 963 if (inet_csk_ack_scheduled(sk)) { 964 const struct inet_connection_sock *icsk = inet_csk(sk); 965 /* Delayed ACKs frequently hit locked sockets during bulk 966 * receive. */ 967 if (icsk->icsk_ack.blocked || 968 /* Once-per-two-segments ACK was not sent by tcp_input.c */ 969 tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || 970 /* 971 * If this read emptied read buffer, we send ACK, if 972 * connection is not bidirectional, user drained 973 * receive buffer and there was a small segment 974 * in queue. 975 */ 976 (copied > 0 && 977 ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) || 978 ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && 979 !icsk->icsk_ack.pingpong)) && 980 !atomic_read(&sk->sk_rmem_alloc))) 981 time_to_ack = 1; 982 } 983 984 /* We send an ACK if we can now advertise a non-zero window 985 * which has been raised "significantly". 986 * 987 * Even if window raised up to infinity, do not send window open ACK 988 * in states, where we will not receive more. It is useless. 989 */ 990 if (copied > 0 && !time_to_ack && !(sk->sk_shutdown & RCV_SHUTDOWN)) { 991 __u32 rcv_window_now = tcp_receive_window(tp); 992 993 /* Optimize, __tcp_select_window() is not cheap. */ 994 if (2*rcv_window_now <= tp->window_clamp) { 995 __u32 new_window = __tcp_select_window(sk); 996 997 /* Send ACK now, if this read freed lots of space 998 * in our buffer. Certainly, new_window is new window. 999 * We can advertise it now, if it is not less than current one. 1000 * "Lots" means "at least twice" here. 1001 */ 1002 if (new_window && new_window >= 2 * rcv_window_now) 1003 time_to_ack = 1; 1004 } 1005 } 1006 if (time_to_ack) 1007 tcp_send_ack(sk); 1008} 1009 1010static void tcp_prequeue_process(struct sock *sk) 1011{ 1012 struct sk_buff *skb; 1013 struct tcp_sock *tp = tcp_sk(sk); 1014 1015 NET_INC_STATS_USER(LINUX_MIB_TCPPREQUEUED); 1016 1017 /* RX process wants to run with disabled BHs, though it is not 1018 * necessary */ 1019 local_bh_disable(); 1020 while ((skb = __skb_dequeue(&tp->ucopy.prequeue)) != NULL) 1021 sk->sk_backlog_rcv(sk, skb); 1022 local_bh_enable(); 1023 1024 /* Clear memory counter. */ 1025 tp->ucopy.memory = 0; 1026} 1027 1028static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) 1029{ 1030 struct sk_buff *skb; 1031 u32 offset; 1032 1033 skb_queue_walk(&sk->sk_receive_queue, skb) { 1034 offset = seq - TCP_SKB_CB(skb)->seq; 1035 if (tcp_hdr(skb)->syn) 1036 offset--; 1037 if (offset < skb->len || tcp_hdr(skb)->fin) { 1038 *off = offset; 1039 return skb; 1040 } 1041 } 1042 return NULL; 1043} 1044 1045/* 1046 * This routine provides an alternative to tcp_recvmsg() for routines 1047 * that would like to handle copying from skbuffs directly in 'sendfile' 1048 * fashion. 1049 * Note: 1050 * - It is assumed that the socket was locked by the caller. 1051 * - The routine does not block. 1052 * - At present, there is no support for reading OOB data 1053 * or for 'peeking' the socket using this routine 1054 * (although both would be easy to implement). 1055 */ 1056int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, 1057 sk_read_actor_t recv_actor) 1058{ 1059 struct sk_buff *skb; 1060 struct tcp_sock *tp = tcp_sk(sk); 1061 u32 seq = tp->copied_seq; 1062 u32 offset; 1063 int copied = 0; 1064 1065 if (sk->sk_state == TCP_LISTEN) 1066 return -ENOTCONN; 1067 while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { 1068 if (offset < skb->len) { 1069 size_t used, len; 1070 1071 len = skb->len - offset; 1072 /* Stop reading if we hit a patch of urgent data */ 1073 if (tp->urg_data) { 1074 u32 urg_offset = tp->urg_seq - seq; 1075 if (urg_offset < len) 1076 len = urg_offset; 1077 if (!len) 1078 break; 1079 } 1080 used = recv_actor(desc, skb, offset, len); 1081 if (used < 0) { 1082 if (!copied) 1083 copied = used; 1084 break; 1085 } else if (used <= len) { 1086 seq += used; 1087 copied += used; 1088 offset += used; 1089 } 1090 if (offset != skb->len) 1091 break; 1092 } 1093 if (tcp_hdr(skb)->fin) { 1094 sk_eat_skb(sk, skb, 0); 1095 ++seq; 1096 break; 1097 } 1098 sk_eat_skb(sk, skb, 0); 1099 if (!desc->count) 1100 break; 1101 } 1102 tp->copied_seq = seq; 1103 1104 tcp_rcv_space_adjust(sk); 1105 1106 /* Clean up data we have read: This will do ACK frames. */ 1107 if (copied > 0) 1108 tcp_cleanup_rbuf(sk, copied); 1109 return copied; 1110} 1111 1112/* 1113 * This routine copies from a sock struct into the user buffer. 1114 * 1115 * Technical note: in 2.3 we work on _locked_ socket, so that 1116 * tricks with *seq access order and skb->users are not required. 1117 * Probably, code can be easily improved even more. 1118 */ 1119 1120int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, 1121 size_t len, int nonblock, int flags, int *addr_len) 1122{ 1123 struct tcp_sock *tp = tcp_sk(sk); 1124 int copied = 0; 1125 u32 peek_seq; 1126 u32 *seq; 1127 unsigned long used; 1128 int err; 1129 int target; /* Read at least this many bytes */ 1130 long timeo; 1131 struct task_struct *user_recv = NULL; 1132 int copied_early = 0; 1133 1134 lock_sock(sk); 1135 1136 TCP_CHECK_TIMER(sk); 1137 1138 err = -ENOTCONN; 1139 if (sk->sk_state == TCP_LISTEN) 1140 goto out; 1141 1142 timeo = sock_rcvtimeo(sk, nonblock); 1143 1144 /* Urgent data needs to be handled specially. */ 1145 if (flags & MSG_OOB) 1146 goto recv_urg; 1147 1148 seq = &tp->copied_seq; 1149 if (flags & MSG_PEEK) { 1150 peek_seq = tp->copied_seq; 1151 seq = &peek_seq; 1152 } 1153 1154 target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); 1155 1156#ifdef CONFIG_NET_DMA 1157 tp->ucopy.dma_chan = NULL; 1158 preempt_disable(); 1159 if ((len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && 1160 !sysctl_tcp_low_latency && __get_cpu_var(softnet_data).net_dma) { 1161 preempt_enable_no_resched(); 1162 tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len); 1163 } else 1164 preempt_enable_no_resched(); 1165#endif 1166 1167 do { 1168 struct sk_buff *skb; 1169 u32 offset; 1170 1171 /* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */ 1172 if (tp->urg_data && tp->urg_seq == *seq) { 1173 if (copied) 1174 break; 1175 if (signal_pending(current)) { 1176 copied = timeo ? sock_intr_errno(timeo) : -EAGAIN; 1177 break; 1178 } 1179 } 1180 1181 /* Next get a buffer. */ 1182 1183 skb = skb_peek(&sk->sk_receive_queue); 1184 do { 1185 if (!skb) 1186 break; 1187 1188 /* Now that we have two receive queues this 1189 * shouldn't happen. 1190 */ 1191 if (before(*seq, TCP_SKB_CB(skb)->seq)) { 1192 printk(KERN_INFO "recvmsg bug: copied %X " 1193 "seq %X\n", *seq, TCP_SKB_CB(skb)->seq); 1194 break; 1195 } 1196 offset = *seq - TCP_SKB_CB(skb)->seq; 1197 if (tcp_hdr(skb)->syn) 1198 offset--; 1199 if (offset < skb->len) 1200 goto found_ok_skb; 1201 if (tcp_hdr(skb)->fin) 1202 goto found_fin_ok; 1203 BUG_TRAP(flags & MSG_PEEK); 1204 skb = skb->next; 1205 } while (skb != (struct sk_buff *)&sk->sk_receive_queue); 1206 1207 /* Well, if we have backlog, try to process it now yet. */ 1208 1209 if (copied >= target && !sk->sk_backlog.tail) 1210 break; 1211 1212 if (copied) { 1213 if (sk->sk_err || 1214 sk->sk_state == TCP_CLOSE || 1215 (sk->sk_shutdown & RCV_SHUTDOWN) || 1216 !timeo || 1217 signal_pending(current) || 1218 (flags & MSG_PEEK)) 1219 break; 1220 } else { 1221 if (sock_flag(sk, SOCK_DONE)) 1222 break; 1223 1224 if (sk->sk_err) { 1225 copied = sock_error(sk); 1226 break; 1227 } 1228 1229 if (sk->sk_shutdown & RCV_SHUTDOWN) 1230 break; 1231 1232 if (sk->sk_state == TCP_CLOSE) { 1233 if (!sock_flag(sk, SOCK_DONE)) { 1234 /* This occurs when user tries to read 1235 * from never connected socket. 1236 */ 1237 copied = -ENOTCONN; 1238 break; 1239 } 1240 break; 1241 } 1242 1243 if (!timeo) { 1244 copied = -EAGAIN; 1245 break; 1246 } 1247 1248 if (signal_pending(current)) { 1249 copied = sock_intr_errno(timeo); 1250 break; 1251 } 1252 } 1253 1254 tcp_cleanup_rbuf(sk, copied); 1255 1256 if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) { 1257 /* Install new reader */ 1258 if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) { 1259 user_recv = current; 1260 tp->ucopy.task = user_recv; 1261 tp->ucopy.iov = msg->msg_iov; 1262 } 1263 1264 tp->ucopy.len = len; 1265 1266 BUG_TRAP(tp->copied_seq == tp->rcv_nxt || 1267 (flags & (MSG_PEEK | MSG_TRUNC))); 1268 1269 /* Ugly... If prequeue is not empty, we have to 1270 * process it before releasing socket, otherwise 1271 * order will be broken at second iteration. 1272 * More elegant solution is required!!! 1273 * 1274 * Look: we have the following (pseudo)queues: 1275 * 1276 * 1. packets in flight 1277 * 2. backlog 1278 * 3. prequeue 1279 * 4. receive_queue 1280 * 1281 * Each queue can be processed only if the next ones 1282 * are empty. At this point we have empty receive_queue. 1283 * But prequeue _can_ be not empty after 2nd iteration, 1284 * when we jumped to start of loop because backlog 1285 * processing added something to receive_queue. 1286 * We cannot release_sock(), because backlog contains 1287 * packets arrived _after_ prequeued ones. 1288 * 1289 * Shortly, algorithm is clear --- to process all 1290 * the queues in order. We could make it more directly, 1291 * requeueing packets from backlog to prequeue, if 1292 * is not empty. It is more elegant, but eats cycles, 1293 * unfortunately. 1294 */ 1295 if (!skb_queue_empty(&tp->ucopy.prequeue)) 1296 goto do_prequeue; 1297 1298 /* __ Set realtime policy in scheduler __ */ 1299 } 1300 1301 if (copied >= target) { 1302 /* Do not sleep, just process backlog. */ 1303 release_sock(sk); 1304 lock_sock(sk); 1305 } else 1306 sk_wait_data(sk, &timeo); 1307 1308#ifdef CONFIG_NET_DMA 1309 tp->ucopy.wakeup = 0; 1310#endif 1311 1312 if (user_recv) { 1313 int chunk; 1314 1315 /* __ Restore normal policy in scheduler __ */ 1316 1317 if ((chunk = len - tp->ucopy.len) != 0) { 1318 NET_ADD_STATS_USER(LINUX_MIB_TCPDIRECTCOPYFROMBACKLOG, chunk); 1319 len -= chunk; 1320 copied += chunk; 1321 } 1322 1323 if (tp->rcv_nxt == tp->copied_seq && 1324 !skb_queue_empty(&tp->ucopy.prequeue)) { 1325do_prequeue: 1326 tcp_prequeue_process(sk); 1327 1328 if ((chunk = len - tp->ucopy.len) != 0) { 1329 NET_ADD_STATS_USER(LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE, chunk); 1330 len -= chunk; 1331 copied += chunk; 1332 } 1333 } 1334 } 1335 if ((flags & MSG_PEEK) && peek_seq != tp->copied_seq) { 1336 if (net_ratelimit()) 1337 printk(KERN_DEBUG "TCP(%s:%d): Application bug, race in MSG_PEEK.\n", 1338 current->comm, current->pid); 1339 peek_seq = tp->copied_seq; 1340 } 1341 continue; 1342 1343 found_ok_skb: 1344 /* Ok so how much can we use? */ 1345 used = skb->len - offset; 1346 if (len < used) 1347 used = len; 1348 1349 /* Do we have urgent data here? */ 1350 if (tp->urg_data) { 1351 u32 urg_offset = tp->urg_seq - *seq; 1352 if (urg_offset < used) { 1353 if (!urg_offset) { 1354 if (!sock_flag(sk, SOCK_URGINLINE)) { 1355 ++*seq; 1356 offset++; 1357 used--; 1358 if (!used) 1359 goto skip_copy; 1360 } 1361 } else 1362 used = urg_offset; 1363 } 1364 } 1365 1366 if (!(flags & MSG_TRUNC)) { 1367#ifdef CONFIG_NET_DMA 1368 if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) 1369 tp->ucopy.dma_chan = get_softnet_dma(); 1370 1371 if (tp->ucopy.dma_chan) { 1372 tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec( 1373 tp->ucopy.dma_chan, skb, offset, 1374 msg->msg_iov, used, 1375 tp->ucopy.pinned_list); 1376 1377 if (tp->ucopy.dma_cookie < 0) { 1378 1379 printk(KERN_ALERT "dma_cookie < 0\n"); 1380 1381 /* Exception. Bailout! */ 1382 if (!copied) 1383 copied = -EFAULT; 1384 break; 1385 } 1386 if ((offset + used) == skb->len) 1387 copied_early = 1; 1388 1389 } else 1390#endif 1391 { 1392 err = skb_copy_datagram_iovec(skb, offset, 1393 msg->msg_iov, used); 1394 if (err) { 1395 /* Exception. Bailout! */ 1396 if (!copied) 1397 copied = -EFAULT; 1398 break; 1399 } 1400 } 1401 } 1402 1403 *seq += used; 1404 copied += used; 1405 len -= used; 1406 1407 tcp_rcv_space_adjust(sk); 1408 1409skip_copy: 1410 if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) { 1411 tp->urg_data = 0; 1412 tcp_fast_path_check(sk); 1413 } 1414 if (used + offset < skb->len) 1415 continue; 1416 1417 if (tcp_hdr(skb)->fin) 1418 goto found_fin_ok; 1419 if (!(flags & MSG_PEEK)) { 1420 sk_eat_skb(sk, skb, copied_early); 1421 copied_early = 0; 1422 } 1423 continue; 1424 1425 found_fin_ok: 1426 /* Process the FIN. */ 1427 ++*seq; 1428 if (!(flags & MSG_PEEK)) { 1429 sk_eat_skb(sk, skb, copied_early); 1430 copied_early = 0; 1431 } 1432 break; 1433 } while (len > 0); 1434 1435 if (user_recv) { 1436 if (!skb_queue_empty(&tp->ucopy.prequeue)) { 1437 int chunk; 1438 1439 tp->ucopy.len = copied > 0 ? len : 0; 1440 1441 tcp_prequeue_process(sk); 1442 1443 if (copied > 0 && (chunk = len - tp->ucopy.len) != 0) { 1444 NET_ADD_STATS_USER(LINUX_MIB_TCPDIRECTCOPYFROMPREQUEUE, chunk); 1445 len -= chunk; 1446 copied += chunk; 1447 } 1448 } 1449 1450 tp->ucopy.task = NULL; 1451 tp->ucopy.len = 0; 1452 } 1453 1454#ifdef CONFIG_NET_DMA 1455 if (tp->ucopy.dma_chan) { 1456 struct sk_buff *skb; 1457 dma_cookie_t done, used; 1458 1459 dma_async_memcpy_issue_pending(tp->ucopy.dma_chan); 1460 1461 while (dma_async_memcpy_complete(tp->ucopy.dma_chan, 1462 tp->ucopy.dma_cookie, &done, 1463 &used) == DMA_IN_PROGRESS) { 1464 /* do partial cleanup of sk_async_wait_queue */ 1465 while ((skb = skb_peek(&sk->sk_async_wait_queue)) && 1466 (dma_async_is_complete(skb->dma_cookie, done, 1467 used) == DMA_SUCCESS)) { 1468 __skb_dequeue(&sk->sk_async_wait_queue); 1469 kfree_skb(skb); 1470 } 1471 } 1472 1473 /* Safe to free early-copied skbs now */ 1474 __skb_queue_purge(&sk->sk_async_wait_queue); 1475 dma_chan_put(tp->ucopy.dma_chan); 1476 tp->ucopy.dma_chan = NULL; 1477 } 1478 if (tp->ucopy.pinned_list) { 1479 dma_unpin_iovec_pages(tp->ucopy.pinned_list); 1480 tp->ucopy.pinned_list = NULL; 1481 } 1482#endif 1483 1484 /* According to UNIX98, msg_name/msg_namelen are ignored 1485 * on connected socket. I was just happy when found this 8) --ANK 1486 */ 1487 1488 /* Clean up data we have read: This will do ACK frames. */ 1489 tcp_cleanup_rbuf(sk, copied); 1490 1491 TCP_CHECK_TIMER(sk); 1492 release_sock(sk); 1493 return copied; 1494 1495out: 1496 TCP_CHECK_TIMER(sk); 1497 release_sock(sk); 1498 return err; 1499 1500recv_urg: 1501 err = tcp_recv_urg(sk, timeo, msg, len, flags, addr_len); 1502 goto out; 1503} 1504 1505/* 1506 * State processing on a close. This implements the state shift for 1507 * sending our FIN frame. Note that we only send a FIN for some 1508 * states. A shutdown() may have already sent the FIN, or we may be 1509 * closed. 1510 */ 1511 1512static const unsigned char new_state[16] = { 1513 /* current state: new state: action: */ 1514 /* (Invalid) */ TCP_CLOSE, 1515 /* TCP_ESTABLISHED */ TCP_FIN_WAIT1 | TCP_ACTION_FIN, 1516 /* TCP_SYN_SENT */ TCP_CLOSE, 1517 /* TCP_SYN_RECV */ TCP_FIN_WAIT1 | TCP_ACTION_FIN, 1518 /* TCP_FIN_WAIT1 */ TCP_FIN_WAIT1, 1519 /* TCP_FIN_WAIT2 */ TCP_FIN_WAIT2, 1520 /* TCP_TIME_WAIT */ TCP_CLOSE, 1521 /* TCP_CLOSE */ TCP_CLOSE, 1522 /* TCP_CLOSE_WAIT */ TCP_LAST_ACK | TCP_ACTION_FIN, 1523 /* TCP_LAST_ACK */ TCP_LAST_ACK, 1524 /* TCP_LISTEN */ TCP_CLOSE, 1525 /* TCP_CLOSING */ TCP_CLOSING, 1526}; 1527 1528static int tcp_close_state(struct sock *sk) 1529{ 1530 int next = (int)new_state[sk->sk_state]; 1531 int ns = next & TCP_STATE_MASK; 1532 1533 tcp_set_state(sk, ns); 1534 1535 return next & TCP_ACTION_FIN; 1536} 1537 1538/* 1539 * Shutdown the sending side of a connection. Much like close except 1540 * that we don't receive shut down or set_sock_flag(sk, SOCK_DEAD). 1541 */ 1542 1543void tcp_shutdown(struct sock *sk, int how) 1544{ 1545 /* We need to grab some memory, and put together a FIN, 1546 * and then put it into the queue to be sent. 1547 * Tim MacKenzie(tym@dibbler.cs.monash.edu.au) 4 Dec '92. 1548 */ 1549 if (!(how & SEND_SHUTDOWN)) 1550 return; 1551 1552 /* If we've already sent a FIN, or it's a closed state, skip this. */ 1553 if ((1 << sk->sk_state) & 1554 (TCPF_ESTABLISHED | TCPF_SYN_SENT | 1555 TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) { 1556 /* Clear out any half completed packets. FIN if needed. */ 1557 if (tcp_close_state(sk)) 1558 tcp_send_fin(sk); 1559 } 1560} 1561 1562void tcp_close(struct sock *sk, long timeout) 1563{ 1564 struct sk_buff *skb; 1565 int data_was_unread = 0; 1566 int state; 1567 1568 lock_sock(sk); 1569 sk->sk_shutdown = SHUTDOWN_MASK; 1570 1571 if (sk->sk_state == TCP_LISTEN) { 1572 tcp_set_state(sk, TCP_CLOSE); 1573 1574 /* Special case. */ 1575 inet_csk_listen_stop(sk); 1576 1577 goto adjudge_to_death; 1578 } 1579 1580 /* We need to flush the recv. buffs. We do this only on the 1581 * descriptor close, not protocol-sourced closes, because the 1582 * reader process may not have drained the data yet! 1583 */ 1584 while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) { 1585 u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq - 1586 tcp_hdr(skb)->fin; 1587 data_was_unread += len; 1588 __kfree_skb(skb); 1589 } 1590 1591 sk_stream_mem_reclaim(sk); 1592 1593 /* As outlined in RFC 2525, section 2.17, we send a RST here because 1594 * data was lost. To witness the awful effects of the old behavior of 1595 * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk 1596 * GET in an FTP client, suspend the process, wait for the client to 1597 * advertise a zero window, then kill -9 the FTP client, wheee... 1598 * Note: timeout is always zero in such a case. 1599 */ 1600 if (data_was_unread) { 1601 /* Unread data was tossed, zap the connection. */ 1602 NET_INC_STATS_USER(LINUX_MIB_TCPABORTONCLOSE); 1603 tcp_set_state(sk, TCP_CLOSE); 1604 tcp_send_active_reset(sk, GFP_KERNEL); 1605 } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) { 1606 /* Check zero linger _after_ checking for unread data. */ 1607 sk->sk_prot->disconnect(sk, 0); 1608 NET_INC_STATS_USER(LINUX_MIB_TCPABORTONDATA); 1609 } else if (tcp_close_state(sk)) { 1610 /* We FIN if the application ate all the data before 1611 * zapping the connection. 1612 */ 1613 1614 /* RED-PEN. Formally speaking, we have broken TCP state 1615 * machine. State transitions: 1616 * 1617 * TCP_ESTABLISHED -> TCP_FIN_WAIT1 1618 * TCP_SYN_RECV -> TCP_FIN_WAIT1 (forget it, it's impossible) 1619 * TCP_CLOSE_WAIT -> TCP_LAST_ACK 1620 * 1621 * are legal only when FIN has been sent (i.e. in window), 1622 * rather than queued out of window. Purists blame. 1623 * 1624 * F.e. "RFC state" is ESTABLISHED, 1625 * if Linux state is FIN-WAIT-1, but FIN is still not sent. 1626 * 1627 * The visible declinations are that sometimes 1628 * we enter time-wait state, when it is not required really 1629 * (harmless), do not send active resets, when they are 1630 * required by specs (TCP_ESTABLISHED, TCP_CLOSE_WAIT, when 1631 * they look as CLOSING or LAST_ACK for Linux) 1632 * Probably, I missed some more holelets. 1633 * --ANK 1634 */ 1635 tcp_send_fin(sk); 1636 } 1637 1638 sk_stream_wait_close(sk, timeout); 1639 1640adjudge_to_death: 1641 state = sk->sk_state; 1642 sock_hold(sk); 1643 sock_orphan(sk); 1644 atomic_inc(sk->sk_prot->orphan_count); 1645 1646 /* It is the last release_sock in its life. It will remove backlog. */ 1647 release_sock(sk); 1648 1649 1650 /* Now socket is owned by kernel and we acquire BH lock 1651 to finish close. No need to check for user refs. 1652 */ 1653 local_bh_disable(); 1654 bh_lock_sock(sk); 1655 BUG_TRAP(!sock_owned_by_user(sk)); 1656 1657 /* Have we already been destroyed by a softirq or backlog? */ 1658 if (state != TCP_CLOSE && sk->sk_state == TCP_CLOSE) 1659 goto out; 1660 1661 /* This is a (useful) BSD violating of the RFC. There is a 1662 * problem with TCP as specified in that the other end could 1663 * keep a socket open forever with no application left this end. 1664 * We use a 3 minute timeout (about the same as BSD) then kill 1665 * our end. If they send after that then tough - BUT: long enough 1666 * that we won't make the old 4*rto = almost no time - whoops 1667 * reset mistake. 1668 * 1669 * Nope, it was not mistake. It is really desired behaviour 1670 * f.e. on http servers, when such sockets are useless, but 1671 * consume significant resources. Let's do it with special 1672 * linger2 option. --ANK 1673 */ 1674 1675 if (sk->sk_state == TCP_FIN_WAIT2) { 1676 struct tcp_sock *tp = tcp_sk(sk); 1677 if (tp->linger2 < 0) { 1678 tcp_set_state(sk, TCP_CLOSE); 1679 tcp_send_active_reset(sk, GFP_ATOMIC); 1680 NET_INC_STATS_BH(LINUX_MIB_TCPABORTONLINGER); 1681 } else { 1682 const int tmo = tcp_fin_time(sk); 1683 1684 if (tmo > TCP_TIMEWAIT_LEN) { 1685 inet_csk_reset_keepalive_timer(sk, 1686 tmo - TCP_TIMEWAIT_LEN); 1687 } else { 1688 tcp_time_wait(sk, TCP_FIN_WAIT2, tmo); 1689 goto out; 1690 } 1691 } 1692 } 1693 if (sk->sk_state != TCP_CLOSE) { 1694 sk_stream_mem_reclaim(sk); 1695 if (tcp_too_many_orphans(sk, 1696 atomic_read(sk->sk_prot->orphan_count))) { 1697 if (net_ratelimit()) 1698 printk(KERN_INFO "TCP: too many of orphaned " 1699 "sockets\n"); 1700 tcp_set_state(sk, TCP_CLOSE); 1701 tcp_send_active_reset(sk, GFP_ATOMIC); 1702 NET_INC_STATS_BH(LINUX_MIB_TCPABORTONMEMORY); 1703 } 1704 } 1705 1706 if (sk->sk_state == TCP_CLOSE) 1707 inet_csk_destroy_sock(sk); 1708 /* Otherwise, socket is reprieved until protocol close. */ 1709 1710out: 1711 bh_unlock_sock(sk); 1712 local_bh_enable(); 1713 sock_put(sk); 1714} 1715 1716/* These states need RST on ABORT according to RFC793 */ 1717 1718static inline int tcp_need_reset(int state) 1719{ 1720 return (1 << state) & 1721 (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 | 1722 TCPF_FIN_WAIT2 | TCPF_SYN_RECV); 1723} 1724 1725int tcp_disconnect(struct sock *sk, int flags) 1726{ 1727 struct inet_sock *inet = inet_sk(sk); 1728 struct inet_connection_sock *icsk = inet_csk(sk); 1729 struct tcp_sock *tp = tcp_sk(sk); 1730 int err = 0; 1731 int old_state = sk->sk_state; 1732 1733 if (old_state != TCP_CLOSE) 1734 tcp_set_state(sk, TCP_CLOSE); 1735 1736 /* ABORT function of RFC793 */ 1737 if (old_state == TCP_LISTEN) { 1738 inet_csk_listen_stop(sk); 1739 } else if (tcp_need_reset(old_state) || 1740 (tp->snd_nxt != tp->write_seq && 1741 (1 << old_state) & (TCPF_CLOSING | TCPF_LAST_ACK))) { 1742 /* The last check adjusts for discrepancy of Linux wrt. RFC 1743 * states 1744 */ 1745 tcp_send_active_reset(sk, gfp_any()); 1746 sk->sk_err = ECONNRESET; 1747 } else if (old_state == TCP_SYN_SENT) 1748 sk->sk_err = ECONNRESET; 1749 1750 tcp_clear_xmit_timers(sk); 1751 __skb_queue_purge(&sk->sk_receive_queue); 1752 tcp_write_queue_purge(sk); 1753 __skb_queue_purge(&tp->out_of_order_queue); 1754#ifdef CONFIG_NET_DMA 1755 __skb_queue_purge(&sk->sk_async_wait_queue); 1756#endif 1757 1758 inet->dport = 0; 1759 1760 if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK)) 1761 inet_reset_saddr(sk); 1762 1763 sk->sk_shutdown = 0; 1764 sock_reset_flag(sk, SOCK_DONE); 1765 tp->srtt = 0; 1766 if ((tp->write_seq += tp->max_window + 2) == 0) 1767 tp->write_seq = 1; 1768 icsk->icsk_backoff = 0; 1769 tp->snd_cwnd = 2; 1770 icsk->icsk_probes_out = 0; 1771 tp->packets_out = 0; 1772 tp->snd_ssthresh = 0x7fffffff; 1773 tp->snd_cwnd_cnt = 0; 1774 tp->bytes_acked = 0; 1775 tcp_set_ca_state(sk, TCP_CA_Open); 1776 tcp_clear_retrans(tp); 1777 inet_csk_delack_init(sk); 1778 tcp_init_send_head(sk); 1779 memset(&tp->rx_opt, 0, sizeof(tp->rx_opt)); 1780 __sk_dst_reset(sk); 1781 1782 BUG_TRAP(!inet->num || icsk->icsk_bind_hash); 1783 1784 sk->sk_error_report(sk); 1785 return err; 1786} 1787 1788/* 1789 * Socket option code for TCP. 1790 */ 1791static int do_tcp_setsockopt(struct sock *sk, int level, 1792 int optname, char __user *optval, int optlen) 1793{ 1794 struct tcp_sock *tp = tcp_sk(sk); 1795 struct inet_connection_sock *icsk = inet_csk(sk); 1796 int val; 1797 int err = 0; 1798 1799 /* This is a string value all the others are int's */ 1800 if (optname == TCP_CONGESTION) { 1801 char name[TCP_CA_NAME_MAX]; 1802 1803 if (optlen < 1) 1804 return -EINVAL; 1805 1806 val = strncpy_from_user(name, optval, 1807 min(TCP_CA_NAME_MAX-1, optlen)); 1808 if (val < 0) 1809 return -EFAULT; 1810 name[val] = 0; 1811 1812 lock_sock(sk); 1813 err = tcp_set_congestion_control(sk, name); 1814 release_sock(sk); 1815 return err; 1816 } 1817 1818 if (optlen < sizeof(int)) 1819 return -EINVAL; 1820 1821 if (get_user(val, (int __user *)optval)) 1822 return -EFAULT; 1823 1824 lock_sock(sk); 1825 1826 switch (optname) { 1827 case TCP_MAXSEG: 1828 /* Values greater than interface MTU won't take effect. However 1829 * at the point when this call is done we typically don't yet 1830 * know which interface is going to be used */ 1831 if (val < 8 || val > MAX_TCP_WINDOW) { 1832 err = -EINVAL; 1833 break; 1834 } 1835 tp->rx_opt.user_mss = val; 1836 break; 1837 1838 case TCP_NODELAY: 1839 if (val) { 1840 /* TCP_NODELAY is weaker than TCP_CORK, so that 1841 * this option on corked socket is remembered, but 1842 * it is not activated until cork is cleared. 1843 * 1844 * However, when TCP_NODELAY is set we make 1845 * an explicit push, which overrides even TCP_CORK 1846 * for currently queued segments. 1847 */ 1848 tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH; 1849 tcp_push_pending_frames(sk); 1850 } else { 1851 tp->nonagle &= ~TCP_NAGLE_OFF; 1852 } 1853 break; 1854 1855 case TCP_CORK: 1856 /* When set indicates to always queue non-full frames. 1857 * Later the user clears this option and we transmit 1858 * any pending partial frames in the queue. This is 1859 * meant to be used alongside sendfile() to get properly 1860 * filled frames when the user (for example) must write 1861 * out headers with a write() call first and then use 1862 * sendfile to send out the data parts. 1863 * 1864 * TCP_CORK can be set together with TCP_NODELAY and it is 1865 * stronger than TCP_NODELAY. 1866 */ 1867 if (val) { 1868 tp->nonagle |= TCP_NAGLE_CORK; 1869 } else { 1870 tp->nonagle &= ~TCP_NAGLE_CORK; 1871 if (tp->nonagle&TCP_NAGLE_OFF) 1872 tp->nonagle |= TCP_NAGLE_PUSH; 1873 tcp_push_pending_frames(sk); 1874 } 1875 break; 1876 1877 case TCP_KEEPIDLE: 1878 if (val < 1 || val > MAX_TCP_KEEPIDLE) 1879 err = -EINVAL; 1880 else { 1881 tp->keepalive_time = val * HZ; 1882 if (sock_flag(sk, SOCK_KEEPOPEN) && 1883 !((1 << sk->sk_state) & 1884 (TCPF_CLOSE | TCPF_LISTEN))) { 1885 __u32 elapsed = tcp_time_stamp - tp->rcv_tstamp; 1886 if (tp->keepalive_time > elapsed) 1887 elapsed = tp->keepalive_time - elapsed; 1888 else 1889 elapsed = 0; 1890 inet_csk_reset_keepalive_timer(sk, elapsed); 1891 } 1892 } 1893 break; 1894 case TCP_KEEPINTVL: 1895 if (val < 1 || val > MAX_TCP_KEEPINTVL) 1896 err = -EINVAL; 1897 else 1898 tp->keepalive_intvl = val * HZ; 1899 break; 1900 case TCP_KEEPCNT: 1901 if (val < 1 || val > MAX_TCP_KEEPCNT) 1902 err = -EINVAL; 1903 else 1904 tp->keepalive_probes = val; 1905 break; 1906 case TCP_SYNCNT: 1907 if (val < 1 || val > MAX_TCP_SYNCNT) 1908 err = -EINVAL; 1909 else 1910 icsk->icsk_syn_retries = val; 1911 break; 1912 1913 case TCP_LINGER2: 1914 if (val < 0) 1915 tp->linger2 = -1; 1916 else if (val > sysctl_tcp_fin_timeout / HZ) 1917 tp->linger2 = 0; 1918 else 1919 tp->linger2 = val * HZ; 1920 break; 1921 1922 case TCP_DEFER_ACCEPT: 1923 icsk->icsk_accept_queue.rskq_defer_accept = 0; 1924 if (val > 0) { 1925 /* Translate value in seconds to number of 1926 * retransmits */ 1927 while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && 1928 val > ((TCP_TIMEOUT_INIT / HZ) << 1929 icsk->icsk_accept_queue.rskq_defer_accept)) 1930 icsk->icsk_accept_queue.rskq_defer_accept++; 1931 icsk->icsk_accept_queue.rskq_defer_accept++; 1932 } 1933 break; 1934 1935 case TCP_WINDOW_CLAMP: 1936 if (!val) { 1937 if (sk->sk_state != TCP_CLOSE) { 1938 err = -EINVAL; 1939 break; 1940 } 1941 tp->window_clamp = 0; 1942 } else 1943 tp->window_clamp = val < SOCK_MIN_RCVBUF / 2 ? 1944 SOCK_MIN_RCVBUF / 2 : val; 1945 break; 1946 1947 case TCP_QUICKACK: 1948 if (!val) { 1949 icsk->icsk_ack.pingpong = 1; 1950 } else { 1951 icsk->icsk_ack.pingpong = 0; 1952 if ((1 << sk->sk_state) & 1953 (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) && 1954 inet_csk_ack_scheduled(sk)) { 1955 icsk->icsk_ack.pending |= ICSK_ACK_PUSHED; 1956 tcp_cleanup_rbuf(sk, 1); 1957 if (!(val & 1)) 1958 icsk->icsk_ack.pingpong = 1; 1959 } 1960 } 1961 break; 1962 1963#ifdef CONFIG_TCP_MD5SIG 1964 case TCP_MD5SIG: 1965 /* Read the IP->Key mappings from userspace */ 1966 err = tp->af_specific->md5_parse(sk, optval, optlen); 1967 break; 1968#endif 1969 1970 default: 1971 err = -ENOPROTOOPT; 1972 break; 1973 } 1974 1975 release_sock(sk); 1976 return err; 1977} 1978 1979int tcp_setsockopt(struct sock *sk, int level, int optname, char __user *optval, 1980 int optlen) 1981{ 1982 struct inet_connection_sock *icsk = inet_csk(sk); 1983 1984 if (level != SOL_TCP) 1985 return icsk->icsk_af_ops->setsockopt(sk, level, optname, 1986 optval, optlen); 1987 return do_tcp_setsockopt(sk, level, optname, optval, optlen); 1988} 1989 1990#ifdef CONFIG_COMPAT 1991int compat_tcp_setsockopt(struct sock *sk, int level, int optname, 1992 char __user *optval, int optlen) 1993{ 1994 if (level != SOL_TCP) 1995 return inet_csk_compat_setsockopt(sk, level, optname, 1996 optval, optlen); 1997 return do_tcp_setsockopt(sk, level, optname, optval, optlen); 1998} 1999 2000EXPORT_SYMBOL(compat_tcp_setsockopt); 2001#endif 2002 2003/* Return information about state of tcp endpoint in API format. */ 2004void tcp_get_info(struct sock *sk, struct tcp_info *info) 2005{ 2006 struct tcp_sock *tp = tcp_sk(sk); 2007 const struct inet_connection_sock *icsk = inet_csk(sk); 2008 u32 now = tcp_time_stamp; 2009 2010 memset(info, 0, sizeof(*info)); 2011 2012 info->tcpi_state = sk->sk_state; 2013 info->tcpi_ca_state = icsk->icsk_ca_state; 2014 info->tcpi_retransmits = icsk->icsk_retransmits; 2015 info->tcpi_probes = icsk->icsk_probes_out; 2016 info->tcpi_backoff = icsk->icsk_backoff; 2017 2018 if (tp->rx_opt.tstamp_ok) 2019 info->tcpi_options |= TCPI_OPT_TIMESTAMPS; 2020 if (tp->rx_opt.sack_ok) 2021 info->tcpi_options |= TCPI_OPT_SACK; 2022 if (tp->rx_opt.wscale_ok) { 2023 info->tcpi_options |= TCPI_OPT_WSCALE; 2024 info->tcpi_snd_wscale = tp->rx_opt.snd_wscale; 2025 info->tcpi_rcv_wscale = tp->rx_opt.rcv_wscale; 2026 } 2027 2028 if (tp->ecn_flags&TCP_ECN_OK) 2029 info->tcpi_options |= TCPI_OPT_ECN; 2030 2031 info->tcpi_rto = jiffies_to_usecs(icsk->icsk_rto); 2032 info->tcpi_ato = jiffies_to_usecs(icsk->icsk_ack.ato); 2033 info->tcpi_snd_mss = tp->mss_cache; 2034 info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss; 2035 2036 info->tcpi_unacked = tp->packets_out; 2037 info->tcpi_sacked = tp->sacked_out; 2038 info->tcpi_lost = tp->lost_out; 2039 info->tcpi_retrans = tp->retrans_out; 2040 info->tcpi_fackets = tp->fackets_out; 2041 2042 info->tcpi_last_data_sent = jiffies_to_msecs(now - tp->lsndtime); 2043 info->tcpi_last_data_recv = jiffies_to_msecs(now - icsk->icsk_ack.lrcvtime); 2044 info->tcpi_last_ack_recv = jiffies_to_msecs(now - tp->rcv_tstamp); 2045 2046 info->tcpi_pmtu = icsk->icsk_pmtu_cookie; 2047 info->tcpi_rcv_ssthresh = tp->rcv_ssthresh; 2048 info->tcpi_rtt = jiffies_to_usecs(tp->srtt)>>3; 2049 info->tcpi_rttvar = jiffies_to_usecs(tp->mdev)>>2; 2050 info->tcpi_snd_ssthresh = tp->snd_ssthresh; 2051 info->tcpi_snd_cwnd = tp->snd_cwnd; 2052 info->tcpi_advmss = tp->advmss; 2053 info->tcpi_reordering = tp->reordering; 2054 2055 info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3; 2056 info->tcpi_rcv_space = tp->rcvq_space.space; 2057 2058 info->tcpi_total_retrans = tp->total_retrans; 2059} 2060 2061EXPORT_SYMBOL_GPL(tcp_get_info); 2062 2063static int do_tcp_getsockopt(struct sock *sk, int level, 2064 int optname, char __user *optval, int __user *optlen) 2065{ 2066 struct inet_connection_sock *icsk = inet_csk(sk); 2067 struct tcp_sock *tp = tcp_sk(sk); 2068 int val, len; 2069 2070 if (get_user(len, optlen)) 2071 return -EFAULT; 2072 2073 len = min_t(unsigned int, len, sizeof(int)); 2074 2075 if (len < 0) 2076 return -EINVAL; 2077 2078 switch (optname) { 2079 case TCP_MAXSEG: 2080 val = tp->mss_cache; 2081 if (!val && ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) 2082 val = tp->rx_opt.user_mss; 2083 break; 2084 case TCP_NODELAY: 2085 val = !!(tp->nonagle&TCP_NAGLE_OFF); 2086 break; 2087 case TCP_CORK: 2088 val = !!(tp->nonagle&TCP_NAGLE_CORK); 2089 break; 2090 case TCP_KEEPIDLE: 2091 val = (tp->keepalive_time ? : sysctl_tcp_keepalive_time) / HZ; 2092 break; 2093 case TCP_KEEPINTVL: 2094 val = (tp->keepalive_intvl ? : sysctl_tcp_keepalive_intvl) / HZ; 2095 break; 2096 case TCP_KEEPCNT: 2097 val = tp->keepalive_probes ? : sysctl_tcp_keepalive_probes; 2098 break; 2099 case TCP_SYNCNT: 2100 val = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries; 2101 break; 2102 case TCP_LINGER2: 2103 val = tp->linger2; 2104 if (val >= 0) 2105 val = (val ? : sysctl_tcp_fin_timeout) / HZ; 2106 break; 2107 case TCP_DEFER_ACCEPT: 2108 val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 : 2109 ((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1)); 2110 break; 2111 case TCP_WINDOW_CLAMP: 2112 val = tp->window_clamp; 2113 break; 2114 case TCP_INFO: { 2115 struct tcp_info info; 2116 2117 if (get_user(len, optlen)) 2118 return -EFAULT; 2119 2120 tcp_get_info(sk, &info); 2121 2122 len = min_t(unsigned int, len, sizeof(info)); 2123 if (put_user(len, optlen)) 2124 return -EFAULT; 2125 if (copy_to_user(optval, &info, len)) 2126 return -EFAULT; 2127 return 0; 2128 } 2129 case TCP_QUICKACK: 2130 val = !icsk->icsk_ack.pingpong; 2131 break; 2132 2133 case TCP_CONGESTION: 2134 if (get_user(len, optlen)) 2135 return -EFAULT; 2136 len = min_t(unsigned int, len, TCP_CA_NAME_MAX); 2137 if (put_user(len, optlen)) 2138 return -EFAULT; 2139 if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) 2140 return -EFAULT; 2141 return 0; 2142 default: 2143 return -ENOPROTOOPT; 2144 } 2145 2146 if (put_user(len, optlen)) 2147 return -EFAULT; 2148 if (copy_to_user(optval, &val, len)) 2149 return -EFAULT; 2150 return 0; 2151} 2152 2153int tcp_getsockopt(struct sock *sk, int level, int optname, char __user *optval, 2154 int __user *optlen) 2155{ 2156 struct inet_connection_sock *icsk = inet_csk(sk); 2157 2158 if (level != SOL_TCP) 2159 return icsk->icsk_af_ops->getsockopt(sk, level, optname, 2160 optval, optlen); 2161 return do_tcp_getsockopt(sk, level, optname, optval, optlen); 2162} 2163 2164#ifdef CONFIG_COMPAT 2165int compat_tcp_getsockopt(struct sock *sk, int level, int optname, 2166 char __user *optval, int __user *optlen) 2167{ 2168 if (level != SOL_TCP) 2169 return inet_csk_compat_getsockopt(sk, level, optname, 2170 optval, optlen); 2171 return do_tcp_getsockopt(sk, level, optname, optval, optlen); 2172} 2173 2174EXPORT_SYMBOL(compat_tcp_getsockopt); 2175#endif 2176 2177struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features) 2178{ 2179 struct sk_buff *segs = ERR_PTR(-EINVAL); 2180 struct tcphdr *th; 2181 unsigned thlen; 2182 unsigned int seq; 2183 __be32 delta; 2184 unsigned int oldlen; 2185 unsigned int len; 2186 2187 if (!pskb_may_pull(skb, sizeof(*th))) 2188 goto out; 2189 2190 th = tcp_hdr(skb); 2191 thlen = th->doff * 4; 2192 if (thlen < sizeof(*th)) 2193 goto out; 2194 2195 if (!pskb_may_pull(skb, thlen)) 2196 goto out; 2197 2198 oldlen = (u16)~skb->len; 2199 __skb_pull(skb, thlen); 2200 2201 if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) { 2202 /* Packet is from an untrusted source, reset gso_segs. */ 2203 int type = skb_shinfo(skb)->gso_type; 2204 int mss; 2205 2206 if (unlikely(type & 2207 ~(SKB_GSO_TCPV4 | 2208 SKB_GSO_DODGY | 2209 SKB_GSO_TCP_ECN | 2210 SKB_GSO_TCPV6 | 2211 0) || 2212 !(type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))) 2213 goto out; 2214 2215 mss = skb_shinfo(skb)->gso_size; 2216 skb_shinfo(skb)->gso_segs = (skb->len + mss - 1) / mss; 2217 2218 segs = NULL; 2219 goto out; 2220 } 2221 2222 segs = skb_segment(skb, features); 2223 if (IS_ERR(segs)) 2224 goto out; 2225 2226 len = skb_shinfo(skb)->gso_size; 2227 delta = htonl(oldlen + (thlen + len)); 2228 2229 skb = segs; 2230 th = tcp_hdr(skb); 2231 seq = ntohl(th->seq); 2232 2233 do { 2234 th->fin = th->psh = 0; 2235 2236 th->check = ~csum_fold((__force __wsum)((__force u32)th->check + 2237 (__force u32)delta)); 2238 if (skb->ip_summed != CHECKSUM_PARTIAL) 2239 th->check = 2240 csum_fold(csum_partial(skb_transport_header(skb), 2241 thlen, skb->csum)); 2242 2243 seq += len; 2244 skb = skb->next; 2245 th = tcp_hdr(skb); 2246 2247 th->seq = htonl(seq); 2248 th->cwr = 0; 2249 } while (skb->next); 2250 2251 delta = htonl(oldlen + (skb->tail - skb->transport_header) + 2252 skb->data_len); 2253 th->check = ~csum_fold((__force __wsum)((__force u32)th->check + 2254 (__force u32)delta)); 2255 if (skb->ip_summed != CHECKSUM_PARTIAL) 2256 th->check = csum_fold(csum_partial(skb_transport_header(skb), 2257 thlen, skb->csum)); 2258 2259out: 2260 return segs; 2261} 2262EXPORT_SYMBOL(tcp_tso_segment); 2263 2264#ifdef CONFIG_INET_GRO 2265struct sk_buff ** BCMFASTPATH_HOST tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb) 2266{ 2267 struct sk_buff **pp = NULL; 2268 struct sk_buff *p; 2269 struct tcphdr *th; 2270 struct tcphdr *th2; 2271 unsigned int len; 2272 unsigned int thlen; 2273 unsigned int flags; 2274 unsigned int mss = 1; 2275 int flush = 1; 2276 int i; 2277 2278 th = skb_gro_header(skb, sizeof(*th)); 2279 if (unlikely(!th)) 2280 goto out; 2281 2282 thlen = th->doff * 4; 2283 if (thlen < sizeof(*th)) 2284 goto out; 2285 2286 th = skb_gro_header(skb, thlen); 2287 if (unlikely(!th)) 2288 goto out; 2289 2290 skb_gro_pull(skb, thlen); 2291 2292 len = skb_gro_len(skb); 2293 flags = tcp_flag_word(th); 2294 2295 for (; (p = *head); head = &p->next) { 2296 if (!NAPI_GRO_CB(p)->same_flow) 2297 continue; 2298 2299 th2 = tcp_hdr(p); 2300 2301 if ((th->source ^ th2->source) | (th->dest ^ th2->dest)) { 2302 NAPI_GRO_CB(p)->same_flow = 0; 2303 continue; 2304 } 2305 2306 goto found; 2307 } 2308 2309 goto out_check_final; 2310 2311found: 2312 flush = NAPI_GRO_CB(p)->flush; 2313 flush |= flags & TCP_FLAG_CWR; 2314 flush |= (flags ^ tcp_flag_word(th2)) & 2315 ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH); 2316 flush |= (th->ack_seq ^ th2->ack_seq) | (th->window ^ th2->window); 2317 for (i = sizeof(*th); !flush && i < thlen; i += 4) 2318 flush |= *(u32 *)((u8 *)th + i) ^ 2319 *(u32 *)((u8 *)th2 + i); 2320 2321 mss = skb_shinfo(p)->gso_size; 2322 2323 flush |= (len > mss) | !len; 2324 flush |= (ntohl(th2->seq) + skb_gro_len(p)) ^ ntohl(th->seq); 2325 2326 if (flush || skb_gro_receive(head, skb)) { 2327 mss = 1; 2328 goto out_check_final; 2329 } 2330 2331 p = *head; 2332 th2 = tcp_hdr(p); 2333 tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH); 2334 2335out_check_final: 2336 flush = len < mss; 2337 flush |= flags & (TCP_FLAG_URG | TCP_FLAG_PSH | TCP_FLAG_RST | 2338 TCP_FLAG_SYN | TCP_FLAG_FIN); 2339 2340 if (p && (!NAPI_GRO_CB(skb)->same_flow || flush)) 2341 pp = head; 2342 2343out: 2344 NAPI_GRO_CB(skb)->flush |= flush; 2345 2346 return pp; 2347} 2348EXPORT_SYMBOL(tcp_gro_receive); 2349 2350int BCMFASTPATH_HOST tcp_gro_complete(struct sk_buff *skb) 2351{ 2352 struct tcphdr *th = tcp_hdr(skb); 2353 2354 skb->csum_start = skb_transport_header(skb) - skb->head; 2355 skb->csum_offset = offsetof(struct tcphdr, check); 2356 skb->ip_summed = CHECKSUM_PARTIAL; 2357 2358 skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count; 2359 2360 if (th->cwr) 2361 skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN; 2362 2363 return 0; 2364} 2365EXPORT_SYMBOL(tcp_gro_complete); 2366#endif /* CONFIG_INET_GRO */ 2367 2368#ifdef CONFIG_TCP_MD5SIG 2369static unsigned long tcp_md5sig_users; 2370static struct tcp_md5sig_pool **tcp_md5sig_pool; 2371static DEFINE_SPINLOCK(tcp_md5sig_pool_lock); 2372 2373static void __tcp_free_md5sig_pool(struct tcp_md5sig_pool **pool) 2374{ 2375 int cpu; 2376 for_each_possible_cpu(cpu) { 2377 struct tcp_md5sig_pool *p = *per_cpu_ptr(pool, cpu); 2378 if (p) { 2379 if (p->md5_desc.tfm) 2380 crypto_free_hash(p->md5_desc.tfm); 2381 kfree(p); 2382 p = NULL; 2383 } 2384 } 2385 free_percpu(pool); 2386} 2387 2388void tcp_free_md5sig_pool(void) 2389{ 2390 struct tcp_md5sig_pool **pool = NULL; 2391 2392 spin_lock_bh(&tcp_md5sig_pool_lock); 2393 if (--tcp_md5sig_users == 0) { 2394 pool = tcp_md5sig_pool; 2395 tcp_md5sig_pool = NULL; 2396 } 2397 spin_unlock_bh(&tcp_md5sig_pool_lock); 2398 if (pool) 2399 __tcp_free_md5sig_pool(pool); 2400} 2401 2402EXPORT_SYMBOL(tcp_free_md5sig_pool); 2403 2404static struct tcp_md5sig_pool **__tcp_alloc_md5sig_pool(void) 2405{ 2406 int cpu; 2407 struct tcp_md5sig_pool **pool; 2408 2409 pool = alloc_percpu(struct tcp_md5sig_pool *); 2410 if (!pool) 2411 return NULL; 2412 2413 for_each_possible_cpu(cpu) { 2414 struct tcp_md5sig_pool *p; 2415 struct crypto_hash *hash; 2416 2417 p = kzalloc(sizeof(*p), GFP_KERNEL); 2418 if (!p) 2419 goto out_free; 2420 *per_cpu_ptr(pool, cpu) = p; 2421 2422 hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); 2423 if (!hash || IS_ERR(hash)) 2424 goto out_free; 2425 2426 p->md5_desc.tfm = hash; 2427 } 2428 return pool; 2429out_free: 2430 __tcp_free_md5sig_pool(pool); 2431 return NULL; 2432} 2433 2434struct tcp_md5sig_pool **tcp_alloc_md5sig_pool(void) 2435{ 2436 struct tcp_md5sig_pool **pool; 2437 int alloc = 0; 2438 2439retry: 2440 spin_lock_bh(&tcp_md5sig_pool_lock); 2441 pool = tcp_md5sig_pool; 2442 if (tcp_md5sig_users++ == 0) { 2443 alloc = 1; 2444 spin_unlock_bh(&tcp_md5sig_pool_lock); 2445 } else if (!pool) { 2446 tcp_md5sig_users--; 2447 spin_unlock_bh(&tcp_md5sig_pool_lock); 2448 cpu_relax(); 2449 goto retry; 2450 } else 2451 spin_unlock_bh(&tcp_md5sig_pool_lock); 2452 2453 if (alloc) { 2454 /* we cannot hold spinlock here because this may sleep. */ 2455 struct tcp_md5sig_pool **p = __tcp_alloc_md5sig_pool(); 2456 spin_lock_bh(&tcp_md5sig_pool_lock); 2457 if (!p) { 2458 tcp_md5sig_users--; 2459 spin_unlock_bh(&tcp_md5sig_pool_lock); 2460 return NULL; 2461 } 2462 pool = tcp_md5sig_pool; 2463 if (pool) { 2464 /* oops, it has already been assigned. */ 2465 spin_unlock_bh(&tcp_md5sig_pool_lock); 2466 __tcp_free_md5sig_pool(p); 2467 } else { 2468 tcp_md5sig_pool = pool = p; 2469 spin_unlock_bh(&tcp_md5sig_pool_lock); 2470 } 2471 } 2472 return pool; 2473} 2474 2475EXPORT_SYMBOL(tcp_alloc_md5sig_pool); 2476 2477struct tcp_md5sig_pool *__tcp_get_md5sig_pool(int cpu) 2478{ 2479 struct tcp_md5sig_pool **p; 2480 spin_lock_bh(&tcp_md5sig_pool_lock); 2481 p = tcp_md5sig_pool; 2482 if (p) 2483 tcp_md5sig_users++; 2484 spin_unlock_bh(&tcp_md5sig_pool_lock); 2485 return (p ? *per_cpu_ptr(p, cpu) : NULL); 2486} 2487 2488EXPORT_SYMBOL(__tcp_get_md5sig_pool); 2489 2490void __tcp_put_md5sig_pool(void) 2491{ 2492 tcp_free_md5sig_pool(); 2493} 2494 2495EXPORT_SYMBOL(__tcp_put_md5sig_pool); 2496#endif 2497 2498void tcp_done(struct sock *sk) 2499{ 2500 if(sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV) 2501 TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); 2502 2503 tcp_set_state(sk, TCP_CLOSE); 2504 tcp_clear_xmit_timers(sk); 2505 2506 sk->sk_shutdown = SHUTDOWN_MASK; 2507 2508 if (!sock_flag(sk, SOCK_DEAD)) 2509 sk->sk_state_change(sk); 2510 else 2511 inet_csk_destroy_sock(sk); 2512} 2513EXPORT_SYMBOL_GPL(tcp_done); 2514 2515extern void __skb_cb_too_small_for_tcp(int, int); 2516extern struct tcp_congestion_ops tcp_reno; 2517 2518static __initdata unsigned long thash_entries; 2519static int __init set_thash_entries(char *str) 2520{ 2521 if (!str) 2522 return 0; 2523 thash_entries = simple_strtoul(str, &str, 0); 2524 return 1; 2525} 2526__setup("thash_entries=", set_thash_entries); 2527 2528void __init tcp_init(void) 2529{ 2530 struct sk_buff *skb = NULL; 2531 unsigned long limit; 2532 int order, i, max_share; 2533 2534 if (sizeof(struct tcp_skb_cb) > sizeof(skb->cb)) 2535 __skb_cb_too_small_for_tcp(sizeof(struct tcp_skb_cb), 2536 sizeof(skb->cb)); 2537 2538 tcp_hashinfo.bind_bucket_cachep = 2539 kmem_cache_create("tcp_bind_bucket", 2540 sizeof(struct inet_bind_bucket), 0, 2541 SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); 2542 2543 /* Size and allocate the main established and bind bucket 2544 * hash tables. 2545 * 2546 * The methodology is similar to that of the buffer cache. 2547 */ 2548 tcp_hashinfo.ehash = 2549 alloc_large_system_hash("TCP established", 2550 sizeof(struct inet_ehash_bucket), 2551 thash_entries, 2552 (num_physpages >= 128 * 1024) ? 2553 13 : 15, 2554 0, 2555 &tcp_hashinfo.ehash_size, 2556 NULL, 2557 0); 2558 tcp_hashinfo.ehash_size = 1 << tcp_hashinfo.ehash_size; 2559 for (i = 0; i < tcp_hashinfo.ehash_size; i++) { 2560 rwlock_init(&tcp_hashinfo.ehash[i].lock); 2561 INIT_HLIST_HEAD(&tcp_hashinfo.ehash[i].chain); 2562 INIT_HLIST_HEAD(&tcp_hashinfo.ehash[i].twchain); 2563 } 2564 2565 tcp_hashinfo.bhash = 2566 alloc_large_system_hash("TCP bind", 2567 sizeof(struct inet_bind_hashbucket), 2568 tcp_hashinfo.ehash_size, 2569 (num_physpages >= 128 * 1024) ? 2570 13 : 15, 2571 0, 2572 &tcp_hashinfo.bhash_size, 2573 NULL, 2574 64 * 1024); 2575 tcp_hashinfo.bhash_size = 1 << tcp_hashinfo.bhash_size; 2576 for (i = 0; i < tcp_hashinfo.bhash_size; i++) { 2577 spin_lock_init(&tcp_hashinfo.bhash[i].lock); 2578 INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain); 2579 } 2580 2581 /* Try to be a bit smarter and adjust defaults depending 2582 * on available memory. 2583 */ 2584 for (order = 0; ((1 << order) << PAGE_SHIFT) < 2585 (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)); 2586 order++) 2587 ; 2588 if (order >= 4) { 2589 tcp_death_row.sysctl_max_tw_buckets = 180000; 2590 sysctl_tcp_max_orphans = 4096 << (order - 4); 2591 sysctl_max_syn_backlog = 1024; 2592 } else if (order < 3) { 2593 tcp_death_row.sysctl_max_tw_buckets >>= (3 - order); 2594 sysctl_tcp_max_orphans >>= (3 - order); 2595 sysctl_max_syn_backlog = 128; 2596 } 2597 2598 /* Set the pressure threshold to be a fraction of global memory that 2599 * is up to 1/2 at 256 MB, decreasing toward zero with the amount of 2600 * memory, with a floor of 128 pages. 2601 */ 2602 limit = min(nr_all_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT); 2603 limit = (limit * (nr_all_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11); 2604 limit = max(limit, 128UL); 2605 sysctl_tcp_mem[0] = limit / 4 * 3; 2606 sysctl_tcp_mem[1] = limit; 2607 sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2; 2608 2609 /* Set per-socket limits to no more than 1/128 the pressure threshold */ 2610 limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7); 2611 max_share = min(4UL*1024*1024, limit); 2612 2613 sysctl_tcp_wmem[0] = SK_STREAM_MEM_QUANTUM; 2614 sysctl_tcp_wmem[1] = 16*1024; 2615 sysctl_tcp_wmem[2] = max(64*1024, max_share); 2616 2617 sysctl_tcp_rmem[0] = SK_STREAM_MEM_QUANTUM; 2618 sysctl_tcp_rmem[1] = 87380; 2619 sysctl_tcp_rmem[2] = max(87380, max_share); 2620 2621 printk(KERN_INFO "TCP: Hash tables configured " 2622 "(established %d bind %d)\n", 2623 tcp_hashinfo.ehash_size, tcp_hashinfo.bhash_size); 2624 2625 tcp_register_congestion_control(&tcp_reno); 2626} 2627 2628EXPORT_SYMBOL(tcp_close); 2629EXPORT_SYMBOL(tcp_disconnect); 2630EXPORT_SYMBOL(tcp_getsockopt); 2631EXPORT_SYMBOL(tcp_ioctl); 2632EXPORT_SYMBOL(tcp_poll); 2633EXPORT_SYMBOL(tcp_read_sock); 2634EXPORT_SYMBOL(tcp_recvmsg); 2635EXPORT_SYMBOL(tcp_sendmsg); 2636EXPORT_SYMBOL(tcp_sendpage); 2637EXPORT_SYMBOL(tcp_setsockopt); 2638EXPORT_SYMBOL(tcp_shutdown); 2639EXPORT_SYMBOL(tcp_statistics); 2640