History log of /freebsd-10.0-release/sys/netinet/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
279264 25-Feb-2015 delphij

Fix integer overflow in IGMP protocol. [SA-15:04]

Fix vt(4) crash with improper ioctl parameters. [EN-15:01]

Updated base system OpenSSL to 1.0.1l. [EN-15:02]

Fix freebsd-update libraries update ordering issue. [EN-15:03]

Approved by: so


/freebsd-10.0-release/UPDATING
/freebsd-10.0-release/crypto/openssl/ACKNOWLEDGMENTS
/freebsd-10.0-release/crypto/openssl/CHANGES
/freebsd-10.0-release/crypto/openssl/Configure
/freebsd-10.0-release/crypto/openssl/FAQ
/freebsd-10.0-release/crypto/openssl/Makefile
/freebsd-10.0-release/crypto/openssl/Makefile.org
/freebsd-10.0-release/crypto/openssl/NEWS
/freebsd-10.0-release/crypto/openssl/README
/freebsd-10.0-release/crypto/openssl/apps/Makefile
/freebsd-10.0-release/crypto/openssl/apps/apps.c
/freebsd-10.0-release/crypto/openssl/apps/apps.h
/freebsd-10.0-release/crypto/openssl/apps/ca.c
/freebsd-10.0-release/crypto/openssl/apps/ciphers.c
/freebsd-10.0-release/crypto/openssl/apps/crl.c
/freebsd-10.0-release/crypto/openssl/apps/crl2p7.c
/freebsd-10.0-release/crypto/openssl/apps/dgst.c
/freebsd-10.0-release/crypto/openssl/apps/ecparam.c
/freebsd-10.0-release/crypto/openssl/apps/enc.c
/freebsd-10.0-release/crypto/openssl/apps/ocsp.c
/freebsd-10.0-release/crypto/openssl/apps/openssl.c
/freebsd-10.0-release/crypto/openssl/apps/pkcs12.c
/freebsd-10.0-release/crypto/openssl/apps/progs.h
/freebsd-10.0-release/crypto/openssl/apps/progs.pl
/freebsd-10.0-release/crypto/openssl/apps/req.c
/freebsd-10.0-release/crypto/openssl/apps/s_cb.c
/freebsd-10.0-release/crypto/openssl/apps/s_client.c
/freebsd-10.0-release/crypto/openssl/apps/s_server.c
/freebsd-10.0-release/crypto/openssl/apps/s_socket.c
/freebsd-10.0-release/crypto/openssl/apps/s_time.c
/freebsd-10.0-release/crypto/openssl/apps/smime.c
/freebsd-10.0-release/crypto/openssl/apps/speed.c
/freebsd-10.0-release/crypto/openssl/config
/freebsd-10.0-release/crypto/openssl/crypto/Makefile
/freebsd-10.0-release/crypto/openssl/crypto/aes/asm/aes-mips.pl
/freebsd-10.0-release/crypto/openssl/crypto/aes/asm/aes-parisc.pl
/freebsd-10.0-release/crypto/openssl/crypto/aes/asm/aesni-x86_64.pl
/freebsd-10.0-release/crypto/openssl/crypto/aes/asm/bsaes-x86_64.pl
/freebsd-10.0-release/crypto/openssl/crypto/aes/asm/vpaes-x86_64.pl
/freebsd-10.0-release/crypto/openssl/crypto/armcap.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/a_int.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/a_strex.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/a_strnid.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/a_utctm.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/ameth_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/asn1.h
/freebsd-10.0-release/crypto/openssl/crypto/asn1/asn1_err.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/asn1_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/asn_mime.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/asn_pack.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/bio_asn1.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/charmap.pl
/freebsd-10.0-release/crypto/openssl/crypto/asn1/evp_asn1.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/t_x509.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/tasn_dec.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/tasn_enc.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/x_crl.c
/freebsd-10.0-release/crypto/openssl/crypto/asn1/x_name.c
/freebsd-10.0-release/crypto/openssl/crypto/bio/bio.h
/freebsd-10.0-release/crypto/openssl/crypto/bio/bio_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/bio/bss_dgram.c
/freebsd-10.0-release/crypto/openssl/crypto/bio/bss_log.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/Makefile
/freebsd-10.0-release/crypto/openssl/crypto/bn/asm/mips-mont.pl
/freebsd-10.0-release/crypto/openssl/crypto/bn/asm/mips.pl
/freebsd-10.0-release/crypto/openssl/crypto/bn/asm/mips3.s
/freebsd-10.0-release/crypto/openssl/crypto/bn/asm/parisc-mont.pl
/freebsd-10.0-release/crypto/openssl/crypto/bn/asm/x86_64-gcc.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/asm/x86_64-gf2m.pl
/freebsd-10.0-release/crypto/openssl/crypto/bn/asm/x86_64-mont5.pl
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn.h
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn_ctx.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn_div.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn_exp.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn_mont.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn_nist.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/bn_sqr.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/bntest.c
/freebsd-10.0-release/crypto/openssl/crypto/bn/exptest.c
/freebsd-10.0-release/crypto/openssl/crypto/buffer/buffer.c
/freebsd-10.0-release/crypto/openssl/crypto/buffer/buffer.h
/freebsd-10.0-release/crypto/openssl/crypto/cms/cms_env.c
/freebsd-10.0-release/crypto/openssl/crypto/cms/cms_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/cms/cms_pwri.c
/freebsd-10.0-release/crypto/openssl/crypto/cms/cms_sd.c
/freebsd-10.0-release/crypto/openssl/crypto/cms/cms_smime.c
/freebsd-10.0-release/crypto/openssl/crypto/conf/conf_def.c
/freebsd-10.0-release/crypto/openssl/crypto/constant_time_locl.h
/freebsd-10.0-release/crypto/openssl/crypto/constant_time_test.c
/freebsd-10.0-release/crypto/openssl/crypto/cryptlib.c
/freebsd-10.0-release/crypto/openssl/crypto/cversion.c
/freebsd-10.0-release/crypto/openssl/crypto/dsa/dsa_ameth.c
/freebsd-10.0-release/crypto/openssl/crypto/dso/dso_dlfcn.c
/freebsd-10.0-release/crypto/openssl/crypto/ebcdic.h
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec.h
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec2_smpl.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec_ameth.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec_asn1.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec_lcl.h
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec_mult.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ec_pmeth.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ecp_mont.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ecp_nist.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ecp_nistp256.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ecp_smpl.c
/freebsd-10.0-release/crypto/openssl/crypto/ec/ectest.c
/freebsd-10.0-release/crypto/openssl/crypto/ecdsa/ecs_vrf.c
/freebsd-10.0-release/crypto/openssl/crypto/engine/eng_dyn.c
/freebsd-10.0-release/crypto/openssl/crypto/engine/eng_list.c
/freebsd-10.0-release/crypto/openssl/crypto/engine/eng_rdrand.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/Makefile
/freebsd-10.0-release/crypto/openssl/crypto/evp/bio_b64.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/digest.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/e_aes.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/e_aes_cbc_hmac_sha1.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/e_des3.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/encode.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/evp_enc.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/evp_pbe.c
/freebsd-10.0-release/crypto/openssl/crypto/evp/p5_crpt2.c
/freebsd-10.0-release/crypto/openssl/crypto/idea/ideatest.c
/freebsd-10.0-release/crypto/openssl/crypto/md32_common.h
/freebsd-10.0-release/crypto/openssl/crypto/md5/asm/md5-x86_64.pl
/freebsd-10.0-release/crypto/openssl/crypto/mem.c
/freebsd-10.0-release/crypto/openssl/crypto/modes/Makefile
/freebsd-10.0-release/crypto/openssl/crypto/modes/asm/ghash-parisc.pl
/freebsd-10.0-release/crypto/openssl/crypto/modes/cbc128.c
/freebsd-10.0-release/crypto/openssl/crypto/modes/ccm128.c
/freebsd-10.0-release/crypto/openssl/crypto/modes/cts128.c
/freebsd-10.0-release/crypto/openssl/crypto/modes/gcm128.c
/freebsd-10.0-release/crypto/openssl/crypto/modes/modes.h
/freebsd-10.0-release/crypto/openssl/crypto/modes/modes_lcl.h
/freebsd-10.0-release/crypto/openssl/crypto/objects/obj_dat.h
/freebsd-10.0-release/crypto/openssl/crypto/objects/obj_dat.pl
/freebsd-10.0-release/crypto/openssl/crypto/objects/obj_xref.h
/freebsd-10.0-release/crypto/openssl/crypto/objects/objxref.pl
/freebsd-10.0-release/crypto/openssl/crypto/ocsp/ocsp_ht.c
/freebsd-10.0-release/crypto/openssl/crypto/ocsp/ocsp_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/ocsp/ocsp_vfy.c
/freebsd-10.0-release/crypto/openssl/crypto/opensslconf.h
/freebsd-10.0-release/crypto/openssl/crypto/opensslv.h
/freebsd-10.0-release/crypto/openssl/crypto/ossl_typ.h
/freebsd-10.0-release/crypto/openssl/crypto/pariscid.pl
/freebsd-10.0-release/crypto/openssl/crypto/pem/pem_info.c
/freebsd-10.0-release/crypto/openssl/crypto/pem/pvkfmt.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs12/p12_crt.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs12/p12_kiss.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/Makefile
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/bio_ber.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/dec.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/des.pem
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/doc
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/enc.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/es1.pem
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/example.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/example.h
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/info.pem
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/infokey.pem
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/p7
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/pk7_doit.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/pkcs7.h
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/pkcs7err.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/server.pem
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/sign.c
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/t
/freebsd-10.0-release/crypto/openssl/crypto/pkcs7/verify.c
/freebsd-10.0-release/crypto/openssl/crypto/pqueue/pqueue.h
/freebsd-10.0-release/crypto/openssl/crypto/rand/md_rand.c
/freebsd-10.0-release/crypto/openssl/crypto/rand/rand.h
/freebsd-10.0-release/crypto/openssl/crypto/rand/rand_err.c
/freebsd-10.0-release/crypto/openssl/crypto/rand/rand_lcl.h
/freebsd-10.0-release/crypto/openssl/crypto/rand/rand_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/rand/randfile.c
/freebsd-10.0-release/crypto/openssl/crypto/rc4/asm/rc4-parisc.pl
/freebsd-10.0-release/crypto/openssl/crypto/rsa/Makefile
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa.h
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_ameth.c
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_chk.c
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_eay.c
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_err.c
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_oaep.c
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_pk1.c
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_pmeth.c
/freebsd-10.0-release/crypto/openssl/crypto/rsa/rsa_sign.c
/freebsd-10.0-release/crypto/openssl/crypto/sha/Makefile
/freebsd-10.0-release/crypto/openssl/crypto/sha/asm/sha1-mips.pl
/freebsd-10.0-release/crypto/openssl/crypto/sha/asm/sha1-parisc.pl
/freebsd-10.0-release/crypto/openssl/crypto/sha/asm/sha1-x86_64.pl
/freebsd-10.0-release/crypto/openssl/crypto/sha/asm/sha512-mips.pl
/freebsd-10.0-release/crypto/openssl/crypto/sha/asm/sha512-parisc.pl
/freebsd-10.0-release/crypto/openssl/crypto/sha/sha512.c
/freebsd-10.0-release/crypto/openssl/crypto/srp/srp_grps.h
/freebsd-10.0-release/crypto/openssl/crypto/srp/srp_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/srp/srp_vfy.c
/freebsd-10.0-release/crypto/openssl/crypto/stack/safestack.h
/freebsd-10.0-release/crypto/openssl/crypto/symhacks.h
/freebsd-10.0-release/crypto/openssl/crypto/ts/ts_rsp_sign.c
/freebsd-10.0-release/crypto/openssl/crypto/ts/ts_rsp_verify.c
/freebsd-10.0-release/crypto/openssl/crypto/ui/ui_lib.c
/freebsd-10.0-release/crypto/openssl/crypto/x509/by_dir.c
/freebsd-10.0-release/crypto/openssl/crypto/x509/x509_vfy.c
/freebsd-10.0-release/crypto/openssl/crypto/x509/x509_vpm.c
/freebsd-10.0-release/crypto/openssl/crypto/x509/x_all.c
/freebsd-10.0-release/crypto/openssl/crypto/x509v3/v3_ncons.c
/freebsd-10.0-release/crypto/openssl/crypto/x509v3/v3_purp.c
/freebsd-10.0-release/crypto/openssl/crypto/x86cpuid.pl
/freebsd-10.0-release/crypto/openssl/doc/HOWTO/certificates.txt
/freebsd-10.0-release/crypto/openssl/doc/HOWTO/proxy_certificates.txt
/freebsd-10.0-release/crypto/openssl/doc/apps/asn1parse.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/c_rehash.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/ca.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/ciphers.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/cms.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/config.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/crl.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/dgst.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/dhparam.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/dsa.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/ec.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/ecparam.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/enc.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/gendsa.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/genrsa.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/ocsp.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/pkcs12.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/req.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/rsa.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/s_client.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/s_server.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/smime.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/ts.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/tsget.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/verify.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/version.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/x509.pod
/freebsd-10.0-release/crypto/openssl/doc/apps/x509v3_config.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/ASN1_generate_nconf.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/BIO_f_base64.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/BIO_push.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/BIO_s_accept.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/BN_BLINDING_new.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/CMS_add1_signer.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/CMS_decrypt.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/CMS_sign_add1_signer.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/CONF_modules_free.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/CONF_modules_load_file.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/ERR_get_error.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_BytesToKey.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_DigestInit.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_DigestVerifyInit.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_EncryptInit.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_PKEY_encrypt.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_PKEY_set1_RSA.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_PKEY_sign.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/EVP_SignInit.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/OPENSSL_config.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/RSA_set_method.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/RSA_sign.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/X509_NAME_ENTRY_get_object.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/X509_NAME_add_entry_by_txt.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/X509_NAME_get_index_by_NID.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/X509_STORE_CTX_get_error.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/X509_STORE_CTX_get_ex_new_index.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/X509_VERIFY_PARAM_set_flags.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/des.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/ecdsa.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/err.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/pem.pod
/freebsd-10.0-release/crypto/openssl/doc/crypto/ui.pod
/freebsd-10.0-release/crypto/openssl/doc/fingerprints.txt
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CIPHER_get_name.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_COMP_add_compression_method.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_add_extra_chain_cert.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_add_session.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_load_verify_locations.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_new.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_cipher_list.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_client_CA_list.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_client_cert_cb.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_mode.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_msg_callback.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_options.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_session_id_context.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_ssl_version.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_tlsext_ticket_key_cb.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_tmp_dh_callback.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_set_verify.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_CTX_use_psk_identity_hint.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_accept.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_clear.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_connect.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_do_handshake.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_get_peer_cert_chain.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_get_version.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_read.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_session_reused.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_set_fd.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_set_session.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_set_shutdown.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_shutdown.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/SSL_write.pod
/freebsd-10.0-release/crypto/openssl/doc/ssl/d2i_SSL_SESSION.pod
/freebsd-10.0-release/crypto/openssl/e_os.h
/freebsd-10.0-release/crypto/openssl/engines/ccgost/gost89.h
/freebsd-10.0-release/crypto/openssl/engines/ccgost/gost_ameth.c
/freebsd-10.0-release/crypto/openssl/engines/ccgost/gosthash.c
/freebsd-10.0-release/crypto/openssl/engines/e_padlock.c
/freebsd-10.0-release/crypto/openssl/ssl/Makefile
/freebsd-10.0-release/crypto/openssl/ssl/d1_both.c
/freebsd-10.0-release/crypto/openssl/ssl/d1_clnt.c
/freebsd-10.0-release/crypto/openssl/ssl/d1_enc.c
/freebsd-10.0-release/crypto/openssl/ssl/d1_lib.c
/freebsd-10.0-release/crypto/openssl/ssl/d1_pkt.c
/freebsd-10.0-release/crypto/openssl/ssl/d1_srvr.c
/freebsd-10.0-release/crypto/openssl/ssl/dtls1.h
/freebsd-10.0-release/crypto/openssl/ssl/heartbeat_test.c
/freebsd-10.0-release/crypto/openssl/ssl/kssl.c
/freebsd-10.0-release/crypto/openssl/ssl/kssl.h
/freebsd-10.0-release/crypto/openssl/ssl/s23_clnt.c
/freebsd-10.0-release/crypto/openssl/ssl/s23_lib.c
/freebsd-10.0-release/crypto/openssl/ssl/s23_srvr.c
/freebsd-10.0-release/crypto/openssl/ssl/s2_enc.c
/freebsd-10.0-release/crypto/openssl/ssl/s2_lib.c
/freebsd-10.0-release/crypto/openssl/ssl/s2_pkt.c
/freebsd-10.0-release/crypto/openssl/ssl/s2_srvr.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_both.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_cbc.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_clnt.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_enc.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_lib.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_meth.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_pkt.c
/freebsd-10.0-release/crypto/openssl/ssl/s3_srvr.c
/freebsd-10.0-release/crypto/openssl/ssl/srtp.h
/freebsd-10.0-release/crypto/openssl/ssl/ssl.h
/freebsd-10.0-release/crypto/openssl/ssl/ssl3.h
/freebsd-10.0-release/crypto/openssl/ssl/ssl_asn1.c
/freebsd-10.0-release/crypto/openssl/ssl/ssl_cert.c
/freebsd-10.0-release/crypto/openssl/ssl/ssl_ciph.c
/freebsd-10.0-release/crypto/openssl/ssl/ssl_err.c
/freebsd-10.0-release/crypto/openssl/ssl/ssl_lib.c
/freebsd-10.0-release/crypto/openssl/ssl/ssl_locl.h
/freebsd-10.0-release/crypto/openssl/ssl/ssl_sess.c
/freebsd-10.0-release/crypto/openssl/ssl/ssl_stat.c
/freebsd-10.0-release/crypto/openssl/ssl/ssl_utst.c
/freebsd-10.0-release/crypto/openssl/ssl/ssltest.c
/freebsd-10.0-release/crypto/openssl/ssl/t1_enc.c
/freebsd-10.0-release/crypto/openssl/ssl/t1_lib.c
/freebsd-10.0-release/crypto/openssl/ssl/tls1.h
/freebsd-10.0-release/crypto/openssl/util/libeay.num
/freebsd-10.0-release/crypto/openssl/util/mk1mf.pl
/freebsd-10.0-release/crypto/openssl/util/mkbuildinf.pl
/freebsd-10.0-release/crypto/openssl/util/mkdef.pl
/freebsd-10.0-release/crypto/openssl/util/mkerr.pl
/freebsd-10.0-release/crypto/openssl/util/pl/BC-32.pl
/freebsd-10.0-release/crypto/openssl/util/pl/VC-32.pl
/freebsd-10.0-release/crypto/openssl/util/pl/netware.pl
/freebsd-10.0-release/crypto/openssl/util/shlib_wrap.sh
/freebsd-10.0-release/crypto/openssl/util/ssleay.num
/freebsd-10.0-release/secure/lib/libcrypto/Makefile
/freebsd-10.0-release/secure/lib/libcrypto/Makefile.inc
/freebsd-10.0-release/secure/lib/libcrypto/Makefile.man
/freebsd-10.0-release/secure/lib/libcrypto/amd64/bsaes-x86_64.S
/freebsd-10.0-release/secure/lib/libcrypto/amd64/vpaes-x86_64.S
/freebsd-10.0-release/secure/lib/libcrypto/i386/x86cpuid.s
/freebsd-10.0-release/secure/lib/libcrypto/man/ASN1_OBJECT_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ASN1_STRING_length.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ASN1_STRING_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ASN1_STRING_print_ex.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ASN1_generate_nconf.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_ctrl.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_f_base64.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_f_buffer.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_f_cipher.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_f_md.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_f_null.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_f_ssl.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_find_type.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_new_CMS.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_push.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_read.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_accept.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_bio.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_connect.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_fd.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_file.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_mem.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_null.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_s_socket.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_set_callback.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BIO_should_retry.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_BLINDING_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_CTX_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_CTX_start.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_add.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_add_word.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_bn2bin.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_cmp.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_copy.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_generate_prime.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_mod_inverse.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_mod_mul_montgomery.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_mod_mul_reciprocal.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_num_bytes.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_rand.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_set_bit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_swap.3
/freebsd-10.0-release/secure/lib/libcrypto/man/BN_zero.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_add0_cert.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_add1_recipient_cert.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_add1_signer.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_compress.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_decrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_encrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_final.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_get0_RecipientInfos.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_get0_SignerInfos.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_get0_type.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_get1_ReceiptRequest.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_sign.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_sign_add1_signer.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_sign_receipt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_uncompress.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_verify.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CMS_verify_receipt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CONF_modules_free.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CONF_modules_load_file.3
/freebsd-10.0-release/secure/lib/libcrypto/man/CRYPTO_set_ex_data.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DH_generate_key.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DH_generate_parameters.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DH_get_ex_new_index.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DH_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DH_set_method.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DH_size.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_SIG_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_do_sign.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_dup_DH.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_generate_key.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_generate_parameters.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_get_ex_new_index.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_set_method.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_sign.3
/freebsd-10.0-release/secure/lib/libcrypto/man/DSA_size.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_GET_LIB.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_clear_error.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_error_string.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_get_error.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_load_crypto_strings.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_load_strings.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_print_errors.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_put_error.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_remove_state.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ERR_set_mark.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_BytesToKey.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_DigestInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_DigestSignInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_DigestVerifyInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_EncryptInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_OpenInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_CTX_ctrl.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_CTX_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_cmp.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_decrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_derive.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_encrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_get_default_digest.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_keygen.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_print_private.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_set1_RSA.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_sign.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_verify.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_PKEY_verify_recover.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_SealInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_SignInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/EVP_VerifyInit.3
/freebsd-10.0-release/secure/lib/libcrypto/man/OBJ_nid2obj.3
/freebsd-10.0-release/secure/lib/libcrypto/man/OPENSSL_Applink.3
/freebsd-10.0-release/secure/lib/libcrypto/man/OPENSSL_VERSION_NUMBER.3
/freebsd-10.0-release/secure/lib/libcrypto/man/OPENSSL_config.3
/freebsd-10.0-release/secure/lib/libcrypto/man/OPENSSL_ia32cap.3
/freebsd-10.0-release/secure/lib/libcrypto/man/OPENSSL_load_builtin_modules.3
/freebsd-10.0-release/secure/lib/libcrypto/man/OpenSSL_add_all_algorithms.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PEM_write_bio_CMS_stream.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PEM_write_bio_PKCS7_stream.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PKCS12_create.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PKCS12_parse.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PKCS7_decrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PKCS7_encrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PKCS7_sign.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PKCS7_sign_add_signer.3
/freebsd-10.0-release/secure/lib/libcrypto/man/PKCS7_verify.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RAND_add.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RAND_bytes.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RAND_cleanup.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RAND_egd.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RAND_load_file.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RAND_set_rand_method.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_blinding_on.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_check_key.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_generate_key.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_get_ex_new_index.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_padding_add_PKCS1_type_1.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_print.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_private_encrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_public_encrypt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_set_method.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_sign.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_sign_ASN1_OCTET_STRING.3
/freebsd-10.0-release/secure/lib/libcrypto/man/RSA_size.3
/freebsd-10.0-release/secure/lib/libcrypto/man/SMIME_read_CMS.3
/freebsd-10.0-release/secure/lib/libcrypto/man/SMIME_read_PKCS7.3
/freebsd-10.0-release/secure/lib/libcrypto/man/SMIME_write_CMS.3
/freebsd-10.0-release/secure/lib/libcrypto/man/SMIME_write_PKCS7.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_NAME_ENTRY_get_object.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_NAME_add_entry_by_txt.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_NAME_get_index_by_NID.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_NAME_print_ex.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_STORE_CTX_get_error.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_STORE_CTX_get_ex_new_index.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_STORE_CTX_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_STORE_CTX_set_verify_cb.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_STORE_set_verify_cb_func.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_VERIFY_PARAM_set_flags.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_new.3
/freebsd-10.0-release/secure/lib/libcrypto/man/X509_verify_cert.3
/freebsd-10.0-release/secure/lib/libcrypto/man/bio.3
/freebsd-10.0-release/secure/lib/libcrypto/man/blowfish.3
/freebsd-10.0-release/secure/lib/libcrypto/man/bn.3
/freebsd-10.0-release/secure/lib/libcrypto/man/bn_internal.3
/freebsd-10.0-release/secure/lib/libcrypto/man/buffer.3
/freebsd-10.0-release/secure/lib/libcrypto/man/crypto.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_ASN1_OBJECT.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_DHparams.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_DSAPublicKey.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_PKCS8PrivateKey.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_RSAPublicKey.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_X509.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_X509_ALGOR.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_X509_CRL.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_X509_NAME.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_X509_REQ.3
/freebsd-10.0-release/secure/lib/libcrypto/man/d2i_X509_SIG.3
/freebsd-10.0-release/secure/lib/libcrypto/man/des.3
/freebsd-10.0-release/secure/lib/libcrypto/man/dh.3
/freebsd-10.0-release/secure/lib/libcrypto/man/dsa.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ecdsa.3
/freebsd-10.0-release/secure/lib/libcrypto/man/engine.3
/freebsd-10.0-release/secure/lib/libcrypto/man/err.3
/freebsd-10.0-release/secure/lib/libcrypto/man/evp.3
/freebsd-10.0-release/secure/lib/libcrypto/man/hmac.3
/freebsd-10.0-release/secure/lib/libcrypto/man/i2d_CMS_bio_stream.3
/freebsd-10.0-release/secure/lib/libcrypto/man/i2d_PKCS7_bio_stream.3
/freebsd-10.0-release/secure/lib/libcrypto/man/lh_stats.3
/freebsd-10.0-release/secure/lib/libcrypto/man/lhash.3
/freebsd-10.0-release/secure/lib/libcrypto/man/md5.3
/freebsd-10.0-release/secure/lib/libcrypto/man/mdc2.3
/freebsd-10.0-release/secure/lib/libcrypto/man/pem.3
/freebsd-10.0-release/secure/lib/libcrypto/man/rand.3
/freebsd-10.0-release/secure/lib/libcrypto/man/rc4.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ripemd.3
/freebsd-10.0-release/secure/lib/libcrypto/man/rsa.3
/freebsd-10.0-release/secure/lib/libcrypto/man/sha.3
/freebsd-10.0-release/secure/lib/libcrypto/man/threads.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ui.3
/freebsd-10.0-release/secure/lib/libcrypto/man/ui_compat.3
/freebsd-10.0-release/secure/lib/libcrypto/man/x509.3
/freebsd-10.0-release/secure/lib/libssl/Makefile.man
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CIPHER_get_name.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_COMP_add_compression_method.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_add_extra_chain_cert.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_add_session.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_ctrl.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_flush_sessions.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_free.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_get_ex_new_index.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_get_verify_mode.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_load_verify_locations.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_new.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_sess_number.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_sess_set_cache_size.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_sess_set_get_cb.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_sessions.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_cert_store.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_cert_verify_callback.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_cipher_list.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_client_CA_list.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_client_cert_cb.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_default_passwd_cb.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_generate_session_id.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_info_callback.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_max_cert_list.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_mode.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_msg_callback.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_options.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_psk_client_callback.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_quiet_shutdown.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_session_cache_mode.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_session_id_context.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_ssl_version.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_timeout.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_tlsext_ticket_key_cb.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_tmp_dh_callback.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_tmp_rsa_callback.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_set_verify.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_use_certificate.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_CTX_use_psk_identity_hint.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_SESSION_free.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_SESSION_get_ex_new_index.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_SESSION_get_time.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_accept.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_alert_type_string.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_clear.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_connect.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_do_handshake.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_free.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_SSL_CTX.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_ciphers.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_client_CA_list.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_current_cipher.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_default_timeout.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_error.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_ex_data_X509_STORE_CTX_idx.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_ex_new_index.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_fd.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_peer_cert_chain.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_peer_certificate.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_psk_identity.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_rbio.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_session.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_verify_result.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_get_version.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_library_init.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_load_client_CA_file.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_new.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_pending.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_read.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_rstate_string.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_session_reused.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_set_bio.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_set_connect_state.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_set_fd.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_set_session.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_set_shutdown.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_set_verify_result.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_shutdown.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_state_string.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_want.3
/freebsd-10.0-release/secure/lib/libssl/man/SSL_write.3
/freebsd-10.0-release/secure/lib/libssl/man/d2i_SSL_SESSION.3
/freebsd-10.0-release/secure/lib/libssl/man/ssl.3
/freebsd-10.0-release/secure/usr.bin/openssl/Makefile.man
/freebsd-10.0-release/secure/usr.bin/openssl/man/CA.pl.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/asn1parse.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/c_rehash.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/ca.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/ciphers.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/cms.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/crl.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/crl2pkcs7.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/dgst.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/dhparam.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/dsa.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/dsaparam.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/ec.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/ecparam.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/enc.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/errstr.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/gendsa.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/genpkey.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/genrsa.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/nseq.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/ocsp.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/openssl.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/passwd.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/pkcs12.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/pkcs7.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/pkcs8.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/pkey.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/pkeyparam.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/pkeyutl.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/rand.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/req.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/rsa.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/rsautl.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/s_client.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/s_server.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/s_time.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/sess_id.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/smime.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/speed.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/spkac.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/ts.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/tsget.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/verify.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/version.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/x509.1
/freebsd-10.0-release/secure/usr.bin/openssl/man/x509v3_config.1
/freebsd-10.0-release/sys/conf/newvers.sh
igmp.c
/freebsd-10.0-release/usr.sbin/freebsd-update/freebsd-update.sh
277808 27-Jan-2015 delphij

Fix SCTP SCTP_SS_VALUE kernel memory corruption and disclosure vulnerability
and SCTP stream reset vulnerability.

Security: FreeBSD-SA-15:02.kmem
Security: CVE-2014-8612
Security: FreeBSD-SA-15:03.sctp
Security: CVE-2014-8613
Approved by: so

271669 16-Sep-2014 delphij

Fix Denial of Service in TCP packet processing.

Security: FreeBSD-SA-14:19.tcp
Approved by: so

268434 08-Jul-2014 delphij

Fix kernel memory disclosure in control message and SCTP notifications.

Security: FreeBSD-SA-14:17.kmem
Security: CVE-2014-3952, CVE-2014-3953
Approved by: so

265124 30-Apr-2014 delphij

Fix devfs rules not applied by default for jails.

Fix OpenSSL use-after-free vulnerability.

Fix TCP reassembly vulnerability.

Security: FreeBSD-SA-14:07.devfs
Security: CVE-2014-3001
Security: FreeBSD-SA-14:08.tcp
Security: CVE-2014-3000
Security: FreeBSD-SA-14:09.openssl
Security: CVE-2010-5298
Approved by: so

260378 06-Jan-2014 glebius

Merge r260319 from stable/10 (r260188 from head):

Fix regression from r249894. Now we pass "gw" as argument to if_output
method, thus for multicast case we need it to point at "dst".

PR: 185395
Approved by: re (gjb)

259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

258890 03-Dec-2013 tuexen

MFC r258574:

Only initialize some mutexes for the default VNET.

In r208160, sctp_it_ctl was made a global variable, across all VNETs.
However, sctp_init() is called for every VNET that is created. This results
in the same global mutexes which are part of sctp_it_ctl being initialized. This can result
in crashes if many jails are created.

To reproduce the problem:
(1) Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS,
INVARIANTS.
(2) Run this command in a loop:
jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo

(see http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html )

Witness will warn about the same mutex being initialized.

Fix the problem by only initializing these mutexes in the default VNET.

MFC r258765:

In
http://svnweb.freebsd.org/changeset/base/258221
I introduced a bug which initialized global locks
whenever the SCTP stack initialized. This was fixed in
http://svnweb.freebsd.org/changeset/base/258574
by rodrigc@. He just initialized the locks for
the default vnet. This fix reverts to the old
behaviour before r258221, which explicitly makes
sure it is only called once, because this works also on
other platforms.

Approved by: re@ (gjb)


258454 21-Nov-2013 tuexen

MFC r256556:
Remove a buggy comparision when setting manually the path MTU.
After fixing, the comparision would have become redundant.
Thanks to Andrew Galante for reporting the issue.

MFC r257272:
Fix compilation if SCTP_DONT_DO_PRIVADDR_SCOPE is defined.
The issue was reported by Andrew Galante.

MFC r257274:
Fix the value of *optlen when calling getsockopt() for
SCTP_REMOTE_UDP_ENCAPS_PORT.
This issue was reported by Andrew Galante.

MFC r257359:
Terminate a debug output with a \n.

MFC r257555:
Changes from upstream to improve compilation when INET or INET6
or none of them is defined.

MFC r257574:
Unlock the lock before destroying it.
This issue was reported by Andrew Galante.

MFC r257800:
Use htons()/ntohs() appropriately.
These issues were reported by Andrew Galante.

MFC r257803:
Make sure that we don't try to build an ASCONF-ACK chunk
larger than what fits in the the mbuf cluster.
This issue was reported by Andrew Galante.

MFC r257804:
Get rid of the artification limitation enforced by
SCTP_AUTH_RANDOM_SIZE_MAX.
This was suggested by Andrew Galante.

MFC r258221:
Cleanups which result in fixes which have been made upstream
and where partially suggested by Andrew Galante.
There is no functional change in FreeBSD.

MFC r258224:
When determining if an address belongs to an stcb, take the address family
into account for wildcard bound endpoints.

MFC r258228:
Remove a stray write operation.

MFC r258235:
Use SCTP_PR_SCTP_TTL when the user provides a positive
timetolive in sctp_sendmsg().

Approved by: re@


257367 29-Oct-2013 andre

MFC r256920:

The TCP delayed ACK logic isn't aware of LRO passing up large aggregated
segments thinking it received only one segment. This causes it to enable
the delay the ACK for 100ms to wait for another segment which may never
come because all the data was received already.

Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes
us further away from acking every other packet; b) it introduces additional
delay in responding to the sender. The latter is especially bad because it
is in the nature of LRO to aggregated all segments of a burst with no more
coming until an ACK is sent back.

Change the delayed ACK logic to detect LRO segments by being larger than
the MSS for this connection and issuing an immediate ACK for them to keep
the ACK clock ticking without interruption.

Reported by: julian, cperciva
Tested by: cperciva
Reviewed by: lstewart

Approved by: re (glebius)


256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


256186 09-Oct-2013 glebius

When processing ACK in tcp_do_segment, use sbcut_locked() instead of
sbdrop_locked() to cut acked mbufs from the socket buffer. Free this
chain a batch manner after the socket buffer lock is dropped.

This measurably reduces contention on socket buffer.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Approved by: re (marius)


255993 02-Oct-2013 markj

Add a separate translator for headers passed to the TCP probes in the
input path. These probes get some of the fields in host order, whereas the
output probes get them in network order, so a single translator isn't
enough. This workaround ensures that the problem is essentially invisble
to users: none of the probe arguments or their fields have changed.

Approved by: re (hrs)


255759 21-Sep-2013 bz

Introduce spares in the TCP syncache and timewait structures
so that fixed TCP_SIGNATURE handling can later be merged.

This is derived from follow-up work to SVN r183001 posted to
net@ on Sep 13 2008.

Approved by: re (gjb)


255523 13-Sep-2013 trociny

Unregister inet/inet6 pfil hooks on vnet destroy.

Discussed with: andre
Approved by: re (rodrigc)


255434 09-Sep-2013 tuexen

Fix the aborting of association with the iterator using an empty
user initiated error cause (using SCTP_ABORT|SCTP_SENDALL).

Approved by: re (delphij)
MFC after: 1 week


255397 08-Sep-2013 trociny

Relese the interface in the last.

Reviewed by: glebius
Approved by: re (kib)


255337 07-Sep-2013 tuexen

When computing the partial delivery point, take the
receiver socket buffer size correctly into account.

MFC after: 1 week


255249 05-Sep-2013 jhb

Use LIST_FOREACH_SAFE() instead of doing it by hand.


255248 05-Sep-2013 jhb

Use an unsigned long when indexing into mfchashtbl[] and mf6ctable[]. This
matches the types used when computing hash indices and the type of the
maximum size of mfchashtbl[].

PR: kern/181821
Submitted by: Sven-Thorsten Dietrich <sven@vyatta.com> (IPv4)
MFC after: 1 week


255235 05-Sep-2013 ae

Remove unused code and sort variables declarations.

PR: kern/181822
MFC after: 1 week


255190 03-Sep-2013 tuexen

Remove redundant field pr_sctp_on.

MFC after: 1 week


255162 02-Sep-2013 tuexen

Use uint16_t instead of in_port_t for consistency with the SCTP code.

MFC after: 1 week


255160 02-Sep-2013 tuexen

All changes affect only SCTP-AUTH:
* Remove non working code related to SHA224.
* Remove support for non-standardised HMAC-IDs using SHA384 and SHA512.
* Prefer SHA256 over SHA1.
* Minor cleanup.

MFC after: 2 weeks


255010 28-Aug-2013 np

Merge r254336 from user/np/cxl_tuning.

Add a last-modified timestamp to each LRO entry and provide an interface
to flush all inactive entries. Drivers decide when to flush and what
the inactivity threshold should be.

Network drivers that process an rx queue to completion can enter a
livelock type situation when the rate at which packets are received
reaches equilibrium with the rate at which the rx thread is processing
them. When this happens the final LRO flush (normally when the rx
routine is done) does not occur. Pure ACKs and segments with total
payload < 64K can get stuck in an LRO entry. Symptoms are that TCP
tx-mostly connections' performance falls off a cliff during heavy,
unrelated rx on the interface.

Flushing only inactive LRO entries works better than any of these
alternates that I tried:
- don't LRO pure ACKs
- flush _all_ LRO entries periodically (every 'x' microseconds or every
'y' descriptors)
- stop rx processing in the driver periodically and schedule remaining
work for later.

Reviewed by: andre


254925 26-Aug-2013 jhb

Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by: bde
Tested by: exp-run by bdrewery


254893 26-Aug-2013 markj

The second last argument of udp:::receive is supposed to contain the
connection state, not the IP header.

X-MFC with: r254889


254889 25-Aug-2013 markj

Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by: gnn, hiren
MFC after: 1 month


254854 25-Aug-2013 tuexen

Provide human readable debug output.


254834 25-Aug-2013 andre

For now limit printf(9) %x of the 64bit pkthdr.csum_flags field to 32bits.
The upper 32bits are not occupied for now.

Sponsored by: The FreeBSD Foundation


254804 24-Aug-2013 andre

Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features. The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
layer specific union PH_loc for local use. Protocols can flexibly overlay
their own 8 to 64 bit fields to store information while the packet is
worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
rsstype field. Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
with the packet. It is not yet supported in any drivers but allows us to
get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
from the start of the packet. This is important for various offload
capabilities and to relieve the drivers from having to parse the packet
and protocol headers to find out location of checksums and other
information. Header parsing in drivers is a lot of copy-paste and
unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
packet information, like ether_vtag, tso_segsz and csum fields.
Depending on the csum_flags settings some fields may have different usage
making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
stack) and inbound (up the stack) use. The CSUM flags used to be a bit
chaotic and rather poorly documented leading to incorrect use in many
places. Bring clarity into their use through better naming.
Compatibility mappings are provided to preserve the API. The drivers
can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by: The FreeBSD Foundation


254672 22-Aug-2013 tuexen

Export the inpcb features as a 64-bit entity.
Bump __FreeBSD_version to 1000048 since the
modified structure is user visible and used
by netstat, for example.


254670 22-Aug-2013 tuexen

Make also the features of the association 64-bit.
When exporting to xinpcb, just export the lower
32-bit. Using there also 64-bits will break the
ABI and will be committed separetly.

MFC after: 2 weeks
X-MFC with: 254248


254629 22-Aug-2013 delphij

Fix an integer overflow in computing the size of a temporary buffer
can result in a buffer which is too small for the requested
operation.

Security: CVE-2013-3077
Security: FreeBSD-SA-13:09.ip_multicast


254527 19-Aug-2013 andre

Reorder the mbuf defines to make more sense and group related flags
together.

Add M_FLAG_PRINTF for use with printf(9) %b indentifier.

Use the generic mbuf flags print names in the net80211 code and adjust
the protocol specific bits for their new positions.

Change SCTP M_PROTO mapping from 5 to 1 to fit within the 16bit field
they use internally to store some additional information.

Discussed with: trociny, glebius


254523 19-Aug-2013 andre

Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with: trociny, glebius


254521 19-Aug-2013 andre

Move the SCTP specific definition of M_NOTIFICATION onto a protocol
specific mbuf flag from sys/mbuf.h to netinet/sctp_os_bsd.h. It is
only relevant within SCTP.

Discussed with: tuexen


254519 19-Aug-2013 andre

Move the global M_SKIP_FIREWALL mbuf flags to a protocol layer specific
flag instead. The flag is only used within the IP and IPv6 layer 3
protocols.

Because some firewall packages treat IPv4 and IPv6 packets the same the
flag should have the same value for both.

Discussed with: trociny, glebius


254518 19-Aug-2013 andre

Move ip_reassemble()'s use of the global M_FRAG mbuf flag to a protocol layer
specific flag instead. The flag is only relevant while the packet stays in
the IP reassembly queue.

Discussed with: trociny, glebius


254517 19-Aug-2013 andre

Remove unused M_FRAG, M_FIRSTFRAG and M_LASTFRAG tagging from ip_fragment().
There wasn't any real driver (and hardware) support for it. Modern hardware
does full fragmentation/segmentation offload instead.


254350 15-Aug-2013 markj

Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after: 2 weeks


254338 14-Aug-2013 tuexen

Don't send uninitialized memory (two instances of 4 bytes) in
every cookie on the wire. This bug was reported in
https://bugzilla.mozilla.org/show_bug.cgi?id=905080

MFC after: 3 days


254292 13-Aug-2013 trociny

Virtualize carp(4) variables to have per vnet control.

Reviewed by: ae, glebius


254248 12-Aug-2013 tuexen

Make the features a 64-bit value instead of 32-bit.
This will allow an easier integration of the support
for NDATA.
While there, do also some minor cleanups.
Obtained from: rrs@
MFC after: 2 weeks


253858 01-Aug-2013 tuexen

Micro-optimization suggested in
https://bugzilla.mozilla.org/show_bug.cgi?id=898234
by pchang9. While there simplify the code.

MFC after: 1 week


253571 23-Jul-2013 ae

Remove the large part of struct ipsecstat. Only few fields of this
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.

This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.

Reported by: Taku YAMAMOTO, Maciej Milewski


253493 20-Jul-2013 tuexen

Allow the code to be compiled without warnings for any combination
of INET, INET6 and SCTP_DEBUG defines.
The issue was reported by Lally Singh.

MFC after: 2 weeks


253472 19-Jul-2013 tuexen

Get the code compiling without INET and INET6 being defined.
This is not possible in FreeBSD, but in the upstream code.

MFC after: 2 weeks


253395 16-Jul-2013 andre

Free the non-fatal "timestamp missing" debug string manually as it is
not covered by the catch-all free for the error cases.

Found by: Coverity


253282 12-Jul-2013 trociny

A complete duplication of binding should be allowed if on both new and
duplicated sockets a multicast address is bound and either
SO_REUSEPORT or SO_REUSEADDR is set.

But actually it works for the following combinations:

* SO_REUSEPORT is set for the fist socket and SO_REUSEPORT for the new;
* SO_REUSEADDR is set for the fist socket and SO_REUSEADDR for the new;
* SO_REUSEPORT is set for the fist socket and SO_REUSEADDR for the new;

and fails for this:

* SO_REUSEADDR is set for the fist socket and SO_REUSEPORT for the new.

Fix the last case.

PR: 179901
MFC after: 1 month


253254 12-Jul-2013 andre

Unbreak VIMAGE by correctly naming the vnet pointer in struct tcp_syncache.

Reported by: trociny, rodrigc


253210 11-Jul-2013 andre

Improve SYN cookies by encoding the MSS, WSCALE (window scaling) and SACK
information into the ISN (initial sequence number) without the additional
use of timestamp bits and switching to the very fast and cryptographically
strong SipHash-2-4 MAC hash algorithm to protect the SYN cookie against
forgeries.

The purpose of SYN cookies is to encode all necessary session state in
the 32 bits of our initial sequence number to avoid storing any information
locally in memory. This is especially important when under heavy spoofed
SYN attacks where we would either run out of memory or the syncache would
fill with bogus connection attempts swamping out legitimate connections.

The original SYN cookies method only stored an indexed MSS values in the
cookie. This isn't sufficient anymore and breaks down in the presence of
WSCALE information which is only exchanged during SYN and SYN-ACK. If we
can't keep track of it then we may severely underestimate the available
send or receive window. This is compounded with large windows whose size
information on the TCP segment header is even lower numerically. A number
of years back SYN cookies were extended to store the additional state in
the TCP timestamp fields, if available on a connection. While timestamps
are common among the BSD, Linux and other *nix systems Windows never enabled
them by default and thus are not present for the vast majority of clients
seen on the Internet.

The common parameters used on TCP sessions have changed quite a bit since
SYN cookies very invented some 17 years ago. Today we have a lot more
bandwidth available making the use window scaling almost mandatory. Also
SACK has become standard making recovering from packet loss much more
efficient.

This change moves all necessary information into the ISS removing the need
for timestamps. Both the MSS (16 bits) and send WSCALE (4 bits) are stored
in 3 bit indexed form together with a single bit for SACK. While this is
significantly less than the original range, it is sufficient to encode all
common values with minimal rounding.

The MSS depends on the MTU of the path and with the dominance of ethernet
the main value seen is around 1460 bytes. Encapsulations for DSL lines
and some other overheads reduce it by a few more bytes for many connections
seen. Rounding down to the next lower value in some cases isn't a problem
as we send only slightly more packets for the same amount of data.

The send WSCALE index is bit more tricky as rounding down under-estimates
the available send space available towards the remote host, however a small
number values dominate and are carefully selected again.

The receive WSCALE isn't encoded at all but recalculated based on the local
receive socket buffer size when a valid SYN cookie returns. A listen socket
buffer size is unlikely to change while active.

The index values for MSS and WSCALE are selected for minimal rounding errors
based on large traffic surveys. These values have to be periodically
validated against newer traffic surveys adjusting the arrays tcp_sc_msstab[]
and tcp_sc_wstab[] if necessary.

In addition the hash MAC to protect the SYN cookies is changed from MD5
to SipHash-2-4, a much faster and cryptographically secure algorithm.

Reviewed by: dwmalone
Tested by: Fabian Keil <fk@fabiankeil.de>


253150 10-Jul-2013 andre

Extend debug logging of TCP timestamp related specification
violations.

Update related comments and style.


253099 09-Jul-2013 tuexen

Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

X-MFC with: r252026


253087 09-Jul-2013 ae

Migrate struct carpstats to PCPU counters.


253086 09-Jul-2013 ae

Migrate structs in6_ifstat and icmp6_ifstat to PCPU counters.


253085 09-Jul-2013 ae

Migrate structs ip6stat, icmp6stat and rip6stat to PCPU counters.


253084 09-Jul-2013 ae

Migrate structs arpstat, icmpstat, mrtstat, pimstat and udpstat to PCPU
counters.


253083 09-Jul-2013 ae

Use new macros to implement ipstat and tcpstat using PCPU counters.
Change interface of kread_counters() similar ot kread() in the netstat(1).


253081 09-Jul-2013 ae

Prepare network statistics structures for migration to PCPU counters.
Use uint64_t as type for all fields of structures.

Changed structures: ahstat, arpstat, espstat, icmp6_ifstat, icmp6stat,
in6_ifstat, ip6stat, ipcompstat, ipipstat, ipsecstat, mrt6stat, mrtstat,
pfkeystat, pim6stat, pimstat, rip6stat, udpstat.

Discussed with: arch@


252779 05-Jul-2013 tuexen

Fix a bug were only 2048 streams where usable even though more than
2048 streams were negotiated on the wire. While there, remove the
hard coded limit of 2048 streams.

MFC after: 3 days


252718 04-Jul-2013 tuexen

When processing an incoming ABORT, SHUTDOWN_COMPLETE or ERROR (NAT related)
chunk, take always the T-bit into account, when checking the verification
tag.

MFC after: 3 days


252710 04-Jul-2013 trociny

In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.

PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
MFC after: 1 week


252585 03-Jul-2013 tuexen

Code cleanups.

MFC after: 3 days


252577 03-Jul-2013 np

Catch up with r238990. LLE_DELETED does not clobber everything else in
la_flags since said revision.


252510 02-Jul-2013 hrs

Fix a panic when leaving MC group in a kernel with VIMAGE enabled.
in_leavegroup() is called from an asynchronous task, and
igmp_change_state() requires that curvnet is set by the caller.


252504 02-Jul-2013 lstewart

Import an implementation of the CAIA Delay-Gradient (CDG) congestion control
algorithm, which is based on the 2011 v0.1 patch release and described in the
paper "Revisiting TCP Congestion Control using Delay Gradients" by David Hayes
and Grenville Armitage. It is implemented as a kernel module compatible with the
modular congestion control framework.

CDG is a hybrid congestion control algorithm which reacts to both packet loss
and inferred queuing delay. It attempts to operate as a delay-based algorithm
where possible, but utilises heuristics to detect loss-based TCP cross traffic
and will compete effectively as required. CDG is therefore incrementally
deployable and suitable for use on shared networks.

In collaboration with: David Hayes <david.hayes at ieee.org> and
Grenville Armitage <garmitage at swin edu au>
MFC after: 4 days
Sponsored by: Cisco University Research Program and FreeBSD Foundation


252055 21-Jun-2013 glebius

Fix kmod_*stat_inc() after r249276. The incorrect code actually
increased the pointer, not the memory it points to.

In collaboration with: kib
Reported & tested by: Ian FREISLICH <ianf clue.co.za>
Sponsored by: Nginx, Inc.


252026 20-Jun-2013 ae

Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

MFC after: 2 weeks


251502 07-Jun-2013 bms

Disable IGMPv3 link timers on a transition to IGMPv2.

Submitted by: Alan Smithee


251296 03-Jun-2013 andre

Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore. The upper limit is still at
IP_MAXPACKET (65536 bytes). Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found. When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by: cperciva (earlier version)
Reviewed by: cperciva
Tested by: cperciva
MFC after: 1 week (using spare struct members to preserve ABI)


251248 02-Jun-2013 tuexen

Use LIST_EMPTY when appropriate.

MFC after: 1 week


251054 28-May-2013 tuexen

Remove redundant checks.

MFC after: 2 weeks


250962 24-May-2013 tuexen

Withdraw http://svnweb.freebsd.org/changeset/base/250809
since the real fix is in http://svnweb.freebsd.org/changeset/base/250952.


250809 19-May-2013 tuexen

Initialize the fibnum for outgoing packets to 0. This avoids
crashing due to the usage of uninitialized fibnum.
This bugs became visiable after
http://svnweb.freebsd.org/changeset/base/250700

MFC after: 2 weeks


250756 17-May-2013 tuexen

Set errno to ETIMEDOUT if an SCTP association times out during
setup.

MFC after: 1 week


250754 17-May-2013 tuexen

Don't send an ABORT chunk with verification 0.

MFC after: 1 week


250613 13-May-2013 jimharris

Fix typo in net.inet.tcp.minmss sysctl description.

MFC after: 3 days


250523 11-May-2013 hrs

Add IFF_MONITOR support to gre(4).

Tested by: Chip Marshall
MFC after: 1 week


250504 11-May-2013 glebius

Rate limit the number of remotely triggered ARP log messages
to 1 log message per second.


250466 10-May-2013 tuexen

Honor the net.inet6.ip6.v6only sysctl variable and the IPV6_V6ONLY
socket option for SCTP sockets in the same way as for UDP or TCP
sockets.

MFC after: 2 weeks


250300 06-May-2013 andre

Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.


250251 04-May-2013 hrs

Use FF02:0:0:0:0:2:FF00::/104 prefix for IPv6 Node Information Group
Address. Although KAME implementation used FF02:0:0:0:0:2::/96 based on
older versions of draft-ietf-ipngwg-icmp-name-lookup, it has been changed
in RFC 4620.

The kernel always joins the /104-prefixed address, and additionally does
/96-prefixed one only when net.inet6.icmp6.nodeinfo_oldmcprefix=1.
The default value of the sysctl is 1.

ping6(8) -N flag now uses /104-prefixed one. When this flag is specified
twice, it uses /96-prefixed one instead.

Reviewed by: ume
Based on work by: Thomas Scheffler
PR: conf/174957
MFC after: 2 weeks


250000 27-Apr-2013 cperciva

Move IPPROTO_IPV6 from #ifdef __BSD_VISIBLE to #if __POSIX_VISIBLE >= 201112
since POSIX 2001 states that it shall be defined.

Reported by: sbruno
Reviewed by: jilles
MFC after: 1 week


249925 26-Apr-2013 glebius

Add const qualifier to the dst parameter of the ifnet if_output method.


249903 25-Apr-2013 glebius

Fix couple of mbuf leaks in incoming ARP processing.


249894 25-Apr-2013 glebius

Introduce a pointer to const variable gw, which points either at the
same place as dst, or to the sockaddr in the routing table.

The const constraint of gw makes us safe from modifing routing table
accidentially. And "onstantness" of dst allows us to remove several
bandaids, when we switched it back at &ro->ro_dst, now it always
points there.

Reviewed by: rrs


249848 24-Apr-2013 rrs

This fixes the issue with the "randomly changing" default
route. What it was is there are two places in ip_output.c
where we do a goto again. One place was fine, it
copies out the new address and then resets dst = ro->rt_dst;
But the other place does *not* do that, which means earlier
when we found the gateway, we have dst pointing there
aka dst = ro->rt_gateway is done.. then we do a
goto again.. bam now we clobber the default route.

The fix is just to move the again so we are always
doing dst = &ro->rt_dst; in the again loop.

PR: 174749,157796
MFC after: 1 week


249809 23-Apr-2013 andre

When doing RFC3042 limited transmit on the first on second
duplicate ACK make sure we actually have new data to send.
This prevents us from sending unneccessary pure ACKs.

Reported by: Matt Miller <matt@matthewjmiller.net>
Tested by: Matt Miller <matt@matthewjmiller.net>
MFC after: 2 weeks


249742 21-Apr-2013 oleg

Plug static llentry leak (ipv4 & ipv6 were affected).

PR: kern/172985
MFC after: 1 month


249585 17-Apr-2013 gabor

- Corrrect mispellings of word useful

Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)


249562 16-Apr-2013 delphij

Fix incomplete printf.

PR: kern/177889
Submitted by: Sven-Thorsten Dietrich <sven vyatta com>
MFC after: 1 week


249559 16-Apr-2013 delphij

Don't leak lock when returning.

PR: kern/177888
Submitted by: Sven-Thorsten Dietrich <sven vyatta com>
MFC after: 1 week


249411 12-Apr-2013 ae

Reflect removing of the counter_u64_subtract() function in the macro.


249372 11-Apr-2013 glebius

Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.

To achieve that, move large block of code that updates tcpcb below
the out: label.

This fixes a panic, that requires the following sequence to happen:

1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.

For reference, this bug fixed in DragonflyBSD repo:

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419

Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>


249327 10-Apr-2013 glebius

Fix build.


249318 09-Apr-2013 andre

Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.


249317 09-Apr-2013 andre

Fix a race condition on tcp listen socket teardown with pending
connections in the accept queue and contiguous new incoming SYNs.

Compared to the original submitters patch I've moved the test
next to the SYN handling to have it together in a logical unit
and reworded the comment explaining the issue.

Submitted by: Matt Miller <matt@matthewjmiller.net>
Submitted by: Juan Mojica <jmojica@gmail.com>
Reviewed by: Matt Miller (changes)
Tested by: pho
MFC after: 1 week


249302 09-Apr-2013 glebius

Fix VIMAGE build.


249294 09-Apr-2013 ae

Use IP6STAT_INC/IP6STAT_DEC macros to update ip6 stats.

MFC after: 1 week


249276 08-Apr-2013 glebius

Merge from projects/counters: TCP/IP stats.

Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by: Nginx, Inc.


248953 31-Mar-2013 tuexen

Add a macro for checking for IPv4 link local addresses.

MFC after: 1 week


248914 29-Mar-2013 emaste

Keep fwd_tag around for subsequent pcb lookups

For TIMEWAIT handling tcp_input may have to jump back for an additional
pass through pcblookup. Prior to this change the fwd_tag had been
discarded after the first lookup, so a new connection attempt delivered
locally via 'ipfw fwd' would fail to find a match.

As of r248886 the tag will be detached and freed when passed to the
socket buffer.


248552 20-Mar-2013 melifaro

Add ipfw support for setting/matching DiffServ codepoints (DSCP).

Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.

Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).

Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):

Dmitrii Tejblum
Marcelo Araujo
Roman Bogorodskiy (novel)
Sergey Matveichuk (sem)
Sergey Ryabin

PR: kern/102471, kern/121122
MFC after: 2 weeks


248416 17-Mar-2013 glebius

In m_megapullup() instead of reserving some space at the end of packet,
m_align() it, reserving space to prepend data.

Reviewed by: mav


248373 16-Mar-2013 glebius

- Replace compat macros with function calls.


248326 15-Mar-2013 glebius

We can, and should use M_WAITOK here.

Sponsored by: Nginx, Inc.


248324 15-Mar-2013 glebius

Use m_get/m_gethdr instead of compat macros.

Sponsored by: Nginx, Inc.


248323 15-Mar-2013 glebius

- Use m_getcl() instead of hand allocating.

Sponsored by: Nginx, Inc.


248207 12-Mar-2013 glebius

Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.


248158 11-Mar-2013 glebius

Remove LIBALIAS_LOCK_ASSERT(), including a couple with an uninitialzed
argument, in code that isn't compiled in kernel.

PR: kern/176667
Sponsored by: Nginx, Inc.


247906 07-Mar-2013 lstewart

The hashmask returned by hashinit() is a valid index in the returned hash array.
Fix a siftr(4) potential memory leak and INVARIANTS triggered kernel panic in
hashdestroy() by ensuring the last array index in the flow counter hash table is
flushed of entries.

MFC after: 3 days


247777 04-Mar-2013 davide

- Make callout(9) tickless, relying on eventtimers(4) as backend for
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.

This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.

Together with: mav
Reviewed by: attilio, bde, luigi, phk
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm),
markj (amd64), mav, Fabian Keil


247412 27-Feb-2013 tuexen

Fix a potential race in returning setting errno when an
association goes down.
Reported by Mozilla in
https://bugzilla.mozilla.org/show_bug.cgi?id=845513

MFC after: 3 days


247104 21-Feb-2013 gallatin

Fix tcp_lro_rx_ipv4() for drivers that do not set CSUM_IP_CHECKED.
Specifcially, in_cksum_hdr() returns 0 (not 0xffff) when the IPv4
checksum is correct. Without this fix, the tcp_lro code will reject
good IPv4 traffic from drivers that do not implement IPv4 header
harder csum offload.

Sponsored by: Myricom Inc.

MFC after: 7 days


247044 20-Feb-2013 pluknet

ip_savecontrol() style fixes. No functional changes.
- fix indentation
- put the operator at the end of the line for long statements
- remove spaces between the type and the variable in a cast
- remove excessive parentheses

Tested by: md5


246687 11-Feb-2013 tuexen

Send the adaptation layer indication only if set by the user.

MFC after: 3 days
Discussed with: rrs


246674 11-Feb-2013 tuexen

Don't send kernel provided information in the User Initiated
ABORT cause, since the user can also provide this kind of
information. So the receiver doesn't know who provided the
information.
While there: Fix a bug where the stack would send a malformed
ABORT chunk when using a send() call with SCTP_ABORT|SCT_SENDALL
flags.

MFC after: 3 days


246659 11-Feb-2013 glebius

Resolve source address selection in presense of CARP. Add a couple
of helper functions:

- carp_master() - boolean function which is true if an address
is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
and is aware of CARP.

Utilize ifa_preferred() in ifa_ifwithnet().

The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.

Reported & tested by: Anton Yuzhaninov <citrin citrin.ru>
Sponsored by: Nginx, Inc


246635 10-Feb-2013 tuexen

Make sure that received packets for removed addresses are handled
consistently. While there, make variable names consistent.

MFC after: 3 days


246595 09-Feb-2013 tuexen

Cleanup the handling of address scopes. Announce in the INIT/INIT-ACK
only the supported address types. While there, do some whitespace
cleanups.

MFC after: 1 week


246588 09-Feb-2013 tuexen

Fix a bug where HEARTBEATs were still sent in SHUTDOWN_SENT or
SHUTDOWN_ACK_SENT state. While there, make the corresponding
code consistent.

MFC after: 1 week


246210 01-Feb-2013 jhb

Add placeholder constants to reserve a portion of the socket option
name space for use by downstream vendors to add custom options.

MFC after: 2 weeks


246208 01-Feb-2013 andre

uma_zone_set_max() directly returns the rounded effective zone
limit. Use the return value directly instead of doing a second
uma_zone_set_max() step.

MFC after: 1 week


246144 31-Jan-2013 glebius

- Move AUTHORS and ACKNOWLEDGEMENTS to the end of the page.
- Add myself to list of authors.


246143 31-Jan-2013 glebius

Retire struct sockaddr_inarp.

Since ARP and routing are separated, "proxy only" entries
don't have any meaning, thus we don't need additional field
in sockaddr to pass SIN_PROXY flag.

New kernel is binary compatible with old tools, since sizes
of sockaddr_inarp and sockaddr_in match, and sa_family are
filled with same value.

The structure declaration is left for compatibility with
third party software, but in tree code no longer use it.

Reviewed by: ru, andre, net@


246130 30-Jan-2013 glebius

Utilize m_get2() to get mbuf of appropriate size.


245934 26-Jan-2013 np

Add checks for SO_NO_OFFLOAD in a couple of places that I missed earlier
in r245915.


245932 26-Jan-2013 np

Teach toe_l2_resolve to resolve IPv6 destinations too.

Reviewed by: bz@


245924 26-Jan-2013 np

Move lle_event to if_llatbl.h

lle_event replaced arp_update_event after the ARP rewrite and ended up
in if_ether.h simply because arp_update_event used to be there too.
IPv6 neighbor discovery is going to grow lle_event support and this is a
good time to move it to if_llatbl.h.

The two in-tree consumers of this event - OFED and toecore - are not
affected.

Reviewed by: bz@


245921 25-Jan-2013 np

There is no need to call into the TOE driver twice in pru_rcvd (tod_rcvd
and then tod_output right after that).

Reviewed by: bz@


245919 25-Jan-2013 np

Add TCP_OFFLOAD hook in syncache_respond for IPv6 too, just like the one
that exists for IPv4.

Reviewed by: bz@


245916 25-Jan-2013 np

Teach toe_4tuple_check() to deal with IPv6 4-tuples too.

Reviewed by: bz@


245915 25-Jan-2013 np

Heed SO_NO_OFFLOAD.

MFC after: 1 week


245914 25-Jan-2013 np

Remove redundant test, we know inp_lport is 0.

MFC after: 1 week


245823 22-Jan-2013 jhb

Use decimal values for UDP and TCP socket options rather than hex to avoid
implying that these constants should be treated as bit masks.

Reviewed by: net
MFC after: 1 week


245783 22-Jan-2013 lstewart

Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"
logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are
unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could
therefore potentially corrupt the result (although under normal operation,
neither variable should legitmately exceed 32 bits).

[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html

Submitted by: jhb
MFC after: 1 week


245238 09-Jan-2013 jhb

Don't drop options from the third retransmitted SYN by default. If the
SYNs (or SYN/ACK replies) are dropped due to network congestion, then the
remote end of the connection may act as if options such as window scaling
are enabled but the local end will think they are not. This can result in
very slow data transfers in the case of window scaling disagreements.

The old behavior can be obtained by setting the
net.inet.tcp.rexmit_drop_options sysctl to a non-zero value.

Reviewed by: net@
MFC after: 2 weeks


244989 03-Jan-2013 peter

Temporarily revert rev 244678. This is causing loopback problems with
the lo (loopback) interfaces.


244730 27-Dec-2012 tuexen

Some cleanups.

MFC after: 3 days


244729 27-Dec-2012 tuexen

Minor cleanups of debug messages.

MFC after: 3 days


244728 27-Dec-2012 tuexen

Fix a copy and paste error.

MFC after: 3 days


244683 25-Dec-2012 glebius

Garbage collect carp_cksum().


244681 25-Dec-2012 glebius

Change net.inet.carp.demotion sysctl to add the supplied value
to the current demotion factor instead of assigning it.

This allows external scripts to control demotion factor together
with kernel in a raceless manner.


244680 25-Dec-2012 glebius

Fix sysctl_handle_int() usage. Either arg1 or arg2 should be supplied,
and arg2 doesn't pass size of arg1.


244678 25-Dec-2012 glebius

The SIOCSIFFLAGS ioctl handler runs if_up()/if_down() that notify
all interested parties in case if interface flag IFF_UP has changed.

However, not only SIOCSIFFLAGS can raise the flag, but SIOCAIFADDR
and SIOCAIFADDR_IN6 can, too. The actual |= is done not in the protocol
code, but in code of interface drivers. To fix this historical layering
violation, we will check whether ifp->if_ioctl(SIOCSIFADDR) raised the
IFF_UP flag, and if it did, run the if_up() handler.

This fixes configuring an address under CARP control on an interface
that was initially !IFF_UP.

P.S. I intentionally omitted handling the IFF_SMART flag. This flag was
never ever used in any driver since it was introduced, and since it
means another layering violation, it should be garbage collected instead
of pretended to be supported.


244665 24-Dec-2012 glebius

Minor style(9) changes:
- Remove declaration in initializer.
- Add empty line between logical blocks.


244387 18-Dec-2012 glebius

Fix !INET6 build after r244365.


244386 18-Dec-2012 glebius

Clear correct flag in INET6 case.


244365 17-Dec-2012 ae

Since we use different flags to detect tcp forwarding, and we share the
same code for IPv4 and IPv6 in tcp_input, we should check both
M_IP_NEXTHOP and M_IP6_NEXTHOP flags.

MFC after: 3 days


244183 13-Dec-2012 glebius

Fix problem in r238990. The LLE_LINKED flag should be tested prior to
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.

Reported by: Ian FREISLICH <ianf clue.co.za>


244157 12-Dec-2012 glebius

Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,
but later after processing and freeing the tag, we need to jump back again
to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to
process and free the tag for second time.

Reported & tested by: Pawel Tyll <ptyll nitronet.pl>
MFC after: 3 days


244033 08-Dec-2012 tuexen

Get it compiling without INET and INET6 support (mainly userland stack).

MFC after: 2 weeks


244031 08-Dec-2012 pjd

More warnings for zones that depend on the kern.ipc.maxsockets limit.

Obtained from: WHEEL Systems


244026 08-Dec-2012 tuexen

Use correct padding of the ABORT chunk in case of an user initiated
abort cause is used.

MFC after: 2 weeks


244021 08-Dec-2012 tuexen

Ensure that the padding of the last parameter of an INIT chunk
is not included in the chunk length as required by RFC 4960.
While there, cleanup sctp_send_initiate().

MFC after: 2 weeks


243882 05-Dec-2012 glebius

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


243624 27-Nov-2012 andre

Remove unused and unnecessary CSUM_IP_FRAGS checksumming capability.
Checksumming the IP header of fragments is no different from doing
normal IP headers.

Discussed with: yongari
MFC after: 1 week


243621 27-Nov-2012 andre

Add DELACK to list of timers.

MFC after: 1 week


243603 27-Nov-2012 np

Make sure that tcp_timer_activate() correctly sees TCP_OFFLOAD (or not).


243594 27-Nov-2012 alfred

Auto size the tcbhashsize structure based on max sockets.

While here, also make the code that enforces power-of-two more
forgiving, instead of just resetting to 512, graciously round-down
to the next lower power of two.


243565 26-Nov-2012 tuexen

Add support for sctp_peeloff() also in the front states of the
association.

MFC after: 3 days


243564 26-Nov-2012 tuexen

Find the endpoint for an incoming packet also if the endpoint
comes from sctp_peeloff().

MFC after: 3 days


243558 26-Nov-2012 tuexen

Allow shutdown() to be used on fds returned from sctp_peeloff().

MFC after: 3 days


243516 25-Nov-2012 tuexen

Remove unused function.

MFC after: 1 week


243186 17-Nov-2012 tuexen

Add support for SCTP/UDP/IPV6.
This completes the support of
http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-udp-encaps

MFC after: 1 week


243157 16-Nov-2012 tuexen

Get the accounting working. We now have counters how many
chunks for each SCTP outgoing stream are in the send and
sent queue.
While there, improve the naming of NR-SACK related constants
recently introduced.

MFC after: 1 week


242854 10-Nov-2012 rdivacky

Initialize hdrlen to 0 to avoid clang warning in NOINET case.


242745 08-Nov-2012 bz

Cleanup some whitspace in this file to get it out of an upcoming patch.

MFC after: 10 days


242714 07-Nov-2012 tuexen

Add per outgoing stream accounting for chunks in the send
and sent queue. This provides no functional change, but is
a preparation for an upcoming stream reset improvement.
Done with rrs@.

MFC after: 1 week


242709 07-Nov-2012 tuexen

Add some missing changes missed in the last commit.

MFC after: 1 week
X-MFC with: 242708


242708 07-Nov-2012 tuexen

Improve PR-SCTP if used in combination with NR-SACK.
Based on work done by Mohammad Rajiullah.

MFC after: 1 week


242692 07-Nov-2012 kevlo

Fix typo; s/ouput/output


242680 06-Nov-2012 mjg

Fix possible spurious sbunlock in sctp_sorecvmsg.

Reviewed by: tuexen
Approved by: trasz (mentor)
MFC after: 3 days


242627 05-Nov-2012 tuexen

Move from early SSN assignment to late SSN assignment.
This doesn't change functionality, but makes upcoming change
much easier.
Developed with rrs@ at the IETF 85.

MFC after: 1 week


242601 05-Nov-2012 andre

Back out r242262. The simplified window change/update logic wasn't
complete and ready for production use.

PR: kern/173309


242463 02-Nov-2012 ae

Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by: andre


242327 29-Oct-2012 tuexen

Whitespace changes due to upstream integration of SCTP changes in the
FreeBSD code base.


242326 29-Oct-2012 tuexen

Add braces (as used elsewhere in the SCTP code).


242325 29-Oct-2012 tuexen

Use ntohs() and htons() in correct order. However, this doesn't change
functionality.


242311 29-Oct-2012 andre

Forced commit to provide the correct commit message to r242251:

Defer sending an independent window update if a delayed ACK is pending
saving a packet. The window update then gets piggy-backed on the next
already scheduled ACK.

Added grammar fixes as well.

MFC after: 2 weeks


242308 29-Oct-2012 andre

Define the delayed ACK timeout value directly as hz/10 instead of
obfuscating it by going through PR_FASTHZ. No functional change.

MFC after: 2 weeks


242267 28-Oct-2012 andre

If the user has closed the socket then drop a persisting connection
after a much reduced timeout.

Typically web servers close their sockets quickly under the assumption
that the TCP connections goes away as well. That is not entirely true
however. If the peer closed the window we're going to wait for a long
time with lots of data in the send buffer.

MFC after: 2 weeks


242266 28-Oct-2012 andre

Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet. Also the restart
window isn't yet increased as allowed. Both will be adjusted with
upcoming changes.

Is is enabled by default. In Linux it is enabled since kernel 3.0.

MFC after: 2 weeks


242264 28-Oct-2012 andre

Update comment to reflect the change made in r242263.

MFC after: 2 weeks


242263 28-Oct-2012 andre

Add SACK_PERMIT to the list of TCP options that are switched off after
retransmitting a SYN three times.

MFC after: 2 weeks


242262 28-Oct-2012 andre

Simplify and enhance the window change/update acceptance logic,
especially in the presence of bi-directional data transfers.

snd_wl1 tracks the right edge, including data in the reassembly
queue, of valid incoming data. This makes it like rcv_nxt plus
reassembly. It never goes backwards to prevent older, possibly
reordered segments from updating the window.

snd_wl2 tracks the left edge of sent data. This makes it a duplicate
of snd_una. However joining them right now is difficult due to
separate update dependencies in different places in the code flow.

snd_wnd tracks the current advertized send window by the peer. In
tcp_output() the effective window is calculated by subtracting the
already in-flight data, snd_nxt less snd_una, from it.

ACK's become the main clock of window updates and will always update
the window when the left edge of what we sent is advanced. The ACK
clock is the primary signaling mechanism in ongoing data transfers.
This works reliably even in the presence of reordering, reassembly
and retransmitted segments. The ACK clock is most important because
it determines how much data we are allowed to inject into the network.

Zero window updates get us out of persistence mode are crucial. Here
a segment that neither moves ACK nor SEQ but enlarges WND is accepted.

When the ACK clock is not active (that is we're not or no longer
sending any data) any segment that moves the extended right SEQ edge,
including out-of-order segments, updates the window. This gives us
updates especially during ping-pong transfers where the peer isn't
done consuming the already acknowledged data from the receive buffer
while responding with data.

The SSH protocol is a prime candidate to benefit from the improved
bi-directional window update logic as it has its own windowing
mechanism on top of TCP and is frequently sending back protocol ACK's.

Tcpdump provided by: darrenr
Tested by: darrenr
MFC after: 2 weeks


242261 28-Oct-2012 andre

For retransmits of SYN|ACK from the syncache use the slightly more
aggressive special tcp_syn_backoff[] retransmit schedule instead of
the normal tcp_backoff[] schedule for established connections.

MFC after: 2 weeks


242260 28-Oct-2012 andre

When retransmitting SYN in TCPS_SYN_SENT state use TCPTV_RTOBASE,
the default retransmit timeout, as base to calculate the backoff
time until next try instead of the TCP_REXMTVAL() macro which only
works correctly when we already have measured an actual RTT+RTTVAR.

Before it would cause the first retransmit at RTOBASE, the next
four at the same time (!) about 200ms later, and then another one
again RTOBASE later.

MFC after: 2 weeks


242257 28-Oct-2012 andre

Remove bogus 'else' in #ifdef that prevented the rttvar from being reset
tcp_timer_rexmt() on retransmit for IPv6 sessions.

MFC after: 2 weeks


242255 28-Oct-2012 andre

Allow arbitrary MSS sizes and don't mind about the cluster size anymore.
We've got more cluster sizes for quite some time now and the orginally
imposed limits and the previously codified thoughts on efficiency gains
are no longer true.

MFC after: 2 weeks


242254 28-Oct-2012 andre

Change the syncache count reporting the current number of entries
from an unprotected u_int that reports garbage on SMP to a function
based sysctl obtaining the current value from UMA.

Also read back the actual cache_limit after page size rounding by UMA.

PR: kern/165879
MFC after: 2 weeks


242253 28-Oct-2012 andre

Simplify implementation of net.inet.tcp.reass.maxsegments and
net.inet.tcp.reass.cursegments.

MFC after: 2 weeks


242252 28-Oct-2012 andre

Prevent a flurry of forced window updates when an application is
doing small reads on a (partially) filled receive socket buffer.

Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS. This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender. There still is available space in
the window and the sender can continue sending data. All window
updates then get carried by the regular ACKs. Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.

Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small. The next regular data ACK will carry and
report the exact window size again.

Reported by: sbruno
Tested by: darrenr
Tested by: Darren Baginski
PR: kern/116335
MFC after: 2 weeks


242251 28-Oct-2012 andre

When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after: 2 weeks


242250 28-Oct-2012 andre

When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment. This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after: 2 weeks


242249 28-Oct-2012 andre

Adjust the initial default CWND upon connection establishment to the
new and increased values specified by RFC5681 Section 3.1.

The even larger initial CWND per RFC3390, if enabled, is not affected.

MFC after: 2 weeks


242161 26-Oct-2012 glebius

o Remove last argument to ip_fragment(), and obtain all needed information
on checksums directly from mbuf flags. This simplifies code.
o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in
hardware. Some driver may not announce CSUM_IP in theur if_hwassist,
although try to do checksums if CSUM_IP set on mbuf. Example is em(4).
o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP.
After this change CSUM_DELAY_IP vanishes from the stack.

Submitted by: Sebastian Kuzminsky <seb lineratesystems.com>


242079 25-Oct-2012 ae

Remove the IPFIREWALL_FORWARD kernel option and make possible to turn
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.

Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks


242077 25-Oct-2012 glebius

After r241923 the updated ip_len no longer needed.


242076 25-Oct-2012 glebius

Fix error in r241913 that had broken fragment reassembly.


241926 23-Oct-2012 glebius

Use ip_stripoptions() instead of handrolled version.


241925 23-Oct-2012 glebius

Simplify ip_stripoptions() reducing number of intermediate
variables.


241923 23-Oct-2012 glebius

Do not reduce ip_len by size of IP header in the ip_input()
before passing a packet to protocol input routines.
For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.

Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.


241916 22-Oct-2012 delphij

Remove __P.

Submitted by: kevlo
Reviewed by: md5(1)
MFC after: 2 months


241913 22-Oct-2012 glebius

Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>


241735 19-Oct-2012 zont

- Update cachelimit after hashsize and bucketlimit were set.

Reported by: az
Reviewed by: melifaro
Approved by: kib (mentor)
MFC after: 1 week


241686 18-Oct-2012 andre

Mechanically remove the last stray remains of spl* calls from net*/*.
They have been Noop's for a long time now.


241648 17-Oct-2012 emaste

Avoid potential bad pointer dereference.

Previously RuleAdd would leave entry->la unset for the first entry in
the proxyList.

Sponsored by: ADARA Networks
MFC After: 1 week


241575 15-Oct-2012 glebius

We don't need to convert ip6_len to host byte order before
ip6_output(), the IPv6 stack is working in net byte order.

The reason this code worked before is that ip6_output()
doesn't look at ip6_plen at all and recalculates it based
on mbuf length.


241547 14-Oct-2012 glebius

Fix a miss from r241344: in ip_mloopback() we need to go to
net byte order prior to calling in_delayed_cksum().

Reported by: Olivier Cochard-Labbe <olivier cochard.me>


241502 13-Oct-2012 melifaro

Cleanup documentation: cloning route support has been removed in r186119.

MFC after: 2 weeks


241481 12-Oct-2012 glebius

Revert fixup of ip_len from r241480. Now stack isn't yet
ready for that change.


241480 12-Oct-2012 glebius

In ip_stripoptions():
- Remove unused argument and incorrect comment.
- Fixup ip_len after stripping.


241406 10-Oct-2012 melifaro

Do not check if found IPv4 rte is dynamic if net.inet.icmp.drop_redirect is
enabled. This eliminates one mtx_lock() per each routing lookup thus improving
performance in several cases (routing to directly connected interface or routing
to default gateway).

Icmp redirects should not be used to provide routing direction nowadays, even
for end hosts. Routers should not use them too (and this is explicitly restricted
in IPv6, see RFC 4861, clause 8.2).

Current commit changes rnh_machaddr function to 'stock' rn_match (and back) for every
AF_INET routing table in given VNET instance on drop_redirect sysctl change.

This change is part of bigger patch eliminating rte locking.

Sponsored by: Yandex LLC
MFC after: 2 weeks


241394 10-Oct-2012 kevlo

Revert previous commit...

Pointyhat to: kevlo (myself)


241370 09-Oct-2012 kevlo

Prefer NULL over 0 for pointers


241344 08-Oct-2012 glebius

After r241245 it appeared that in_delayed_cksum(), which still expects
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
- ip_output(), ipfilter(4) not changed, since already call
in_delayed_cksum() with header in net byte order.
- divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
there and back.
- mrouting code and IPv6 ipsec now need to switch byte order there and
back, but I hope, this is temporary solution.
- In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
- pf_route() catches up on r241245 changes to ip_output().


241342 08-Oct-2012 glebius

No reason to play with IP header before calling sctp_delayed_cksum()
with offset beyond the IP header.


241245 06-Oct-2012 glebius

A step in resolving mess with byte ordering for AF_INET. After this change:

- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.

Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)


241129 02-Oct-2012 glebius

There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.

This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:

- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.

To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.

Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>


241043 29-Sep-2012 glebius

carp_send_ad() should never return without rescheduling next run.


240985 27-Sep-2012 glebius

Fix bug in TCP_KEEPCNT setting, which slipped in in the last round
of reviewing of r231025.

Unlike other options from this family TCP_KEEPCNT doesn't specify
time interval, but a count, thus parameter supplied doesn't need
to be multiplied by hz.

Reported & tested by: amdmi3


240849 23-Sep-2012 tuexen

Whitespace change.

MFC after: 3 days


240848 23-Sep-2012 tuexen

Declare a static function as such.

MFC after: 3 days


240842 22-Sep-2012 tuexen

Fix a bug related to handling Re-config chunks. It is not true that
the association can be removed if the socket is gone.

MFC after: 3 days


240826 22-Sep-2012 tuexen

Small cleanups. No functional change.

MFC after: 10 days


240725 20-Sep-2012 kevlo

Fix typo: s/pakcet/packet


240520 14-Sep-2012 eadler

s/teh/the/g

Approved by: cperciva
MFC after: 3 days


240507 14-Sep-2012 tuexen

Small cleanups. No functional change.

MFC after: 10 days


240494 14-Sep-2012 glebius

o Create directory sys/netpfil, where all packet filters should
reside, and move there ipfw(4) and pf(4).

o Move most modified parts of pf out of contrib.

Actual movements:

sys/contrib/pf/net/*.c -> sys/netpfil/pf/
sys/contrib/pf/net/*.h -> sys/net/
contrib/pf/pfctl/*.c -> sbin/pfctl
contrib/pf/pfctl/*.h -> sbin/pfctl
contrib/pf/pfctl/pfctl.8 -> sbin/pfctl
contrib/pf/pfctl/*.4 -> share/man/man4
contrib/pf/pfctl/*.5 -> share/man/man5

sys/netinet/ipfw -> sys/netpfil/ipfw

The arguable movement is pf/net/*.h -> sys/net. There are
future plans to refactor pf includes, so I decided not to
break things twice.

Not modified bits of pf left in contrib: authpf, ftp-proxy,
tftp-proxy, pflogd.

The ipfw(4) movement is planned to be merged to stable/9,
to make head and stable match.

Discussed with: bz, luigi


240263 09-Sep-2012 tuexen

Whitespace changes.

MFC after: 10 days


240250 08-Sep-2012 tuexen

Whitespace cleanup.

MFC after: 10 days


240233 08-Sep-2012 glebius

Merge the projects/pf/head branch, that was worked on for last six months,
into head. The most significant achievements in the new code:

o Fine grained locking, thus much better performance.
o Fixes to many problems in pf, that were specific to FreeBSD port.

New code doesn't have that many ifdefs and much less OpenBSDisms, thus
is more attractive to our developers.

Those interested in details, can browse through SVN log of the
projects/pf/head branch. And for reference, here is exact list of
revisions merged:

r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330,
r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656,
r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782,
r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868,
r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223,
r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456,
r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505,
r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168,
r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230,
r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398,
r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548,
r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672,
r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169,
r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442,
r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522,
r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661,
r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.

I'd like to thank people who participated in early testing:

Tested by: Florian Smeets <flo freebsd.org>
Tested by: Chekaluk Vitaly <artemrts ukr.net>
Tested by: Ben Wilber <ben desync.com>
Tested by: Ian FREISLICH <ianf cloudseed.co.za>


240198 07-Sep-2012 tuexen

Don't include a structure containing a flexible array in another
structure.

MFC after: 10 days


240158 06-Sep-2012 tuexen

Get rid of a gcc'ism.

MFC after: 10 days


240148 05-Sep-2012 tuexen

Using %p in a format string requires a void *.

MFC after: 10 days


240115 04-Sep-2012 tuexen

Use the consistenly the size of a variable. This helps to keep the code
simpler for the userland implementation.

MFC after: 3 days


240114 04-Sep-2012 tuexen

Whitespace change.

MFC after: 3 days


240099 04-Sep-2012 melifaro

Introduce new link-layer PFIL hook V_link_pfil_hook.
Merge ether_ipfw_chk() and part of bridge_pfil() into
unified ipfw_check_frame() function called by PFIL.
This change was suggested by rwatson? @ DevSummit.

Remove ipfw headers from ether/bridge code since they are unneeded now.

Note this thange introduce some (temporary) performance penalty since
PFIL read lock has to be acquired for every link-level packet.

MFC after: 3 weeks


240073 03-Sep-2012 glebius

Provide a sysctl switch that allows to install ARP entries
with multicast bit set. FreeBSD refuses to install such
entries since 9.0, and this broke installations running
Microsoft NLB, which are violating standards.

Tested by: Tarasov Oleg <oleg_tarasov sg-tea.com>


240007 02-Sep-2012 tuexen

Fix a typo which results in RTT to be off by a factor of 10, if the RTT is
larger than 1 second.

MFC after: 3 days


239997 01-Sep-2012 eadler

Mark the ipfw interface type as not being ether. This fixes an issue
where uuidgen tried to obtain a ipfw device's mac address which was
always zero.

PR: 170460
Submitted by: wxs
Reviewed by: bdrewery
Reviewed by: delphij
Approved by: cperciva
MFC after: 1 week


239672 25-Aug-2012 rrs

This small change takes care of a race condition
that can occur when both sides close at the same time.
If that occurs, without this fix the connection enters
FIN1 on both sides and they will forever send FIN|ACK at
each other until the connection times out. This is because
we stopped processing the FIN|ACK and thus did not advance
the sequence and so never ACK'd each others FIN. This
fix adjusts it so we *do* process the FIN properly and
the race goes away ;-)

MFC after: 1 month


239511 21-Aug-2012 np

Correctly handle the case where an inp has already been dropped by the time
the TOE driver reports that an active open failed. toe_connect_failed is
supposed to handle this but it should be provided the inpcb instead of the
tcpcb which may no longer be around.


239395 19-Aug-2012 rrs

Though I disagree, I conceed to jhb & Rui. Note
that we still have a problem with this whole structure of
locks and in_input.c [it does not lock which it should not, but
this *can* lead to crashes]. (I have seen it in our SQA
testbed.. besides the one with a refcnt issue that I will
have SQA work on next week ;-)


239353 17-Aug-2012 rrs

Ok jhb, lets move the ifa_free() down to the bottom to
assure that *all* tables and such are removed before
we start to free. This won't protect the Hash in ip_input.c
but in theory should protect any other uses that *do* use locks.

MFC after: 1 week (or more)


239346 17-Aug-2012 lstewart

The TCP PAWS fix for kernels with fast tick rates (r231767) changed the TCP
timestamp related stack variables to reference ms directly instead of ticks.
The h_ertt(4) Khelp module relies on TCP timestamp information in order to
calculate its enhanced RTT estimates, but was not updated as part of r231767.

Consequently, h_ertt has not been calculating correct RTT estimates since
r231767 was comitted, which in turn broke all delay-based congestion control
algorithms because they rely on the h_ertt RTT estimates.

Fix the breakage by switching h_ertt to use tcp_ts_getticks() in place of all
previous uses of the ticks variable. This ensures all timestamp related
variables in h_ertt use the same units as the TCP stack and therefore results in
meaningful comparisons and RTT estimate calculations.

Reported & tested by: Naeem Khademi (naeemk at ifi uio no)
Discussed with: bz
MFC after: 3 days


239334 16-Aug-2012 rrs

Its never a good idea to double free the same
address.

MFC after: 1 week (after the other commits ahead of this gets MFC'd)


239124 07-Aug-2012 luigi

s/lenght/length/ in comments


239093 06-Aug-2012 luigi

move functions outside the SYSBEGIN/SYSEND block

(SYSBEGIN/SYSEND are specific to ipfw/dummynet and are used to
emulate sysctl on platforms that do not have them, and they work
by creating an array which contains all the sysctl-ed symbols.)


239092 06-Aug-2012 luigi

use FREE_PKT instead of m_freem to free an mbuf.
The former is the standard form used in ipfw/dummynet, so that
it is easier to remap it to different memory managers depending
on the platform.


239091 06-Aug-2012 tuexen

Fix a bug found by dim@:
Don't use an uninitilized variable, if INVARIANTS is on and an illegal
packet with destination 0 is received.

MFC after: 3 days
X-MFC with: 238003


239075 05-Aug-2012 trociny

In tcp timers, check INP_DROPPED flag a little later, after
callout_deactivate(), so if INP_DROPPED is set we return with the
timer active flag cleared.

For me this fixes negative keep timer values reported by `netstat -x'
for connections in CLOSE state.

Approved by: net (silence)
MFC after: 2 weeks


239052 05-Aug-2012 tuexen

Fix a refcount issue. The called only decrements is stcb is NULL.

MFC after: 3 days
Discussed with: rrs


239041 04-Aug-2012 tuexen

Fix a bug reported by Simon L. B. Nielsen:
If an SCTP endpoint receives an ASCONF with a wildcard
lookup address and incorrect verification tag, the system
crashes.

MFC after: 3 days.


239035 04-Aug-2012 tuexen

Testing an interface property should depend on the interface, not
on an address.

MFC after: 3 days


238990 02-Aug-2012 glebius

Fix races between in_lltable_prefix_free(), lla_lookup(),
llentry_free() and arptimer():

o Use callout_init_rw() for lle timeout, this allows us safely
disestablish them.
- This allows us to simplify the arptimer() and make it
race safe.
o Consistently use ifp->if_afdata_lock to lock access to
linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
is attached to the hash.
- Use LLE_LINKED to avoid double unlinking via consequent
calls to llentry_free().
- Mark lle with LLE_DELETED via |= operation istead of =,
so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
consistent and provide more informative KASSERTs.

The patch is a collaborative work of all submitters and myself.

PR: kern/165863
Submitted by: Andrey Zonov <andrey zonov.org>
Submitted by: Ryan Stone <rysto32 gmail.com>
Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>


238988 02-Aug-2012 luigi

replace __unused with a portable construct;
fix a couple of signed/unsigned warnings.


238978 01-Aug-2012 luigi

replace inet_ntoa_r with the more standard inet_ntop().
As discussed on -current, inet_ntoa_r() is non standard,
has different arguments in userspace and kernel, and
almost unused (no clients in userspace, only
net/flowtable.c, net/if_llatbl.c, netinet/in_pcb.c, netinet/tcp_subr.c
in the kernel)


238977 01-Aug-2012 luigi

add a cast to avoid a signed/unsigned warning (to be removed
when we will have TUNABLE_UINT constructors)


238967 01-Aug-2012 glebius

Some more whitespace cleanup.


238945 31-Jul-2012 glebius

Some style(9) and whitespace changes.

Together with: Andrey Zonov <andrey zonov.org>


238941 31-Jul-2012 luigi

nobody uses this file except the userspace ipfw code, but the cast
of a pointer to an integer needs a cast to prevent a warning for
size mismatch.

MFC after: 1 week


238790 26-Jul-2012 tuexen

Fix the sctp_sockstore union such that userland programs don't depend
on INET and/or INET6 to be defined and in-tune with how the kernel
was compiled.

MFC after: 3 days
Discussed with: rrs


238769 25-Jul-2012 bz

Fix a problem when CARP is enabled on the interface for IPv4
but not for IPv6. The current checks in nd6_nbr.c along with the
old version will result in ifa being NULL and subsequently the
packet will be dropped. This prevented NS/NA, from working and
with that IPv6.

Now return the ifa from the carp lookup function in two cases:
1) if the address matches, is a carp address, and we are MASTER
(as before),
2) if the address matches but it is not a carp address at all (new).

Reported by: Peter Wemm (new Y! FreeBSD cluster, eating our own dogfood)
Tested on: New Y! FreeBSD cluster machines
Reviewed by: glebius


238699 22-Jul-2012 rwatson

Update some stale comments regarding tcbinfo locking in the TCP input
path: read locks on tcbinfo are no longer used, so won't happen. No
functional change.

MFC after: 3 days


238573 18-Jul-2012 glebius

Plug a reference leak: before doing 'goto again' we need to unref
ia->ia_ifa if there is any.

Submitted by: Andrey Zonov <andrey zonov.org>


238572 18-Jul-2012 glebius

When traversing global in_ifaddr list in the IFP_TO_IA() macro, we need
to obtain IN_IFADDR_RLOCK().


238550 17-Jul-2012 tuexen

Fix a refcount bug when freeing an association.
While there: Change code to be consistent.
Discussed with rrs@.
MFC after: 3 days


238516 16-Jul-2012 glebius

If ip_output() returns EMSGSIZE to tcp_output(), then the latter calls
tcp_mtudisc(), which in its turn may call tcp_output(). Under certain
conditions (must admit they are very special) an infinite recursion can
happen.

To avoid recursion we can pass struct route to ip_output() and obtain
correct mtu. This allows us not to use tcp_mtudisc() but call tcp_mss_update()
directly.

PR: kern/155585
Submitted by: Andrey Zonov <andrey zonov.org> (original version of patch)


238501 15-Jul-2012 tuexen

Changes which improve compilation if neither INET nor INET6 is defined.

MFC after: 3 days


238475 15-Jul-2012 tuexen

#ifdef INET and INET6 consistently. This also fixes a bug, where
it was done wrong.

MFC after: 3 days


238458 14-Jul-2012 tuexen

Provide the correct notification type (SCTP_SEND_FAILED_EVENT)
for unsent messages.

MFC after: 3 days


238455 14-Jul-2012 tuexen

Use case for selecting the address family (as in other places).

MFC after: 3 days


238454 14-Jul-2012 tuexen

Use case for selecting the address family (as in other places).

MFC after: 3 days


238294 09-Jul-2012 tuexen

Fix a bug introduced in r237715.

MFC after:i 3 days.


238277 09-Jul-2012 hrs

Make ipfw0 logging pseudo-interface clonable. It can be created automatically
by $firewall_logif rc.conf(5) variable at boot time or manually by ifconfig(8)
after a boot.

Discussed on: freebsd-ipfw@


238265 08-Jul-2012 melifaro

Finally fix lookup (account remaining '\0') and deletion
(provide valid key length for radix lookup).

Submitted by: Ihor Kaharlichenko<madkinder at gmail.com> (prev version)
Approved by: kib(mentor)
MFC after: 3 days

Sponsored by: Shtorm ISP


238122 04-Jul-2012 tuexen

Use consistent method to determine IPV4_OUTPUT/IPV6_OUTPUT.

MFC after: 3 days


238121 04-Jul-2012 tuexen

Use CSUM_SCTP_IPV6 for IPv6.

MFC after: 3 days


238092 04-Jul-2012 glebius

When ip_output()/ip6_output() is supplied a struct route *ro argument,
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.

The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.

- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.

Tested by: tuexen (SCTP part)


238087 03-Jul-2012 tuexen

Iniitialize a variable.

MFC after: 3 days


238084 03-Jul-2012 trociny

Don't check for ifp != NULL before KASSERT, as ifp may not be NULL here
(it is dereferenced below).

Discussed with: jhb
MFC after: 1 week


238083 03-Jul-2012 trociny

Fix RTTVAR scale in net.inet.tcp.hostcache.list sysctl.

Reviewed by: andre
MFC after: 3 days


238063 03-Jul-2012 issyl0

- Make ipfw's sched rules case insensitive, for user-friendliness.
- Add a note to the ipfw(8) man page about the rules no longer being
case sensitive.
- Fix some typos in the man page.

PR: docs/164772
Reviewed by: bz
Approved by: gabor (doc mentor, src committer)
MFC after: 2 weeks


238016 02-Jul-2012 glebius

Remove route caching from IP multicast routing code. There is no
reason to do that, and also, cached route never got unreferenced,
which meant a reference leak.

Reviewed by: bms


238003 02-Jul-2012 tuexen

Move common code parts to sctp_common_input_processing().

MFC after: 3 days


238002 02-Jul-2012 tuexen

Remove dead code (on FreeBSD) as suggested by glebius@.

MFC after: 3 days


237715 28-Jun-2012 tuexen

Pass the src and dst address of a received packet explicitly around.

MFC after: 3 days


237569 25-Jun-2012 tuexen

Unify sctp_input() and sctp6_input().

MFC after: 3 days


237565 25-Jun-2012 tuexen

Whitespace cleanup.

MFC after: 3 days


237542 24-Jun-2012 tuexen

Pass the packet length explicitly around.

MFC after: 3 days


237541 24-Jun-2012 tuexen

Remove redundant check.

MFC after: 3 days


237540 24-Jun-2012 tuexen

Do packet logging in a consistent way.

MFC after: 3 days


237479 23-Jun-2012 melifaro

Fix interface matching by ipfw table

Submitted by: Ihor Kaharlichenko <madkinder@gmail.com>
Tested by: Ihor Kaharlichenko <madkinder@gmail.com>
Approved by: kib(mentor)
MFC after: 3 days


237392 21-Jun-2012 tuexen

Remove redundant #ifdef. Reported by gnn@.

MFC after: 3 days


237263 19-Jun-2012 np

- Updated TOE support in the kernel.

- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
These are available as t3_tom and t4_tom modules that augment cxgb(4)
and cxgbe(4) respectively. The cxgb/cxgbe drivers continue to work as
usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs). T4 iWARP in the
works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload? Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded? Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by: bz, gnn
Sponsored by: Chelsio communications.
MFC after: ~3 months (after 9.1, and after ensuring MFC is feasible)


237230 18-Jun-2012 tuexen

Add rate limitation for SCTP OOTB responses.

MFC after: 3 days


237229 18-Jun-2012 tuexen

Cleanup the UDP decapsulation code.

MFC after: 3 days


237049 14-Jun-2012 tuexen

Pass flowid explicitly through the stack instead of taking it from
the mbuf chain at different places.
While there: Fix several bugs related to VRFs.

MFC after: 3 days


237015 13-Jun-2012 joel

mdoc: avoid nested displays. Fixes mandoc warnings.


236961 12-Jun-2012 tuexen

Add a cmsg of type IP_TOS for UDP/IPv4 sockets to specify the TOS byte.

MFC after: 3 days


236959 12-Jun-2012 tuexen

Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.

MFC after: 3 days


236956 12-Jun-2012 tuexen

Unify the sending of ABORT, SHUTDOWN-COMPLETE and ERROR chunks.
While there: Fix also some minor bugs and prepare for SCTP/DTLS.

MFC after: 3 days


236949 12-Jun-2012 tuexen

Small cleanup.

MFC after: 3 days


236819 09-Jun-2012 melifaro

Validate IPv4 network mask being passed to ipfw kernel interface.
Incorrect mask can possibly be one of the reasons for kern/127209 existance.

Approved by: kib(mentor)
MFC after: 3 days


236596 05-Jun-2012 eadler

Fix style nit: don't use leading zero for dates in .Dd

Prompted by: brueffer
Approved by: brueffer
MFC after: 3 days


236575 04-Jun-2012 emax

Plug more refcount leaks and possible NULL deref for interface
address list.

Submitted by: scottl@
MFC after: 3 days


236522 03-Jun-2012 tuexen

Remove code which is not needed.

MFC after: 3 days


236515 03-Jun-2012 tuexen

Use an existing function to get the source address.

MFC after: 3 days


236493 02-Jun-2012 tuexen

Honor sysctl for TTL.

MFC after: 3 days


236492 02-Jun-2012 tuexen

Don't request data from the IPv6 layer, which is not used.

MFC after: 3 days


236450 02-Jun-2012 tuexen

Remove an unused parameter.

MFC after: 3 days


236394 01-Jun-2012 bz

Make TCP LRO work properly with VIMAGE kernels rather than just panicing.
There's no VIMAGE context set there yet as this is before if_ethersubr.c.

MFC after: 3 days
X-MFC with: r235981


236391 01-Jun-2012 tuexen

Small cleanups. No functional change.

MFC after: 3 days


236332 30-May-2012 tuexen

Seperate SCTP checksum offloading for IPv4 and IPv6.
While there: remove some trainling whitespaces.

MFC after: 3 days
X-MFC with: 236170


236310 30-May-2012 glebius

Improve style(9) of bcopy() to and from mbuf tag.

Submitted by: bde


236297 30-May-2012 glebius

After r228571 carp_output() expects carp_softc * pointer in the mtag.

Noticed by: thompsa


236170 28-May-2012 bz

It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958


236157 27-May-2012 emaste

Add IPPROTO_MPLS (rfc4023) IP protocol definition

There are currently no in-tree consumers; I'm adding it now for use by
vendor code. This matches the change OpenBSD made while implementing
MPLS in gif(4).


236093 26-May-2012 bz

Trim the extra $FreeBSD$ from the comment below the license. We use
the __FBSDID() macro on the file now instead.

MFC after: 3 days


236087 26-May-2012 tuexen

Get rid of SCTP specific code to avoid CRC32C computations on loopback.
Just just offloading.
MFC after: 3 days


235990 25-May-2012 tuexen

Undefine SCTP_PACKED before including sctp_uio.h, which doesn't
use it. Spotted by Irene Ruengeler.

MFC after: 3 days


235985 25-May-2012 bz

MFp4 bz_ipv6_fast:

Properly protect the inp read access when handling the control code.
In the past this was expensive but given the rlock it's not so much
anymore.

Spotted while: optimizing udp6
Discussed with: rwatson (a few months ago)

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


235981 25-May-2012 bz

In case forwarding is turned on for a given address family, refuse to
queue the packet for LRO and tell the driver to directly pass it on.
This avoids re-assembly and later re-fragmentation problems when
forwarding.

It's not the best solution but the simplest and most effective for
the moment.

Should have been done: ages ago
Discussed with and by: many
MFC after: 3 days


235961 25-May-2012 bz

MFp4 bz_ipv6_fast:

Add code to handle pre-checked TCP checksums as indicated by mbuf
flags to save the entire computation for validation if not needed.

In the IPv6 TCP output path only compute the pseudo-header checksum,
set the checksum offset in the mbuf field along the appropriate flag
as done in IPv4.

In tcp_respond() just initialize the IPv6 payload length to 0 as
ip6_output() will properly set it.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


235950 25-May-2012 bz

MFp4 bz_ipv6_fast:

Factor out the tcp_hc_getmtu() call. As the comments say it
applies to both v4 and v6, so only write it once making it easier
to read the protocol family specifc code.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


235944 24-May-2012 bz

MFp4 bz_ipv6_fast:

Significantly update tcp_lro for mostly two things:
1) introduce basic support for IPv6 without extension headers.
2) try hard to also get the incremental checksum updates right,
especially also in the IPv4 case for the IP and TCP header.

Move variables around for better locality, factor things out into
functions, allow checksum updates to be compiled out, ...

Leave a few comments on further things to look at in the future,
though that is not the full list.

Update drivers with appropriate #includes as needed for IPv6 data
type in LRO.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


235903 24-May-2012 tuexen

Add sn_send_failed_event to sctp_notification.

MFC after: 3 days


235828 23-May-2012 tuexen

Use consistent text at the begining of the files.

MFC after: 3 days


235644 19-May-2012 marcel

Remove unused inclusion of curses.h


235557 17-May-2012 tuexen

Use a default for max_burst of 4 and l2var of 2.
This was discussed with rrs@.

MFC after: 3 days


235554 17-May-2012 tuexen

Support SCTP_EOF also for 1-to-1 style sockets.

MFC after: 3 days


235474 15-May-2012 bz

Switch to a standard 2 clause BSD license (from bsd-style-copyright).

Approved by: Myricom Inc. (gallatin)
Approved by: Intel Corporation (jfv)


235418 13-May-2012 tuexen

Support SCTP_REMOTE_ERROR notification.

MFC after: 3 days


235416 13-May-2012 tuexen

Provide in the SCTP_SEND_FAILED and SCTP_SEND_FAILED_EVENT notifications
the correct ssf_error or ssfe_error as required by RFC 6458.

MFC after: 3 days


235414 13-May-2012 tuexen

Provide the error code in SCTP_PEER_ADDR_CHANGE notifications as
specified in RFC 6458.

MFC after: 3 days


235412 13-May-2012 tuexen

Remove unused constants.

MFC after: 3 days


235403 13-May-2012 tuexen

Use ECONNABORTED in cases where the ABORT was sent to the peer.

MFC after: 3 days


235402 13-May-2012 tuexen

Ensure the user can read COMM_LOST notifications on 1-to-1 style sockets.

MFC after: 3 days


235360 12-May-2012 tuexen

Provide in the association change notification the received ABORT chunk
if case of SCTP_COMM_LOST or SCTP_CANT_STR_ASSOC as required by RFC 6458.

MFC after: 3 days


235286 11-May-2012 gjb

General mdoc(7) and typo fixes.

PR: 167734
Submitted by: Nobuyuki Koganemaru (kogane!jp.freebsd.org)
MFC after: 3 days


235283 11-May-2012 tuexen

Fix a bug in the handling of association reset request.

MFC after: 3 days


235282 11-May-2012 tuexen

Only provide the supported features in the SCTP_ASSOC_CHANGE notif
if the state is SCTP_COMM_UP or SCTP_RESTART.
While there, do some cleanups.

MFC after: 3 days


235280 11-May-2012 tuexen

Remove a constant which is only used on non-FreeBSD platform.
(The actual code for the socket option handling has been #ifdefed
out forever...)

MFC after: 3 days.


235091 06-May-2012 tuexen

Address clang warnings.

MFC after: 3 days


235081 06-May-2012 tuexen

Add support for the sac_info field in struct sctp_assoc_change
as required by RFC 6458.

MFC after: 3 days


235077 06-May-2012 tuexen

Remove debug code.

MFC after: 3 days


235075 06-May-2012 tuexen

Add support for SCTP_SEND_FAILED_EVENT as required by RFC 6458.

MFC after: 3 days


235066 05-May-2012 tuexen

Provide the flags in the SCTP stream reconfig related notification
as specified in RFC 6525.

MFC after: 3 days


235064 05-May-2012 tuexen

Honor SCTP_ENABLE_STREAM_RESET socket option when processing incoming
requests. Fix also the provided result in the response and use names
as specified in RFC 6525.

MFC after: 3 days


235057 05-May-2012 tuexen

Do error checking for the SCTP_RESET_STREAMS, SCTP_RESET_ASSOC,
and SCTP_ADD_STREAMS socket options as specified by RFC 6525.

MFC after: 3 days


235036 04-May-2012 delphij

Add ToS definitions for DiffServ Codepoints as per RFC2474.

Obtained from: OpenBSD
MFC after: 2 weeks


235021 04-May-2012 tuexen

Add support for the SCTP_ENABLE_STREAM_RESET socket option to
getsockopt(). This improves the support of RFC 6525.

MFC after: 3 days


235009 04-May-2012 tuexen

Add support for SCTP_STREAM_CHANGE_EVENT, SCTP_ASSOC_RESET_EVENT as
required by RFC 6525. This also fixes SCTP_STREAM_RESET_EVENT.

MFC after: 3 days


234996 04-May-2012 tuexen

Call panic() only under INVARIANTS.

MFC after: 3 days


234995 04-May-2012 tuexen

Use SCTP_PRINTF() instead of printf() in all SCTP sources.

MFC after: 3 days


234951 03-May-2012 tuexen

Fix another RFC 6458 issue. Spotted by Irene Ruengeler.

MFC after: 3 days


234946 03-May-2012 melifaro

Revert r234834 per luigi@ request.

Cleaner solution (e.g. adding another header) should be done here.

Original log:
Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h.
Remove ipfw/ip_fw_private.h header from non-ipfw code.

Requested by: luigi
Approved by: kib(mentor)


234834 30-Apr-2012 melifaro

Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h.
Remove ipfw/ip_fw_private.h header from non-ipfw code.

Approved by: ae(mentor)
MFC after: 2 weeks


234832 30-Apr-2012 tuexen

Add support for missing gauth_number_of_chunks field. This Bug was
found by Irene Ruengeler.

MFC after: 1 week


234762 28-Apr-2012 tuexen

Whitespace changes.

MFC after: 3 days


234731 27-Apr-2012 tuexen

Remove unused structure.
Reported by Irene Ruengeler.

MFC after: 3 days


234699 26-Apr-2012 tuexen

Fix a type in an SCTP AUTH related notification. Keep the old name
for backwards compatibility.
Spotted by Irene Ruengeler.

MFC after: 3 days


234614 23-Apr-2012 tuexen

Use the flags defined in RFC 6525 in the stream reset event.


234539 21-Apr-2012 tuexen

Fix check used by stream reset related events.

MFC after: 3 days


234464 19-Apr-2012 tuexen

Whitespace changes.

MFC after: 3 days


234461 19-Apr-2012 tuexen

Use the same pattern for mbuf logging everywhere.

MFC after: 3 days


234460 19-Apr-2012 tuexen

Fix reported errno.

MFC after: 3 days


234459 19-Apr-2012 tuexen

Fix a bug where we copy out more data from a mbuf chain that are
actually in it. This happens when SCTP receives an unknown chunk, which
requires the sending of an ERROR chunk, and there is no final padding but
the chunk is not 4-byte aligned.
Reported by yueting via rwatson@

MFC after: 3 days


234342 16-Apr-2012 glebius

When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by: az
Reported by: Andrey Zonov <andrey zonov.org>
Reviewed by: andre (previous version of patch)


234297 14-Apr-2012 tuexen

Send always HBs when in PF state.

MFC after: 1 week
X-MFC with: r234296


234296 14-Apr-2012 tuexen

Bugfix: Don't send HBs on path which are not idle.

MFC after: 1 week


234130 11-Apr-2012 glebius

It is a logical error that in carp_multicast_cleanup()
we look at count of addresses on a particular vhid, we
should account number of addresses on cif.

To achieve this we need to run carp_attach() and
carp_detach() under appropriate cif lock.


234087 10-Apr-2012 glebius

M_DONTWAIT is a flag from historical mbuf(9)
allocator, not malloc(9) or uma(9) flag.


234084 10-Apr-2012 glebius

CARP should be capable to run on if_bridge(4). Unfortunately,
this commit is not enough to enable CARP operation on
if_bridge(4), because the latter doesn't handle or even
initialize its ifp->if_link_state.

Reported by: Alexander Lunev <sol289 gmail.com>


233940 06-Apr-2012 tuexen

Remove duplicate condition in if statement.

Obtained from: brucec@
MFC after: 3 days


233745 31-Mar-2012 glebius

Don't check malloc(M_WAITOK) results.


233660 29-Mar-2012 rrs

Make stream our stream reset implementation
compliant to RFC6525.

MFC after: 1 month


233601 28-Mar-2012 zec

Permit tcpdrop in VNET jails.

Submitted by: Miljenko Mikuc
MFC after: 3 days


233597 28-Mar-2012 tuexen

Honor the net.inet.udp.checksum sysctl when using SCTP/UDP/IPv4
encapsulation.
MFCing requires MFCing http://svn.freebsd.org/changeset/base/233554
MFC after: 2 weeks


233554 27-Mar-2012 bz

Export the udp_cksum sysctl for upcoming SCTP work. Rather than always,
SCTP will only do IPv4 UDP checksum calculation as defined by the host
policy. When tunneling SCTP always calculates the inner checksum already
so not doing the outer UDP can save cycles.

While here virtualize the variable.

Requested by: tuexen
MFC after: 2 weeks


233478 25-Mar-2012 melifaro

- Permit number of ipfw tables to be changed in runtime.

net.inet.ip.fw.tables_max is now read-write.

- Bump IPFW_TABLES_MAX to 65535
Default number of tables is still 128

- Remove IPFW_TABLES_MAX from ipfw(8) code.

Sponsored by Yandex LLC

Approved by: kib(mentor)

MFC after: 2 weeks


233311 22-Mar-2012 tuexen

Small cleanup of the code. No functional change (in FreeBSD kernel).

MFC after: 1 week.


233096 17-Mar-2012 rmh

Hide a few declarations from userland (including `struct inpcbgroup'). This
removes the dependency on <machine/param.h> which was introduced with SVN
rev 222748 (due to CACHE_LINE_SIZE).

Reviewed by: bde
MFC after: 10 days


233005 15-Mar-2012 tuexen

Clean up, no functional change.

MFC after: 3 days.


233004 15-Mar-2012 tuexen

Fix bugs which can result in a panic when an non-SCTP socket it
used with an sctp_ system-call which expects an SCTP socket.

MFC after: 3 days.


232868 12-Mar-2012 melifaro

Fix VNET build broken by r232865.
Temporary remove the ability to assign different number of tables per VNET instance.


232866 12-Mar-2012 rrs

This fixes PR 165210. Basically we just
add in the netgraph interface to the list of
acceptable interfaces. A todo at the next
IETF code blitz, though is we need to review
why we screen interfaces, there was a reason ;-).

PR: 165210
MFC after: 1 week


232865 12-Mar-2012 melifaro

- Add ipfw eXtended tables permitting radix to be used for any kind of keys.
- Add support for IPv6 and interface extended tables
- Make number of tables to be loader tunable in range 0..65534.
- Use IP_FW3 opcode for all new extended table cmds

No ABI changes are introduced. Old userland will see valid tables for
IPv4 tables and no entries otherwise. Flush works for any table.

IP_FW3 socket option is used to encapsulate all new opcodes:
/* IP_FW3 header/opcodes */
typedef struct _ip_fw3_opheader {
uint16_t opcode; /* Operation opcode */
uint16_t reserved[3]; /* Align to 64-bit boundary */
} ip_fw3_opheader;

New opcodes added:
IP_FW_TABLE_XADD, IP_FW_TABLE_XDEL, IP_FW_TABLE_XGETSIZE, IP_FW_TABLE_XLIST

ipfw(8) table argument parsing behavior is changed:
'ipfw table 999 add host' now assumes 'host' to be interface name instead of
hostname.

New tunable:
net.inet.ip.fw.tables_max controls number of table supported by ipfw in given
VNET instance. 128 is still the default value.

New syntax:
ipfw add skipto tablearg ip from any to any via table(42) in
ipfw add skipto tablearg ip from any to any via table(4242) out

This is a bit hackish, special interface name '\1' is used to signal interface
table number is passed in p.glob field.

Sponsored by Yandex LLC

Reviewed by: ae
Approved by: ae (mentor)

MFC after: 4 weeks


232726 09-Mar-2012 tuexen

Fix a warning reported by bz@

MFC after: 3 days.


232724 09-Mar-2012 tuexen

Add support for stf interfaces.

MFC after: 3days.


232723 09-Mar-2012 tuexen

Fix a bug reported by Peter Holm which results in a crash:
Verify in sctp_peeloff() that the socket is a one-to-many
style SCTP socket.

MFC after: 3 days.


232517 04-Mar-2012 zec

Change SYSINIT priorities so that ip_mroute_modevent() is executed
before vnet_mroute_init(), since vnet_mroute_init() depends on mfchashsize
tunable to be set, and that is done in in ip_mroute_modevent().
Apparently I broke that ordering with r208744 almost 2 years ago...

PR: kern/162201
Submitted by: Stevan Markovic (mcafee.com)
MFC after: 3 days


232513 04-Mar-2012 bz

Correct typo in the RFC number for the constants based on IANA assignments
for IPv6 Neighbor Discovery Option types for "IPv6 Router Advertisement
Options for DNS Configuration". It is RFC 6106.

MFC after: 3 days


232273 28-Feb-2012 oleg

- Refresh dynamic tcp rule only if both sides answered keepalive packets.
- Remove some useless assignments.

MFC after: 1 month


232272 28-Feb-2012 oleg

lookup_dyn_rule_locked(): style(9) cleanup

MFC after: 1 month


232054 23-Feb-2012 kmacy

When using flowtable llentrys can outlive the interface with which they're associated
at which the lle_tbl pointer points to freed memory and the llt_free pointer is no longer
valid.

Move the free pointer in to the llentry itself and update the initalization sites.

MFC after: 2 weeks


231991 22-Feb-2012 ae

Don't use `m' after m_megapullup.

PR: kern/165373
MFC after: 3 days


231895 18-Feb-2012 tuexen

Remove two clang warnings.

MFC after: 1 month.


231852 17-Feb-2012 bz

Merge multi-FIB IPv6 support from projects/multi-fibv6/head/:

Extend the so far IPv4-only support for multiple routing tables (FIBs)
introduced in r178888 to IPv6 providing feature parity.

This includes an extended rtalloc(9) KPI for IPv6, the necessary
adjustments to the network stack, and user land support as in netstat.

Sponsored by: Cisco Systems, Inc.
Reviewed by: melifaro (basically)
MFC after: 10 days


231767 15-Feb-2012 bz

Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.

Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.

Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.

PR: kern/61404
Reviewed by: jhb, mav, rrs
Discussed with: silby, lstewart
Sponsored by: Sandvine Incorporated (originally in 2011)
MFC after: 6 weeks


231672 14-Feb-2012 tuexen

Fix a bug where the wrong protocol overhead was used. This can lead
to a deadlock of an association when an IPv6 socket was used to
communcate with IPv4 and an ICMPv4 fragmentation needed message
was received.
While there, simplify the code a bit.

MFC after: 3 days.


231201 08-Feb-2012 glebius

Set vnet context in callouts and taskqueues.

PR: 164696


231076 06-Feb-2012 glebius

Make the 'tcpwin' option of ipfw(8) accept ranges and lists.

Submitted by: sem


231074 06-Feb-2012 tuexen

Fix a typo which was already fixed by eadler in r227489. We missed
to integrate this fix in our code base, so it was removed in r227755.

MFC after: 3 days.


231025 05-Feb-2012 glebius

Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by: andre, bz, lstewart


230863 01-Feb-2012 glebius

o Provide functions carp_ifa_addroute()/carp_ifa_delroute()
to cleanup routes from a single ifa.
o Implement carp_addroute()/carp_delroute() via above functions.
o Call carp_ifa_delroute() in the carp_detach() to avoid
junk routes left in routing table, in case if user
removes an address in a MASTER state. [1]

Reported by: az [1]


230614 27-Jan-2012 luigi

a variable was erroneously declared as 32 bit instead of 64.

MFC after: 3 days


230508 24-Jan-2012 glebius

Remove unused variable.


230452 22-Jan-2012 bz

Make #error messages string-literals and remove punctuation.

Reported by: bde (for ip_divert)
Reviewed by: bde
MFC after: 3 days


230443 22-Jan-2012 bz

Fix ip_divert handling of inet and inet6 and module building some more.

Properly sort the "carp" case in modules/Makefile after it was renamed.

Reported by: bde (most)
Reviewed by: bde
MFC after: 3 days


230442 22-Jan-2012 bz

Clean up some #endif comments removing from short sections. Add #endif
comments to longer, also refining strange ones.

Properly use #ifdef rather than #if defined() where possible. Four
#if defined(PCBGROUP) occurances (netinet and netinet6) were ignored to
avoid conflicts with eventually upcoming changes for RSS.

Reported by: bde (most)
Reviewed by: bde
MFC after: 3 days


230387 20-Jan-2012 bz

Remove a superfluous INET6 check (no opt_inet6.h included anyway).

MFC after: 3 days


230379 20-Jan-2012 tuexen

Fix a problem when using the CBAPI.
While there, remove an old comment which does not apply anymore.


230207 16-Jan-2012 glebius

Drop support for SIOCSIFADDR, SIOCSIFNETMASK, SIOCSIFBRDADDR, SIOCSIFDSTADDR
ioctl commands.

PR: 163524
Reviewed by: net


230136 15-Jan-2012 tuexen

Two cleanups. No functional change.


230104 14-Jan-2012 tuexen

Fix two bugs, which result in a panic when calling getsockopt()
using SCTP_RECVINFO or SCTP_NXTINFO.
Reported by Clement Lecigne and forwarded to us by zi@.

MFC after: 3 days.


229850 09-Jan-2012 glebius

Bunch of fixes to pfsync(4) module load/unload:

o Make the pfsync.ko actually usable. Before this change loading it
didn't register protosw, so was a nop. However, a module /boot/kernel
did confused users.
o Rewrite the way we are joining multicast group:
- Move multicast initialization/destruction to separate functions.
- Don't allocate memory if we aren't going to join a multicast group.
- Use modern API for joining/leaving multicast group.
- Now the utterly wrong pfsync_ifdetach() isn't needed.
o Move module initialization from SYSINIT(9) to moduledata_t method.
o Refuse to unload module, unless asked forcibly.
o Improve a bit some FreeBSD porting code:
- Use separate malloc type.
- Simplify swi sheduling.

This change is probably wrong from VIMAGE viewpoint, however pfsync
wasn't VIMAGE-correct before this change, too.

Glanced at by: bz


229816 08-Jan-2012 glebius

Make it possible to use alternative source hardware address
in the ARP datagram generated by arprequest(). If caller doesn't
supply the address, then it is either picked from CARP or hardware
address of the interface is taken.

While here, make several minor fixes:

- Hold IF_ADDR_RLOCK(ifp) while traversing address list.
- Remove not true comment.
- Access internet address and mask via in_ifaddr fields,
rather than ifaddr.


229815 08-Jan-2012 glebius

Provide IA_MASKSIN() macro similar to IA_SIN() and IA_DSTSIN().


229810 08-Jan-2012 glebius

Move arprequest() declaration to if_ether.h.


229805 08-Jan-2012 tuexen

Add an SCTP sysctl "blackhole", similar to the one for TCP.
If set to 1, no ABORT is sent back in response to an incoming
INIT. If set to 2, no ABORT is sent back in response to
an out of the blue packet. If set to 0 (the default), ABORTs
are sent.
Discussed with rrs@.

MFC after: 1 month.


229775 07-Jan-2012 tuexen

Retire the SCTP sysctl "strict_init". We always perform the validation
and there is no reason to make is configuarable.
Discussed with rrs@.


229774 07-Jan-2012 tuexen

Improve the handling of received INITs. Send an ABORT when
not accepting the connection. Also fix a crash, which
could happen when the user closed the socket.

MFC after: 1 month.


229749 07-Jan-2012 eadler

- Fix sysctl description

PR: 163623
Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru>
Approved by: bz


229729 06-Jan-2012 tuexen

Use NULL instead of 0.

MFC after: 1 month.


229714 06-Jan-2012 np

Always release the inp lock before returning from tcp_detach.

MFC after: 5 days


229700 06-Jan-2012 jhb

Tweak the last fix to match what was actually tested.

Pointy hat to: jhb


229672 06-Jan-2012 pluknet

Fix a typo.

X-MFC-with: 229665


229665 05-Jan-2012 jhb

Remove the assertion from tcp_input() that rcv_nxt is always greater
than or equal to rcv_adv and fix tcp_twstart() to handle this case by
assuming the last window was zero rather than a negative value.

The code in tcp_input() already safely handled this case. It can happen
due to delayed ACKs along with a remote sender that sends data beyond
the window we previously advertised. If we have room in our socket buffer
for the extra data beyond the advertised window, we will accept it.
However, if the ACK for that segment is delayed, then we will not
effectively fixup rcv_adv to account for that extra data until the
next segment arrives and forces out an ACK. When that next segment
arrives, rcv_nxt will be beyond rcv_adv.

Tested by: pjd
MFC after: 1 week


229621 05-Jan-2012 jhb

Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by: bz
MFC after: 2 weeks


229478 04-Jan-2012 jhb

Use a helper variable to wrap a long line.


229477 04-Jan-2012 jhb

In the handling of the SIOC[DG]LIFADDR icotls in in_lifaddr_ioctl(), add
missing interface address list locking and grab a reference on the
matching interface address after dropping the lock while it is used to
avoid a potential use after free.

Reviewed by: bz
MFC after: 1 week


229476 04-Jan-2012 jhb

Fix the SIOC[DG]LIFADDR ioctls in in_lifaddr_ioctl() to work with IPv4
interface address rather than IPv6.

Submitted by: hrs
Reviewed by: bz
MFC after: 1 week


229420 03-Jan-2012 jhb

When cancelling multicast timers on an interface, don't release the
reference on a group in the leaving state while iterating over the loop.
Instead, use the same approach used in igmp_ifdetach() and mld_ifdetach()
of placing the groups to free on pending release list and then releasing
the references after dropping the IF_ADDR_LOCK. This closes an ugly race
where the code was dropping the lock in the middle of iterating over the
list. It also fixes some additional potential use-after-free bugs since
the cancellation routine also applied other changes to the group after
dropping the reference. Now those changes are performed before the
reference is dropped and the group is potentially freed.

Prodded to fix by: glebius
Reviewed by: bz
MFC after: 1 week


229390 03-Jan-2012 jhb

Use TAILQ_FOREACH() instead of TAILQ_FOREACH_SAFE() for some loops that
do not modify the queues they iterate over.

Submitted by: glebius


229265 02-Jan-2012 bz

As I came by and noticed add a comment that inp locking is a bit optistic
(read: non-existent) here and should be fixed.


228969 29-Dec-2011 jhb

Defer the work of freeing IPv4 multicast options from a socket to an
asychronous task. This avoids tearing down multicast state including
sending IGMP leave messages and reprogramming MAC filters while holding
the per-protocol global pcbinfo lock that is used in the receive path of
packet processing.

Reviewed by: rwatson
MFC after: 1 month


228966 29-Dec-2011 jhb

Use queue(3) macros instead of home-rolled versions in several places in
the INET6 code. This includes retiring the 'ndpr_next' and 'pfr_next'
macros.

Submitted by: pluknet (earlier version)
Reviewed by: pluknet


228959 29-Dec-2011 glebius

Don't fallback to a CARP address in BACKUP state.


228907 27-Dec-2011 tuexen

Address issues found by clang. While there, fix also some style
issues.

MFC after: 3 months.


228812 22-Dec-2011 glebius

Use a better log message for master down event.


228768 21-Dec-2011 glebius

Provide ABI compatibility shim to enable configuring of addresses
with ifconfig(8) prior to r228571.

Requested by: brooks


228736 20-Dec-2011 glebius

Restore a feature that was present in 5.x and 6.x, and was cleared in
7.x, 8.x and 9.x with pf(4) imports: pfsync(4) should suppress CARP
preemption, while it is running its bulk update.

However, reimplement the feature in more elegant manner, that is
partially inspired by newer OpenBSD:

- Rename term "suppression" to "demotion", to match with OpenBSD.
- Keep a global demotion factor, that can be raised by several
conditions, for now these are:
- interface goes down
- carp(4) has problems with ip_output() or ip6_output()
- pfsync performs bulk update
- Unlike in OpenBSD the demotion factor isn't a counter, but
is actual value added to advskew. The adjustment values for
particular error conditions are also configurable, and their
defaults are maximum advskew value, so a single failure bumps
demotion to maximum. This is for POLA compatibility, and should
satisfy most users.
- Demotion factor is a writable sysctl, so user can do
foot shooting, if he desires to.


228653 17-Dec-2011 tuexen

Fix unused parameter warnings.
While there, fix some whitespace issues.

MFC after: 3 months.


228574 16-Dec-2011 glebius

Since size of struct in_aliasreq has just been changed in r228571,
and thus ifconfig(8) needs recompile, it is a good chance to make
parameter checks on SIOCAIFADDR arguments more strict.


228571 16-Dec-2011 glebius

A major overhaul of the CARP implementation. The ip_carp.c was started
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.

The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.

ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.

To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]

The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.

Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!

PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]


228454 13-Dec-2011 glebius

Belatedly catch up with r151555. in_scrubprefix() also needs this fix. We
should compare not only addresses, but their masks, too, when searching
for matching prefix.


228391 10-Dec-2011 tuexen

Fix a bug reported by Irene Ruengeler which resulted in not sending
out HEARTBEATs when requested by the user. The HEARTBEATs were only
queued, but not actually sent out.

MFC after: 2 months.


228313 06-Dec-2011 glebius

Fix a very special case when SIOCAIFADDR supplies mask of 0.0.0.0,
don't overwrite the mask with autoguessing based on classes.


228102 28-Nov-2011 tuexen

Remove debug code.

MFC after: 1 month.


228062 28-Nov-2011 glebius

Fix one more fallout from r227791: do not overwrite trimmed sa_len
on the ia_sockmask when doing SIOCSIFNETMASK.

Reported by: Stefan Bethke <stb lassitu.de>, gonzo
Pointy hat to: glebius


228031 27-Nov-2011 tuexen

Fix a warning reported by arundel@.
Fix a bug where the parameter length of a supported address types
parameter is set to a wrong value if the kernel is built with
with either INET or INET6, but not both.

MFC after: 3 days.


228016 27-Nov-2011 lstewart

Plug a TCP reassembly UMA zone leak introduced in r226113 by only using the
backup stack queue entry when the zone is exhausted, otherwise we leak a zone
allocation each time we plug a hole in the reassembly queue.

Reported by: many on freebsd-stable@ (thread: "TCP Reassembly Issues")
Tested by: many on freebsd-stable@ (thread: "TCP Reassembly Issues")
Reviewed by: bz (very brief sanity check)
MFC after: 3 days


227959 24-Nov-2011 glebius

Remove superfluous check: SIOCAIFADDR must have ifra_addr supplied.


227958 24-Nov-2011 glebius

Fix stupid typo in r227830.

PR: 162806
Pointy hat to: glebius


227931 24-Nov-2011 tuexen

Move up the address to the top of the sctp_udencaps structure
like in all other structures. This avoids alignment problems.

MFC after: 3 months.


227930 24-Nov-2011 tuexen

Move up the address to the top of the sctp_paddrthlds structure
like in all other structures. This avoids alignment problems.

MFC after: 3 days.


227831 22-Nov-2011 glebius

style(9) nit


227830 22-Nov-2011 glebius

Fix SIOCDIFADDR semantics: if no address is specified, then delete first one.


227801 21-Nov-2011 glebius

This check isn't needed now, sanity checking done in the beginning.
Missed it in last commit.


227791 21-Nov-2011 glebius

Historically in_control() did not check sockaddrs supplied with
structs ifreq/in_aliasreq and there've been several panics due
to that problem. All these panics were fixed just a couple of
lines above the panicing code.

Take a more general approach: sanity check sockaddrs supplied
with SIOCAIFADDR and SIOCSIF*ADDR at the beggining of the
function and drop all checks below.

One check is now disabled due to strange code in ifconfig(8)
that I've removed recently. I'm going to enable it with next
__FreeBSD_version bump.

Historically in_ifinit() was able to recover from an error
and restore old address. Nowadays this feature isn't working
for all error cases, but for some of them. I suppose no software
relies on this behavior, so I'd like to remove it, since this
simplifies code a lot.

Also, move if_scrub() earlier in the in_ifinit(). It is more
correct to wipe routes before removing address from local
address list, and interface address list.

Silence from: bz, brooks, andre, rwatson, 3 weeks


227790 21-Nov-2011 glebius

Be more informative for "unknown hardware address format" message.

Submitted by: Andrzej Tobola <ato iem.pw.edu.pl>


227785 21-Nov-2011 glebius

- Reduce severity for all ARP events, that can be triggered from remote
machine to LOG_NOTICE. Exception left to "using my IP address".
- Fix multicast ARP warning: add newline and also log the bad MAC address.

Tested by: Alexander Wittig <wittigal msu.edu>


227755 20-Nov-2011 tuexen

Add support for the SCTP_REMOTE_UDP_ENCAPS_PORT socket option.
Retire the the now unused sctp_udp_tunneling_for_client_enable
sysctl variable.

MFC after: 3 months.


227655 18-Nov-2011 tuexen

Cleanup comparison of interface names.

MFC after: 1 month.


227540 15-Nov-2011 tuexen

Set the MTU of an path to an approriate value if the interface MTU
can't be determined.

MFC after: 3 days.


227489 13-Nov-2011 eadler

- fix duplicate "a a" in some comments

Submitted by: eadler
Approved by: simon
MFC after: 3 days


227486 13-Nov-2011 tuexen

Don't copy uninitialized memory. Also simplify the comparison
of interface names.

MFC after: 3 days.


227459 11-Nov-2011 brooks

In r191367 the need for if_free_type() was removed and a new member
if_alloctype was used to store the origional interface type. Take
advantage of this change by removing all existing uses of if_free_type()
in favor of if_free().

MFC after: 1 Month


227458 11-Nov-2011 eadler

- add a missing "be" and "in"
- fix other errors introduced when committing r226436
- add 'function' to a sentence where it makes sense

Submitted by: delphij
Submitted by: dougb
Submitted by: jhb
Approved by: dougb
Approved by: jhb


227320 07-Nov-2011 tuexen

When loading addresses from INITs, always use the correct
local address.

MFC after: 3 days.


227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


227293 07-Nov-2011 ed

Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.

This means that their use is restricted to a single C file.


227266 06-Nov-2011 tuexen

Initialize all components of the sent COOKIE.

MFC after: 3 days.


227207 06-Nov-2011 trociny

Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>


227204 06-Nov-2011 trociny

Fix the typo made in r157474.

MFC after: 3 days


227085 04-Nov-2011 bz

Always use the opt_*.h options for ipfw.ko, not just when
compiled into the kernel.
Do not try to build the module in case of no INET support but
keep #error calls for now in case we would compile it into the
kernel.

This should fix an issue where the module would fail to enable
IPv6 support from the rc framework, but also other INET and INET6
parts being silently compiled out without giving a warning in the
module case.

While here garbage collect unneeded opt_*.h includes.
opt_ipdn.h is not used anywhere but we need to leave the DUMMYNET
entry in options for conditional inclusion in kernel so keep the
file with the same name.

Reported by: pluknet
Reviewed by: plunket, jhb
MFC After: 3 days


227034 02-Nov-2011 pluknet

Restore sysctl names for tcp_sendspace/tcp_recvspace.

They seem to be changed unintentionally in r226437, and there were no
any mentions of renaming in commit log message.

Reported by: Anton Yuzhaninov <citrin citrin ru>


226869 27-Oct-2011 tuexen

When add a new remote address using sctp_add_remote_addr(),
return the correct net if requested.

MFC after: 3 days.


226868 27-Oct-2011 tuexen

Send out control chunks which have no specific destination.

MFC after: 3 days.


226713 25-Oct-2011 qingli

Exclude host routes when checking for prefix coverage on multiple
interfaces. A host route has a NULL mask so check for that condition.
I have also been told by developers who customize the packet output
path with direct manipulation of the route entry (or the outgoing
interface to be specific). This patch checks for the route mask
explicitly to make sure custom code will not panic.

PR: kern/161805
MFC after: 3 days


226610 21-Oct-2011 ed

Add missing #includes.

According to POSIX, these two header files should be able to be included
by themselves, not depending on other headers. The <net/if.h> header
uses struct sockaddr when __BSD_VISIBLE=1, while <netinet/tcp.h> uses
integer datatypes (u_int32_t, u_short, etc).

MFC after: 2 months


226454 17-Oct-2011 bz

Add syntactic sugar missed in r226437 and then not added either when moving
things around in r226448 but desperately needed to always make things
compile successfully.

MFC after: 1 week


226448 16-Oct-2011 andre

Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after: 1 week


226447 16-Oct-2011 andre

Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after: 1 week


226437 16-Oct-2011 andre

VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT. A long is not necessary as the TCP window is
limited to 2**30. A larger initial window isn't useful.

MFC after: 1 week


226436 16-Oct-2011 eadler

- change "is is" to "is" or "it is"
- change "the the" to "the"

Approved by: lstewart
Approved by: sahil (mentor)
MFC after: 3 days


226433 16-Oct-2011 andre

Update the comment and description of tcp_sendspace and tcp_recvspace
to better reflect their purpose.
MFC after: 1 week


226431 16-Oct-2011 ed

Forward declare mbuf and inpcb.

This fixes a compiler warning at WARNS=6 when including the header files
as follows:

#include <sys/types.h>
#include <netinet/in.h>
#include <netinet/ip_var.h>
#include <netinet/udp.h>
#include <netinet/udp_var.h>


226402 15-Oct-2011 glebius

Add support for IPv4 /31 prefixes, as described in RFC3021.

To run a /31 network, participating hosts MUST drop support
for directed broadcasts, and treat the first and last addresses
on subnet as unicast. The broadcast address for the prefix
should be the link local broadcast address, INADDR_BROADCAST.


226401 15-Oct-2011 glebius

Remove last remnants of classful addressing:

- Remove ia_net, ia_netmask, ia_netbroadcast from struct in_ifaddr.
- Remove net.inet.ip.subnetsarelocal, I bet no one need it in 2011.
- fix bug when we were not forwarding to a host which matches classful
net address. For example router having 192.168.x.y/16 network attached,
would not forward traffic to 192.168.*.0, which are legal IPs in
CIDR world.
- For compatibility, leave autoguessing of mask based on class.

Reviewed by: andre, bz, rwatson


226367 14-Oct-2011 glebius

Never switch directly from INIT to MASTER, since this produces
nasty status flaps.

PR: kern/161123
Submitted by: Damien Fleuriot <dam my.gd>
OpenBSD: ip_carp.c, rev. 1.115


226339 13-Oct-2011 glebius

De-spl(9).


226318 12-Oct-2011 np

Make sure the inp wasn't dropped when rexmt let go of the inp and
pcbinfo locks.

Reviewed by: andre@
MFC after: 7 days


226252 11-Oct-2011 tuexen

Use the most significant 6 bits of the dscp instead of the least
significant ones.
This has changed in the latest version of the socket API ID and
provides backwards compatibility and gets it in syn with the
usage of the IP_TOS socket option.

MFC after: 3 days.


226224 10-Oct-2011 qingli

All indirect routes will fail the rtcheck, except for a special host
route where the destination IP and the gateway IP is the same. This
special case handling is only meant for backward compatibility reason.
The last commit introduced a bug in the route check logic, where a
valid special case is treated as an error. This patch fixes that bug
along with some code cleanup.

Suggested by: gleb
Reviewed by: kmacy, discussed with gleb
MFC after: 1 day


226222 10-Oct-2011 tuexen

Get struct sctp_net_route in tune with struct route.
struct route was changed in
http://svn.freebsd.org/changeset/base/225698
and since then SCTP support was broken.
This needs to be MFCed to stable/9 to unbreak SCTP support in 9.0
MFC after: 3 days.


226203 10-Oct-2011 tuexen

When moving an stcb to a new inp and we copy over the list of
bound addresses, update the last used address pointer.
If not, it might result in a crash if the old inp goes away.

MFC after: 3 days.


226168 09-Oct-2011 tuexen

Update the inp stored in a HB-timer when moving an stcb to a new inp.
Use only this stored inp when processing a HB timeout.
This fixes a bug which results in a crash.

MFC after: 3 days.


226120 07-Oct-2011 qingli

Do not try removing an ARP entry associated with a given interface
address if that interface does not support ARP. Otherwise the
system will generate error messages unnecessarily due to the missing
entry.

PR: kern/159602
Submitted by: pluknet
MFC after: 3 days


226114 07-Oct-2011 qingli

Remove the reference held on the loopback route when the interface
address is being deleted. Only the last reference holder deletes the
loopback route. All other delete operations just clear the IFA_RTSELF
flag.

PR: kern/159601
Submitted by: pluknet
Reviewed by: discussed on net@
MFC after: 3 days


226113 07-Oct-2011 andre

Prevent TCP sessions from stalling indefinitely in reassembly
when reaching the zone limit of reassembly queue entries.

When the zone limit was reached not even the missing segment
that would complete the sequence space could be processed
preventing the TCP session forever from making any further
progress.

Solve this deadlock by using a temporary on-stack queue entry
for the missing segment followed by an immediate dequeue again
by delivering the contiguous sequence space to the socket.

Add logging under net.inet.tcp.log_debug for reassembly queue
issues.

Reviewed by: lsteward (previous version)
Tested by: Steven Hartland <killing-at-multiplay.co.uk>
MFC after: 3 days


226105 07-Oct-2011 andre

Add back the IP header length to the total packet length field on
raw IP sockets. It was deducted in ip_input() in preparation for
protocols interested only in the payload.

On raw sockets the IP header should be delivered as it at came in
from the network except for the byte order swaps in some fields.

This brings us in line with all other OS'es that provide raw
IP sockets.

Reported by: Matthew Cini Sarreo <mcins1-at-gmail.com>
MFC after: 3 days


226060 06-Oct-2011 attilio

For the INP_TIMEWAIT case, there is no valid tcpcb object tied to the
inpcb object.
Skip the TCP_SIGNATURE check in that case as it is consistent with the
output path (no TCP_SIGNATURE for outcoming packets in TIMEWAIT state)
and also because for TIMEWAIT state the verify may be less effective.

Sponsored by: Sandvine Incorporated
Reported by: rwatson
No objections by: rwatson
MFC after: 3 days


225947 03-Oct-2011 qingli

A system may have multiple physical interfaces, all of which are on the
same prefix. Since a single route entry is installed for the prefix
(without RADIX_MPATH), incoming packets on the interfaces that are not
associated with the prefix route may trigger an error message about
unable to allocation LLE entry, and fails L2. This patch makes sure a
valid route is present in the system, and allow the aforementioned
condition to exist and treats as valid.

Reviewed by: bz
MFC after: 5 days


225946 03-Oct-2011 qingli

This patch allows ARP to work properly in the presence of
self-referencing routes. This patch is a rework of r223862.

Reviewed by: bz, zec
MFC after: 5 days


225793 27-Sep-2011 bz

Unbreak no-ip and no-inet6 module builds with ipfw. For now continue to
build the ip_fw_pfil.c hooks and ipfw even in case of no-ip under the
assumption that the private L2 hook (which hopefully eventually will be a
pfil hook as well) can still be useful.

Allow building the module without inet as well.

Glanced at by: jhb
MFC after: 3 days


225676 19-Sep-2011 tuexen

Cleanup the iterator code, remove code that is never executed.

Approved by: re
MFC after: 1 month.


225635 17-Sep-2011 tuexen

Fix the enabling/disabling of Heartbeats and path MTU
discovery when using the SCTP_PEER_ADDR_PARAMS socket option.
Approved by: re
MFC after: 1 month.


225584 15-Sep-2011 tuexen

Fix a typo introduced in
http://svn.freebsd.org/changeset/base/225571
Reported by Ilya A. Arkhipov.

Approved by: re
MFC after: 1 month.


225571 15-Sep-2011 tuexen

Make sure that SCTP rejects broadcast, multicast and wildcard addresses
as remote addresses.

Approved by: re
MFC after: 1 month.


225559 14-Sep-2011 tuexen

Ensure that 1-to-1 style SCTP sockets can only be connected once.
Allow implicit setup also for 1-to-1 style sockets as described
in the latest version of the socket API ID.

Approved by: re
MFC after: 1 month


225549 14-Sep-2011 tuexen

Fix the handling of the flowlabel and DSCP value in the SCTP_PEER_ADDR_PARAMS
socket option.
Honor the net.inet6.ip6.auto_flowlabel sysctl setting.

Approved by: re (bz)
MFC after: 1 month.


225518 12-Sep-2011 jhb

Allow the ipfw.ko module built with a kernel to honor any IPFIREWALL_*
options defined in the kernel config. This more closely matches the
behavior of other modules which inherit configuration settings from the
kernel configuration during a kernel + modules build.

Reviewed by: luigi
Approved by: re (kib)
MFC after: 1 week


225462 09-Sep-2011 tuexen

Improve implementation of the Nagle algorithm for SCTP:
Don't delay the final fragment of a fragmented user message.

Approved by: re
MFC after: 4 weeks


225223 28-Aug-2011 qingli

When an interface address route is removed from the system, another
route with the same prefix is searched for as a replacement. The
current code did not bypass routes that have non-operational
interfaces. This patch fixes that bug and will find a replacement
route with an active interface.

PR: kern/159603
Submitted by: pluknet, ambrisko at ambrisko dot com
Reviewed by: discussed on net@
Approved by: re (bz)
MFC after: 3 days


225169 25-Aug-2011 bz

Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults. They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption. All values are still tunable
by sysctls.

Suggested by: gnn
Discussed on: arch (Mar and Aug 2011)
MFC after: 3 weeks
Approved by: re (kib)


225046 20-Aug-2011 bz

Fix compilation in case of defined(INET) && defined(IPFIREWALL_FORWARD)
but no INET6.

Reported by: avg
Tested by: avg
MFC after: 4 weeks
X-MFC with: r225044
Approved by: re (kib)


225044 20-Aug-2011 bz

Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from: David Dolson at Sandvine Incorporated
(original version for ipfw fwd IPv6 support)
Sponsored by: Sandvine Incorporated
PR: bin/117214
MFC after: 4 weeks
Approved by: re (kib)


225036 20-Aug-2011 bz

Hide IPv6 next header parsing warnings under the verbose sysctl
so people can possibly disable it when their consoles are flooded,
or enabled it for debugging.

MFC after: 2 weeks
Approved by: re (kib)


225034 20-Aug-2011 bz

After r225032 fix logging in a similar way masking the the IPv6
more fragments flag off so that offset == 0 checks work properly.

PR: kern/145733
Submitted by: Matthew Luckie (mjl luckie.org.nz)
MFC after: 2 weeks
X-MFC with: r225032
Approved by: re (kib)


225033 20-Aug-2011 bz

If we detect an IPv6 fragment header and it is not the first fragment,
then terminate the loop as we will not find any further headers and
for short fragments this could otherwise lead to a pullup error
discarding the fragment.

PR: kern/145733
Submitted by: Matthew Luckie (mjl luckie.org.nz)
MFC after: 2 weeks
Approved by: re (kib)


225032 20-Aug-2011 bz

ipfw internally checks for offset == 0 to determine whether the
packet is a/the first fragment or not. For IPv6 we have added the
"more fragments" flag as well to be able to determine on whether
there will be more as we do not have the fragment header avaialble
for logging, while for IPv4 this information can be derived directly
from the IPv4 header. This allowed fragmented packets to bypass
normal rules as proper masking was not done when checking offset.
Split variables to not need masking for IPv6 to avoid further errors.

PR: kern/145733
Submitted by: Matthew Luckie (mjl luckie.org.nz)
MFC after: 2 weeks
Approved by: re (kib)


225030 20-Aug-2011 bz

While not explicitly allowed by RFC 2460, in case there is no
translation technology involved (and that section is suggested to
be removed by Errata 2843), single packet fragments do not harm.

There is another errata under discussion to clarify and allow this.
Meanwhile add a sysctl to allow disabling this behaviour again.
We will treat single packet fragment (a fragment header added
when not needed) as if there was no fragment header.

PR: kern/145733
Submitted by: Matthew Luckie (mjl luckie.org.nz) (original version)
Tested by: Matthew Luckie (mjl luckie.org.nz)
MFC after: 2 weeks
Approved by: re (kib)


224918 16-Aug-2011 tuexen

Fix the handling of [gs]etsockopt() unconnected 1-to-1 style sockets.
While there:
* Fix a locking issue in setsockopt() of SCTP_CMT_ON_OFF.
* Fix a bug in setsockopt() of SCTP_DEFAULT_PRINFO, where the pr_value
was ignored.

Approved by: re@
MFC after: 2 months.


224870 14-Aug-2011 tuexen

Add support for the spp_dscp field in the SCTP_PEER_ADDR_PARAMS
socket option. Backwards compatibility is provided by still
supporting the spp_ipv4_tos field.

Approved by: re@
MFC after: 2 months.


224747 10-Aug-2011 kevlo

If RTF_HOST flag is specified, then we are interested in destination
address.

PR: kern/159600
Submitted by: Svatopluk Kraus <onwahe at gmail dot com>
Approved by: re (hrs)


224641 03-Aug-2011 tuexen

The result of a joint work between rrs@ and myself at the IETF:
* Decouple the path supervision using a separate HB timer per path.
* Add support for potentially failed state.
* Bring back RTO.min to 1 second.
* Accept packets on IP-addresses already announced via an ASCONF
* While there: do some cleanups.

Approved by: re@
MFC after: 2 months.


224575 01-Aug-2011 glebius

Add missing break; in r223593.

Submitted by: sem
Pointy hat to: glebius
Approved by: re (kib)


224151 17-Jul-2011 bz

Add spares to the network stack for FreeBSD-9:
- TCP keep* timers
- TCP UTO (adjust from what was there already)
- netmap
- route caching
- user cookie (temporary to allow for the real fix)

Slightly re-shuffle struct ifnet moving fields out of the middle
of spares and to better align.

Discussed with: rwatson (slightly earlier version)


224010 14-Jul-2011 bz

Unbreak no-INET kernels after r223839 adding the needed #ifdef INET.

MFC after: 4 weeks


223965 12-Jul-2011 tuexen

Don't check for SOCK_DGRAM anymore. Also remove multicast
related code which is not necessary anymore.


223963 12-Jul-2011 tuexen

The socket API only specifies SCTP for SOCK_SEQPACKET and
SOCK_STREAM, but not SOCK_DGRAM. So don't register it for
SOCK_DGRAM.
While there, fix some indentation.


223862 08-Jul-2011 zec

Permit ARP to proceed for IPv4 host routes for which the gateway is the
same as the host address. This already works fine for INET6 and ND6.

While here, remove two function pointers from struct lltable which are
only initialized but never used.

MFC after: 3 days


223840 07-Jul-2011 ae

Add again the checking for log_arp_permanent_modify that was by accident
removed in the r186119.

PR: kern/154831
MFC after: 1 week


223839 07-Jul-2011 andre

Remove the TCP_SORECEIVE_STREAM compile time option. The use of
soreceive_stream() for TCP still has to be enabled with the loader
tuneable net.inet.tcp.soreceive_stream.

Suggested by: trociny and others


223799 05-Jul-2011 cperciva

Remove #ifdef notyet code dating back to 4.3BSD Net/2 (and possibly earlier).

I think the benefit of making the code cleaner and easier to understand
outweighs the humour of leaving this intact (or possibly changing it to
#ifdef not_yet_and_probably_never).

MFC after: 2 weeks


223797 05-Jul-2011 cperciva

Don't allow lro->len to exceed 65535, as this will result in overflow
when len is inserted back into the synthetic IP packet and cause a
multiple of 2^16 bytes of TCP "packet loss".

This improves Linux->FreeBSD netperf bandwidth by a factor of 300 in
testing on Amazon EC2.

Reviewed by: jfv
MFC after: 2 weeks


223773 04-Jul-2011 gjb

- General grammar and mdoc(7) fixes. [1] [2]
- While here, remove a paragraph about userspace operation that
has been outdated for some time. [2]

PR: 158623
Submitted by: Ben Kudak (kaduk % mit!edu) [1]
Reviewed by: glebius [2]
MFC after: 1 week


223765 04-Jul-2011 eri

pf(4) tags now store the state key but tcp_respond tries to reuse a mbuf as an optimization.
This makes pf find the wrong state and cause errors reported with state mismatches.
Clear the cached state link on the pf(4) tag to avoid the state mismatches.

Approved by: bz


223753 04-Jul-2011 ae

ARP code reuses mbuf from ARP request to make a reply, but it does not
reset rcvif to NULL. Since rcvif is not NULL, ipfw(4) supposes that ARP
replies were received on specified interface.
Reset rcvif to NULL for ARP replies to fix this issue.

PR: kern/131817
Reviewed by: glebius
MFC after: 1 month


223697 30-Jun-2011 tuexen

Add the missing sca_keylength field to the sctp_authkey structure,
which is used the the SCTP_AUTH_KEY socket option.

MFC after: 1 month.


223666 29-Jun-2011 ae

Add new rule actions "call" and "return" to ipfw. They make
possible to organize subroutines with rules.

The "call" action saves the current rule number in the internal
stack and rules processing continues from the first rule with
specified number (similar to skipto action). If later a rule with
"return" action is encountered, the processing returns to the first
rule with number of "call" rule saved in the stack plus one or higher.

Submitted by: Vadim Goncharov
Discussed by: ipfw@, luigi@


223637 28-Jun-2011 bz

Update packet filter (pf) code to OpenBSD 4.5.

You need to update userland (world and ports) tools
to be in sync with the kernel.

Submitted by: mlaier
Submitted by: eri


223613 27-Jun-2011 tuexen

Add support for SCTP_PR_SCTP_NONE which I misded to add.
This constant is defined in the socket API ID.

MFC after: 2 months.


223593 27-Jun-2011 glebius

Add possibility to pass IPv6 packets to a divert(4) socket.

Submitted by: sem


223437 22-Jun-2011 ae

Export AddLink() function from libalias. It can be used when custom
alias address needs to be specified.
Add inbound handler to the alias_ftp module. It helps handle active
FTP transfer mode for the case with external clients and FTP server behind
NAT. Fix passive FTP transfer case for server behind NAT using redirect with
external IP address different from NAT ip address.

PR: kern/157957
Submitted by: Alexander V. Chernikov


223421 22-Jun-2011 ae

Document PKT_ALIAS_SKIP_GLOBAL option.

Submitted by: Alexander V. Chernikov


223358 21-Jun-2011 ae

Do not use SET_HOST_IPLEN() macro for IPv6 packets.

PR: kern/157239
MFC after: 2 weeks


223326 20-Jun-2011 bz

Fix a KASSERT from r212803 to check the correct length also in case of
IPsec being compiled in and used. Improve reporting by adding the length
fields to the panic message, so that we would have some immediate debugging
hints.

Discussed with: jhb


223261 18-Jun-2011 bz

Remove a these days incorrect comment left from before new-arp.

MFC after: 1 week


223162 16-Jun-2011 tuexen

Add SCTP_DEFAULT_PRINFO socket option.
Fix the SCTP_DEFAULT_SNDINFO socket option: Don't clear the
PR SCTP policy when setting sinfo_flags.

MFC after: 1 month.


223152 16-Jun-2011 tuexen

* Fix the handling of addresses in sctp_sendv().
* Add support for SCTP_SENDV_NOINFO.
* Improve the error handling of sctp_sendv() and sctp_recv().

MFC after: 1 month


223132 15-Jun-2011 tuexen

Add support for the newly added SCTP API.
In particular add support for:
* SCTP_SNDINFO, SCTP_PRINFO, SCTP_AUTHINFO, SCTP_DSTADDRV4, and
SCTP_DSTADDRV6 cmsgs.
* SCTP_NXTINFO and SCTP_RCVINFO cmgs.
* SCTP_EVENT, SCTP_RECVRCVINFO, SCTP_RECVNXTINFO and SCTP_DEFAULT_SNDINFO
socket option.
* Special association ids (SCTP_FUTURE_ASSOC, ...)
* sctp_recvv() and sctp_sendv() functions.

MFC after: 1 month.


223080 14-Jun-2011 ae

Implement "global" mode for ipfw nat. It is similar to natd(8)
"globalport" option for multiple NAT instances.

If ipfw rule contains "global" keyword instead of nat_number, then
for each outgoing packet ipfw_nat looks up translation state in all
configured nat instances. If an entry is found, packet aliased
according to that entry, otherwise packet is passed unchanged.

User can specify "skip_global" option in NAT configuration to exclude
an instance from the lookup in global mode.

PR: kern/157867
Submitted by: Alexander V. Chernikov (previous version)
Tested by: Eugene Grosbein


223077 14-Jun-2011 ae

Sort alias mode flags in the increasing order.


223073 14-Jun-2011 ae

Add IPv6 support to the ipfw uid/gid check. Pass an ip_fw_args structure
to the check_uidgid() function, since it contains all needed arguments
and also pointer to mbuf and now it is possible use in_pcblookup_mbuf()
function.

Since i can not test it for the non-FreeBSD case, i keep this ifdef
unchanged.

Tested by: Alexander V. Chernikov
MFC after: 3 weeks


223049 13-Jun-2011 jhb

Advance the advertised window (rcv_adv) to the currently received data
(rcv_nxt) if we advertising a zero window. This can be true when ACK'ing
a window probe whose one byte payload was accepted rather than dropped
because the socket's receive buffer was not completely full, but the
remaining space was smaller than the window scale.

This ensures that window probe ACKs satisfy the assumption made in r221346
and closes a window where rcv_nxt could be greater than rcv_adv.

Tested by: trasz, pho, trociny
Reviewed by: silby
MFC after: 1 week


222845 08-Jun-2011 bz

Correct comments and debug logging in ipsec to better match reality.

MFC after: 3 days


222809 07-Jun-2011 ae

Fix indentation.


222806 07-Jun-2011 ae

Make a behaviour of the libalias based in-kernel NAT a bit closer to
how natd(8) does work. natd(8) drops packets only when libalias returns
PKT_ALIAS_IGNORED and "deny_incoming" option is set, but ipfw_nat
always did drop packets that were not aliased, even if they should
not be aliased and just are going through.

PR: kern/122109, kern/129093, kern/157379
Submitted by: Alexander V. Chernikov (previous version)
MFC after: 1 month


222787 06-Jun-2011 bz

Unbreak kernels with non-default PCBGROUP included but no WITNESS.
Rather than including lock.h in in_pcbgroup.c in right order, fix it
for all consumers of in_pcb.h by further header file pollution under
#ifdef KERNEL.

Reported by: Pan Tsu (inyaoo gmail.com)


222748 06-Jun-2011 rwatson

Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.

Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).

Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.

Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.

Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).

Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.

Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


222742 06-Jun-2011 ae

Do not return EINVAL when user does `ipfw set N flush` on an empty set.

MFC after: 2 weeks


222732 06-Jun-2011 hrs

- Implement RDNSS and DNSSL options (RFC 6106, IPv6 Router Advertisement
Options for DNS Configuration) into rtadvd(8) and rtsold(8). DNS
information received by rtsold(8) will go to resolv.conf(5) by
resolvconf(8) script. This is based on work by J.R. Oldroyd (kern/156259)
but revised extensively[1].

- rtadvd(8) now supports "noifprefix" to disable gathering on-link prefixes
from interfaces when no "addr" is specified[2]. An entry in rtadvd.conf
with "noifprefix" + no "addr" generates an RA message with no prefix
information option.

- rtadvd(8) now supports RTM_IFANNOUNCE message to fix crashes when an
interface is added or removed.

- Correct bogus ND_OPT_ROUTE_INFO value to one in RFC 4191.

Reviewed by: bz[1]
PR: kern/156259 [1]
PR: bin/152458 [2]


222691 04-Jun-2011 rwatson

Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


222690 04-Jun-2011 rwatson

IP divert sockets use their inpcbinfo for port reservation, although not
for lookup. I missed its call to in_pcbbind() when preparing previous
patches, which would lead to a lock assertion failure (although problem
not an actual race condition due to global pcbinfo locks providing
required synchronisation -- in this particular case only). This change
adds the missing locking of the pcbhash lock.

(Existing comments in the ipdivert code question the need for using the
global hash to manage the namespace, as really it's a simple port
namespace and not an address/port namespace. Also, although in_pcbbind
is used to manage reservations, the hash tables aren't used for lookup.
It might be a good idea to make them use hashed lookup, or to use a
different reservation scheme.)

Reviewed by: bz
Reported by: Kristof Provost <kristof at sigsegv.be>
Sponsored by: Juniper Networks


222602 02-Jun-2011 rwatson

Do not leak the pcbinfohash lock in the case where in6_pcbladdr() returns
an error during TCP connect(2) on an IPv6 socket.

Submitted by: bz
Sponsored by: Juniper Networks, Inc.


222582 01-Jun-2011 ae

O_FORWARD_IP is only action which depends from the result of lookup of
dynamic rules. We are doing forwarding in the following cases:
o For the simple ipfw fwd rule, e.g.

fwd 10.0.0.1 ip from any to any out xmit em0
fwd 127.0.0.1,3128 tcp from any to any 80 in recv em1

o For the dynamic fwd rule, e.g.

fwd 192.168.0.1 tcp from any to 10.0.0.3 3333 setup keep-state

When this rule triggers it creates a dynamic rule, but this
dynamic rule should forward packets only in forward direction.

o And the last case that does not work before - simple fwd rule which
triggers when some dynamic rule is already executed.

PR: kern/147720, kern/150798
MFC after: 1 month


222560 01-Jun-2011 ae

Hide some debug messages under debug macro.

MFC after: 1 week


222559 01-Jun-2011 ae

Hide useless warning under debug macro.

PR: kern/69963
MFC after: 1 week


222503 30-May-2011 bz

Unbreak NOINET kernels after r222488.

Reviewed by: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems!
Pointy hat: to myself for missing this during review?


222488 30-May-2011 rwatson

Decompose the current single inpcbinfo lock into two locks:

- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


222474 30-May-2011 ae

Wrap long line.

MFC after: 2 weeks


222473 30-May-2011 ae

Add tablearg support for ipfw setfib.

PR: kern/156410
MFC after: 2 weeks


222459 29-May-2011 tuexen

Get rid of unused functions.

MFC after: 1 week.


222438 29-May-2011 qingli

Supply the LLE_STATIC flag bit to in_ifscurb() when scrubbing interface
address so that proper clean up will take place in the routing code.
This patch fixes the bootp panic on startup problem. Also, added more
error handling and logging code in function in_scrubprefix().

MFC after: 5 days


222272 25-May-2011 bz

Add FEATURE() definitions for IPv4 and IPv6 so that we can use
feature_present(3) to dynamically decide whether to use one or the
other family.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 10 days


222251 24-May-2011 rwatson

An inpcb lock is no longer required in in_pcbref() since the move to
refcount(9).

MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.


222217 23-May-2011 rwatson

Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:

(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.

(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.

(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.

This may well be safe to MFC, but some more KBI analysis is required.

Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.


222215 23-May-2011 rwatson

Move from passing a wildcard boolean to a general set up lookup flags into
in_pcb_lport(), in_pcblookup_local(), and in_pcblookup_hash(), and similarly
for IPv6 functions. In the future, we would like to support other flags
relating to locking strategy.

This change doesn't appear to modify the KBI in practice, as callers already
passed in INPLOOKUP_WILDCARD rather than a simple boolean.

MFC after: 3 weeks
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.


222213 23-May-2011 rwatson

A number of quite incremental refinements to struct inpcbinfo's definition:

(1) Add a locking guide for inpcbinfo.
(2) Annotate inpcbinfo fields with synchronisation information; not all
annotations are 100% satisfactory.
(3) Reorder inpcbinfo fields so that the lock is at the head of the
structure, and close to fields it protects.
(4) Sort fields that will eventually be hashlock/pcbgroup-related together
even though they remain locked by ipi_lock for now.

Reviewed by: bz
Sponsored by: Juniper Networks
X-MFC after: KBI analysis required


222143 20-May-2011 qingli

The statically configured (permanent) ARP entries are removed when an
interface is brought down, even though the interface address is still
valid. This patch maintains the permanent ARP entries as long as the
interface address (having the same prefix as that of the ARP entries)
is valid.

Reviewed by: delphij
MFC after: 5 days


222077 18-May-2011 tuexen

Unbreak INET-less build.
Reported by bz@
MFC after: 1 week


222029 17-May-2011 tuexen

Copy out the mtu when calling getsockopt() with SCTP_GET_PEER_ADDR_INFO.

MFC after: 1 week.


222028 17-May-2011 tuexen

Fix whitespacing.
Reported by scf@

MFC after: 1 week.


221904 14-May-2011 tuexen

Fix the source address selection for boundall sockets
when sending INITs to a global IPv4 address having
only private IPv4 address.
Allow the usage of a private address and make sure
that no other private address will be used by the
association.
Initial work was done by rrs@.

MFC after: 1 week.


221891 14-May-2011 jhb

Oops, fix order of sequence numbers in KASSERT()'s to catch negative
receive windows to match the labels in the panic message.

Submitted by: trociny


221690 09-May-2011 mav

Refactor TCP ISN increment logic. Instead of firing callout at 100Hz to
keep constant ISN growth rate, do the same directly inside tcp_new_isn(),
taking into account how much time (ticks) passed since the last call.

On my test systems this decreases idle interrupt rate from 140Hz to 70Hz.


221627 08-May-2011 tuexen

Fix a locking issue showing up on Mac OS X when subscribing to
authentication events. DTLS/SCTP renegotiations trigger the bug.

MFC after: 2 weeks.


221549 06-May-2011 tuexen

Change the name of an internal structure, since the name
is used by a structure of the (new) SCTP API.

MFC after: 1 week.


221521 06-May-2011 ae

Convert delay parameter back to ms when reporting to user.

PR: 156838
MFC after: 1 week


221460 04-May-2011 tuexen

Implement Resource Pooling V2 and an MPTCP like congestion
control.
Based on a patch received from Martin Becke.

MFC after: 2 weeks.


221411 03-May-2011 tuexen

Remove code with any effect.


221410 03-May-2011 tuexen

Add a missing break. This bug was introduced in r221249.

MFC after: 1 week


221346 02-May-2011 jhb

Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by: bz
MFC after: 1 month


221328 02-May-2011 tuexen

Some more cleanups related to an kernel without INET.

MFC after: 1 week


221264 30-Apr-2011 bz

Fix a mismerge from p4 in that in_localaddr() is not available without INET.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


221251 30-Apr-2011 tuexen

Remove some leftover debug code.

MFC after: 1 week


221250 30-Apr-2011 bz

Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness. To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


221249 30-Apr-2011 tuexen

Improve compilation of SCTP code without INET support.
Some bugs where fixed while doing this:
* ASCONF-ACK messages might use wrong port number when using
IPv6.
* Checking for additional addresses takes the correct address
into account and also does not do more comparisons than
necessary.

This patch is based on one received from bz@ who was
sponsored by The FreeBSD Foundation and iXsystems.

MFC after: 1 week


221248 30-Apr-2011 bz

Make the UDP code compile without INET. Expose udp_usrreq.c to IPv6 only
as well compiling out most functions adding or extending #ifdef INET
coverage.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


221247 30-Apr-2011 bz

Make the PCB code compile without INET support by adding #ifdef INETs
and correcting few #includes.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


221209 29-Apr-2011 jhb

TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer. However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window. As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd. However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh. In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb. The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by: bz
MFC after: 2 weeks


221134 27-Apr-2011 bz

MfP4 CH=192029:

Expose ip_icmp.c to INET6 as well and only export badport_bandlim()
along with the two sysctls in the non-INET case.
The bandlim types work for all cases I reviewed in IPv6 as well and
the sysctls are available as we export net.inet.* from in_proto.c.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


221131 27-Apr-2011 bz

MfP4 CH=192004:

Move ip_defttl to raw_ip.c where it is actually used. In an IPv6
only world we do not want to compile ip_input.c in for that and
it is a shared default with INET6.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


221130 27-Apr-2011 bz

Make various (pseudo) interfaces compile without INET in the kernel
adding appropriate #ifdefs. For module builds the framework needs
adjustments for at least carp.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


221023 25-Apr-2011 attilio

Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, bz
MFC after: 2 weeks


221021 25-Apr-2011 bz

Be less strict on includes than in r220746. We need in.h for both
INET or INET6 as it holds all the IPPROTO_* definitions needed
for the SYSCTL_NODE definitions.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 5 days


220914 21-Apr-2011 glebius

Use size_t for sopt_valsize.

Submitted by: Brandon Gooch <jamesbrandongooch gmail.com>


220880 20-Apr-2011 bz

MFp4 CH=191760:

When compiling out INET we still need the initialization routines
as well as the tuning and montoring sysctls shared with IPv6.

Move the two send/recvspace variables up from the middle of the
file to ease compiling out the INET only code.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 3 days


220879 20-Apr-2011 bz

MFp4 CH=191470:

Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c. The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days


220878 20-Apr-2011 bz

MFp4 CH=191466:

Move fw_one_pass to where it belongs: it is a property of ipfw,
not of ip_input.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 3 days


220837 19-Apr-2011 glebius

- Rewrite functions that copyin/out NAT configuration, so that they
calculate required memory size dynamically.
- Fix races on chain re-lock.
- Introduce new field to ip_fw_chain - generation count. Now utilized
only in the NAT configuration, but can be utilized wider in ipfw.
- Get rid of NAT_BUF_LEN in ip_fw.h

PR: kern/143653


220832 19-Apr-2011 ae

Add sysctl handlers for net.inet.ip.dummynet.hash_size, .pipe_byte_limit
and .pipe_slot_limit oids to prevent to set incorrect values.

MFC after: 2 weeks


220831 19-Apr-2011 ae

ipdn_bound_var() functions is designed to bound a variable between
specified minimum and maximum. In case when specified default value
is out of bounds it does not work as expected and does not limit
variable. Check that default value is in range and limit it if needed.
Also bump max_hash_size value to 65536 to correspond with manual page.

PR: kern/152887
MFC after: 2 weeks


220812 19-Apr-2011 ae

Use M_WAITOK instead M_WAIT for malloc. Remove unneded checks.

MFC after: 1 week


220800 18-Apr-2011 glebius

LibAliasInit() should allocate memory with M_WAITOK flag. Modify it
and its callers.


220796 18-Apr-2011 glebius

Pullup up to TCP header length before matching against 'tcpopts'.

PR: kern/156180
Reviewed by: luigi


220794 18-Apr-2011 jhb

When checking to see if a window update should be sent to the remote peer,
don't force a window update if the window would not actually grow due to
window scaling. Specifically, if the window scaling factor is larger than
2 * MSS, then after the local reader has drained 2 * MSS bytes from the
socket, a window update can end up advertising the same window. If this
happens, the supposed window update actually ends up being a duplicate ACK.
This can result in an excessive number of duplicate ACKs when using a
higher maximum socket buffer size.

Reviewed by: bz
MFC after: 1 month


220746 17-Apr-2011 bz

Make in_proto.c dependent on either inet or inet6.

While it does not provide any functionality for IPv6, it provides
the sysctl nodes for net.inet.* that a lot of functionality shared
between IPv4 and IPv6 depends on. We cannot change these anymore
without breaking a lot of management and tuning.

In case of IPv6 only, we compile out everything but the sysctl node
declarations.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC After: 5 days


220620 14-Apr-2011 trasz

Refactor udp_input(), moving calls to u_tun_func() into udp_append().

Obtained from: Wheel Systems Sp. z o.o.
Reviewed by: bz@


220619 14-Apr-2011 bz

The mbuf_frag_size always was and is file local and not queried from base
user space tools via kvm. Mark it static.

MFC after: 3 days


220592 13-Apr-2011 pluknet

Staticize malloc types.

Approved by: lstewart
MFC after: 1 week


220568 12-Apr-2011 ae

Restore previous behaviour - always match rule when we doing tagging,
even when tag is already exists.

Reported by: Vadim Goncharov
MFC after: 1 week


220560 12-Apr-2011 lstewart

Use the full and proper company name for Swinburne University of Technology
throughout the source tree.

Requested by: Grenville Armitage, Director of CAIA at Swinburne University of
Technology
MFC after: 3 days


220428 07-Apr-2011 jfv

Port of the LRO fix from mxge driver to the generic
LRO code. Thanks to Andrew Gallatin for the change.

MFC after: 7 days


220211 31-Mar-2011 ae

Fill up src_port and dst_port variables for SCTP over IPv4.

PR: kern/153415
MFC after: 1 week


220204 31-Mar-2011 ae

Fix malloc types.

MFC after: 1 week


220203 31-Mar-2011 ae

Fix a memory leak. Memory that is allocated for schedulers hash table
was not freed.

PR: kern/156083
MFC after: 1 week


220156 30-Mar-2011 jhb

Clamp the initial advertised receive window when responding to a SYN/ACK
to the maximum allowed window. Growing the window too large would cause
an underflow in the calculations in tcp_output() to decide if a window
update should be sent which would prevent the persist timer from being
started if data was pending and the other end of the connection advertised
an initial window size of 0.

PR: kern/154006
Submitted by: Stefan `Sec` Zehl sec 42 org
Reviewed by: bz
MFC after: 1 week


220105 28-Mar-2011 weongyo

Covers values if (BYTES_THIS_ACK(tp, th) / tp->t_maxseg) value is from
2.0 to 3.0.

Reviewed by: lstewart


219828 21-Mar-2011 pluknet

Reference ifaddr object before unlocking as it can be freed
from another context at the moment of later access.

PR: kern/155555
Submitted by: Andrew Boyer <aboyer att averesystems.com>
Approved by: avg (mentor)
MFC after: 2 weeks


219819 21-Mar-2011 jeff

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


219779 19-Mar-2011 bz

Properly check for an IPv4 socket after r219579.

In some cases as udp6_connect() without an earlier bind(2) to an
address, v4-mapped scokets allowed and a non mapped destination
address, we can end up here with both v4 and v6 indicated:
inp_vflag = (INP_IPV4|INP_IPV6|INP_IPV6PROTO)

In that case however laddrp is NULL as the IPv6 path does not
pass in a copy currently.

Reported by: Pawel Worach (pawel.worach gmail.com)
Tested by: Pawel Worach (pawel.worach gmail.com)
MFC after: 6 days
X-MFC with: r219579


219579 12-Mar-2011 bz

Merge the two identical implementations for local port selections from
in_pcbbind_setup() and in6_pcbsetport() in a single in_pcb_lport().

MFC after: 2 weeks


219397 08-Mar-2011 rrs

Tunes and fixes the new DC-CC to seem to hit the
right mix. Still may need some tweaks but it
appears to almost not give away too much to an
RFC2581 flow, but can really minimize the amount of
buffers used in the net.

MFC after: 3 months


219120 01-Mar-2011 rrs

Adds a new Congestion Control that helps reduce
the RTT that a flow will build up in buffers in
transit. It is a slight modification to RFC2581
but is more friendly i.e. less aggressive.

MFC after: 3 months


219071 26-Feb-2011 dim

Fix breakage in sys/netinet/sctp_sysctl.c, introduced by r219057. If
SCTP_HAS_RTTC is not defined, this file fails to compile. Insert the
necessary #ifdefs to make it work.

Pointy hat to: rrs


219057 26-Feb-2011 rrs

Improvements to CC modules:
1) Add four new points that allow you to get more information
to cc algo's
2) Fix the case where user changes module on a existing TCB, in
such a case, the initialization module needs to be called on all nets.
3) Move htcp_cc structure to a union that other modules can use.
4) Add 5th point for get/set socket options for cc_module specific options

MFC after: 2 months


219014 24-Feb-2011 tuexen

* Fix several bugs where the scaled versions of srtt and rttvar
where used incorrectly.
* Use appropriate variable names for RTO instead of RTT.

MFC after: 3 months.


219013 24-Feb-2011 tuexen

* Cleanup the code computing the retransmission timeout.
* Fix an initialization bug for the scaled variance of the RTO.

MFC after: 3 months.


218909 21-Feb-2011 brucec

Fix typos - remove duplicate "the".

PR: bin/154928
Submitted by: Eitan Adler <lists at eitanadler.com>
MFC after: 3 days


218818 18-Feb-2011 tuexen

Bugfix: Get per vnet sysctl variables and statistics working.

MFC after:3 months.


218757 16-Feb-2011 bz

Mfp4 CH=177274,177280,177284-177285,177297,177324-177325

VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.

While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.

The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec

Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks


218741 16-Feb-2011 pluknet

Bump dummynet module version to meet dummynet schedulers' requirements,
and thus unbreak loading dummynet.ko via /boot/loader.conf.

Reported by: rihad <rihad att mail.ru> on freebsd-net
Approved by: kib (mentor)


218641 13-Feb-2011 rrs

Fix a bug reported by Jonathan Leighton in his web-sctp testing
at the Univ-of-Del. Basically when a 1-to-1 socket did a
socket/bind/send(data)/close. If the timing was right
we would dereference a socket that is NULL.

MFC after: 1 month


218639 13-Feb-2011 tuexen

Fix several bugs related to stream scheduling.

Obtained from: Robin Seggelmann
MFC after: 3 months.


218629 13-Feb-2011 deischen

Oops, revert an accidental local change that got added in
my last commit (r218627). No damage was done in the last
commit, just some duplicated code was added (which is now
removed).


218627 13-Feb-2011 deischen

Allow the SO_SETFIB socket option to select the default (0)
routing table.

Reviewed by: julian


218521 10-Feb-2011 tuexen

Remove addresses from endpoint when there are no associations.
This fixes a bug reported by brucec@.

MFC after: 3 months.


218400 07-Feb-2011 tuexen

Fix bugs related to M_FLOWID:
* Store the flowid when receiving an SCTP/IPv6 packet.
* Store the flowid when receiving an SCTP packet with wrong CRC.
* Initilize flowid correctly.
* Put test code under INVARIANTS.
MFC after: 3 months.


218393 07-Feb-2011 rrs

If not set (due to some error Michael is working on
fixing) set it for the net.

MFC after: 3 months


218392 07-Feb-2011 rrs

1) Track when flowid does get set.
MFC after: 3 months


218371 06-Feb-2011 rrs

1) Use same scheme Michael and I discussed for a selected for a flowid
2) If flowid is not set, arrange so it is stored.
3) If flowid is set by lower layer, use it.

MFC after: 3 Months


218360 05-Feb-2011 luigi

correct the 'output_time' of packets generated by dummynet.
In the dec.2009 rewrite I introduced a bug, using for the
computation the arrival time instead of the time the packet
has exited from the queue.
The bandwidth computation was still correct because it is
computed elsewhere, but traffic was sent out in bursts.

The bug is also present in RELENG_8 after dec.2009

Thanks to Daikichi Osuga for investingating, finding and fixing the
bug with detailed graphs of the behaviour before and after the fix.

Submitted by: Daikichi Osuga
MFC after: 2 weeks


218335 05-Feb-2011 tuexen

Add support for M_FLOWID.


218319 05-Feb-2011 rrs

1) Typo correction in comments and one spacing change.
2) Mass update to all copyrights.
MFC after: 3 Months


218271 04-Feb-2011 jhb

When turning off TCP_NOPUSH, only call tcp_output() to immediately flush
any pending data if the connection is established.

Submitted by: csjp
Reviewed by: lstewart
MFC after: 1 week


218269 04-Feb-2011 rrs

1) Fix cpu mapping per JB's suggestions
2) Fix it so INIT's don't always end up on CPU0

MFC after: 3 months


218264 04-Feb-2011 brucec

Fix typo (Tuneable -> Tunable).


218241 03-Feb-2011 tuexen

Fix several bugs in the stream schedulers.
From Robin Seggelmann.

MFC after: 3 months.


218235 03-Feb-2011 tuexen

Make sure that changing the ECN sysctl does not affect
exisiting associations and endpoints.

MFC after: 3 months.


218232 03-Feb-2011 rrs

1) Move per John Baldwin to mp_maxid
2) Some signed/unsigned errors found by Mac OS compiler (from Michael)
3) a couple of copyright updates on the effected files.

MFC after: 3 months


218219 03-Feb-2011 rrs

Fix the per CPU stats so that:
1) They don't use the giant "MAX_CPU" define and instead
are allocated dynamically based on mp_ncpus
2) Will zero with the netstat -z -s -p sctp
3) Will be properly handled by both the sctp_init and finish
(the multi-net stuff was incorrectly bzero'ing in sctp_init
the wrong size.. the bzero is now moved to the right places).
And of course the free is put in at the very end.

MFC after: 3 Months


218211 03-Feb-2011 rrs

Adds an experimental option to create a pool of
threads. These serve as input threads and are queued
packets based on the V-tag number. This is similar to
what a modern card can do with queue's for TCP... but
alas modern cards know nothing about SCTP.

MFC after: 3 months (maybe)


218186 02-Feb-2011 rrs

1) Allow a chunk to track the cwnd it was at when sent.
2) Add separate max-bursts for retransmit and hb. These
are set to sysctlable values but not settable via the
socket api. This makes sure we don't blast out HB's or
fast-retransmits.
3) Determine on the first data transmission on a net if
its local-lan (by being under or over a RTT). This
can later be used to think about different algorithms
based on locallan vs big-i (experimental)
4) The cwnd should NOT be allowed to grow when an ECNEcho
is seen (TCP has this same bug). We fix this in SCTP
so an ECNe being seen prevents an advance of cwnd.
5) CWR's should not be sent multiple times to the
same network, instead just updating the TSN being
transmitted if needed.

MFC after: 1 Month


218167 01-Feb-2011 lstewart

Algorithm modules can define their own private congestion signal types in the
top 8 bits of the 32 bit signal bit field space for internal use. These private
signals should not be leaked outside of a module.

Given that many algorithm modules use the NewReno hook functions to simplify
their implementation, the obvious place such a leak would show up is in the
NewReno cong_signal hook function.

- Show the full number of significant bits in the signal type definitions in
<netinet/cc.h>.

- Add a bitmask to simplify figuring out if a given signal is in the private or
public bit range.

- Add a sanity check in newreno_cong_signal() to ensure private signals are not
being leaked into the hook function.

Sponsored by: FreeBSD Foundation
Discussed with: David Hayes <dahayes at swin edu au>
MFC after: 1 week
X-MFC with: r215166


218156 01-Feb-2011 lstewart

Fix typo in comment: "course" -> "coarse"

Sponsored by: FreeBSD Foundation
Submitted by: jmallett
MFC after: 3 months
X-MFC with: r218152


218155 01-Feb-2011 lstewart

Import an implementation of the CAIA-Hamilton-Delay (CHD) congestion control
algorithm described in the paper "Improved coexistence and loss tolerance for
delay based TCP congestion control" by Hayes and Armitage. It is implemented as
a kernel module compatible with the recently committed modular congestion
control framework.

CHD enhances the approach taken by the Hamilton-Delay (HD) algorithm to provide
tolerance to non-congestion related packet loss and improvements to coexistence
with loss-based congestion control algorithms. A key idea in improving
coexistence with loss-based congestion control algorithms is the use of a shadow
window, which attempts to track how NewReno's congestion window (cwnd) would
evolve. At the next packet loss congestion event, CHD uses the shadow window to
correct cwnd in a way that reduces the amount of unfairness CHD experiences when
competing with loss-based algorithms.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months


218153 01-Feb-2011 lstewart

Import a clean-room implementation of the Hamilton-Delay (HD) congestion control
algorithm based on the paper "A strategy for fair coexistence of loss and
delay-based congestion control algorithms" by Budzisz, Stanojevic, Shorten and
Baker. It is implemented as a kernel module compatible with the recently
committed modular congestion control framework.

HD uses a probabilistic approach to reacting to delay-based congestion. The
probability of reducing cwnd is zero when the queuing delay is very small,
increasing to a maximum at a set threshold, then back down to zero again when
the queuing delay is high. Normal operation keeps the queuing delay below the
set threshold. However, since loss-based congestion control algorithms push the
queuing delay high when probing for bandwidth, having the probability of
reducing cwnd drop back to zero for high delays allows HD to compete with
loss-based algorithms.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months


218152 01-Feb-2011 lstewart

Import a clean-room implementation of the VEGAS congestion control algorithm
based on the paper "TCP Vegas: end to end congestion avoidance on a global
internet" by Brakmo and Peterson. It is implemented as a kernel module
compatible with the recently committed modular congestion control framework.

VEGAS uses network delay as a congestion indicator and unlike regular loss-based
algorithms, attempts to keep the network operating with stable queuing delays
and no congestion losses. By keeping network buffers used along the path within
a set range, queuing delays are kept low while maintaining high throughput.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months


218129 31-Jan-2011 rrs

More ECN fixes:
1) We now remove ECN-Nonce since it will no longer continue as a I-D
2) Eliminate last_tsn_echo, this tied us to an assoc not the net
and thus we were not doing m-homing on the ECN-Echo senders side right.
3) Increment the count going out even if the TSN in lower in the pending
ECN-Echo, this way the receiver knows exactly how many packets were
marked even with network re-ordering
4) Fix so we DO NOT stop doing delayed sack if a ECN Echo is in queue
MFC after: 1 month


218078 29-Jan-2011 bz

Remove duplicate printing of TF_NOPUSH in db_print_tflags().

MFC after: 10 days


218072 29-Jan-2011 rrs

Fixes to ECN in SCTP.
1) ECN was on an association basis, this is incorrect and
will not work with CMT or for that matter if the user
is sending to multiple addresses. This commit makes
ECN on a per path basis.
2) Adopt the new format for the ECN internet draft. This also
maintains compatability with old format chunks as well.
3) Keep track of the real time of a RTT down to micro seconds.
For some future conditional features (for like a data center
this is good information to have).
MFC after: 1 month


218039 28-Jan-2011 rrs

Keep track of the real last RTT on each net.
This will be used for Data Center congestion
control, we won't want to engage it in the
ECN code unless we KNOW that the RTT is less
than 500us.

MFC after: 1 week


218037 28-Jan-2011 rrs

Fix a bug in the way ECN-Echo chunk
sends were being accounted for. The
counting was such that we counted only
when we queued a chunk, not when we sent it.
Now keep an additional counter for queuing and
one for sending.

MFC after: 1 week


217913 26-Jan-2011 tuexen

* Use 300 ms as the default for RTO_MIN.
* Disable burst mitigation by default.
* Remove unused constant.
Discussed with rrs.
MFC after: 3 months.


217895 26-Jan-2011 tuexen

Make SCTP_MAX_BURST compliant with the latest version of
the socket API ID. This is not compatible with the API
in stable/8.


217894 26-Jan-2011 tuexen

Change infrastructure for SCTP_MAX_BURST to allow compliance
with the latest socket API ID. Especially it can be disabled.

Full compliance needs changing the structure used in the
socket option. Since this breaks the API, it will be a
seperate commit which will not be MFCed to stable/8.

MFC after: 3 months.


217888 26-Jan-2011 deischen

Prison check addresses set with multicast interface options.

Reviewed by: bz
MFC after: 1 week


217829 25-Jan-2011 thompsa

When matching an incoming ARP against a bridge, ensure both interfaces belong
to the same bridge.

Submitted by: Alexander Zagrebin


217806 24-Jan-2011 lstewart

Import the ERTT (Enhanced Round Trip Time) Khelp module. ERTT uses the
Khelp/Hhook KPIs to hook into the TCP stack and maintain a per-connection, low
noise estimate of the instantaneous RTT. ERTT's implementation is robust even in
the face of delayed acknowledgements and/or TSO being in use for a connection.

A high quality, low noise RTT estimate is a requirement for applications such as
delay-based congestion control, for which we will be importing some algorithm
implementations shortly.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months


217760 23-Jan-2011 tuexen

Add stream scheduling support.
This work is based on a patch received from Robin Seggelmann.

MFC after: 3 months.


217748 23-Jan-2011 lstewart

An sbuf configured with SBUF_AUTOEXTEND will call malloc with M_WAITOK when a
write to the buffer causes it to overflow. We therefore can't hold the CC list
rwlock over a call to sbuf_printf() for an sbuf configured with SBUF_AUTOEXTEND.

Switch to a fixed length sbuf which should be of sufficient size except in the
very unlikely event that the sysctl is being processed as one or more new
algorithms are loaded. If that happens, we accept the race and may fail the
sysctl gracefully if there is insufficient room to print the names of all the
algorithms.

This should address a WITNESS warning and the potential panic that would occur
if the sbuf call to malloc did sleep whilst holding the CC list rwlock.

Sponsored by: FreeBSD Foundation
Reported by: Nick Hibma
Reviewed by: bz
MFC after: 3 weeks
X-MFC with: r215166


217742 23-Jan-2011 tuexen

Remove unnecessary checking of variable.

MFC after: 3 months.


217683 21-Jan-2011 lstewart

Some correctness and robustness fixes related to CUBIC's mean RTT estimate:

- The mean RTT is updated at the end of each congestion epoch, but if we switch
to congestion avoidance within the first epoch (e.g. if ssthresh was primed
from the hostcache), we'll trigger a divide by zero panic in
cubic_ack_received(). Set the mean to the min in cubic_record_rtt() if the
mean is less than the min to ensure we have a sane mean for use in this
situation. This fixes the panic reported by Nick Hibma.

- Adjust conditions under which we update the mean RTT in cubic_post_recovery()
to ensure a low latency path won't yield an RTT of less than 1. This avoids
another potential divide by zero panic when running CUBIC in networks with
sub-millisecond latencies.

- Remove the "safety" assignment of min into mean when we don't update the mean
because of failed conditions. The above change to the conditions for updating
the mean ensures the safety issue is addressed and I feel it is better to keep
our previous mean estimate around if we can't update than to revert to the
min.

- Initialise the mean RTT to 1 on connection startup to act as a safety belt if
a situation we haven't considered and addressed with the above changes were to
crop up in the wild.

Sponsored by: FreeBSD Foundation
Reported and tested by: Nick Hibma
Discussed with: David Hayes <dahayes at swin edu au>
MFC after: 5 weeks
X-MFC with: r216114


217638 20-Jan-2011 tuexen

Improve comments.

MFC after: 1 week.


217635 20-Jan-2011 rrs

Fix it so we align with new socket API draft for
state's in destination (i.e. ACTIVE/INACTIVE/UNCONFIRMED)

MFC after: 1 week


217611 19-Jan-2011 tuexen

Cleanup the management of CC functions.

MFC after: 3 months.


217597 19-Jan-2011 rrs

Fix style 9 nit that snuck in when I
grabbed the wrong patch ;-0 (thanks Daniel)

MFC after: 1 week


217592 19-Jan-2011 rrs

Fix a bug where Multicast packets sent from a
udp endpoint may end up echoing back to the sender
even with OUT joining the multi-cast group.

Reviewed by: gnn, bms, bz?
Obtained from: deischen (with help from)


217554 18-Jan-2011 mdf

Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string. For SYSCTL_PROC instances that I
noticed a discrepancy between the CTLTYPE and the format specifier,
fix the CTLTYPE.


217469 16-Jan-2011 tuexen

Add support for resource pooling to CMT.
An original version of the patch was developed by Martin Becke
and Thomas Dreibholz.

MFC after: 3 months


217361 13-Jan-2011 jhb

Use a blocking malloc() to initialize the dummynet taskq.

Reviewed by: luigi


217333 12-Jan-2011 csjp

Un-break the build: use the correct format specifier for sizeof()


217322 12-Jan-2011 mdf

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the net* piece.


217315 12-Jan-2011 gnn

Fix several bugs in the ARP code related to improperly formatted
packets.

*) Reject requests with a protocol length not equal to 4. This is IPv4
and there is no reason to accept anything else.

*) Reject packets that have a multicast source hardware address.

*) Drop requests where the hardware address length is not equal
to the hardware address length of the interface.

Pointed out by: Rozhuk Ivan
MFC after: 1 week


217252 11-Jan-2011 lstewart

Fixe some whitespace nits that were introduced in r216758.

Sponsored by: FreeBSD Foundation
Submitted by: pjd
MFC after: 10 weeks
X-MFC with: r216758


217221 10-Jan-2011 lstewart

Reset the last_sack_ack SACK hint for TCP input processing to ensure that the
hint is 0 when no SACK data is received to update the hint with. This was
accidentally omitted from r216753.

Sponsored by: FreeBSD Foundation
MFC after: 10 weeks
X-MFC with: 216753


217169 08-Jan-2011 deischen

Make sure to always do source address selection on
an unbound socket, regardless of any multicast options.
If an address is specified via a multicast option, then
let it override normal the source address selection.

This fixes a bug where source address selection was
not being performed when multicast options were present
but without an interface being specified.

Reviewed by: bz
MFC after: 1 day


217126 07-Jan-2011 jhb

Trim extra spaces before tabs.


217121 07-Jan-2011 gnn

Fix a memory leak in ARP queues.

Pointed out by: jhb@
MFC after: 2 weeks


217113 07-Jan-2011 gnn

Adjust ARP hold queue locking.

Submitted by: Rozhuk Ivan, jhb
MFC after: 2 weeks


217110 07-Jan-2011 jhb

Use a regular taskqueue for dummynet rather than a "fast" taskqueue.

Reviewed by: luigi


216887 02-Jan-2011 tuexen

Bugfix: Make sure that the COMM_UP notificatin is delivered first also
on the passive side.

MFC after: 3 days.


216878 01-Jan-2011 tuexen

Fix a typo.

MFC after: 3 months.


216857 31-Dec-2010 bz

Try to catch a possible divide-by-zero as early as possible if "mtu" is 0
(also test for negative MTUs if checking it anyway).
An MTU of 0 is arguably a bug elsewhere, but this at least gives us some
more debugging hints.

Sponsored by: ISPsystem (Early 2010)
MFC after: 1 week


216825 30-Dec-2010 tuexen

Define and use SCTP_SSN_GE, SCTP_SSN_GT, SCTP_TSN_GE, SCTP_TSN_GT macros
and use them instead of the generic compare_with_wrap.
Retire compare_with_wrap.

MFC after: 3 months.


216822 30-Dec-2010 tuexen

Code cleanup: Use LIST_FOREACH, LIST_FOREACH_SAFE, TAILQ_FOREACH,
TAILQ_FOREACH_SAFE where appropriate.
No functional change.

MFC after: 3 months.


216821 30-Dec-2010 tuexen

Fix three bugs related to the sequence number wrap-around affecting
the processing of ECNE and ASCONF chunks.

Reviewed by: rrs
MFC after: 3 days.


216760 28-Dec-2010 lstewart

Add a comment for the ccv member of struct tcpcb.

Sponsored by: FreeBSD Foundation
MFC after: 5 weeks
X-MFC with: r215166


216758 28-Dec-2010 lstewart

- Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
connections. The hooks only run if at least one hook function is registered
for the hook point, ensuring the impact on the stack is effectively nil when
no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
introduced by this commit and r216753.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz, others along the way
MFC after: 3 months


216753 28-Dec-2010 lstewart

Add a new sack hint to track the most recent and highest sacked sequence number.
This will be used by the incoming Enhanced RTT Khelp module.

Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
Reviewed by: bz and others (as part of a larger patch)
MFC after: 3 months


216749 28-Dec-2010 lstewart

Fix a whitespace nit introduced in r215166.

Sponsored by: FreeBSD Foundation
Spotted by: bz
MFC after: 5 weeks
X-MFC with: r215166


216742 27-Dec-2010 rwatson

Remove comment bemoaning the lack of an INP_INHASHLIST above in_pcbdrop();
I fixed this in r189657 in early 2009, so the comment is OBE.

Reviewed by: bz
MFC after: 3 days


216672 22-Dec-2010 tuexen

Provide a possibility to configure the inital congestion window to the
value defined in RFC 4960.

MFC after: 3 months.


216669 22-Dec-2010 tuexen

Improve plausibility check in sctp_handle_sack().
Allow cmt_on_off to support values 0 (no CMT), 1 (CMT), and 2 (CMT/RP).

MFC after: 3 months.


216621 21-Dec-2010 jhb

Fix a typo in a comment.

MFC after: 1 week


216502 17-Dec-2010 tuexen

Fix a flightsize bug related to the processing of PKTDRP reports.

MFC after: 3 days.


216495 16-Dec-2010 tuexen

Bugfix: Take also the nr-mapping array into account when detecting
gaps.

Reviewed by: rrs@
MFC after: 3 days.


216480 16-Dec-2010 tuexen

Add a missing cast. Reported by blade_ly at yahoo.com.cn.

MFC after: 1 day.


216466 15-Dec-2010 bz

Bring back (most of) NATM to avoid further bitrot after r186119.
Keep three lines disabled which I am unsure if they had been used at all.
This will allow us to seek testers and possibly bring it all back.

Discussed with: rwatson
MFC after: 7 weeks


216397 12-Dec-2010 tuexen

Bugfix: Do correct accounting using the MIB counters when an
association is aborted via sctp_abort_association().

MFC after: 3 days.


216192 05-Dec-2010 bz

Use correct field to track statistics counting error as bad header length.
This assimilates the code to what ip_input has been doing since r1.1 in
this case.

Submitted by: Rozhuk Ivan (rozhuk.im gmail.com)
MFC after: 4 days


216188 04-Dec-2010 tuexen

Fix a bug where also the number of non-renegable gap reports
was considered to be potentially renegable.

MFC after: 1 day.


216115 02-Dec-2010 lstewart

Import a clean-room implementation of the experimental H-TCP congestion control
algorithm based on the Internet-Draft "draft-leith-tcp-htcp-06.txt". It is
implemented as a kernel module compatible with the recently committed modular
congestion control framework.

H-TCP was designed to provide increased throughput in fast and long-distance
networks. It attempts to maintain fairness when competing with legacy NewReno
TCP in lower speed scenarios where NewReno is able to operate adequately. The
paper "H-TCP: A framework for congestion control in high-speed and long-distance
networks" provides additional detail.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: rpaulo (older patch from a few weeks ago)
MFC after: 3 months


216114 02-Dec-2010 lstewart

Import a clean-room implementation of the experimental CUBIC congestion control
algorithm based on the Internet-Draft "draft-rhee-tcpm-cubic-02.txt". It is
implemented as a kernel module compatible with the recently committed modular
congestion control framework.

CUBIC was designed for provide increased throughput in fast and long-distance
networks. It attempts to maintain fairness when competing with legacy NewReno
TCP in lower speed scenarios where NewReno is able to operate adequately. The
paper "CUBIC: A New TCP-Friendly High-Speed TCP Variant" provides additional
detail.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: rpaulo (older patch from a few weeks ago)
MFC after: 3 months


216107 02-Dec-2010 lstewart

General cleanup of the NewReno CC module (no functional changes):

- Remove superfluous includes and unhelpful comments.

- Alphabetically order functions.

- Make functions static.

Sponsored by: FreeBSD Foundation
MFC after: 9 weeks
X-MFC with: r215166


216105 02-Dec-2010 lstewart

- Reinstantiate the after_idle hook call in tcp_output(), which got lost
somewhere along the way due to mismerging r211464 in our development tree.

- Capture the essence of r211464 in NewReno's after_idle() hook. We don't
use V_ss_fltsz/V_ss_fltsz_local yet which needs to be revisited.

Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
MFC after: 9 weeks
X-MFC with: r215166


216103 02-Dec-2010 lstewart

Set ssthresh appropriately on RTO. This change was accidentally not ported from
the pre modular CC stack.

Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
MFC after: 9 weeks
X-MFC with: r215166


216101 02-Dec-2010 lstewart

Pass NULL instead of 0 for the th pointer value. NULL != 0 on all platforms.

Submitted by: David Hayes <dahayes at swin edu au>
MFC after: 9 weeks
X-MFC with: r215166


216075 30-Nov-2010 glebius

Use time_uptime instead of non-monotonic time_second to drive ARP
timeouts.

Suggested by: bde


215956 27-Nov-2010 brucec

Fix more continuous/contiguous typos (cf. r215955)


215817 25-Nov-2010 rrs

Adds new dtrace for cwnd functions and lay's
groundwork for future dtrace points (rwnd flightsize etc).

MFC after: 2 months


215790 24-Nov-2010 glebius

Redo r166423. It is important not only skip freeing multicast
entires when underlying interface is detached, but also purge
pointers to them, to avoid double-free in future.


215701 22-Nov-2010 dim

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


215677 22-Nov-2010 zec

Remove an apparently redundant CURVNET_SET() / CURVNET_RESTORE() pair.

MFC after: 3 days


215553 20-Nov-2010 lstewart

Fix a minor code redundancy nit.

MFC after: 3 days


215552 20-Nov-2010 lstewart

When enabling or disabling SIFTR with a VIMAGE kernel, ensure we add or remove
the SIFTR pfil(9) hook functions to or from all network stacks. This patch
allows packets inbound or outbound from a vnet to be "seen" by SIFTR.

Additional work is required to allow SIFTR to actually generate log messages for
all vnet related packets because the siftr_findinpcb() function does not yet
search for inpcbs across all vnets. This issue will be fixed separately.

Reported and tested by: David Hayes <dahayes at swin edu au>
MFC after: 3 days


215434 17-Nov-2010 gnn

Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after: 2 weeks


215410 16-Nov-2010 tuexen

Add an SCTP socket option to retrieve the number of timeouts
of an association.

MFC after: 3 days.


215395 16-Nov-2010 lstewart

Make the CC framework more VIMAGE friendly by adding the machinery to allow
vnets to select their own default CC algorithm independent of each other and the
base system. If the base system or a vnet has set a default which gets unloaded,
we reset that netstack's default to NewReno.

Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by: bz (briefly)
MFC after: 3 months


215393 16-Nov-2010 lstewart

- Querying the default CC algo is more common than setting it and the function
is small, so there is no good reason not to declare the buffer at the top.

- Fix a whitespace nit.

Sponsored by: FreeBSD Foundation
MFC after: 11 weeks
X-MFC with: r215166


215392 16-Nov-2010 lstewart

Move protocol specific implementation detail out of the core CC framework.

Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
MFC after: 11 weeks
X-MFC with: r215166


215391 16-Nov-2010 lstewart

On CC algorithm module unload, we walk the list of active TCP control blocks.
Any found to be using the algorithm that is about to go away are switched back
to NewReno to avoid leaving dangling pointers which would trigger a panic. For
VIMAGE kernels, there is a list per vnet to walk, yet the implementation was
only examining one of the vnet lists.

Fix the implementation of the above feature for VIMAGE kernels by looping
through all active TCP control blocks across all vnets.

Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by: bz (briefly)
MFC after: 11 weeks


215377 16-Nov-2010 lstewart

cc_init() should only be run once on system boot, but with VIMAGE kernels it
runs on boot and each time a vnet jail is created. Running cc_init() multiple
times results in a panic when attempting to initialise the cc_list lock again,
and so r215166 effectively broke the use of vnet jails.

Switch to using a SYSINIT to run cc_init() on boot. CC algorithm modules loaded
on boot register in the same SI_SUB_PROTO_IFATTACHDOMAIN category as is used in
this patch, so cc_init() is run at SI_ORDER_FIRST to ensure the framework is
initialised before module registration is attempted.

Sponsored by: FreeBSD Foundation
Reported and tested by: Mikolaj Golub <to.my.trociny at gmail com>
MFC after: 11 weeks
X-MFC with: r215166


215317 14-Nov-2010 dim

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


215305 14-Nov-2010 tuexen

Take out special code for disable CRC computations on
the loopback interface for IPv6. It will be handled
by the loopback interface.


215301 14-Nov-2010 tuexen

Simplify sctp_delayed_cksum() a bit.

MFC after: 3 days.


215241 13-Nov-2010 tuexen

Fix a locking issue reported by brucec@ affecting
1-to-1 style sockets which have not yet been
accepted.

MFC after: 3 days.


215207 12-Nov-2010 gnn

Add a queue to hold packets while we await an ARP reply.

When a fast machine first brings up some non TCP networking program
it is quite possible that we will drop packets due to the fact that
only one packet can be held per ARP entry. This leads to packets
being missed when a program starts or restarts if the ARP data is
not currently in the ARP cache.

This code adds a new sysctl, net.link.ether.inet.maxhold, which defines
a system wide maximum number of packets to be held in each ARP entry.
Up to maxhold packets are queued until an ARP reply is received or
the ARP times out. The default setting is the old value of 1
which has been part of the BSD networking code since time
immemorial.

Expose the time we hold an incomplete ARP entry by adding
the sysctl net.link.ether.inet.wait, which defaults to 20
seconds, the value used when the new ARP code was added..

Reviewed by: bz, rpaulo
MFC after: 3 weeks


215199 12-Nov-2010 tuexen

Don't print an empty line when printing mapping arrays.

MFC after: 3 days.


215198 12-Nov-2010 tuexen

Fix more issues with the SACK/NR-SACK generation code.

MFC after: 3 days.


215179 12-Nov-2010 luigi

The first customer of the SO_USER_COOKIE option:
the "sockarg" ipfw option matches packets associated to
a local socket and with a non-zero so_user_cookie value.
The value is made available as tablearg, so it can be used
as a skipto target or pipe number in ipfw/dummynet rules.

Code by Paul Joe, manpage by me.

Submitted by: Paul Joe
MFC after: 1 week


215166 12-Nov-2010 lstewart

This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months


215153 12-Nov-2010 lstewart

Standardise all Swinburne related copyright/licence statements throughout the
tree in preparation for another large code import. Swinburne University is the
legal entity that owns copyright and the 2-clause BSD licence is acceptable.


215152 12-Nov-2010 lstewart

The university does not require that its CRICOS number be included in source
code. Remove all references from the tree.

MFC after: 3 days


215134 11-Nov-2010 tuexen

Fix the SACK/NR-SACK generation code.

MFC after: 3 days.


215110 11-Nov-2010 rrs

Fix so that a multicast packet can be sent
even if there is no route out to that mcast address. The code in
in_pcb inadvertantly would error (no route) even though
the user may have specified the address with the
proper socket option (to specify the egress interface).
Thanks bz for reminding me I forgot to commit this ;-)

Reviewed by: bz
MFC after: 1 week


215039 09-Nov-2010 tuexen

Improve the scalability by using the local and remote port when
putting inps in the tcpephash.

MFC after: 3 days.


215035 09-Nov-2010 tuexen

Fix a bug which resulted in kevent() reporting an event twice on
1-to-1 style sockets when an ABORT was received.

MFC after: 3 days.


215034 09-Nov-2010 brucec

Fix typos.

PR: bin/148894
Submitted by: olgeni


214939 07-Nov-2010 tuexen

Do not have the MTU table twice in the code. Therefore move the
function from the timer code to util, rename it appropriately and
also fix a bug in sctp_get_prev_mtu(), where calling it with a
value existing in the MTU table did not return a smaller one.

MFC after: 3 days.


214933 07-Nov-2010 tuexen

Remove two functions which are not used.

MFC after: 3 days.


214928 07-Nov-2010 tuexen

* Use exponential backoff for retransmission of SHUTDOWN and
SHUTDOWN-ACK chunks.
* While there, do some cleanups.

MFC after: 3 days.


214918 07-Nov-2010 tuexen

Not only stop all timers when entering the SHUTDOWN_SENT state,
but also when entering the SHUTDOWN_ACK_SEND state.

MFC after: 3 days.


214877 06-Nov-2010 tuexen

Do not resend DATA chunks without delay when dropped by the peer and
the CRC was correct.

MFC after: 3 days.


214876 06-Nov-2010 tuexen

* Fix an accounting bug regarding SACK/NR-SACK chunks.
* Fix the generation of the SACK/NR-SACK gap lists.

MFC after: 3 days.


214754 03-Nov-2010 n_hibma

Don't spam the console with loaded modules during boot and/or during
startup of ppp.

Note: This cannot be hidden behind bootverbose as this file is included
from lib/libalias as well.


214675 02-Nov-2010 jhb

Don't leak the LLE lock if the arptimer callout is pending or inactive.

Reported by: David Rhodus
MFC after: 1 month


214509 29-Oct-2010 glebius

Remove meaningless XXXXX, that is a remain of comment, removed in r186200.


214508 29-Oct-2010 glebius

Revert a small part of the r198301, that is entirely unrelated to the
r198301 itself. It also broke the logic of not sending more than one
ARP request per second, that consequently lead to a potential problem
of flooding network with broadcast packets.

MFC after: 1 week


214303 24-Oct-2010 bz

Add initial inet DDB support for show in_ifaddr and show sin commands which
proved to be useful while debugging address list problems.

MFC after: 6 days


214250 23-Oct-2010 bz

Make the IPsec SADB embedded route cache a union to be able to hold both the
legacy and IPv6 route destination address.
Previously in case of IPv6, there was a memory overwrite due to not enough
space for the IPv6 address.

PR: kern/122565
MFC After: 2 weeks


214054 19-Oct-2010 uqs

mdoc: drop even more redundant .Pp calls

No change in rendered output, less mandoc lint warnings.

Tool provided by: Nobuyuki Koganemaru n-kogane at syd.odn.ne.jp


213932 16-Oct-2010 bz

MfP4 CH182763 (original version):

Make it harder to exploit certain in_control() related races between the
intiial lookup at the beginning and the time we will remove the entry
from the lists by re-checking that entry is still in the list before
trying to remove it.

(*) It is believed that with the current code and locking strategy we
cannot completely fix all race.

Reported by: Nima Misaghian (nima_misa hotmail.com) on net@ 20100817
Tested by: Nima Misaghian (nima_misa hotmail.com) (original version)
PR: kern/146250
Submitted by: Mikolaj Golub (to.my.trociny gmail.com) (different version)
MFC after: 1 week


213913 16-Oct-2010 lstewart

Retire the system-wide, per-reassembly queue segment limit. The mechanism is far
too coarse grained to be useful and the default value significantly degrades TCP
performance on moderate to high bandwidth-delay product paths with non-zero loss
(e.g. 5+Mbps connections across the public Internet often suffer).

Replace the outgoing mechanism with an individual per-queue limit based on the
number of MSS segments that fit into the socket's receive buffer. This should
strike a good balance between performance and the potential for resource
exhaustion when FreeBSD is acting as a TCP receiver. With socket buffer
autotuning (which is enabled by default), the reassembly queue tracks the
socket buffer and benefits too.

As the XXX comment suggests, my testing uncovered some unexpected behaviour
which requires further investigation. By using so->so_rcv.sb_hiwat
instead of sbspace(&so->so_rcv), we allow more segments to be held across both
the socket receive buffer and reassembly queue than we probably should. The
tradeoff is better performance in at least one common scenario, versus a devious
sender's ability to consume more resources on a FreeBSD receiver.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks


213912 16-Oct-2010 lstewart

- Switch the "net.inet.tcp.reass.cursegments" and
"net.inet.tcp.reass.maxsegments" sysctl variables to be based on UMA zone
stats. The value returned by the cursegments sysctl is approximate owing to
the way in which uma_zone_get_cur is implemented.

- Discontinue use of V_tcp_reass_qsize as a global reassembly segment count
variable in the reassembly implementation. The variable was used without
proper synchronisation and was duplicating accounting done by UMA already. The
lack of synchronisation was particularly problematic on SMP systems
terminating many TCP sessions, resulting in poor TCP performance for
connections with non-zero packet loss.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo (as part of a larger patch)
MFC after: 2 weeks


213832 14-Oct-2010 bz

Use ifa_ifwithaddr_check() rather than ifa_ifwithaddr() as we are not
interested in the result and would leak a reference otherwise.

PR: kern/151435
Submitted by: Andrew Boyer (aboyer averesystems.com)
MFC after: 3 days


213329 01-Oct-2010 luigi

put back the assigment to sched_time. It was correct, and
it was necessary.

Submitted by: Riccardo Panicucci


213325 01-Oct-2010 bz

Proper bracketing.

PR: kern/151100
Submitted by: SunMinghao (sunminghao hotmail.com)
MFC after: 3 days


213279 29-Sep-2010 luigi

remove an unnecessary (and wrong) assignment.
It was meant to reset idle_time (and it was not needed),
but i even used the wrong field.

Obtained from: Oleg
MFC after: 3 days


213267 29-Sep-2010 luigi

whitespace changes in preparation for future commits


213265 29-Sep-2010 luigi

fix handling of initial credit for an idle pipe.
This fixes the bug where setting bw > 1 MTU/tick resulted in
infinite bandwidth if io_fast=1

PR: 147245 148429
Obtained from: Riccardo Panicucci
MFC after: 3 days


213254 28-Sep-2010 luigi

fix breakage in in-kernel NAT: the code did not honor
net.inet.ip.fw.one_pass and always moved to the next rule
in case of a successful nat.

This should fix several related PR (waiting for feedback
before closing them)

PR: 145167 149572 150141
MFC after: 3 days


213253 28-Sep-2010 luigi

Whitespace changes to reduce diffs wrt the most recent ipfw/dummynet code:
+ remove an unused macro,
+ adjust the constants in an enum
+ small whitespace changes

MFC after: 3 days


213225 27-Sep-2010 delphij

Add a bandaid for a long-standing race condition during route entry
un-expiring.

The previous version of code have no locking when testing rt_refcnt.
The result of the lack of locking may result in a condition where
a routing entry have a reference count but at the same time have
RTPRF_OURS bit set and an expiration timer. These would eventually
lead to a panic:

panic: rtqkill route really not free

When the system have ICMP redirects accepted from local gateway
in a moderate frequency, for instance.

Commit this workaround for now until we have some better solution.

PR: kern/149804
Reviewed by: bz
Tested by: Zhao Xin, Pete French
MFC after: 2 weeks


213162 25-Sep-2010 lstewart

Log the number of segments currently in the reassembly queue.

Sponsored by: FreeBSD Foundation


213158 25-Sep-2010 lstewart

Internalise reassembly queue related functionality and variables which should
not be used outside of the reassembly queue implementation. Provide a new
function to flush all segments from a reassembly queue and call it from the
appropriate places instead of manipulating the queue directly.

Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks


213103 24-Sep-2010 attilio

Make the RPC specific __rpc_inet_ntop() and __rpc_inet_pton() general
in the kernel (just as inet_ntoa() and inet_aton()) are and sync their
prototype accordingly with already mentioned functions.

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, rstone
Approved by: dfr
MFC after: 2 weeks


213101 24-Sep-2010 attilio

IP_BINDANY is not correctly handled in getsockopt() case.
Fix it by specifying the correct bits.

Sponsored by: Sandvine Incorporated
Reviewed by: bz, emaste, rstone
Obtained from: Sandvine Incorporated
MFC after: 10 days


212898 20-Sep-2010 glebius

Do not convert some meaningful error value to EINVAL.

Reviewed by: will


212897 20-Sep-2010 tuexen

Fix a locking issue which resulted in aborted associations
due to a corrupted nr-mapping array.

MFC after: 2 weeks.


212851 19-Sep-2010 tuexen

Allow the initial congestion window to be configure
to one MTU. Improve the description.

MFC after: 2 weeks.


212850 19-Sep-2010 tuexen

Fix a locking issue which shows up when the code is used
on Mac OS X.

MFC after: 2 weeks.


212803 17-Sep-2010 andre

Rearrange the TSO code to make it more readable and to clearly
separate the decision logic, of whether we can do TSO, and the
calculation of the burst length into two distinct parts.

Change the way the TSO burst length calculation is done. While
TSO could do bursts of 65535 bytes that can't be represented in
ip_len together with the IP and TCP header. Account for that and
use IP_MAXPACKET instead of TCP_MAXWIN as base constant (both
have the same value of 64K). When more data is available prevent
less than MSS sized segments from being sent during the current
TSO burst.

Add two more KASSERTs to ensure the integrity of the packets.

Tested by: Ben Wilber <ben-at-desync com>
MFC after: 10 days


212801 17-Sep-2010 tuexen

Fix a bug where the wrong PR-SCTP policy was considered.
While there, use always the same code for the check of
TTL expiration.

MFC after: 2 weeks.


212800 17-Sep-2010 tuexen

Make the initial congestion window configurable via sysctl.

MFC after: 2 weeks.


212799 17-Sep-2010 tuexen

* Implement initial version of send buffer splitting.
* Make send/recv buffer splitting switchable via sysctl.
* While there: Fix some comments.


212765 16-Sep-2010 andre

Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework. It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI. It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI. It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned. The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.


212731 16-Sep-2010 andre

Improve comment to TCP_MINMSS by taking the wording from lstewart (with
a small difference in the last paragraph though) as suggested by jhb.

Clarify that the 'reviewed by' in r212653 by lstewart was for the
functional change, not the comments in the committed version.


212714 16-Sep-2010 tuexen

Remove old debug code.

MFC after: 2 weeks.


212713 15-Sep-2010 tuexen

Remove unused variable/assignment.

MFC after: 3 weeks.


212712 15-Sep-2010 tuexen

Delay the assignment of a path for DATA chunk until they hit
the sent_queue. Honor a given path when the SCTP_ADDR_OVER
flag is set.

MFC after: 2 weeks.


212711 15-Sep-2010 tuexen

Use TAILQ_EMPTY() for testing if a tail queue is empty.
Set whoFrom to NULL after freeing whoFrom.


212707 15-Sep-2010 tuexen

Remove unused variable/assignment.

MFC after: 2 weeks.


212704 15-Sep-2010 tuexen

Remove assignment without effect.

MFC after: 2 weeks.


212702 15-Sep-2010 tuexen

* Use !TAILQ_EMPTY() for checking if a tail queue is not empty.
* Remove assignment without any effect.

MFC after: 2 weeks.


212653 15-Sep-2010 andre

Change the default MSS for IPv4 and IPv6 TCP connections from an
artificial power-of-2 rounded number to their real values specified
in RFC879 and RFC2460.

From the history and existing comments it appears that the rounded
numbers were intended to be advantageous for the kernel and mbuf
system. However this hasn't been the case at for at least a long
time. The mbuf clusters used in tcp_output() have enough space
to hold the larger real value for the default MSS for both IPv4 and
IPv6. Note that the default MSS is only used when path MTU discovery
is disabled.

Update and expand related comments.

Reviewed by: lsteward (including some word-smithing)
MFC after: 2 weeks


212502 12-Sep-2010 qingli

Adding an address on an interface also requires the loopback route to
that address be installed.

PR: kern/150481
Submitted by: Ingo Flaschberger <if at xip.at>
MFC after: 5 days


212380 09-Sep-2010 tuexen

* Remove code which has no effect.
* Clean up the handling in sctp_lower_sosend().

MFC after: 3 weeks.


212266 06-Sep-2010 will

Fix CARP in backup mode by properly registering its hooks for INET and INET6
using ipproto_{un,}register() and the newly created ip6proto_{un,}register()
so that it can again receive IPPROTO_CARP packets allowing its state machine
to work.

Reviewed by: bz
Approved by: ken (mentor)


212265 06-Sep-2010 will

Fix static kernel builds with carp(4) by changing its SYSINIT order so that
it is initialized after basic protocol initialization, which allows it to
register via pf_proto_register().

Reviewed by: bz
Approved by: ken (mentor)


212256 06-Sep-2010 glebius

in_delayed_cksum() requires host byte order.

Reported by: Alexander Levin <amindomao googlemail.com>
MFC after: 1 week


212242 05-Sep-2010 tuexen

Implement correct handling of address parameter and
sendinfo for SCTP send calls.

MFC after: 4 weeks.


212225 05-Sep-2010 rrs

Fix some CLANG warnings. One clang warning is left
due to the fact that its bogus.. nam->sa_family will
not change from AF_INET6 to AF_INET (but clang
thinks it does ;-D)


212209 04-Sep-2010 bz

In case of RADIX_MPATH do not leak the IN_IFADDR read lock on
early return.

MFC after: 3 days


212155 02-Sep-2010 bz

MFp4 CH=183052 183053 183258:

In protosw we define pr_protocol as short, while on the wire
it is an uint8_t. That way we can have "internal" protocols
like DIVERT, SEND or gaps for modules (PROTO_SPACER).
Switch ipproto_{un,}register to accept a short protocol number(*)
and do an upfront check for valid boundries. With this we
also consistently report EPROTONOSUPPORT for out of bounds
protocols, as we did for proto == 0. This allows a caller
to not error for this case, which is especially important
if we want to automatically call these from domain handling.

(*) the functions have been without any in-tree consumer
since the initial introducation, so this is considered save.

Implement ip6proto_{un,}register() similarly to their legacy IP
counter parts to allow modules to hook up dynamically.

Reviewed by: philip, will
MFC after: 1 week


212099 01-Sep-2010 tuexen

Fix a bug which results in peer IPv4 addresses a.b.c.d with 224<=d<=239
incorrectly being detected as multicast addresses on little endian systems.

MFC after: 2 weeks


211992 30-Aug-2010 maxim

o Some programs could send broadcast/multicast traffic to ipfw
pseudo-interface. This leads to a panic due to uninitialized
if_broadcastaddr address. Initialize it and implement ip_output()
method to prevent mbuf leak later.

ipfw pseudo-interface should never send anything therefore call
panic(9) in if_start() method.

PR: kern/149807
Submitted by: Dmitrij Tejblum
MFC after: 2 weeks


211969 29-Aug-2010 tuexen

Fix the the SCTP_WITH_NO_CSUM option when used in combination with
interface supporting CRC offload. While at it, make use of the
feature that the loopback interface provides CRC offloading.

MFC after: 4 weeks


211950 28-Aug-2010 tuexen

Bugfix: Do not send a packet drop report in response to a received
INIT-ACK with incorrect CRC.


211944 28-Aug-2010 tuexen

Fix the switching on/off of CMT using sysctl and socket option.
Fix the switching on/off of PF and NR-SACKs using sysctl.
Add minor improvement in handling malloc failures.
Improve the address checks when sending.

MFC after: 4 weeks


211888 27-Aug-2010 jhb

Simplify the tcp pcblist estimate logic slightly.

MFC after: 3 days


211874 27-Aug-2010 andre

Use timestamp modulo comparison macro for automatic receive buffer
scaling to correctly handle wrapping of ticks value.

MFC after: 1 week


211501 19-Aug-2010 anchie

MFp4: anchie_soc2009 branch:

Add kernel side support for Secure Neighbor Discovery (SeND), RFC 3971.

The implementation consists of a kernel module that gets packets from
the nd6 code, sends them to user space on a dedicated socket and reinjects
them back for further processing.

Hooks are used from nd6 code paths to divert relevant packets to the
send implementation for processing in user space. The hooks are only
triggered if the send module is loaded. In case no user space
application is connected to the send socket, processing continues
normaly as if the module would not be loaded. Unloading the module
is not possible at this time due to missing nd6 locking.

The native SeND socket is similar to a raw IPv6 socket but with its own,
internal pseudo-protocol.

Approved by: bz (mentor)


211464 18-Aug-2010 andre

If a TCP connection has been idle for one retransmit timeout or more
it must reset its congestion window back to the initial window.

RFC3390 has increased the initial window from 1 segment to up to
4 segments.

The initial window increase of RFC3390 wasn't reflected into the
restart window which remained at its original defaults of 4 segments
for local and 1 segment for all other connections. Both values are
controllable through sysctl net.inet.tcp.local_slowstart_flightsize
and net.inet.tcp.slowstart_flightsize.

The increase helps TCP's slow start algorithm to open up the congestion
window much faster.

Reviewed by: lstewart
MFC after: 1 week


211462 18-Aug-2010 andre

Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated. This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR: kern/137317
MFC after: 1 week


211451 18-Aug-2010 bz

When calculating the expected memory size for userspace, also take the
number of syncache entries into account for the surplus we add to account
for a possible increase of records in the re-entry window.

Discussed with: jhb, silby
MFC after: 1 week


211433 17-Aug-2010 jhb

Ensure a minimum "slop" of 10 extra pcb structures when providing a
memory size estimate to userland for pcb list sysctls. The previous
behavior of a "slop" of n/8 does not work well for small values of n
(e.g. no slop at all if you have less than 8 open UDP connections).

Reviewed by: bz
MFC after: 1 week


211333 15-Aug-2010 andre

Fix the interaction between 'ICMP fragmentation needed' MTU updates,
path MTU discovery and the tcp_minmss limiter for very small MTU's.

When the MTU suggested by the gateway via ICMP, or if there isn't
any the next smaller step from ip_next_mtu(), is lower than the
floor enforced by net.inet.tcp.minmss (default 216) the value is
ignored and the default MSS (512) is used instead. However the
DF flag in the IP header is still set in tcp_output() preventing
fragmentation by the gateway.

Fix this by using tcp_minmss as the MSS and clear the DF flag if
the suggested MTU is too low. This turns off path MTU dissovery
for the remainder of the session and allows fragmentation to be
done by the gateway.

Only MTU's smaller than 256 are affected. The smallest official
MTU specified is for AX.25 packet radio at 256 octets.

PR: kern/146628
Tested by: Matthew Luckie <mjl-at-luckie org nz>
MFC after: 1 week


211332 15-Aug-2010 andre

Initializing the new error variable to zero in syncache_socket()
is not necessary.

Noticed by: bz


211327 15-Aug-2010 andre

Add more logging points for failures in syncache_socket() to
report when a new socket couldn't be created because one of
in_pcbinshash(), in6_pcbconnect() or in_pcbconnect() failed.

Logging is conditional on net.inet.tcp.log_debug being enabled.

MFC after: 1 week


211317 14-Aug-2010 andre

When using TSO and sending more than TCP_MAXWIN sendalot is set
and we loop back to 'again'. If the remainder is less or equal
to one full segment, the TSO flag was not cleared even though
it isn't necessary anymore. Enabling the TSO flag on a segment
that doesn't require any offloaded segmentation by the NIC may
cause confusion in the driver or hardware.

Reset the internal tso flag in tcp_output() on every iteration
of sendalot.

PR: kern/132832
Submitted by: Renaud Lienhart <renaud-at-vmware com>
MFC after: 1 week


211316 14-Aug-2010 andre

Change the messages of the ICMP bad port bandwidth limiter from
a kernel printf to a log output with the priority of LOG_NOTICE.

This way the messages still show up in /var/log/messages but no
longer spam the console every other second on busy servers that
are port scanned:
"Limiting open port RST response from 114 to 100 packets/sec"

PR: kern/147352
Submitted by: Eugene Grosbein <eugen-at-eg sd rdtc ru>
MFC after: 1 week


211315 14-Aug-2010 andre

Disable TCP inflight limiter by default.

It was experimental and interferes with the normal congestion control
algorithms by instating a separate, possibly lower, ceiling for the
amount of data that is in flight to the remote host. With high speed
internet connections the inflight limit frequently has been estimated
too low due to the noisy nature of the RTT measurements.

This code gives way for the upcoming pluggable congestion control
framework. It is the task of the congestion control algorithm to
set the congestion window and amount of inflight data without external
interference.

Reviewed by: lstewart
MFC after: 1 week
Removal after: 1 month


211193 11-Aug-2010 will

Unbreak LINT by moving all carp hooks to net/if.c / netinet/ip_carp.h, with
the appropriate ifdefs.

Reviewed by: bz
Approved by: ken (mentor)


211157 11-Aug-2010 will

Allow carp(4) to be loaded as a kernel module. Follow precedent set by
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.

Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default. Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.

This commit requires IP6PROTOSPACER, introduced in r211115.

Reviewed by: bz, simon
Approved by: ken (mentor)
MFC after: 2 weeks


211059 08-Aug-2010 delphij

Address an edge condition that we found at work, where the carp(4)
interface goes to issue LINK_UP, then LINK_DOWN, then LINK_UP at
cold boot. This behavior is not observed when carp(4) interface
is created slightly later, when the underlying interface is fully
up.

Before this change what happen at boot is roughly:

- ifconfig creates em0 interface;
- ifconfig clones a carp device using em0;
(em0's link state is DOWN at this point)
- carp state: INIT -> BACKUP [*]
- carp state: BACKUP -> MASTER
- [Some negotiate between em0 and switch]
- em0 kicks up link state change event
(em0's link state is now up DOWN at this point)
- do_link_state_change() -> carp_carpdev_state()
- carp state: MASTER -> INIT (via carp_set_state(sc, INIT)) [+]
- carp state: INIT -> BACKUP
- carp state: BACKUP -> MASTER

At the [*] stage, em0 did not received any broadcast message from other
node, and assume our node is the master, thus carp(4) sets the link
state to "UP" after becoming a master. At [+], the master status
is forcely set to "INIT", then an election is casted, after which our
node would actually become a master.

We believe that at the [*] stage, the master status should remain as
"INIT" since the underlying parent interface's link state is not up.

Obtained from: iXsystems, Inc.
Reported by: jpaetzel
MFC after: 2 months


211057 08-Aug-2010 ed

Don't use struct timezone.

The timezone structure acquired by gettimeofday() is not used at all.
Just remove it.


210866 05-Aug-2010 tuexen

Fix a bug where endpoints bound to wildcard addresses where
using addresses not announced to the peer due to address
scoping.

MFC after: 3 weeks


210714 01-Aug-2010 tuexen

Cleanup code.

MFC after: 2 weeks


210703 31-Jul-2010 bz

Document the mandatory argument to the arptimer() and
nd6_llinfo_timer() functions with a KASSERT().
Note: there is no need to return after panic.

In the legacy IP case, only assign the arg after the check,
in the IPv6 case, remove the extra checks for the table and
interface as they have to be there unless we freed and forgot
to cancel the timer. It doesn't matter anyway as we would
panic on the NULL pointer deref immediately and the bug is
elsewhere.
This unifies the code of both address families to some extend.

Reviewed by: rwatson
MFC after: 6 days


210686 31-Jul-2010 bz

MFp4 @181628:

Free the rtentry after we diconnected it from the FIB and are counting
it as rttrash. There might still be a chance we leak it from a different
code path but there is nothing we can do about this here.

Sponsored by: ISPsystem (in February)
Reviewed by: julian (in February)
MFC after: 2 weeks


210666 30-Jul-2010 andre

Fix a bug in syncache where the initial CWND for new incoming connections
was limited to one segment under the faulty assumption of a retransmit.
Due to this the opportunity to initialize the increased congestion window
according to RFC3390 was missed.

Support for RFC3465 introduced in r187289 uncovered the bug as the ACK
to SYN/ACK no longer caused snd_cwnd increase by MSS (actually, this
increase shouldn't happen as it's explicitly forbidden by RFC3390, but
it's another issue). Snd_cwnd remains really small (1*MSS + 1) and this
causes really bad interaction with delayed acks on other side.

The variable name sc_rxmits is a bit misleading as it counts all transmits,
not just retransmits.

Submitted by: Maxim Dounin <mdounin-at-mdounin-dot-ru>
MFC after: 10 days


210600 29-Jul-2010 rrs

Fix the comment block that has the nice
table to really have the nice table :-)

MFC after: 1 month


210599 29-Jul-2010 rrs

PR SCTP Bugs. Basically a full sized frame of
PR SCTP FWD-TSN's would not be sent and thus
cause a stalled connection. Also the rwnd
Calculation was also off on the receiver side for
PR-SCTP.
MFC after: 1 month


210537 27-Jul-2010 glebius

Fix operation of "netgraph" action in conjunction with the
net.inet.ip.fw.one_pass sysctl.

The "ngtee" action is still broken.

PR: kern/148885
Submitted by: Nickolay Dudorov <nnd mail.nsk.ru>


210495 26-Jul-2010 tuexen

Fix a bug where the length of a FORWARD-TSN chunk was set incorrectly in
the chunk. This resulted in malformed frames.
Remove a duplicate assignment.

MFC after: 2 weeks


210494 26-Jul-2010 rrs

Make sure that we report chunks if a socket
still exists that were not sent. In either
case carefully remove the data if it does not
get taken by the reporting routines.

MFC after: 2 weeks


210493 26-Jul-2010 rrs

When counting the number of chunks in the
retransmission queue to validate the retran count, we
need to include the chunks in the control send queue
too. Otherwise the count will not match and you will get
the invarient warning if invarients are on.

MFC after: 2 weeks


210203 18-Jul-2010 lstewart

- Move common code from the hook functions that fills in a packet node struct to
a separate inline function. This further reduces duplicate code that didn't
have a good reason to stay as it was.

- Reorder the malloc of a pkt_node struct in the hook functions such that it
only occurs if we managed to find a usable tcpcb associated with the packet.

- Make the inp_locally_locked variable's type consistent with the prototype of
siftr_siftdata().

Sponsored by: FreeBSD Foundation


210160 16-Jul-2010 imp

machine/cpu.h isn't appropriate for this file,so remove it


210123 15-Jul-2010 luigi

remove some conditional #ifdefs (no-op on FreeBSD);
run the timer routine on cpu 0.


210120 15-Jul-2010 luigi

whitespace fixes


210119 15-Jul-2010 luigi

fix a comment and final empty line


209982 13-Jul-2010 lstewart

The SIFTR DPCPU statistics struct was not being zeroed between enable/disable
cycles so the values would accumulate rather than reset for each cycle.

Sponsored by: FreeBSD Foundation


209980 13-Jul-2010 lstewart

Catch up with the rename of DPCPU_SUM to DPCPU_VARSUM in r209978.

Sponsored by: FreeBSD Foundation


209845 09-Jul-2010 glebius

Improve last commit: use bpf_mtap2() to avoiding stack usage.

Prodded by: julian


209797 08-Jul-2010 glebius

Since r209216 bpf(4) searches for mbuf_tags(9) and thus will not work with
a stub m_hdr instead of a full mbuf.

PR: kern/148050


209663 03-Jul-2010 rrs

This fixes a crash in SCTP. It was possible to have a
large number of packets queued to a crashing process.
In a specific case you may get 2 ABORT's back (from
say two packets in flight). If the aborts happened to
be processed at the same time its possible to have
one free the association while the other is trying
to report all the outbound packets. When this occured
it could lead to a crash.

MFC after: 3 days


209662 03-Jul-2010 lstewart

Import the Statistical Information For TCP Research (SIFTR) kernel module into
FreeBSD. SIFTR logs a range of statistics on active TCP connections to a log
file, providing the ability to make highly granular measurements of TCP
connection state. The tool is aimed at system administrators, developers and
researchers alike. Please take it for a spin and test it out - the man page
should have all the information required to get you going.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: dwmalone, gnn, rpaulo
Tested by: Many on freebsd-current@ and elsewhere over the years
MFC after: 1 month


209644 02-Jul-2010 rrs

Fix a bug that WILL cause a panic. Basically
a read-lock is being called to check the vtag-timewait cache.
Then in two cases (where a vtag is bad i.e. in the time-wait
state) the write-unlock is called NOT the read-unlock. Under
conditions where lots of associations are coming and going
this will cause the system to panic at some point.

MFC after: 3 days


209589 29-Jun-2010 glebius

After processing the O_SKIPTO opcode our cmd points to the next rule, and
"match" processing at the end of inner loop would look ahead into the next
rule, which is incorrect. Particularly, in the case when the next rule
started with F_NOT opcode it was skipped blindly.

To fix this, exit the inner loop with the continue operator forcibly and
explicitly.

PR: kern/147798


209499 24-Jun-2010 tuexen

Fix a bug I introduced in r209470.

MFC after: 3 days


209470 23-Jun-2010 tuexen

* Implement sctp_does_stcb_own_this_addr() correclty. It was taking the
wrong side into account.
* sctp_findassociation_ep_addr() must check the local address if available.
This fixes a bug where ABORT chunks were accepted even in the case where
the local was not owned by the endpoint.
Thanks to brucec for pointing out a bug in my first version of the fix.
MFC after: 3 days


209289 18-Jun-2010 tuexen

Fix a rece condition in the shutdown handling.
The race condition resulted in a panic.

MFC after: 3 days


209178 14-Jun-2010 tuexen

* Fix a bug where the length of the ASCONF-ACK was calculated wrong due
to using an uninitialized variable.
* Fix a bug where a NULL pointer was dereferenced when interfaces
come and go at a high rate.
* Fix a bug where inps where not deregistered from iterators.
* Fix a race condition in freeing an association.
* Fix a refcount problem related to the iterator.
Each of the above bug results in a panic. It shows up when
interfaces come and go at a high rate.

Obtained from: rrs (partly)
MFC after: 3 days


209029 11-Jun-2010 rrs

3 Fixes -
a) There was a case where a ICMP message could cause
us to return leaving a stuck lock on an stcb.
b) The iterator needed some tweaks to fix its lock
ordering.
c) The ITERATOR_LOCK is no longer needed in the freeing
of a stcb. Now that the timer based one is gone we don't
have a multiple resume situation. Add to that that there
was somewhere a path out of the freeing of an assoc that
did NOT release the iterator_lock.. it was time to clean
this old code up and in the process fix the lock bug.

MFC after: 1 week


208970 09-Jun-2010 rrs

Found by Michael. In cases where we run
out of memory (no more inp space) we don't
propely NULL the INP on return.

Obtained from: tuexen
MFC after: 3 Days


208953 09-Jun-2010 rrs

Fix serveral bugs all having to do with freeing an
sctp_inpcb:
1) Make sure not to remove the flag on the PCB until
after the close() caller is back in control with the
lock. Otherwise a quickly freeing assoc could kill the
inpcb and cause a panic.

2) Make sure all calls to log_closing have not released
the locks before calling the log function, we don't
want the logging function to crash us due to a freed
inpcb.

3) Make sure that when we get to the end, we release all
locks (after removing them from view) and as long as
we are NOT the inp-kill timer removing the inp, call
the callout_drain() function so a racing timer won't
later call in and cause a racing crash.
MFC after: 1 week


208952 09-Jun-2010 rrs

BUG:Turns out we need to use both bit maps
to calculate the cum-ack (we were not doing
it for the NR-Sack case). With this fix
NR-sack should now work correctly.
MFC after: 1 week


208902 08-Jun-2010 rrs

2 Bugs:

1) Only use both mapping arrays when NR sack is off. This
way we can hold off moving the cumack (not the best but
workable) when NR-sack is on.

2) We must make sure to just return on the move of the
bit to the NR array if the cum-ack as already went
past the TSN. This prevents marking a bit behind the
array and hitting the invariant code that panic's us.

MFC after: 1 week


208897 07-Jun-2010 rrs

This fixes a BUG in the handling of the cum-ack calculation.
We were only paying attention to the nr-mapping-array. Which
seems to make sense on the surface, by definition things
up to the cum-ack should be deliverable thus in the nr-mapping-array.
However (there is always a gotcha) thats not true when it
comes to large messages. The stack may hold the message
while re-assembling it not not deliver it based on several
thresholds. If that happens (which it would for smaller
large messages) then the cum-ack is figured wrong. We
now properly use both arrays in the cum-ack calculation.

MFC after: 1 week.


208891 07-Jun-2010 rrs

Opps... my bad.. we don't need a SOCK_UNLOCK() after
calling socantrcvmore_locked() since it will unlock
the lock for you.

MFC after: 1 week


208883 07-Jun-2010 rrs

Fix so we call socantrcvmore_locked so we
don't see a race where we unlock to call
the non-locked version and have the socket
go away.

MFC after: 1 week


208879 06-Jun-2010 rrs

1) Optimize the cleanup and don't always depend on
the timer. This is done by considering the locks
we will destroy and if they are contended we consider
it the same as a reference count being up. Fixing this
appears to cleanup another crash that was appearing with
all the timers where the socket buf lock got corrupted.

2) Fix the sysctl code to take a lot more care when looking
at INP's that are in the GONE or ALLGONE state.

MFC after: 1 week


208878 06-Jun-2010 rrs

Ok, yet another bug in killing off all the hundreds
of apitesters.. Basically we end up with attempting
to destroy a lock thats contended on. A cookie echo
arrives at the same time that the close is happening.
The close gets the lock but the cookie echo has already
passed the check for the gone flag and is then locked
waiting on the create lock.. when we go to destroy it
bam. For now we do the timer destroy for all calls
to close.. We can probably optimize this later so that
we check whats being contended on and if there is contention
then do the timer thing. but this is probably safest since
the inp has been removed from all lists and references and
only the timer can find it.. once the locks are released all
other places will instantly see the GONE flag and bail (thats
what the change in sctp_input is one place that was lacking
the bail code).

MFC after: 1 week


208876 06-Jun-2010 rrs

1) Further enhance the INVARIANT lock validation (no locks) are
held by checking the create and inp locks as well.

2) Fix a bug in that when a socket is closed an INIT-ACK
is returned, we do NOT unlock the locked_tcb unless its
different (an unlikely scenario). If we blindly unlock as
we were doing before we can end up unlocking the actual
stcb thats about to be sent down to the free function which
requires the lock be held.

MFC after: 1 week


208875 06-Jun-2010 rrs

Fix a bug in the sctp_inpcb_free. Basically if the socket
was setup to do an abortive close an association that was
in the accept_queue could get stuck and never freed. Now
we properly start the kill timer on the socket and turn
off the flag (same thing we do for the graceful close method).
MFC after: 1 week


208874 06-Jun-2010 rrs

Fix a bug in sctp_abort_assoc(). DON'T call the sctp_inpcb_free
when the gone flag is set. You don't know what locks the
caller has set and there is already a kill timer running.

MFC after: 1 week


208864 06-Jun-2010 rrs

Hopefully this fixes a LOR by making
so we only hold the iterator lock during
updates to the iterators work.

MFC after: 1 week


208863 06-Jun-2010 rrs

Bruce's fix for some return's in
error legs.

MFC after: 1 week


208857 05-Jun-2010 rrs

Purge out a Windows def that somehow slipped
past the scrubber.

MFC after: 1 Week


208856 05-Jun-2010 rrs

Spacing issues

MFC after: 1 Week


208855 05-Jun-2010 rrs

This change does the following:
1) Fix the alignment of a comment.
2) Fix a BUG where we were NOT paying attention
to the RESEND marking on retransmitting control
chunks.. and worse we were not decrementing the
retran count that could cause us to loop forever.
3) Add in the valdiate_no_lock function on invariants
so that we will really check all ways out to be sure
a lock does not slip out locked.

MFC after: 1 week.


208854 05-Jun-2010 rrs

Use the proper increment macro when increasing the
number on sent_queue_retran_cnt.

MFC after: 1 week


208853 05-Jun-2010 rrs

This does two changes:
1) Makes it so that the INVARIANT function validate nolocks is
available anywhere.
2) Fixes a BUG where a close has been done on a collision socket
and the cookie processing would return leaving a lock held.
MFC after: 1 week


208852 05-Jun-2010 rrs

This fixes a bug in the close up of a socket that
had un-accepted assoc's. Basically the assoc (and inp)
would get stuck and never get cleaned up.

MFC after: 1 week


208744 02-Jun-2010 zec

Virtualize the IPv4 multicast routing code.

Submitted by: iprebeg
Reviewed by: bms, bz, Pavlin Radoslavov
MFC after: 30 days


208553 25-May-2010 qingli

This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after: 3 days


208160 16-May-2010 rrs

This adds back the Iterator to the sctp
code base. We now properly have ONE thread
that services all VNET's. Also we purge out
the old timer based iterator code which had
multiple LOR's and other issues.

MFC after: 3 days


207985 12-May-2010 rrs

Fix an old long time bug in generating a
fwd-tsn. This would appear when greater than
the size of mbuf TSN's would need to be skipped.

MFC after: 3 days


207983 12-May-2010 rrs

More PR-SCTP bugs:
- Make sure that when you kick the streams you add correctly
using a 16 bit unsigned.
- Make sure when sending out you allow FWD-TSN to skip over
and list the ACKED chunks in the stream/seq list (so the
rcv will kick the stream)
MFC after: 3 days


207966 12-May-2010 tuexen

Get rid of unused constants.

MFC after: 3 days.


207963 12-May-2010 rrs

This fixes PR-SCTP issues:
- Slide the map at the proper place.
- Mark the bits in the nr_array ONLY if there
is no marking.
- When generating a FWD-TSN we allow us to skip past
ACKED chunks too.

MFC after: 1 weeks


207924 11-May-2010 rrs

This fixes a bug with the one-2-one model socket when a
user sets up a socket to a server sends data and closes
the socket before the server has called accept(). It used
to NOT work at all. Now we add a flag to the assoc and
defer assoc cleanup so that the accept will suceed.


207369 29-Apr-2010 bz

MFP4: @176978-176982, 176984, 176990-176994, 177441

"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days


207277 27-Apr-2010 bz

Enhance the historic behaviour of raw sockets and jails in a way
that we allow all possible jail IPs as source address rather than
forcing the "primary". While IPv6 naturally has source address
selection, for legacy IP we do not go through the pain in case
IP_HDRINCL was not set. People should bind(2) for that.

This will, for example, allow ping(|6) -S to work correctly for
non-primary addresses.

Reported by: (ten 211.ru)
Tested by: (ten 211.ru)
MFC after: 4 days


207275 27-Apr-2010 bms

Fix a regression where DVMRP diagnostic traffic, such as that used
by mrinfo and mtrace, was dropped by the IGMP TTL check. IGMP control
traffic must always have a TTL of 1.

Submitted by: Matthew Luckie
MFC after: 3 days


207197 25-Apr-2010 tuexen

Sending a FWDTSN chunk should not affect the retran count.

MFC after: 3 days.


207191 25-Apr-2010 tuexen

Undo my lastest fix since that wasn't one at all.

MFC after: 3 days.


207099 23-Apr-2010 tuexen

* Fix compilation when using SCTP_AUDITING_ENABLED.
* Fix delaying of SACK by taking out old optimization code
which does not optimize anymore.
* Fix fast retransmission of chunks abandoned by the
"number of retransmissions" policy.

MFC after: 3 days.


206989 21-Apr-2010 bz

Avoid memory access after free. Use the (shortend) copy for the
ipsec mtu lookup as well.

PR: kern/145736
Submitted by: Peter Molnar (peter molnar.cc)
MFC after: 3 days


206892 20-Apr-2010 tuexen

Update highest_tsn variables when sliding mapping arrays.


206891 20-Apr-2010 tuexen

Really print the nr_mapping array when it should be printed.`

MFC after: 3 days.


206845 19-Apr-2010 luigi

whitespace fixes (trailing whitespace, bad indentation
after a merge, etc.)


206844 19-Apr-2010 ken

Don't clear other flags (e.g. CSUM_TCP) when setting CSUM_TSO. This was
causing TSO to break for the Xen netfront driver.

Reviewed by: gibbs, rwatson
MFC after: 7 days


206840 19-Apr-2010 tuexen

Get delayed SACK working again.

MFC after: 3 days.


206758 17-Apr-2010 tuexen

Fix a bug where SACKs are not sent when they should.
Move some protection code to INVARIANTS.
Cleanups.

MFC after: 3 days.


206481 11-Apr-2010 bz

Plug reference leaks in the link-layer code ("new-arp") that previously
prevented the link-layer entry from being freed.

In both in.c and in6.c (though that code path seems to be basically dead)
plug a reference leak in case of a pending callout being drained.

In if_ether.c consistently add a reference before resetting the callout
and in case we canceled a pending one remove the reference for that.
In the final case in arptimer, before freeing the expired entry, remove
the reference again and explicitly call callout_stop() to clear the active
flag.

In nd6.c:nd6_free() we are only ever called from the callout function and
thus need to remove the reference there as well before calling into
llentry_free().

In if_llatbl.c when freeing entire tables make sure that in case we cancel
a pending callout to remove the reference as well.

Reviewed by: qingli (earlier version)
MFC after: 10 days
Problem observed, patch tested by: simon on ipv6gw.f.o,
Christian Kratzer (ck cksoft.de),
Evgenii Davidov (dado korolev-net.ru)
PR: kern/144564
Configurations still affected: with options FLOWTABLE


206461 10-Apr-2010 bz

Try to help with a virtualized dummynet after r206428.

This adds the explicit include (so far probably included through one of the
few "hidden" includes in other header files) for vnet.h and adds a cast
to unbreak LINT-VIMAGE.


206456 10-Apr-2010 rpaulo

Honor the CE bit even when the CWR bit is set.

PR: 145600
Submitted by: Richard Scheffenegger <rs at netapp.com>
MFC after: 1 week


206452 10-Apr-2010 bms

Fix a few issues related to the legacy 4.4 BSD multicast APIs.

IPv4 addresses can and do change during normal operation. Testing by
pfSense developers exposed an issue where OpenOSPFD was using the IPv4
address to leave the OSPF link-scope multicast groups on a dynamic
OpenVPN tun interface, rather than using RFC 3678 with the interface
index, which won't be raced when the interface's addresses change.

In inp_join_group():
If we are already a member of an ASM group, and IP_ADD_MEMBERSHIP or
MCAST_JOIN_GROUP ioctls are re-issued, return EADDRINUSE as per the
legacy 4.4BSD multicast API. This bends RFC 3678 slightly, but does
not violate POLA for apps using the old API.
It also stops us falling through to kicking IGMP state transactions
in what is otherwise a no-op case.
[This has already been dealt with in HEAD, but make it explicit before
we MFC the change to 8.]

In inp_leave_group():
Fix a bogus conditional.
Move the ifp null check to ioctls MCAST_LEAVE* in the switch..case
where it actually belongs.
If an interface was specified, by primary IPv4 address, for ioctl
IP_DROP_MEMBERSHIP or MCAST_LEAVE_GROUP (an ASM full leave operation),
then and only then should we look up the ifp from the IPv4 address in
mreqs.imr_interface.
If not, we fall through to imo_match_group() as before, but only in
the IP_DROP_MEMBERSHIP case.

With these changes, the legacy 4.4BSD multicast API idempotence should
be mostly preserved in the SSM enabled IPv4 stack.

Found by: ermal (with pfSense)
MFC after: 3 days


206428 09-Apr-2010 luigi

This commit enables partial operation of dummynet with kernels
compiled with "options VIMAGE".
As it is now, there is still a single instance of the pipes,
and it is only usable from vnet0 (the main instance).
Trying to use a pipe from a different vimage does not crash
the system as it did before, but the traffic coming out from
the pipe goes to the wrong place, and i still need to
figure out where.

Support for per-vimage pipes is almost there (just a matter of
uncommenting the VNET_* definitions for dn_cfg, plus putting into
the structure the remaining static variables), however i need
first to figure out how init/uninit work, and also to understand
where packets are ending up on exit from a pipe.

In summary: vimage support for dummynet is not complete yet,
but we are getting there.


206425 09-Apr-2010 luigi

no need to pass an argument to dn_compat_calc_size()

MFC after: 3 days


206339 07-Apr-2010 luigi

Hopefully fix the recent breakage in rule deletion.
A few more tests and this will also go into -stable where
the problem is more critical.


206281 06-Apr-2010 tuexen

Fix a off-by-one bug in zeroing out the mapping arrays.
Fix sctp_print_mapping_array().

MFC after: 1 week


206151 04-Apr-2010 tuexen

Use also SCTP/IPv6 checksum offloading in special cases.

MFC after: 2 weeks


206137 03-Apr-2010 tuexen

* Fix some race condition in SACK/NR-SACK processing.
* Fix handling of mapping arrays when draining mbufs or processing
FORWARD-TSN chunks.
* Cleanup code (no duplicate code anymore for SACKs and NR-SACKs).
Part of this code was developed together with rrs.
MFC after: 2 weeks.


206022 31-Mar-2010 delphij

Add definition of IPv6 mobility header's protocol number, as assigned by
IANA and defined in RFC 3775.

Obtained from: KAME


205955 31-Mar-2010 luigi

fix bug in previous commit related to rule deletion
(stable/8 just fixed moments ago)


205831 29-Mar-2010 luigi

remove a leftover debugging message


205830 29-Mar-2010 luigi

Fix handling of set manipulations.
This patch has two fixes for potential kernel panics (one wrong
index, one access to the wrong lock) and two fixes to wrong logic
in a conditional. The potential panics are also on stable/8,
so I am going to MFC the fix quickly.


205629 24-Mar-2010 rrs

Adds the option of keeping per-cpu statistics in SCTP. This
may be useful since it gets rid of atomics but I want it to
remain an option until I can do further testing on if it really
speeds things up.


205628 24-Mar-2010 rrs

lagging file I forgot to commit with my nr-sack fixes... opps

Reviewed by: tuexen@freebsd.org


205627 24-Mar-2010 rrs

Fix for NR-Sack code. The code was NOT working properly when
enabled. Basically most of the operations were incorrect causing
bad sacks when you enabled nr-sack. The fixes range across
4 files and unifiy most of the processing so that we only test
nr_sack flags to decide which type of sack to generate.

Optimization left for this is to combine the sack generation
code and make it capable of generating either sack thus shrinking
out a routine.

Reviewed by: tuexen@freebsd.org


205602 24-Mar-2010 luigi

Honor ip.fw.one_pass when a packet comes out of a pipe without being delayed.
I forgot to handle this case when i did the mtag cleanup three months ago.

PR: 145004


205502 23-Mar-2010 rrs

Fixes a bug where SACKs in the face of
mapping_array expansion would break. Basically
once we expanded the array we no longer had both
mapping arrays in sync which the sack processing code depends on.
This would mean we were randomly referring to memory that was probably
not there. This mostly just gave us bad sack results going back to the peer.
If INVARIENTS was on of course we would hit the panic routine in the sack_check
call.

We also add a print routine for the place where one would panic in
invarients so one can see what the main mapping array holds.

Reviewed by: tuexen@freebsd.org
MFC after: 2 weeks


205488 22-Mar-2010 kmacy

- boot-time size the ipv4 flowtable and the maximum number of flows
- increase flow cleaning frequency and decrease flow caching time
when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
allocating again until below 12.5%

MFC after: 7 days


205417 21-Mar-2010 luigi

Add a priority-based packet scheduler.

Sponsored by: The ONELAB2 Project
Submitted by: Riccardo Panicucci


205415 21-Mar-2010 luigi

no need for ipfw_flush_tables(), we just need ipfw_destroy_tables()


205414 21-Mar-2010 luigi

revise documentation


205391 20-Mar-2010 kmacy

- spread tcp timer callout load evenly across cpus if net.inet.tcp.per_cpu_timers is set to 1
- don't default to acquiring tcbinfo lock exclusively in rexmt

MFC after: 7 days


205251 17-Mar-2010 bz

Add pcb reference counting to the pcblist sysctl handler functions
to ensure type stability while caching the pcb pointers for the
copyout.

Reviewed by: rwatson
MFC after: 7 days


205178 15-Mar-2010 luigi

small fixes to estimate the buffer size when requesting all pipes/flows.


205173 15-Mar-2010 luigi

+ implement (two lines) the kernel side of 'lookup dscp N' to use the
dscp as a search key in table lookups;

+ (re)implement a sysctl variable to control the expire frequency of
pipes and queues when they become empty;

+ add 'queue number' as optional part of the flow_id. This can be
enabled with the command

queue X config mask queue ...

and makes it possible to support priority-based schedulers, where
packets should be grouped according to the priority and not some
fields in the 5-tuple.
This is implemented as follows:
- redefine a field in the ipfw_flow_id (in sys/netinet/ip_fw.h) but
without changing the size or shape of the structure, so there are
no ABI changes. On passing, also document how other fields are
used, and remove some useless assignments in ip_fw2.c

- implement small changes in the userland code to set/read the field;

- revise the functions in ip_dummynet.c to manipulate masks so they
also handle the additional field;

There are no ABI changes in this commit.


205157 14-Mar-2010 rwatson

Abstract out initialization of most aspects of struct inpcbinfo from
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot. As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.

MFC after: 1 month
Reviewed by: bz
Sponsored by: Juniper Networks


205104 12-Mar-2010 rrs

The proper fix for the delayed SCTP checksum is to
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR: 144529
MFC after: 2 weeks


205066 12-Mar-2010 kmacy

- restructure flowtable to support ipv6
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate

MFC after: 7 days


205050 11-Mar-2010 luigi

implement listing of a subset of pipes/queues/schedulers.
The filtering of the output is done in the kernel instead of userland
to reduce the amount of data transfered.


204954 10-Mar-2010 luigi

fix handling of commands issued by RELENG_7 version of /sbin/ipfw,

Submitted by: Riccardo Panicucci


204902 09-Mar-2010 qingli

One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.

The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.

Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.

MFC after: 5 days


204866 08-Mar-2010 luigi

cosmetic changes and C++ compatibility


204865 08-Mar-2010 luigi

don't use C++ keywords as variable names


204862 08-Mar-2010 luigi

do not report an error unnecessarily


204838 07-Mar-2010 bz

Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Sponsored by: ISPsystem
Reviewed by: rwatson
MFC after: 5 days


204837 07-Mar-2010 bz

Not only flush the ipfw tables when unloading ipfw or tearing
down a virtual netowrk stack, but also free the Radix Node Head.

Sponsored by: ISPsystem
Reviewed by: julian
MFC after: 5 days


204830 07-Mar-2010 rwatson

Locking the tcbinfo structure should not be necessary in tcp_timer_delack(),
so don't.

MFC after: 1 week
Reviewed by: bz
Sponsored by: Juniper Networks


204829 07-Mar-2010 rwatson

Add comment in tcp_discardcb() talking about how we don't, but should,
address TCP races relating to not calling tcp_drain() on stopped callouts.

Discussed with: bz


204826 07-Mar-2010 rwatson

Make udp_set_kernel_tunneling() less forgiving when its invariants are
violated: so_pcb can never be NULL for a valid UDP socket, and it is
always SOCK_DGRAM. Use sotoinpcb() as the rest of the UDP code does.

MFC after: 1 week
Reviewed by: bz
Sponsored by: Juniper Networks


204810 06-Mar-2010 rwatson

Remove unnecessary locking of divcbinfo lock from div_output(): this has not
been required since FreeBSD 7.0 when the so_pcb pointer leading to inp was
guaranteed to be stable when a valid socket reference is held (as it is in
the output path).

MFC after: 1 week
Reviewed by: bz
Sponsored by: Juniper Networks


204809 06-Mar-2010 rwatson

Add a comment to tcp_usr_accept() to indicate why it is we acquire the
tcbinfo lock there: r175612, which re-added it, masked a race between
sonewconn(2) and accept(2) that could allow an incompletely initialized
address on a newly-created socket on a listen queue to be exposed. Full
details can be found in that commit message.

MFC after: 1 week
Sponsored by: Juniper Networks


204807 06-Mar-2010 bz

Destroy UDP UMA zones (empty or not) upon network stack teardown
to not leak them making the VM subsystem unhappy with every stoped vnet(*).
We will still leak pages (especially as zones are marked NOFREE).

(*) This will also keep vmstat -z more usable.

Sponsored by: ISPsystem
MFC after: 5 days


204806 06-Mar-2010 rwatson

Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

MFC after: 1 week


204763 05-Mar-2010 luigi

plug a memory leak on pipe's reconfiguration


204754 05-Mar-2010 luigi

fix a memory leak when deleting RED queues


204736 04-Mar-2010 luigi

portability fixes


204735 04-Mar-2010 luigi

don't use keywords as variable names.


204714 04-Mar-2010 luigi

use callout_drain() (outside the lock) when unloading the module.
This prevents a potential deadlock.

Submitted by: Francesco Magno


204713 04-Mar-2010 luigi

improve compatibility with RELENG_7.2


204591 02-Mar-2010 luigi

Bring in the most recent version of ipfw and dummynet, developed
and tested over the past two months in the ipfw3-head branch. This
also happens to be the same code available in the Linux and Windows
ports of ipfw and dummynet.

The major enhancement is a completely restructured version of
dummynet, with support for different packet scheduling algorithms
(loadable at runtime), faster queue/pipe lookup, and a much cleaner
internal architecture and kernel/userland ABI which simplifies
future extensions.

In addition to the existing schedulers (FIFO and WF2Q+), we include
a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new,
very fast version of WF2Q+ called QFQ.

Some test code is also present (in sys/netinet/ipfw/test) that
lets you build and test schedulers in userland.

Also, we have added a compatibility layer that understands requests
from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries,
and replies correctly (at least, it does its best; sometimes you
just cannot tell who sent the request and how to answer).
The compatibility layer should make it possible to MFC this code in a
relatively short time.

Some minor glitches (e.g. handling of ipfw set enable/disable,
and a workaround for a bug in RELENG_7's /sbin/ipfw) will be
fixed with separate commits.

CREDITS:
This work has been partly supported by the ONELAB2 project, and
mostly developed by Riccardo Panicucci and myself.
The code for the qfq scheduler is mostly from Fabio Checconi,
and Marta Carbone and Francesco Magno have helped with testing,
debugging and some bug fixes.


204522 01-Mar-2010 joel

The NetBSD Foundation has granted permission to remove clause 3 and 4 from
their software.

Obtained from: NetBSD


204143 20-Feb-2010 bz

Upon virtual network stack teardown properly release the TCP syncache
resources.

Sponsored by: ISPsystem
Reviewed by: rwatson
MFC After: 5 days


204141 20-Feb-2010 tuexen

Fix handling of SHUTDOWN-ACK chunk in COOKIE_WAIT and COOKIE_ECHOED.

MFC after: 1 week


204140 20-Feb-2010 bz

Split up ip_drain() into an outer lock and iterator part and
a "locked" version that will only handle a single network stack
instance. The latter is called directly from ip_destroy().

Hook up an ip_destroy() function to release resources from the
legacy IP network layer upon virtual network stack teardown.

Sponsored by: ISPsystem
Reviewed by: rwatson
MFC After: 5 days


204096 19-Feb-2010 tuexen

* Fix another u_long -> uint32_t issue.
* Remove an unused global variable.
* Fix an issue reported by Bruce Cran related to reusing SCTP socket which
where connected.

MFC after: 1 week


204068 18-Feb-2010 pjd

No need to include security/mac/mac_framework.h here.


204040 18-Feb-2010 tuexen

Use uint32_t instead of u_long.

MFC after: 1 week


204003 17-Feb-2010 luigi

remove recursive lock/unlock calls, we do them already before entering
the switch.

Reported by: Marta Carbone


203847 13-Feb-2010 tuexen

Add missing SCTP_PACKED. Spotted by Irene Ruengeler.

MFC after: 1 week


203724 09-Feb-2010 bz

Properly free resources when destroying the TCP hostcache while
tearing down a network stack (in the VIMAGE jail+vnet case).

For that break out the logic from tcp_hc_purge() into an internal
function we can call from both, the sysctl handler and the
tcp_hc_destroy().

Sponsored by: ISPsystem
Reviewed by: silby, lstewart
MFC After: 8 days


203503 04-Feb-2010 tuexen

Restore the checksum received before processing the packet.

MFC after: 1 week


203401 02-Feb-2010 qingli

Some of the existing ppp and vpn related scripts create and set
the IP addresses of the tunnel end points to the same value. In
these cases the loopback route is not installed for the local
end.

Verified by: avg
MFC after: 5 days


203343 01-Feb-2010 luigi

use u_char instead of u_int for short bitfields.

For our compiler the two constructs are completely equivalent, but
some compilers (including MSC and tcc) use the base type for alignment,
which in the cases touched here result in aligning the bitfields
to 32 bit instead of the 8 bit that is meant here.

Note that almost all other headers where small bitfields
are used have u_int8_t instead of u_int.

MFC after: 3 days


202782 22-Jan-2010 tuexen

Use [] instead of [0] for flexible arrays.

Obtained from: Bruce Cran
MFC after: 1 week


202526 17-Jan-2010 tuexen

Get rid of a lot of duplicated code for NR-SACK handle.
Generalize the SACK to code handle also NR-SACKs.


202523 17-Jan-2010 rrs

Bug fix: If the allocation of a socket failed and we
freed the inpcb, it was possible to not set the
proper flags on the pcb (i.e. the socket is not there).
This is HIGHLY unlikely since no one else should be
able to find the socket.. but for consistency we
do the proper loop thing to make sure that we
mark the socket as gone on the PCB.


202521 17-Jan-2010 rrs

Pulls out another leaked windows ifdef that somehow
made its way through the scrubber.


202520 17-Jan-2010 rrs

This change syncs up the socketAPI stream-reset
values to match those in linux and the I-D
just released to the IETF.


202518 17-Jan-2010 rrs

More leaked ifdefs for APPLE and its mobility stuff.


202517 17-Jan-2010 rrs

Remove another set of "leaked" ifdefs that somehow found
their way into FreeBSD.


202516 17-Jan-2010 rrs

Remove strange APPLE define that leaked
through the scrubber scripts. Scripts are
now fixed so this won't happen again.


202469 17-Jan-2010 bz

Garbage collect references to the no longer implemented tcp_fasttimo().

Discussed with: rwatson
MFC after: 5 days


202468 17-Jan-2010 bz

Add ip4.saddrsel/ip4.nosaddrsel (and equivalent for ip6) to control
whether to use source address selection (default) or the primary
jail address for unbound outgoing connections.

This is intended to be used by people upgrading from single-IP
jails to multi-IP jails but not having to change firewall rules,
application ACLs, ... but to force their connections (unless
otherwise changed) to the primry jail IP they had been used for
years, as well as for people prefering to implement similar policies.

Note that for IPv6, if configured incorrectly, this might lead to
scope violations, which single-IPv6 jails could as well, as by the
design of jails. [1]

Reviewed by: jamie, hrs (ipv6 part)
Pointed out by: hrs [1]
MFC After: 2 weeks
Asked for by: Jase Thew (bazerka beardz.net)


202459 17-Jan-2010 ume

Change 'me' to match any IPv6 address configured on an interface in
the system as well as any IPv4 address.

Reviewed by: David Horn <dhorn2000__at__gmail.com>, luigi, qingli
MFC after: 2 weeks


202449 16-Jan-2010 tuexen

Get rid of support of an old version of the SCTP-AUTH draft.
Get rid of unused MD5 code.

MFC after: 1 week


201811 08-Jan-2010 qingli

Ensure an address is removed from the interface address
list when the installation of that address fails.

PR: 139559


201801 08-Jan-2010 ru

Complete the swap of carp(4) log levels and document the change.

MFC after: 3 days


201758 07-Jan-2010 mbr

Remove extraneous semicolons, no functional changes.

Submitted by: Marc Balmer <marc@msys.ch>
MFC after: 1 week


201745 07-Jan-2010 luigi

we don't use dummynet_drain!


201740 07-Jan-2010 luigi

check that we have an ipv4 packet before swapping ip_len and ip_off.
This should fix the handling of ipv6 packets which i broke when i
made ipfw operate on packets in network format.

Reported by: Hajimu UMEMOTO


201735 07-Jan-2010 luigi

Following up on a request from Ermal Luci to make
ip_divert work as a client of pf(4),
make ip_divert not depend on ipfw.

This is achieved by moving to ip_var.h the struct ipfw_rule_ref
(which is part of the mtag for all reinjected packets) and other
declarations of global variables, and moving to raw_ip.c global
variables for filter and divert hooks.

Note that names and locations could be made more generic
(ipfw_rule_ref is really a generic reference robust to reconfigurations;
the packet filter is not necessarily ipfw; filters and their clients
are not necessarily limited to ipv4), but _right now_ most
of this stuff works on ipfw and ipv4, so i don't feel like
doing a gratuitous renaming, at least for the time being.


201732 07-Jan-2010 luigi

some header shuffling to help decoupling ip_divert from ipfw


201722 07-Jan-2010 luigi

put ip_len in correct order for ip_output().
This prevents a panic when ipfw generates packets on its own
(such as reject or keepalives for dynamic rules).

Reported by: Chagin Dmitry


201568 05-Jan-2010 luigi

this file does not require ip_dummynet.h


201544 05-Jan-2010 qingli

An existing incomplete ARP entry would expire a subsequent
statically configured entry of the same host. This bug was
due to the expiration timer was not cancelled when installing
the static entry. Since there exist a potential race condition
with respect to timer cancellation, simply check for the
LLE_STATIC bit inside the expiration function instead of
cancelling the active timer.

MFC after: 5 days


201527 04-Jan-2010 luigi

Various cleanup done in ipfw3-head branch including:
- use a uniform mtag format for all packets that exit and re-enter
the firewall in the middle of a rulechain. On reentry, all tags
containing reinject info are renamed to MTAG_IPFW_RULE so the
processing is simpler.

- make ipfw and dummynet use ip_len and ip_off in network format
everywhere. Conversion is done only once instead of tracking
the format in every place.

- use a macro FREE_PKT to dispose of mbufs. This eases portability.

On passing i also removed a few typos, staticise or localise variables,
remove useless declarations and other minor things.

Overall the code shrinks a bit and is hopefully more readable.

I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr.
For ng_ipfw i am actually waiting for feedback from glebius@ because
we might have some small changes to make.
For if_bridge and if_ethersubr feedback would be welcome
(there are still some redundant parts in these two modules that
I would like to remove, but first i need to check functionality).


201523 04-Jan-2010 tuexen

Correct usage of parenthesis.

PR: kern/142066
Approved by: rrs (mentor)
Obtained from: Henning Petersen, Bruce Cran.
MFC after: 2 weeks


201416 03-Jan-2010 np

Avoid NULL dereference in arpresolve.


201285 30-Dec-2009 qingli

Consolidate the route message generation code for when address
aliases were added or deleted. The announced route entry for
an address alias is no longer empty because this empty route
entry was causing some route daemon to fail and exit abnormally.

MFC after: 5 days


201282 30-Dec-2009 qingli

The proxy arp entries could not be added into the system over the
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.

MFC after: 5 days


201254 30-Dec-2009 syrinx

Make sure the multicast forwarding cache entry's stall queue is properly
initialized before trying to insert an entry into it.

PR: kern/142052
Reviewed by: bms
MFC after: now


201150 29-Dec-2009 luigi

we really need htonl() here, see the comment a few lines above in the code.


201145 28-Dec-2009 antoine

(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month


201141 28-Dec-2009 bz

Make the compiler happy after r201125:
- + remove two unnecessary initializations in ip_output;
+ + remove one unnecessary initializations in ip_output;


201131 28-Dec-2009 luigi

introduce a local variable rte acting as a cache of ro->ro_rt
within ip_output, achieving (in random order of importance):
- a reduction of the number of 'r's in the source code;
- improved legibility;
- a reduction of 64 bytes in the .text


201125 28-Dec-2009 luigi

+ remove an unused #define print_ip;
+ remove two unnecessary initializations in ip_output;
+ localize 'len';
+ introduce a temporary variable n to count the number of fragments,
the compiler seems unable to identify a common subexpression
(written 3 times, used twice);
+ document some assumptions on ip_len and ip_hl


201124 28-Dec-2009 luigi

bring the NGM_IPFW_COOKIE back into ng_ipfw.h, libnetgraph expects
to find it there. Unfortunately this reintroduces the dependency
on ip_fw_pfil.c


201122 28-Dec-2009 luigi

bring in several cleanups tested in ipfw3-head branch, namely:

r201011
- move most of ng_ipfw.h into ip_fw_private.h, as this code is
ipfw-specific. This removes a dependency on ng_ipfw.h from some files.

- move many equivalent definitions of direction (IN, OUT) for
reinjected packets into ip_fw_private.h

- document the structure of the packet tags used for dummynet
and netgraph;

r201049
- merge some common code to attach/detach hooks into
a single function.

r201055
- remove some duplicated code in ip_fw_pfil. The input
and output processing uses almost exactly the same code so
there is no need to use two separate hooks.
ip_fw_pfil.o goes from 2096 to 1382 bytes of .text

r201057 (see the svn log for full details)
- macros to make the conversion of ip_len and ip_off
between host and network format more explicit

r201113 (the remaining parts)
- readability fixes -- put braces around some large for() blocks,
localize variables so the compiler does not think they are uninitialized,
do not insist on precise allocation size if we have more than we need.

r201119
- when doing a lookup, keys must be in big endian format because
this is what the radix code expects (this fixes a bug in the
recently-introduced 'lookup' option)

No ABI changes in this commit.

MFC after: 1 week


201121 28-Dec-2009 luigi

readability fixes -- add braces on large blocks, remove unnecessary
initializations


201120 28-Dec-2009 luigi

explain details of operation of table lookups, and improve portability


201046 27-Dec-2009 luigi

diverted packet must re-enter _after_ the matching rule,
or we create loops.
The divert cookie (that can be set from userland too)
contains the matching rule nr, so we must start from nr+1.

Reported by: Joe Marcus Clarke


200951 24-Dec-2009 luigi

fix poor indentation resulting from a merge


200909 23-Dec-2009 luigi

mostly style changes, such as removal of trailing whitespace,
reformatting to avoid unnecessary line breaks, small block
restructuring to avoid unnecessary nesting, replace macros
with function calls, etc.

As a side effect of code restructuring, this commit fixes one bug:
previously, if a realloc() failed, memory was leaked. Now, the
realloc is not there anymore, as we first count how much memory
we need and then do a single malloc.


200897 23-Dec-2009 luigi

fix build with the new fast lookup structure.
Also remove some unnecessary headers


200896 23-Dec-2009 luigi

fix build on 64-bit architectures.
Also fix the indentation on a few lines.


200855 22-Dec-2009 luigi

merge code from ipfw3-head to reduce contention on the ipfw lock
and remove all O(N) sequences from kernel critical sections in ipfw.

In detail:

1. introduce a IPFW_UH_LOCK to arbitrate requests from
the upper half of the kernel. Some things, such as 'ipfw show',
can be done holding this lock in read mode, whereas insert and
delete require IPFW_UH_WLOCK.

2. introduce a mapping structure to keep rules together. This replaces
the 'next' chain currently used in ipfw rules. At the moment
the map is a simple array (sorted by rule number and then rule_id),
so we can find a rule quickly instead of having to scan the list.
This reduces many expensive lookups from O(N) to O(log N).

3. when an expensive operation (such as insert or delete) is done
by userland, we grab IPFW_UH_WLOCK, create a new copy of the map
without blocking the bottom half of the kernel, then acquire
IPFW_WLOCK and quickly update pointers to the map and related info.
After dropping IPFW_LOCK we can then continue the cleanup protected
by IPFW_UH_LOCK. So userland still costs O(N) but the kernel side
is only blocked for O(1).

4. do not pass pointers to rules through dummynet, netgraph, divert etc,
but rather pass a <slot, chain_id, rulenum, rule_id> tuple.
We validate the slot index (in the array of #2) with chain_id,
and if successful do a O(1) dereference; otherwise, we can find
the rule in O(log N) through <rulenum, rule_id>

All the above does not change the userland/kernel ABI, though there
are some disgusting casts between pointers and uint32_t

Operation costs now are as follows:

Function Old Now Planned
-------------------------------------------------------------------
+ skipto X, non cached O(N) O(log N)
+ skipto X, cached O(1) O(1)
XXX dynamic rule lookup O(1) O(log N) O(1)
+ skipto tablearg O(N) O(1)
+ reinject, non cached O(N) O(log N)
+ reinject, cached O(1) O(1)
+ kernel blocked during setsockopt() O(N) O(1)
-------------------------------------------------------------------

The only (very small) regression is on dynamic rule lookup and this will
be fixed in a day or two, without changing the userland/kernel ABI

Supported by: Valeria Paoli
MFC after: 1 month


200847 22-Dec-2009 jhb

- Rename the __tcpi_(snd|rcv)_mss fields of the tcp_info structure to remove
the leading underscores since they are now implemented.
- Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info
structure.

Reviewed by: rwatson
MFC after: 2 weeks


200838 22-Dec-2009 luigi

some mostly cosmetic changes in preparation for upcoming work:

+ in many places, replace &V_layer3_chain with a local
variable chain;
+ bring the counter of rules and static_len within ip_fw_chain
replacing static variables;
+ remove some spurious comments and extern declaration;
+ document which lock protects certain data structures


200673 18-Dec-2009 ru

Added proper attribution.

Requested by: luigi


200654 17-Dec-2009 luigi

Add some experimental code to log traffic with tcpdump,
similar to pflog(4).
To use the feature, just put the 'log' options on rules
you are interested in, e.g.

ipfw add 5000 count log ....

and run
tcpdump -ni ipfw0 ...

net.inet.ip.fw.verbose=0 enables logging to ipfw0,
net.inet.ip.fw.verbose=1 sends logging to syslog as before.

More features can be added, similar to pflog(), to store in
the MAC header metadata such as rule numbers and actions.
Manpage to come once features are settled.


200634 17-Dec-2009 luigi

simplify and document lookup_next_rule()


200629 17-Dec-2009 luigi

simplify the code that finds the next rule after reinjections

MFC after: 1 week


200610 16-Dec-2009 luigi

remove a duplicate sysctl entry


200603 16-Dec-2009 luigi

bring back a couple of #include that are supplied by nesting,
and explain why they are used.


200601 16-Dec-2009 luigi

Various cosmetic cleanup of the files:
- move global variables around to reduce the scope and make them
static if possible;
- add an ipfw_ prefix to all public functions to prevent conflicts
(the same should be done for variables);
- try to pack variable declaration in an uniform way across files;
- clarify some comments;
- remove some misspelling of names (#define V_foo VNET(bar)) that
slipped in due to cut&paste
- remove duplicate static variables in different files;

MFC after: 1 month


200598 16-Dec-2009 imp

Quick fix to make this compile:
Remove redundant extern declearations.
If the maintainer has a better fix, then feel free to back this out.


200590 15-Dec-2009 luigi

more splitting of ip_fw2.c, now extract the 'table' routines
and the sockopt routines (the upper half of the kernel).

Whoever is the author of the 'table' code (Ruslan/glebius/oleg ?)
please change the attribution in ip_fw_table.c. I have copied
the copyright line from ip_fw2.c but it carries my name and I have
neither written nor designed the feature so I don't deserve
the credit.

MFC after: 1 month


200580 15-Dec-2009 luigi

Start splitting ip_fw2.c and ip_fw.h into smaller components.
At this time we pull out from ip_fw2.c the logging functions, and
support for dynamic rules, and move kernel-only stuff into
netinet/ipfw/ip_fw_private.h

No ABI change involved in this commit, unless I made some mistake.
ip_fw.h has changed, though not in the userland-visible part.

Files touched by this commit:

conf/files
now references the two new source files

netinet/ip_fw.h
remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h.

netinet/ipfw/ip_fw_private.h
new file with kernel-specific ipfw definitions

netinet/ipfw/ip_fw_log.c
ipfw_log and related functions

netinet/ipfw/ip_fw_dynamic.c
code related to dynamic rules

netinet/ipfw/ip_fw2.c
removed the pieces that goes in the new files

netinet/ipfw/ip_fw_nat.c
minor rearrangement to remove LOOKUP_NAT from the
main headers. This require a new function pointer.

A bunch of other kernel files that included netinet/ip_fw.h now
require netinet/ipfw/ip_fw_private.h as well.
Not 100% sure i caught all of them.

MFC after: 1 month


200567 15-Dec-2009 luigi

implement a new match option,

lookup {dst-ip|src-ip|dst-port|src-port|uid|jail} N

which searches the specified field in table N and sets tablearg
accordingly.
With dst-ip or src-ip the option replicates two existing options.
When used with other arguments, the option can be useful to
quickly dispatch traffic based on other fields.

Work supported by the Onelab project.

MFC after: 1 week


200473 13-Dec-2009 bz

Throughout the network stack we have a few places of
if (jailed(cred))
left. If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that, as it is used
outside the network stack as well.

Discussed with: julian, zec, jamie, rwatson (back in Sept)
MFC after: 5 days


200361 10-Dec-2009 luigi

use div64 when converting back the burst value for userland


200360 10-Dec-2009 luigi

when draining a flowset free the entire chain, not just one packet.


200358 10-Dec-2009 luigi

centralize the code to free a packet (or a chain) while in dummynet.
Remove an old macro and its stale comment.


200170 05-Dec-2009 oleg

Fix burst processing for WF2Q pipes - do not increase available burst size
unless pipe is idle. This should fix follwing issues:
- 'dummynet: OUCH! pipe should have been idle!' log messages.
- exceeding configured pipe bandwidth.

MFC after: 1 week


200118 05-Dec-2009 luigi

adjust comment in previous commit after Julian's explanation


200116 05-Dec-2009 luigi

remove a dead block of code, document how the ipfw clients are
hooked and the difference in handling the 'enable' variable
for layer2 and layer3. The latter needs fixing once i figure out
how it worked pre-vnet.

MFC after: 7 days


200113 05-Dec-2009 luigi

fix build with VNET enabled

Reported by: David Wolfskill


200102 04-Dec-2009 ume

Use INET_ADDRSTRLEN and INET6_ADDRSTRLEN rather than hard
coded number.

Spotted by: bz


200059 03-Dec-2009 luigi

preparation work to replace the monster switch in ipfw_chk() with
table of functions.

This commit (which is heavily based on work done by Marta Carbone
in this year's GSOC project), removes the goto's and explicit
return from the inner switch(), so we will have a easier time when
putting the blocks into individual functions.

MFC after: 3 weeks


200055 03-Dec-2009 ume

Teach an IPv6 to the debug prints.


200040 02-Dec-2009 luigi

- initialize src_ip in the main loop to prevent a compiler warning
(gcc 4.x under linux, not sure how real is the complaint).
- rename a macro argument to prevent name clashes.
- add the macro name on a couple of #endif
- add a blank line for readability.

MFC after: 3 days


200034 02-Dec-2009 luigi

Dispatch sockopt calls to ipfw and dummynet
using the new option numbers, IP_FW3 and IP_DUMMYNET3.
Right now the modules return an error if called with those arguments
so there is no danger of unwanted behaviour.

MFC after: 3 days


200029 02-Dec-2009 luigi

small changes for portability and diff reduction wrt/ FreeBSD 7.
No functional differences.

- use the div64() macro to wrap 64 bit divisions
(which almost always are 64 / 32 bits) so they are easier
to handle with compilers or OS that do not have native
support for 64bit divisions;

- use a local variable for p_numbytes even if not strictly
necessary on HEAD, as it reduces diffs with FreeBSD7

- in dummynet_send() check that a tag is present before
dereferencing the pointer.

- add a couple of blank lines for readability near the end of a function

MFC after: 3 days


200027 02-Dec-2009 ume

Teach an IPv6 to send_pkt() and ipfw_tick().
It fixes the issue which keep-alive doesn't work for an IPv6.

PR: kern/117234
Submitted by: mlaier, Joost Bekkers <joost__at__jodocus.org>
MFC after: 1 month


200026 02-Dec-2009 glebius

Until this moment carp(4) used a strange logging priority. It used debug
priority for such important information as MASTER/BACKUP state change,
and used a normal logging priority for such innocent messages as receiving
short packet (which is a normal VRRP packet between some other routers) or
receving a CARP packet on non-carp interface (someone else running CARP).

This commit shifts message logging priorities to a more sane default.


200023 02-Dec-2009 luigi

Add new sockopt names for ipfw and dummynet.

This commit is just grabbing entries for the new names
that will be used in the future, so you don't need to
rebuild anything now.

MFC after: 3 days


200020 02-Dec-2009 luigi

change the type of the opcode from enum *:8 to u_int8_t
so the size and alignment of the ipfw_insn is not compiler dependent.
No changes in the code generated by gcc.

There was only one instance of this kind in our entire source tree,
so i suspect the old definition was a poor choice (which i made).

MFC after: 3 days


199866 27-Nov-2009 tuexen

Use the default stack size for the iterator thread.
This fixes a crash reported by Irene Ruengeler.

Approved by: rrs (mentor)
MFC after: 1 month


199525 19-Nov-2009 bms

Correct a comment.

MFC after: 1 day


199477 18-Nov-2009 tuexen

Fix a bug where the system panics when a SHUTDOWN is received with an
illegal TSN.

Approved by: rrs (mentor)
MFC after: ASAP


199459 17-Nov-2009 tuexen

Get rid of unused fields addr_over which is never really used,
only copied around.

Approved by: rrs (mentor)


199437 17-Nov-2009 tuexen

Use always LIST_EMPTY instead of sometime SCTP_LIST_EMPTY,
which is defined as LIST_EMPTY.

Approved by: rrs (mentor)
MFC after: 1 month


199374 17-Nov-2009 tuexen

Fix a bug where queued ASCONF messags are not sent out.

Approved by: rrs (mentor)
Obtained from: Irene Ruengeler
MFC after: 1 month


199373 17-Nov-2009 tuexen

Fix a memory leak when destroying an SCTP stack.
Clean up sctp_pcb_finish().
Approved by: rrs (mentor)
MFC after: 1 month


199372 17-Nov-2009 tuexen

Do not start the iterator when there are no associations.
This fixes a bug found by Irene Ruengeler.

Approved by: rrs (mentor)
MFC after: 1 month


199371 17-Nov-2009 tuexen

Disable (temporary) the thread based interator. It does not work with vnet.

Approved by: rrs (mentor)


199370 17-Nov-2009 tuexen

Allow the UMA to free data. This resolves the UMA related bug reported
by Julian.

Approved by: rrs (mentor)
MFC after: 1 month


199369 17-Nov-2009 tuexen

Do not hold the lock longer than necessary.

Approved by: rrs (mentor)
MFC after: 1 month


199287 15-Nov-2009 bms

Fix a functional regression in multicast.

Userland daemons need to see IGMP traffic regardless of the group;
omit the imo filter check if the proto is IGMP. The kernel part
of IGMP will have already filtered appropriately at this point.

MFC after: ASAP
Submitted by: Franz Struwig
Reported by: Ivor Prebeg, Franz Struwig


199208 12-Nov-2009 attilio

Move inet_aton() (specular to inet_ntoa(), already present in libkern)
into libkern in order to made it usable by other modules than alias_proxy.

Obtained from: Sandvine Incorporated
Sponsored by: Sandvine Incorporated
MFC: 1 week


199102 09-Nov-2009 trasz

Remove ifdefed out part of code, which seems to have originated a decade ago
in OpenBSD. As it is now, there is no way for this to be useful, since IPsec
is free to forward packets via whatever interface it wants, so checking
capabilities of the interface passed from ip_output (fetched from the routing
table) serves no purpose.

Discussed with: sam@


199073 09-Nov-2009 oleg

style(9): add missing parentheses


198990 06-Nov-2009 jhb

Several years ago a feature was added to TCP that casued soreceive() to
send an ACK right away if data was drained from a TCP socket that had
previously advertised a zero-sized window. The current code requires the
receive window to be exactly zero for this to kick in. If window scaling is
enabled and the window is smaller than the scale, then the effective window
that is advertised is zero. However, in that case the zero-sized window
handling is not enabled because the window is not exactly zero. The fix
changes the code to check the raw window value against zero.

Reviewed by: bz
MFC after: 1 week


198845 03-Nov-2009 oleg

Fix two issues that can lead to exceeding configured pipe bandwidth:
- do not expire queues which are not ready to be expired.
- properly calculate available burst size.

MFC after: 3 days


198621 29-Oct-2009 tuexen

Improve round robin stream scheduler and cleanup some code.

Approved by: rrs (mentor)
MFC after: 3 days


198539 28-Oct-2009 brueffer

Close a stream file descriptor leak.

PR: 138130
Submitted by: Patroklos Argyroudis <argp@census-labs.com>
MFC after: 1 week


198522 27-Oct-2009 tuexen

Bugfix: Use formula from section 7.2.3 of RFC 4960. Reported by Martin Becke.

Approved by: rrs (mentor)
MFC after: 3 days


198499 26-Oct-2009 tuexen

Improve the round robin stream scheduler.

Approved by: rrs (mentor)
MFC after: 3 days


198438 24-Oct-2009 rwatson

Correct spelling typo in ip_input comment.

Pointed out by: N.J. Mann <njm at njm.me.uk>,
John Nielsen <john at jnielsen.net>, julian (!), lstewart
MFC after: 2 days


198418 23-Oct-2009 qingli

Use the correct option name in the preprocessor command to enable
or disable diagnostic messages.

Reviewed by: ru
MFC after: 3 days


198393 23-Oct-2009 rwatson

Improve grammar in ip_input comment while attempting to maintain what
might be its meaning.

MFC after: 3 days


198301 20-Oct-2009 qingli

In the ARP callout timer expiration function, the current time_second
is compared against the entry expiration time value (that was set based
on time_second) to check if the current time is larger than the set
expiration time. Due to the +/- timer granularity value, the comparison
returns false, causing the alternative code to be executed. The
alternative code path freed the memory without removing that entry
from the table list, causing a use-after-free bug.

Reviewed by: discussed with kmacy
MFC after: immediately
Verified by: rnoland, yongari


198196 18-Oct-2009 rwatson

Rewrap ip_input() comment so that it prints more nicely.

MFC after: 3 days


198111 15-Oct-2009 qingli

This patch fixes the following issues in the ARP operation:

1. There is a regression issue in the ARP code. The incomplete
ARP entry was timing out too quickly (1 second timeout), as
such, a new entry is created each time arpresolve() is called.
Therefore the maximum attempts made is always 1. Consequently
the error code returned to the application is always 0.
2. Set the expiration of each incomplete entry to a 20-second
lifetime.
3. Return "incomplete" entries to the application.

Reviewed by: kmacy
MFC after: 3 days


198050 13-Oct-2009 bz

Compare pointer to NULL rather than 0.

MFC after: 1 month


197955 11-Oct-2009 tuexen

Fix a race condition where a mutex was destroyed while sleeping on it.
Found while analyzing a report from julian. It might fix his bug.
Approved by: rrs (mentor)
MFC after: 3 days


197952 11-Oct-2009 julian

Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months


197929 10-Oct-2009 tuexen

Correct include order as indicated by bz.

Approved by: re (mentor)
MFC after: 3 days


197914 09-Oct-2009 tuexen

Do not include vnet.h twice.

Approved by: rrs (mentor)
MFC after: 3 days


197868 08-Oct-2009 tuexen

Use correct arguments when calling SCTP_RTALLOC().

Approved by: rrs (mentor)
MFC after: 0 days


197856 08-Oct-2009 rrs

Fix so that round robing stream scheduling works as advertised

MFC after: 0 days


197814 06-Oct-2009 rwatson

Remove tcp_input lock statistics; these are intended for debugging only
and are not intended to ship in 8.0 as they dirty additional cache
lines in a performance-critical per-packet path.

MFC after: 3 days


197795 05-Oct-2009 rwatson

In tcp_input(), we acquire a global write lock at first only if a
segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN).
If we later have to upgrade the lock, we acquire an inpcb reference
and drop both global/inpcb locks before reacquiring in-order. In
that gap, the connection may transition into TIMEWAIT, so we need
to loop back and reevaluate the inpcb after relocking.

MFC after: 3 days
Reported by: Kamigishi Rei <spambox at haruhiism.net>
Reviewed by: bz


197696 02-Oct-2009 qingli

Remove a log message from production code. This log message can be
triggered by a misconfigured host that is sending out gratuious ARPs.
This log message can also be triggered during a network renumbering
event when multiple prefixes co-exist on a single network segment.

MFC after: immediately


197695 02-Oct-2009 qingli

Previously, if an address alias is configured on an interface, and
this address alias has a prefix matching that of another address
configured on the same interface, then the ARP entry for the alias
is not deleted from the ARP table when that address alias is removed.
This patch fixes the aforementioned issue.

PR: kern/139113
MFC after: 3 days


197342 20-Sep-2009 tuexen

Fix handling of sctp_drain().

Approved by: rrs (mentor)
MFC after: 2 month


197341 20-Sep-2009 tuexen

Fix errnos.

Approved by: rrs(mentor)
MFC after: 3 days.


197328 19-Sep-2009 tuexen

Use appropriate locking when using interface list.

Approved by: rrs (mentor)
MFC after: 1 month.


197327 19-Sep-2009 tuexen

Fix the disabling of sctp_drain().

Approved by: rrs (mentor)
MFC after: 1 month.


197326 19-Sep-2009 tuexen

Get SCTP working in combination with VIMAGE.
Contains code from bz.
Approved by: rrs (mentor)
MFC after: 1 month.


197314 18-Sep-2009 bms

Return ENOBUFS consistently if user attempts to exceed
in_mcast_maxsocksrc resource limit.

Submitted by: syrinx
MFC after: 3 days


197288 17-Sep-2009 rrs

Support for VNET in SCTP (hopefully)


197257 16-Sep-2009 tuexen

Fix a bug reported by Daniel Mentz:
When authenticating DATA chunks some DATA chunks
might get stuck when the MTU gets decreased via
an ICMP message.

Approved by: rrs (mentor)
MFC after: immediately


197244 16-Sep-2009 silby

Add the ability to see TCP timers via netstat -x. This can be a useful
feature when you have a seemingly stuck socket and want to figure
out why it has not been closed yet.

No plans to MFC this, as it changes the netstat sysctl ABI.

Reviewed by: andre, rwatson, Eric Van Gyzen


197236 15-Sep-2009 andre

-Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by: brooks

Once compiled in make it easily switchable for testers by using a tuneable
net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by: rwatson

MFC after: 2 days
-This line, and those below, will be ignored--
> Description of fields to fill in above: 76 columns --|
> PR: If a GNATS PR is affected by the change.
> Submitted by: If someone else sent in the change.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M sys/conf/options
M sys/kern/uipc_socket.c
M sys/netinet/tcp_subr.c
M sys/netinet/tcp_usrreq.c


197227 15-Sep-2009 qingli

Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by: bz
MFC after: immediately


197225 15-Sep-2009 qingli

This patch enables the node to respond to ARP requests for
configured proxy ARP entries.

Reviewed by: bz
MFC after: immediately


197210 15-Sep-2009 qingli

The bootp code installs an interface address and the nfs client
module tries to install the same address again. This extra code
is removed, which was discovered by the removal of a call to
in_ifscrub() in r196714. This call to in_ifscrub is put back here
because the SIOCAIFADDR command can be used to change the prefix
length of an existing alias.

Reviewed by: kmacy


197203 14-Sep-2009 qingli

Previously local end of point-to-point interface is not reachable
within the system that owns the interface. Packets destined to
the local end point leak to the wire towards the default gateway
if one exists. This behavior is changed as part of the L2/L3
rewrite efforts. The local end point is now reachable within the
system. The inpcb code needs to consider this fact during the
address selection process.

Reviewed by: bz
MFC after: immediately


197173 13-Sep-2009 rrs

Fixes two bugs:
1) A lock issue, if we ever had to try again
we would double lock the INP lock.
2) We were allowing (at wrap) associd 0... which really
we cannot allow since 0 normally means in most socket
API calls that we are wishing to effect something on
the INP not TCB.

MFC after: 1 week


197148 13-Sep-2009 bms

In expire_mfc(), add an assert on the multicast forwarding cache mutex.

PR: 138666


197136 12-Sep-2009 bms

Comment some flawed assumptions in inp_join_group() about
mixing SSM full-state and delta-based APIs.

ENOTIME to fix right now. No functional changes.

MFC after: 5 days


197135 12-Sep-2009 bms

Don't allow joins w/o source on an existing group.
This is almost always pilot error.

We don't need to check for group filter UNDEFINED state at t1,
because we only ever allocate filters with their groups, so we
unconditionally reject such calls with EINVAL.
Trying to change the active filter mode w/o going through IP_MSFILTER
is also disallowed.

Deals with the case described in PR 137164 upfront, cumulative
with the fix in svn rev 197132 which only calls imo_match_source()
if the source address family was not unspecified.

PR: 137164
MFC after: 5 days


197132 12-Sep-2009 bms

Tighten input checking in inp_join_group():
* Don't try to use the source address, when its family is unspecified.
* If we get a join without a source, on an existing inclusive
mode group, this is an error, as it would change the filter mode.

Fix a problem with the handling of in_mfilter for new memberships:
* Do not rely on imf being NULL; it is explicitly initialized to a
non-NULL pointer when constructing a membership.
* Explicitly initialize *imf to EX mode when the source address
is unspecified.

This fixes a problem with in_mfilter slot recycling in the join path.

PR: 138690
Submitted by: Stef Walter
MFC after: 5 days


197130 12-Sep-2009 bms

Fix an obvious logic error in the IPv4 multicast leave processing,
where the filter mode vector was not updated correctly after the leave.

PR: 138691
Submitted by: Stef Walter
MFC after: 5 days


197129 12-Sep-2009 bms

Fix an API issue in leave processing for IPv4 multicast groups.
* Do not assume that the group lookup performed by imo_match_group()
is valid when ifp is NULL in this case.
* Instead, return EADDRNOTAVAIL if the ifp cannot be resolved for the
membership we are being asked to leave.

Caveat user:
* The way IPv4 multicast memberships are implemented in the inpcb layer
at the moment, has the side-effect that struct ip_moptions will
still hold the membership, under the old ifp, until ip_freemoptions()
is called for the parent inpcb.
* The underlying issue is: the inpcb layer does not get notification
of ifp being detached going away in a thread-safe manner.
This is non-trivial to fix.

But hey, at least the kernel should't panic when you unplug a card.

PR: 138689
Submitted by: Stef Walter
MFC after: 5 days


196995 08-Sep-2009 np

Add arp_update_event. This replaces route_arp_update_event, which
has not worked since the arp-v2 rewrite.

The event handler will be called with the llentry write-locked and
can examine la_flags to determine whether the entry is being added
or removed.

Reviewed by: gnn, kmacy
Approved by: gnn (mentor)
MFC after: 1 month


196967 08-Sep-2009 phk

Move the duplicate definition of struct sockaddr_storage to its own
include file, and include this where the previous duplicate definitions were.

Static program checkers like FlexeLint rightfully take a dim view of
duplicate definitions, even if they currently are identical.


196932 07-Sep-2009 syrinx

When joining a multicast group, the inp_lookup_mcast_ifp call
does a KASSERT that the group address is multicast, so the
check if this is indeed true and eventually return a EINVAL if not,
should be done before calling inp_lookup_mcast_ifp. This fixes a kernel
crash when calling setsockopt (sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,...)
with invalid group address.

Reviewed by: bms
Approved by: bms

MFC after: 3 days


196881 06-Sep-2009 pjd

Correct comment.


196797 03-Sep-2009 gnn

Add ARP statistics to the kernel and netstat.

New counters now exist for:
requests sent
replies sent
requests received
replies received
packets received
total packets dropped due to no ARP entry
entrys timed out
Duplicate IPs seen

The new statistics are seen in the netstat command
when it is given the -s command line switch.

MFC after: 2 weeks
In collaboration with: bz


196738 01-Sep-2009 bz

In case an upper layer protocol tries to send a packet but the
L2 code does not have the ethernet address for the destination
within the broadcast domain in the table, we remember the
original mbuf in `la_hold' in arpresolve() and send out a
different packet with an arp request.
In case there will be more upper layer packets to send we will
free an earlier one held in `la_hold' and queue the new one.

Once we get a packet in, with which we can perfect our arp table
entry we send out the original 'on hold' packet, should there
be any.
Rather than continuing to process the packet that we received,
we returned without freeing the packet that came in, which
basically means that we leaked an mbuf for every arp request
we sent.

Rather than freeing the received packet and returning, continue
to process the incoming arp packet as well.
This should (a) improve some setups, also proxy-arp, in case it was an
incoming arp request and (b) resembles the behaviour FreeBSD had
from day 1, which alignes with RFC826 "Packet reception" (merge case).

Rename 'm0' to 'hold' to make the code more understandable as
well as diffable to earlier versions more easily.

Handle the link-layer entry 'la' lock comepletely in the block
where needed and release it as early as possible, rather than
holding it longer, down to the end of the function.

Found by: pointyhat, ns1
Bug hunting session with: erwin, simon, rwatson
Tested by: simon on cluster machines
Reviewed by: ratson, kmacy, julian
MFC after: 3 days


196714 31-Aug-2009 qingli

This patch fixes the following issues:

- Routing messages are not generated when adding and removing
interface address aliases.
- Loopback route installed for an interface address alias is
not deleted from the routing table when that address alias
is removed from the associated interface.
- Function in_ifscrub() is called extraneously.

Reviewed by: gnn, kmacy, sam
MFC after: 3 days


196610 28-Aug-2009 tuexen

Fix a bug where vlan interfaces are not supported by SCTP.

Approved by: rrs (mentor)
MFC after: 3 days


196608 28-Aug-2009 qingli

Do not try to free the rt_lle entry of the cached route in
ip_output() if the cached route was not initialized from the
flow-table. The rt_lle entry is invalid unless it has been
initialized through the flow-table.

Reviewed by: kmacy, rwatson
MFC after: immediately


196535 25-Aug-2009 rwatson

Use locks specific to the lltable code, rather than borrow the ifnet
list/index locks, to protect link layer address tables. This avoids
lock order issues during interface teardown, but maintains the bug that
sysctl copy routines may be called while a non-sleepable lock is held.

Reviewed by: bz, kmacy
MFC after: 3 days


196509 24-Aug-2009 tuexen

This fixes a bug where the value set by SCTP_PARTIAL_DELIVERY_POINT
was not honored, if the socket buffer size was not 4 times that large.

Approved by: rrs (mentor)
MFC after: 3 days.


196507 24-Aug-2009 rrs

This fixes two bugs in the NR-Sack code:
1) When calculating the table offset for sliding the sack
array, the two byte values must be "ored" together in order
for us to do the correct sliding of the arrays.
2) We were NOT properly doing CC and other changes to things only
NR-Sacked. The solution here is to make a separate function that
will actually do both CC/updates and free things if its NR sack'd.
This actually shrinks out common code from three places (much better).

MFC after: 3 days


196502 24-Aug-2009 zec

Introduce a div_destroy() function which takes over per-vnet cleanup tasks
from the existing modevent / MOD_UNLOAD handler, and register div_destroy()
in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds. In
nooptions VIMAGE builds, div_destroy() will be invoked from the modevent
handler, resulting in effectively identical operation as it was prior this
change. div_destroy() also tears down hashtables used by ipdivert, which
were previously left behind on ipdivert kldunloads.

For options VIMAGE builds only, temporarily disable kldunloading of ipdivert,
because without introducing additional locking logic it is impossible to
atomically check whether all ipdivert instances in all vnets are idle, and
proceed with cleanup without opening a race window for a vnet to open an
ipdivert socket while ipdivert tear-down is in progress.

While here, staticize div_init(), because it is not used outside of
ip_divert.c.

In cooperation with: julian
Approved by: re (rwatson), julian (mentor)
MFC after: 3 days


196481 23-Aug-2009 rwatson

Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by: bz, julian
MFC after: 3 days


196453 23-Aug-2009 julian

Fix another typo right next to the previous one, that amazingly, I did
not see before.

MFC after: 1 week


196451 23-Aug-2009 julian

Fix typo in comment that has been bugging me for days.

MFC after: 1 week


196423 21-Aug-2009 julian

Fix ipfw's initialization functions to get the correct order of evaluation
to allow vnet and non vnet operation. Move some functions from ip_fw_pfil.c
to ip_fw2.c and mode to mostly using the SYSINIT and VNET_SYSINIT handlers
instead of the modevent handler. Correct some spelling errors in comments
in the affected code. Note this bug fixes a crash in NON VIMAGE kernels when
ipfw is unloaded.

This patch is a minimal patch for 8.0
I have a much larger patch that actually fixes the underlying problems
that will be applied after 8.0

Reviewed by: zec@, rwatson@, bz@(earlier version)
Approved by: re (rwatson)
MFC after: Immediatly


196410 20-Aug-2009 peter

Fix signed comparison bug when ticks goes negative after 24 days of
uptime. This causes the tcp time_wait state code to fail to expire
sockets in timewait state.

Approved by: re (kensmith)


196397 20-Aug-2009 will

Fix CARP memory leaks on carp_if's malloc'd using M_CARP. This occurs when
CARP tries to free them using M_IFADDR after the last address for a virtual
host is removed and when detaching from the parent interface.

Reviewed by: mlaier
Approved by: re (kib), ken (mentor)


196376 19-Aug-2009 tuexen

Fix a bug in the handling of unreliable messages
which results in stalled associations.

Approved by: re, rrs (mentor)
MFC after: immediately


196368 18-Aug-2009 kmacy

- change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
- check that a cached flow corresponds to the same fib index as the
packet for which we are doing the lookup
- at interface detach time flush any flows referencing stale rtentrys
associated with the interface that is going away (fixes reported
panics)
- reduce the time between cleans in case the cleaner is running at
the time the eventhandler is called and the wakeup is missed less
time will elapse before the eventhandler returns
- separate per-vnet initialization from global initialization
(pointed out by jeli@)

Reviewed by: sam@
Approved by: re@


196364 18-Aug-2009 tuexen

Fix a crash when using one-to-one stlye socket in non-blocking
mode and there is no listening server.
PR: 137795
Approved by: re, rrs (mentor)
MFC after:immediately.


196322 17-Aug-2009 jhb

Purge mergeinfo in sys/ that is either empty or a subset of the parent
mergeinfo on sys/ itself.

Approved by: re (mergeinfo blanket)


196260 15-Aug-2009 tuexen

* Fix a bug where PR-SCTP settings are ignore when using implicit
association setup.
* Fix a bug where message with illegal stream ids are not deleted.
* Fix a crash when reporting back unsent messages from the send_queue.
* Fix a bug related to INIT retransmission when the socket is already
closed.
* Fix a bug where associations were stalled when partial delivery API
was enabled.
* Fix a bug where the receive buffer size was smaller than the
partial_delivery_point.

Approved by: re, rrs (mentor)
MFC after: One day.


196234 14-Aug-2009 qingli

In function ip_output(), the cached route is flushed when there is a
mismatch between the cached entry and the intended destination. The
cached rtentry{} is flushed but the associated llentry{} is not. This
causes the wrong destination MAC address being used in the output
packets. The fix is to flush the llentry{} when rtentry{} is cleared.

Reviewed by: kmacy, rwatson
Approved by: re


196229 14-Aug-2009 zec

SCTP is not yet compatible with options VIMAGE kernels although it compiles
with VIMAGE defined, so explicitly disallow building such kernels.

Reviewed by: rrs
Approved by: re (rwatson), julian (mentor)


196201 14-Aug-2009 julian

Fix ipfw crash on uid or gid check.
Receiving any ip packet for which there is no existing socket will
crash if ipfw has a uid or gid test rule, as the uid/gid
of the non existent owner of said non existent socket is tested.
Brooks introduced this error as part of his >16 gids patch.
It appears to be a cut-n-paste error from similar code a few lines
before. The old code used the 'pcb' variable here, but in the
new code that switched the 'inp' variable, which is often NULL
and what is tested in the code further up. The rest of the multi-gid
patch for ipfw seems solid (and cleaner than previous code).

Reviewed by: brooks
Approved by: re (rwatson)


196041 02-Aug-2009 rwatson

Add padding to struct inpcb, missed during our padding sweep earlier in
the release cycle.

Approved by: re (kensmith)


196039 02-Aug-2009 rwatson

Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.

Reviewed by: bz
Approved by: re (kib)


196019 01-Aug-2009 rwatson

Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by: bz
Approved by: re (vimage blanket)


195976 30-Jul-2009 delphij

Show interface name which received short CARP packet (e.g. a VRRP packet),
in order to match other codepaths nearby. This makes troubleshooting
easier.

Approved by: re (kib)
MFC after: 1 month


195923 28-Jul-2009 julian

Startup the vnet part of initialization a bit after the global part.
Fixes crash on boot if ipfw compiled in.

Submitted by: tegge@
Reviewed by: tegge@
Approved by: re (kib)


195922 28-Jul-2009 julian

Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by: ambrisko
Approved by: re (kib)
MFC after: 3 days


195919 28-Jul-2009 tuexen

Fix a bug where wrong initialization value
in used for an SCTP specific sysctl variable.

Approved by: re, rrs(mentor).
MFC after: 2 weeks.


195918 28-Jul-2009 rrs

Turns out that when a receiver forwards through its TNS's the
processing code holds the read lock (when processing a
FWD-TSN for pr-sctp). If it finds stranded data that
can be given to the application, it calls sctp_add_to_readq().
The readq function also grabs this lock. So if INVAR is on
we get a double recurse on a non-recursive lock and panic.

This fix will change it so that readq() function gets a
flag to tell if the lock is held, if so then it does not
get the lock.

Approved by: re@freebsd.org (Kostik Belousov)
MFC after: 1 week


195914 27-Jul-2009 qingli

This patch does the following:

- Allow loopback route to be installed for address assigned to
interface of IFF_POINTOPOINT type.
- Install loopback route for an IPv4 interface addreess when the
"useloopback" sysctl variable is enabled. Similarly, install
loopback route for an IPv6 interface address when the sysctl variable
"nd6_useloopback" is enabled. Deleting loopback routes for interface
addresses is unconditional in case these sysctl variables were
disabled after an interface address has been assigned.

Reviewed by: bz
Approved by: re


195906 27-Jul-2009 tuexen

Fix the handling of unordered messages when using
PR-SCTP.

Approved by: re, rrs (mentor)
MFC after: 3 weeks.


195904 27-Jul-2009 tuexen

Get rid of unused field. This will also be deleted
in the official speciication of the SCTP socket API.

Approved by:re, rrs (mentor)


195894 26-Jul-2009 tuexen

Add a missing unlock for the inp lock when
returning early from sctp_add_to_readq().

Approved by: re, rrs (mentor)
MFC after: 2 weeks.


195862 25-Jul-2009 julian

Catch ipfw up to the rest of the vimage code.
It got left behind when it moved to its new location.

Approved by: re (kensmith)


195837 23-Jul-2009 rwatson

Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
occur each time a network stack is instantiated and destroyed. In the
!VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
For the VIMAGE case, we instead use SYSINIT's to track their order and
properties on registration, using them for each vnet when created/
destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
previously, as well as its dependency scheme: we now just use the
SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
they want init functions to be called for each virtual network stack
rather than just once at boot, compiling down to DOMAIN_SET() in the
non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
of modinfo or DOMAIN_SET() for init/uninit events. In some cases,
convert modular components from using modevent to using sysinit (where
appropriate). In some cases, do minor rejuggling of SYSINIT ordering
to make room for or better manage events.

Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup)
Discussed with: jhb, bz, julian, zec
Reviewed by: bz
Approved by: re (VIMAGE blanket)


195814 21-Jul-2009 bz

sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by: rwatson
Approved by: re (kib)


195788 20-Jul-2009 rwatson

Back out the moving in r195782 of V_ip_id's initialization from the top
back to the bottom of ip_init() as found in 7.x. I missed the fact that
the bottom half of the init routine only runs in the !VNET case.

Submitted by: zec
Approved by: re (vimage blanket)


195782 20-Jul-2009 rwatson

Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by: bz
Approved by: re (vimage blanket)


195760 19-Jul-2009 rwatson

Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use). Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions. Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by: bz
Approved by: re (kib)


195727 16-Jul-2009 rwatson

Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)


195699 14-Jul-2009 rwatson

Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)


195655 13-Jul-2009 lstewart

Fix a race in the manipulation of the V_tcp_sack_globalholes global variable,
which is currently not protected by any type of lock. When triggered, the bug
would sometimes cause a panic when the TCP activity to an affected machine
eventually slowed during a lull. The panic only occurs if INVARIANTS is compiled
into the kernel, and has laid dormant for some time as a result of INVARIANTS
being off by default except in FreeBSD-CURRENT.

Switch to atomic operations in the locations where the variable is changed.
Reads have not been updated to be protected by atomics, so there is a
possibility of accounting errors in any given calculation where the variable is
read. This is considered unlikely to occur in the wild, and will not cause
serious harm on rare occasions where it does.

Thanks to Robert Watson for debugging help.

Reported by: Kamigishi Rei <spambox at haruhiism dot net>
Tested by: Kamigishi Rei <spambox at haruhiism dot net>
Reviewed by: silby
Approved by: re (rwatson), kensmith (mentor temporarily unavailable)


195654 13-Jul-2009 lstewart

Replace struct tcpopt with a proxy toeopt struct in the TOE driver interface to
the TCP syncache. This returns struct tcpopt to being private within the TCP
implementation, thus allowing it to be modified without ABI concerns.

The patch breaks the ABI. Bump __FreeBSD_version to 800103 accordingly. The cxgb
driver is the only TOE consumer affected by this change, and needs to be
recompiled along with the kernel.

Suggested by: rwatson
Reviewed by: rwatson, kmacy
Approved by: re (kensmith), kensmith (mentor temporarily unavailable)


195634 12-Jul-2009 lstewart

Pad the following TCP related structs to allow MFCs of upcoming features/fixes
back to the 8 branch:

tcp_var.h
- struct sackhint
- struct tcpcb
- struct tcpstat

The patch breaks the ABI. Bump __FreeBSD_version to 800102 accordingly. User
space tools that rely on the size of any of these structs (e.g. sockstat) need
to be recompiled.

Reviewed by: rpaulo, sam, andre, rwatson
Approved by: re & mentor (gnn)


195023 26-Jun-2009 rwatson

Update various IPFW-related modules to use if_addr_rlock()/
if_addr_runlock() rather than IF_ADDR_LOCK()/IF_ADDR_UNLOCK().

MFC after: 6 weeks


194971 25-Jun-2009 rwatson

Add address list locking for in6_ifaddrhead/ia_link: as with locking
for in_ifaddrhead, we stick with an rwlock for the time being, which
we will revisit in the future with a possible move to rmlocks.

Some pieces of code require significant further reworking to be
safe from all classes of writer-writer races.

Reviewed by: bz
MFC after: 6 weeks


194962 25-Jun-2009 rwatson

Initialize in_ifaddr_lock using RW_SYSINIT() instead of in ip_init(),
so that it doesn't run multiple times if VIMAGE is being used.

Discussed with: bz
MFC after: 6 weeks


194951 25-Jun-2009 rwatson

Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the
in_ifaddrhead and INADDR_HASH address lists.

Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).

For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support. As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done. This means that one
class of reader-writer races still exists.

MFC after: 6 weeks
Reviewed by: bz


194930 24-Jun-2009 oleg

- fix dummynet 'fast' mode for WF2Q case.
- fix printing of pipe profile data.
- introduce new pipe parameter: 'burst' - how much data can be sent through
pipe bypassing bandwidth limit.


194912 24-Jun-2009 rwatson

Fix CARP build.

Reported by: bz


194907 24-Jun-2009 rwatson

Convert netinet6 to using queue(9) rather than hand-crafted linked lists
for the global IPv6 address list (in6_ifaddr -> in6_ifaddrhead). Adopt
the code styles and conventions present in netinet where possible.

Reviewed by: gnn, bz
MFC after: 6 weeks (possibly not MFCable?)


194837 24-Jun-2009 rwatson

Add missing unlock of if_addr_mtx when an unmatched ARP packet is received.

Reported by: lstewart
MFC after: 6 weeks


194835 24-Jun-2009 rwatson

Clear 'ia' after iterating if_addrhead for unicast address matching: since
'ifa' was used as the TAILQ_FOREACH() iterator argument, and 'ia' was just
derived form it, it could be left non-NULL which confused later
conditional freeing code. This could cause kernel panics if multicast IP
packets were received. [1]

Call 'struct in_ifaddr *' in ip_rtaddr() 'ia', not 'ifa' in keeping with
normal conventions.

When 'ipstealth' is enabled returns from ip_input early, properly release
the 'ia' reference.

Reported by: lstewart, sam [1]
MFC after: 6 weeks


194820 24-Jun-2009 rwatson

In ARP input, more consistently acquire and release ifaddr references.

MFC after: 6 weeks


194777 23-Jun-2009 bz

Make callers to in6_selectsrc() and in6_pcbladdr() pass in memory
to save the selected source address rather than returning an
unreferenced copy to a pointer that might long be gone by the
time we use the pointer for anything meaningful.

Asked for by: rwatson
Reviewed by: rwatson


194760 23-Jun-2009 rwatson

Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:

ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)


194739 23-Jun-2009 bz

After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.


194672 22-Jun-2009 andre

Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
the length of data to be moved into the uio compared to
soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams. Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets. Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4: r164817 //depot/user/andre/soreceive_stream/


194660 22-Jun-2009 zec

V_irtualize flowtable state.

This change should make options VIMAGE kernel builds usable again,
to some extent at least.

Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by: bz
Approved by: julian (mentor)


194622 22-Jun-2009 rwatson

Add a new function, ifa_ifwithaddr_check(), which rather than returning
a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present. In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.

MFC after: 3 weeks


194616 22-Jun-2009 bz

Remove a hack from r186086 so that IPsec via loopback routes continued
working. It was targeted for stable/7 compatibility and actually never
did anything in HEAD.

Reminded by: rwatson
X-MFC after: never


194602 21-Jun-2009 rwatson

Clean up common ifaddr management:

- Unify reference count and lock initialization in a single function,
ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
reference count management.

The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.

MFC after: 3 weeks


194581 21-Jun-2009 rdivacky

Switch cmd argument to u_long. This matches what if_ethersubr.c does and
allows the code to compile cleanly on amd64 with clang.

Reviewed by: rwatson
Approved by: ed (mentor)


194498 19-Jun-2009 brooks

Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.

Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.

Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867


194368 17-Jun-2009 bz

Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.


194357 17-Jun-2009 bz

Add the explicit include of vimage.h to another five .c files still
missing it.

Remove the "hidden" kernel only include of vimage.h from ip_var.h added
with the very first Vimage commit r181803 to avoid further kernel poisoning.


194355 17-Jun-2009 rrs

Changes to the NR-Sack code so that:
1) All bit disappears
2) The two sets of gaps (nr and non-nr) are
disjointed, you don't have gaps struck in
both places.

This adjusts us to coorespond to the new draft. Still
to-do, cleanup the code so that there are only one set
of sack routines (original NR-Sack done by E cloned all
sack code).


194305 16-Jun-2009 jhb

Trim extra sets of ()'s.

Requested by: bde


194304 16-Jun-2009 jhb

Fix edge cases with ticks wrapping from INT_MAX to INT_MIN in the handling
of the per-tcpcb t_badtrxtwin.

Submitted by: bde


194303 16-Jun-2009 jhb

- Change members of tcpcb that cache values of ticks from int to u_int:
t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age,
t_badrxtwin.
- Change t_recent in struct timewait from u_long to u_int32_t to match
the type of the field it shadows from tcpcb: ts_recent.
- Change t_starttime in struct timewait from u_long to u_int to match
the t_starttime field in tcpcb.

Requested by: bde (1, 3)


194252 15-Jun-2009 jamie

Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by: zec, julian
Approved by: bz (mentor)


194245 15-Jun-2009 oleg

Since dn_pipe.numbytes is int64_t now - remove unnecessary overflow detection
code in ready_event_wfq().


194076 12-Jun-2009 bz

Move the kernel option FLOWTABLE chacking from the header file to the
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.

Reviewed by: rwatson


194062 12-Jun-2009 vanhu

Added support for NAT-Traversal (RFC 3948) in IPsec stack.

Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...

X-MFC: never

Reviewed by: bz
Approved by: gnn(mentor)
Obtained from: NETASQ


194003 11-Jun-2009 jhb

Correct printf format type mismatches.


194002 11-Jun-2009 jhb

Trim extra ()'s.

Submitted by: bde


193941 10-Jun-2009 jhb

Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs. This fixes a few overflow edge cases on 64-bit
platforms. Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by: silby, gnn, lstewart
MFC after: 1 week


193938 10-Jun-2009 imp

These are no longer referenced in the tree, so can be safely removed.

Reviewed by: bms@


193896 10-Jun-2009 luigi

in ip_dn_ctl(), do not allocate a large structure on the stack,
and use malloc() instead if/when it is necessary.

The problem is less relevant in previous versions because
the variable involved (tmp_pipe) is much smaller there.
Still worth fixing though.

Submitted by: Marta Carbone (GSOC)
MFC after: 3 days


193895 10-Jun-2009 bz

Remove the "The option TCPDEBUG requires option INET." requirement.
In case of !INET we will not have a timestamp on the trace for now
but that might only affect spx debugging as long as INET6 requires
INET.

Reviewed by: rwatson (earlier version)


193894 10-Jun-2009 luigi

small simplifications to the code in charge of reaping deleted rules:
- clear the head pointer immediately before using it, so there is
no chance of mistakes;
- call reap_rules() unconditionally. The function can handle a NULL
argument just fine, and the cost of the extra call is hardly
significant given that we do it rarely and outside the lock.

MFC after: 3 days


193859 09-Jun-2009 oleg

Close long existed race with net.inet.ip.fw.one_pass = 0:
If packet leaves ipfw to other kernel subsystem (dummynet, netgraph, etc)
it carries pointer to matching ipfw rule. If this packet then reinjected back
to ipfw, ruleset processing starts from that rule. If rule was deleted
meanwhile, due to existed race condition panic was possible (as well as
other odd effects like parsing rules in 'reap list').

P.S. this commit changes ABI so userland ipfw related binaries should be
recompiled.

MFC after: 1 month
Tested by: Mikolaj Golub


193744 08-Jun-2009 bz

After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.


193731 08-Jun-2009 zec

Introduce an infrastructure for dismantling vnet instances.

Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)


193664 07-Jun-2009 hrs

Fix and add a workaround on an issue of EtherIP packet with reversed
version field sent via gif(4)+if_bridge(4). The EtherIP
implementation found on FreeBSD 6.1, 6.2, 6.3, 7.0, 7.1, and 7.2 had
an interoperability issue because it sent the incorrect EtherIP
packets and discarded the correct ones.

This change introduces the following two flags to gif(4):

accept_rev_ethip_ver: accepts both correct EtherIP packets and ones
with reversed version field, if enabled. If disabled, the gif
accepts the correct packets only. This flag is enabled by
default.

send_rev_ethip_ver: sends EtherIP packets with reversed version field
intentionally, if enabled. If disabled, the gif sends the correct
packets only. This flag is disabled by default.

These flags are stored in struct gif_softc and can be set by
ifconfig(8) on per-interface basis.

Note that this is an incompatible change of EtherIP with the older
FreeBSD releases. If you need to interoperate older FreeBSD boxes and
new versions after this commit, setting "send_rev_ethip_ver" is
needed.

Reviewed by: thompsa and rwatson
Spotted by: Shunsuke SHINOMIYA
PR: kern/125003
MFC after: 2 weeks


193582 06-Jun-2009 zec

Unbreak options VIMAGE build.

Submitted by: julian (mentor)
Approved by: julian (mentor)


193550 05-Jun-2009 pjd

Only four out of nine arguments for ip_ipsec_output() are actually used.
Kill unused arguments except for 'ifp' as it might be used in the future
for detecting IPsec-capable interfaces.


193532 05-Jun-2009 luigi

move kernel ipfw-related sources to a separate directory,
adjust conf/files and modules' Makefiles accordingly.

No code or ABI changes so this and most of previous related
changes can be easily MFC'ed

MFC after: 5 days


193516 05-Jun-2009 luigi

Several ipfw options and actions use a 16-bit argument to indicate
pipes, queues, tags, rule numbers and so on.
These are all different namespaces, and the only thing they have in
common is the fact they use a 16-bit slot to represent the argument.

There is some confusion in the code, mostly for historical reasons,
on how the values 0 and 65535 should be used. At the moment, 0 is
forbidden almost everywhere, while 65535 is used to represent a
'tablearg' argument, i.e. the result of the most recent table() lookup.

For now, try to use explicit constants for the min and max allowed
values, and do not overload the default rule number for that.

Also, make the MTAG_IPFW declaration only visible to the kernel.

NOTE: I think the issue needs to be revisited before 8.0 is out:
the 2^16 namespace limit for rule numbers and pipe/queue is
annoying, and we can easily bump the limit to 2^32 which gives
a lot more flexibility in partitioning the namespace.

MFC after: 5 days


193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


193510 05-Jun-2009 rwatson

Unifdef MAC label pointer in syncache entries -- in general, ifdef'd
structure contents are a bad idea in the kernel for binary
compatibility reasons, and this is a single pointer that is now included
in compiles by default anyway due to options MAC being in GENERIC.


193502 05-Jun-2009 luigi

More cleanup in preparation of ipfw relocation (no actual code change):

+ move ipfw and dummynet hooks declarations to raw_ip.c (definitions
in ip_var.h) same as for most other global variables.
This removes some dependencies from ip_input.c;

+ remove the IPFW_LOADED macro, just test ip_fw_chk_ptr directly;

+ remove the DUMMYNET_LOADED macro, just test ip_dn_io_ptr directly;

+ move ip_dn_ruledel_ptr to ip_fw2.c which is the only file using it;

To be merged together with rev 193497

MFC after: 5 days


193497 05-Jun-2009 luigi

Small changes (no actual code changes) in preparation of moving ipfw-related
stuff to its own directory, and cleaning headers and dependencies:

In this commit:
+ remove one use of a typedef;
+ document dn_rule_delete();
+ replace one usage of the DUMMYNET_LOADED macro with its value;

No MFC planned until the cleanup is complete.


193435 04-Jun-2009 luigi

fix a bug introduced in rev.190865 related to the signedness
of the credit of a pipe. On passing, also use explicit
signed/unsigned types for two other fields.
Noticed by Oleg Bulyzhin and Maxim Ignatenko long ago,
i forgot to commit the fix.

Does not affect RELENG_7.


193391 03-Jun-2009 rwatson

Continue work to optimize performance of "options MAC" when no MAC policy
modules are loaded by avoiding mbuf label lookups when policies aren't
loaded, pushing further socket locking into MAC policy modules, and
avoiding locking MAC ifnet locks when no policies are loaded:

- Check mac_policies_count before looking for mbuf MAC label m_tags in MAC
Framework entry points. We will still pay label lookup costs if MAC
policies are present but don't require labels (typically a single mbuf
header field read, but perhaps further indirection if IPSEC or other
m_tag consumers are in use).

- Further push socket locking for socket-related access control checks and
events into MAC policies from the MAC Framework, so that sockets are
only locked if a policy specifically requires a lock to protect a label.
This resolves lock order issues during sonewconn() and also in local
domain socket cross-connect where multiple socket locks could not be
held at once for the purposes of propagatig MAC labels across multiple
sockets. Eliminate mac_policy_count check in some entry points where it
no longer avoids locking.

- Add mac_policy_count checking in some entry points relating to network
interfaces that otherwise lock a global MAC ifnet lock used to protect
ifnet labels.

Obtained from: TrustedBSD Project


193332 02-Jun-2009 rwatson

Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from: TrustedBSD Project


193272 01-Jun-2009 jhb

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


193232 01-Jun-2009 bz

Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by: julian, rwatson, zec
X-MFC: not possible


193231 01-Jun-2009 bms

Merge fixes from p4:
* Tighten v1 query input processing.
* Borrow changes from MLDv2 for how general queries are processed.
* Do address field validation upfront before accepting input.
* Do NOT switch protocol version if old querier present timer active.
* Always clear IGMPv3 state in igmp_v3_cancel_link_timers().
* Update comments.

Tested by: deeptech71 at gmail dot com


193219 01-Jun-2009 rwatson

Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
workstream, or set of per-protocol queues. Threads may be bound
to specific CPUs, or allowed to migrate, based on a global policy.

In the future it would be desirable to support topology-centric
policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
currently be one of:

NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
an implicit or explicit source (such as an interface or socket).

NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
as well as allowing protocols to provide a flow generation function
for mbufs without flow identifers (m2flow). Falls back on
NETISR_POLICY_SOURCE if now flow ID is available.

NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
being used, as well as a mapping function from workstream to CPU ID,
which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
get and clear drop counters, which query data or apply changes across
all workstreams.

- Add a more extensible netisr registration interface, in which
protocols declare 'struct netisr_handler' structures for each
registered NETISR_ type. These include name, handler function,
optional mbuf to flow ID function, optional mbuf to CPU ID function,
queue limit, and ordering policy. Padding is present to allow these
to be expanded in the future. If no queue limit is declared, then
a default is used.

- Queue limits are now per-workstream, and raised from the previous
IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
with the exception of netnatm, default queue limits. Most protocols
register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
NETISR_POLICY_FLOW, and will therefore take advantage of driver-
generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
the netisr, rather than having polling pretend to be two protocols.
Provide two explicit hooks in the netisr worker for start and end
events for runs: netisr_poll() and netisr_pollmore(), as well as a
function, netisr_sched_poll(), to allow the polling code to schedule
netisr execution. DEVICE_POLLING still embeds single-netisr
assumptions in its implementation, so for now if it is compiled into
the kernel, a single and un-bound netisr thread is enforced
regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking. It will be enabled once further rmlock
optimization has taken place. However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by: bz


193217 01-Jun-2009 pjd

- Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).

Discussed with: julian, adrian, jhb, rwatson, kmacy


193090 30-May-2009 rrs

Adds missing sysctl to manage the vtag_time_wait time. This will
even allow disabling time-wait all together if you set the value
to 0 (not advisable actually). The default remains the same
i.e. 60 seconds.


193089 30-May-2009 rrs

Fix a small memory leak from the nr-sack code - the mapping array
was not being freed at term of association. Also get rid of
the MICHAELS_EXP code.


193088 30-May-2009 rrs

Make sctp_uio user to kernel structure match the
socket-api draft. Two fields were uint32_t when they
should have been uint16_t.

Reported by Jonathan Leighton at U-del.


192912 27-May-2009 zml

Correct handling of SYN packets that are to the left of the current window of an ESTABLISHED connection.

Reviewed by: net@, gnn
Approved by: dfr (mentor)


192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


192893 27-May-2009 trasz

Don't discard packets with 'Destination Unreachable' at the beginning
of ip_forward(), if the IPSEC is compiled in. It is possible that there
is an SPD that this packets will go through, even if there is no matching
route. If not, ICMP will be sent anyway, after ip_output().

This is somewhat similar in purpose to r191621, except that one was
for the packets sent from the host, while this one is for packets
being forwarded by the host.

Reviewed by: bz@
Sponsored by: Wheel Sp. z o.o. (http://www.wheel.pl)


192848 26-May-2009 jhb

Correct the sense of a test so that this filter always waits for the full
request to arrive. Previously it would end up returning as soon as the
request length stored in the first two bytes had arrived.

Reviewed by: dwmalone
MFC after: 1 week


192761 25-May-2009 rwatson

Remove comment about moving tcp_reass() to its own file named tcp_reass.c,
that happened a while ago.

MFC after: 3 days


192651 23-May-2009 bz

For UDP with introducing the UDP control block, the uma zone had to
be named "udp_inpcb" to avoid a naming conflict with tcp[1].
For consistency rename the uma zone for TCP from "inpcb" to "tcp_inpcb".

Found by: rwatson [1]
Discussed with: rwatson


192649 23-May-2009 bz

Implement UDP control block support.

So far the udp_tun_func_t had been (ab)using inp_ppcb for udp in kernel
tunneling callbacks. Move that into the udpcb and add a field for flags
there to be used by upcoming changes instead of sticking udp only flags
into in_pcb flags2.

Bump __FreeBSD_version for ports to detect it and because of vnet* struct
size changes.

Submitted by: jhb (7.x version)
Reviewed by: rwatson


192648 23-May-2009 bz

Add sysctls to toggle the behaviour of the (former) IPSEC_FILTERTUNNEL
kernel option.
This also permits tuning of the option per virtual network stack, as
well as separately per inet, inet6.

The kernel option is left for a transition period, marked deprecated,
and will be removed soon.

Initially requested by: phk (1 year 1 day ago)
MFC after: 4 weeks


192612 22-May-2009 bz

If including vnet.h one has to include opt_route.h as well. This is
because struct vnet_net holds the rt_tables[][] for MRT and array size
is compile time dependent. If you had ROUTETABLES set to >1 after
r192011 V_loif was pointing into nonsense leading to strange results
or even panics for some people.

Reviewed by: mz


192528 21-May-2009 rwatson

Consolidate and clean up the first section of ip_output.c in light of the
last year or two's work on routing:

- Combine iproute initialization and flowtable lookup blocks, eliminating
unnecessary tests for known-zero'd iproute fields.

- Add a comment indicating (a) why the route entry returned by the
flowtable is considered stable and (b) that the flowtable lookup must
occur after the setup of the mbuf flow ID.

- Assert the inpcb lock before any use of inpcb fields.

Reviewed by: kmacy


192476 20-May-2009 qingli

When an interface address is removed and the last prefix
route is also being deleted, the link-layer address table
(arp or nd6) will flush those L2 llinfo entries that match
the removed prefix.

Reviewed by: kmacy


192351 18-May-2009 bz

Revert the logical change of r192341.

net.inet.ip.fw.one_pass is a classic ip_input.c variable and is used in
the pfil and bridge code as well. As ipfw is loadable we need to always
provide it. That is the reason why it lives in struct vnet_inet and
not in struct vnet_ipfw.


192341 18-May-2009 jhb

- Fix typo in description of 'net.inet.ip.fw.autoinc_step'.
- Use 'vnet_ipfw' instead of 'vnet_inet' for 'net.inet.ip.fw.one_pass'.


192262 17-May-2009 bz

Unbreak options VIMAGE builds, in a followup to r192011 which did not
introduce INIT_VNET_NET() initializers necessary for accessing V_loif.

Submitted by: zec
Reviewed by: julian


192116 14-May-2009 rwatson

Staticize two functions not used outside of in_pcb.c: in_pcbremlists() and
db_print_inpcb().

MFC after: 1 month


192085 14-May-2009 qingli

Ignore the INADDR_ANY address inserted/deleted by DHCP when installing a loopback route
to the interface address.


192011 12-May-2009 qingli

This patch adds a host route to an interface address (that is assigned
to a non loopback/ppp link types) through the loopback interface. Prior
to the new L2/L3 rewrite, this host route is implicitly added by the L2
code during RTM_RESOLVE of that interface address. This host route is
deleted when that interface is removed.

Reviewed by: kmacy


191943 09-May-2009 imp

Remove bogus comment.


191932 09-May-2009 jhb

Convert IPFW_DEFAULT_TO_ACCEPT into a loader tunable
'net.inet.ip.fw.default_to_accept'. The current value can also be queried
via a read-only sysctl of the same name.

Requested by: plosher
MFC after: 1 week


191917 08-May-2009 zec

A NOP change: style / whitespace cleanup of the noise that slipped
into r191816.

Spotted by: bz
Approved by: julian (mentor) (an earlier version of the diff)


191916 08-May-2009 zec

Remove a bogus check that unintentionally slipped in r191816.

This change has no functional impact on nooptions VIMAGE builds.
Submitted by: bz


191891 07-May-2009 rrs

repository sync to multi-OS repo ... spaceing change


191890 07-May-2009 rrs

ABI expansions to hopefully future-proof our MIB/netstat code for 8.0


191846 06-May-2009 zec

Remove unnecessary CURVNET_SET() calls where curvnet context is
(i.e. seems to be) already set.

This should reduce console noise due to curvnet recursion reports.

This change has no impact on nooptions VIMAGE builds.
Approved by: julian (mentor)


191845 06-May-2009 zec

Unbreak options VIMAGE kernel builds.

Approved by: julian (mentor)


191816 05-May-2009 zec

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


191738 02-May-2009 zec

Make indentation more uniform accross vnet container structs.

This is a purely cosmetic / NOP change.

Reviewed by: bz
Approved by: julian (mentor)
Verified by: svn diff -x -w producing no output


191734 02-May-2009 zec

Unbreak options VIMAGE + nooptions INVARIANTS kernel builds.

Submitted by: julian
Approved by: julian (mentor)


191688 30-Apr-2009 zec

Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)


191672 29-Apr-2009 bms

Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:

* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.

NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.

This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.


191661 29-Apr-2009 bms

Add MLDv2 prototypes and defines.


191660 29-Apr-2009 bms

Use KTR_INET for MROUTING CTRs.


191659 29-Apr-2009 bms

Cut over to KTR_INET for CTR.
For clarity, put pointer incremement/size decrement on own line
when copying out in-mode source filters to userland.


191658 29-Apr-2009 bms

Do not assume that ip6_moptions is always set, it is
a lazy-allocated structure.


191657 29-Apr-2009 bms

Fix a problem whereby enqueued IGMPv3 filter list changes would be
incorrectly output, if the RB-tree enumeration happened to reuse the
same chain for a mode switch: that is, both ALLOW and BLOCK records
were appended for the same group, in the same mbuf packet chain.

This was introduced during an mbuf chain layout bug fix involving
m_getptr(), which obviously cannot count from offset 0 on the
second pass through the RB-tree when serializing the IGMPv3
group records into the pending mbuf chain.

Cut over to KTR_INET for IGMPv3 CTR usage.


191621 28-Apr-2009 trasz

Don't require packet to match a route (any route; this information wasn't
used anyway, so a typical workaround was to add a dummy route) if it's going
to be sent through IPSec tunnel.

Reviewed by: bz


191570 27-Apr-2009 oleg

Optimize packet flow: if net.inet.ip.fw.one_pass != 0 and packet was
processed by ipfw once - avoid second ipfw_chk() call.
This saves us from unnecessary IPFW_RLOCK(), m_tag_find() calls and
ip/tcp/udp header parsing.

MFC after: 2 month


191548 26-Apr-2009 zec

In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by: bz (an older version of the patch)
Approved by: julian (mentor)


191528 26-Apr-2009 rwatson

Acquire IF_ADDR_LOCK() around most iterations over ifp->if_addrhead
(colloquially known as if_addrlist). Currently not acquired around
interface address loops that call out to the routing code due to
potential lock order issues.

MFC after: 3 weeks


191500 25-Apr-2009 rwatson

Expand coverage of IF_ADDR_LOCK() in in_control() from point of initial
lookup of 'ia' from if_addrhead through most use. Note that we
currently have to drop it prematurely in some cases due to calls out to
the routing and interface code while using 'ia', but this closes many
races. Annotate several potential races that persist after this change.
Move to using M_NOWAIT for allocating new interface addresses due to
lock(s) being held.

MFC after: 3 weeks


191476 24-Apr-2009 rwatson

In in_purgemaddrs(), remove the inm being freed from the address list
before freeing it, rather than vice version, to avoid potential use
after free.

Reviewed by: bms


191456 24-Apr-2009 rwatson

Relocate permissions checking code in in_control() to before the body
of the implementation of ioctls. This makes the mapping of ioctls to
specific privileges more explicit, and also simplifies the
implementation by reducing the use of FALLTHROUGH handling in switch.

While this is not intended to be a functional change, it does mean
that certain privilege checks are now performed earlier, so EPERM
might be returned in preference to EADDRNOTAVAIL for management
ioctls that could have failed for both reasons.

MFC after: 3 weeks


191443 23-Apr-2009 rwatson

Reorganize in_control() so that invariants are more obvious, and so
that it is easier to lock:

- Handle the unsupported ioctl case at the beginning of in_control(),
handing off to ifp->if_ioctl, rather than looking up interfaces and
addresses unnecessarily in this case.

- Make it an invariant that ifp is always non-NULL when running
in_control()-implemented ioctls, simplifying the code structure.

MFC after: 3 weeks


191356 21-Apr-2009 bms

Bracket struct mfc and struct rtdetq with #ifdef _KERNEL.
Match the bracketing in netstat.
Since the cleanup of MROUTING, ports have broken because they
expect to include <netinet/ip_mroute.h> without including
<sys/queue.h>. Fix breakage at source.

The real fix, of course, is to fix the MROUTING APIs by blowing them
away and replacing them with something else...


191348 21-Apr-2009 bms

remove IFF_ASSERTGIANT


191338 20-Apr-2009 rwatson

Prefer actual field names (if_addrhead, ifa_link) to macros aliasing
those field names in FreeBSD code.

MFC after: 2 weeks


191314 20-Apr-2009 rwatson

In ip_input(), cache the received mbuf's network interface in a local
variable. Acquire the interface address list lock when iterating over
the interface address list searching for a matching received broadcast
address.

MFC after: 2 weeks


191311 20-Apr-2009 rwatson

In icmp_reflect(), acquire the inteface address list lock when
searching for a source address to use.

MFC after: 2 weeks
Reviewed by: bz


191288 19-Apr-2009 rwatson

Lock the interface address list when searching for a matching interface
by address, or when implementing 'me' rules on IPv6. Prefer the field
name if_addrhead to the macro if_addrlist.

MFC after: 2 weeks


191287 19-Apr-2009 rwatson

In divert_packet(), lock the interface address list before iterating over
it in search of an address.

MFC after: 2 weeks


191286 19-Apr-2009 rwatson

Lock interface address lists in in_pcbladdr() when searching for a
source address for a connection and there's no route or now interface
for the route.

MFC after: 2 weeks


191285 19-Apr-2009 rwatson

Protect against some writer-writer races in in_control() by acquiring
the interface address list lock around interface address list
modifications. More to do here.

MFC after: 2 weeks


191264 19-Apr-2009 bms

Now that IFF_NEEDSGIANT has been removed from the network
stack, catch up with this in IGMPv3 and remove dead code.
This has the side-effect of not being back-portable to RELENG_7
w/o further changes.


191259 19-Apr-2009 kmacy

- Allocate a small flowtable in ip_input.c (changeable by tuneable)
- Use for accelerating ip_output


191160 16-Apr-2009 kmacy

s/void/void */


191158 16-Apr-2009 kmacy

restore spare pointers for MFCing


191148 16-Apr-2009 kmacy

Change if_output to take a struct route as its fourth argument in order
to allow passing a cached struct llentry * down to L2

Reviewed by: rwatson


191129 15-Apr-2009 kmacy

- convert pspare pointers in inpcb to an llentry and rtentry cache
- add flags to indicate their validity


191126 15-Apr-2009 kmacy

- add second flags field to to inpcb
- update comments in vflag


191125 15-Apr-2009 kmacy

provide additional convenience macros for inpcb locking (upgrade, downgrade, exclusive)


191120 15-Apr-2009 kmacy

make LLTABLE visible to netinet


191117 15-Apr-2009 kmacy

add an llentry to struct route{_in6} to allow it to be passed around with
the rtentry


191073 14-Apr-2009 rrs

Add missing address lock when we look at the ifa list


191049 14-Apr-2009 rrs

Move the flight size reduction to right after
we recognize its a retransmit, ahead of the PR-SCTP
work. Without this fix, we end up NOT reducing flight
size and causing an miscalculation when PR-SCTP is active
and data is skipped.

Obtained from: Michael Tuexen.


190978 12-Apr-2009 rwatson

Put TCPSTAT_ADD() and TCPSTAT_INC() behind _KERNEL.

MFC after: 3 days


190968 12-Apr-2009 rwatson

Update stats in struct carpstats using two new macros: CARPSTATS_ADD()
and CARPSTATS_INC(), rather than directly manipulating the fields of
the structure. This will make it easier to change the implementation
of these statistics, such as using per-CPU versions of the data
structure.

MFC after: 3 days


190967 12-Apr-2009 rwatson

Update stats in struct pimstat using two new macros: PIMSTAT_ADD()
and PIMSTAT_INC(), rather than directly manipulating the fields of
the structure. This will make it easier to change the
implementation of these statistics, such as using per-CPU versions
of the data structure.

MFC after: 3 days


190966 12-Apr-2009 rwatson

Update stats in struct mrtstat using two new macros: MRTSTAT_ADD()
and MRTSTAT_INC(), rather than directly manipulating the fields of
the structure. This will make it easier to change the
implementation of these statistics, such as using per-CPU versions
of the data structure.

MFC after: 3 days


190965 12-Apr-2009 rwatson

Update stats in struct igmpstat using two new macros:
IGMPSTAT_ADD() and IGMPSTAT_INC(), rather than directly
manipulating the fields of the structure. This will make it
easier to change the implementation of these statistics,
such as using per-CPU versions of the data structures.

MFC after: 3 days


190964 12-Apr-2009 rwatson

Update stats in struct icmpstat and icmp6stat using four new
macros: ICMPSTAT_ADD(), ICMPSTAT_INC(), ICMP6STAT_ADD(), and
ICMP6STAT_INC(), rather than directly manipulating the fields
of these structures across the kernel. This will make it
easier to change the implementation of these statistics,
such as using per-CPU versions of the data structures.

In on case, icmp6stat members are manipulated indirectly, by
icmp6_errcount(), and this will require further work to fix
for per-CPU stats.

MFC after: 3 days


190962 12-Apr-2009 rwatson

Update stats in struct udpstat using two new macros, UDPSTAT_ADD()
and UDPSTAT_INC(), rather than directly manipulating the fields
across the kernel. This will make it easier to change the
implementation of these statistics, such as using per-CPU versions
of the data structures.

MFC after: 3 days


190951 11-Apr-2009 rwatson

Update stats in struct ipstat using four new macros, IPSTAT_ADD(),
IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly
manipulating the fields across the kernel. This will make it easier
to change the implementation of these statistics, such as using
per-CPU versions of the data structures.

MFC after: 3 days


190948 11-Apr-2009 rwatson

Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after: 3 days


190941 11-Apr-2009 piso

What's the point of adjusting a checksum if we are going to toss the
packet? Anticipate the check/return code.


190938 11-Apr-2009 piso

Plug two bugs introduced with modules conversion:

-UdpAliasIn(): correctly check return code after modules ran.
-alias_nbt: in case of malformed packets (or some other unrecoverable
error), toss the packet.


190935 11-Apr-2009 piso

Remove stale comments.


190909 11-Apr-2009 zec

Introduce vnet module registration / initialization framework with
dependency tracking and ordering enforcement.

With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized. In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.

The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw). Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.

The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module. The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.

For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.

A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on. Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first. The framework will panic if it detects any unresolved
dependencies before completing system initialization. Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.

Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run. In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container. IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.

The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.

Reviewed by: bz
Approved by: julian (mentor)


190880 10-Apr-2009 kmacy

Import "flowid" support for serializing flows across transmit queues

Reviewed by: rwatson and jeli


190865 09-Apr-2009 luigi

Add emulation of delay profiles, which lets you model various
types of MAC overheads such as preambles, link level retransmissions
and more.

Note- this commit changes the userland/kernel ABI for pipes
(but not for ordinary firewall rules) so you need to rebuild
kernel and /sbin/ipfw to use dummynet features.

Please check the manpage for details on the new feature.

The MFC would be trivial but it breaks the ABI, so it will
be postponed until after 7.2 is released.

Interested users are welcome to apply the patch manually
to their RELENG_7 tree.

Work supported by the European Commission, Projects Onelab and
Onelab2 (contract 224263).


190843 08-Apr-2009 rrs

Fix a FR bug. When doing PR-SCTP with number rtx
set to a low number. The check for skipping was in the
incorrect place. Which meant we would FR chunks we
should not.
MFC after: 1 Month


190842 08-Apr-2009 rrs

Add more padding and a new variable. This will
help us be able to keep ABI compatibility between
8 and 9.
MFC after: Never


190841 08-Apr-2009 piso

-don't pass down, to module's fingerprint function, unused data like
a pointer to the ip header.
-style
-spacing


190800 07-Apr-2009 bz

With the right comparison we get a proper wscale value and thus
more adequate TCP performance with IPv6.

Changes for IPv4, r166403 and r172795, both ignored the
IPv6 counterpart and left it in the state of art of year 2000.

The same logic in syncache already shares code between v4 and v6 so
things do not need to be adapted there.

Reported by: Steinar Haug (sthaug nethelp.no)
Tested by: Steinar Haug (sthaug nethelp.no)
MFC after: 3 days


190787 06-Apr-2009 zec

First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.

Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).

While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.

Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.

Approved by: julian (mentor)


190753 05-Apr-2009 kan

If KTR_SUBSYS is compiled in, it does not necessarily mean that user
is interested in being spammed by mcast-related printfs.

Use proper check against ktr_mask instead KTR_COMPILE.


190692 04-Apr-2009 bms

Fix mbuf chain layout pessimization:
in the case where a single mbuf is allocated due to
m_getcl() returning NULL, we already call MH_ALIGN,
so do not increment m->m_data in this case.

Found during MLDv2 port.


190691 04-Apr-2009 bms

Do not obliterate QQI with MAXRESP.

Found during MLDv2 port.


190689 04-Apr-2009 rrs

Many bug fixes (from the IETF hack-fest):
- PR-SCTP had major issues when skipping through a multi-part message.
o Did not look at socket buffer.
o Did not properly handle the reassmebly queue.
o The MARKED segments could interfere and un-skip a chunk causing
a problem with the proper FWD-TSN.
o No FR of FWD-TSN's was being done.
- NR-Sack code was basically disabled. It needed fixes that
never got into the real code.
- CMT code had issues when the two paths were NOT the same b/w. We
found a few small bugs, but also the critcal one here was not
dividing the rwnd amongst the paths.

Obtained from: Michael Tuexen and myself at the IETF hack-fest ;-)


190633 01-Apr-2009 piso

Implement an ipfw action to reassemble ip packets: reass.


190354 24-Mar-2009 bms

Don't call m_freem() after ip_output(), as it always consumes
the mbuf chain provided to it.

Found by: Pierre Guinoiseau


190233 22-Mar-2009 jmallett

Remove local in6_addr variables for local and foreign addresses in sysctl_drop,
they were passed uninitialized to in6_pcblookup_hash. Instead, do as is done
for IPv4 and use the addresses within the sockaddr structure, which are
correctly populated.

This fixes tcpdrop(8) for IPv6 address pairs.

Reviewed by: bz


190148 20-Mar-2009 bms

Fix brainos introduced during mechanical KTR change.

Pointy hat to: bms


190054 19-Mar-2009 bms

Cleanup: Nuke debug.mrtdebug, and replace it with KTR.


190012 19-Mar-2009 bms

Introduce a number of changes to the MROUTING code.
This is purely a forwarding plane cleanup; no control plane
code is involved.

Summary:
* Split IPv4 and IPv6 MROUTING support. The static compile-time
kernel option remains the same, however, the modules may now
be built for IPv4 and IPv6 separately as ip_mroute_mod and
ip6_mroute_mod.
* Clean up the IPv4 multicast forwarding code to use BSD queue
and hash table constructs. Don't build our own timer abstractions
when ratecheck() and timevalclear() etc will do.
* Expose the multicast forwarding cache (MFC) and virtual interface
table (VIF) as sysctls, to reduce netstat's dependence on libkvm
for this information for running kernels.
* bandwidth meters however still require libkvm.
* Make the MFC hash table size a boot/load-time tunable ULONG,
net.inet.ip.mfchashsize (defaults to 256).
* Remove unused members from struct vif and struct mfc.
* Kill RSVP support, as no current RSVP implementation uses it.
These stubs could be moved to raw_ip.c.
* Don't share locks or initialization between IPv4 and IPv6.
* Don't use a static struct route_in6 in ip6_mroute.c.
The v6 code is still using a cached struct route_in6, this is
moved to mif6 for the time being.
* More cleanup remains to be merged from ip_mroute.c to ip6_mroute.c.

v4 path tested using ports/net/mcast-tools.
v6 changes are mostly mechanical locking and *have not* been tested.
As these changes partially break some kernel ABIs, they will not
be MFCed. There is a lot more work to be done here.

Reviewed by: Pavlin Radoslavov


190011 19-Mar-2009 bms

Comment IGMP_PIM as being very historic, as in, don't use.


189931 17-Mar-2009 bms

Deal with the case where ifma_protospec may be NULL, during
any IPv4 multicast operations which reference it.

There is a potential race because ifma_protospec is set to NULL
when we discover the underlying ifnet has gone away. This write
is not covered by the IF_ADDR_LOCK, and it's difficult to widen
its scope without making it a recursive lock. It isn't clear why
this manifests more quickly with 802.11 interfaces, but does not
seem to manifest at all with wired interfaces.

With this change, the 802.11 related panics reported by sam@
and cokane@ should go away. It is not the right fix, that requires
more thought before 8.0.

Idea from: sam
Tested by: cokane


189851 15-Mar-2009 rwatson

Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free. This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile. They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

if_ar
if_axe
if_aue
if_cdce
if_cue
if_kue
if_ray
if_rue
if_rum
if_sr
if_udav
if_ural
if_zyd

Drivers that were already disabled because of tty changes:

if_ppp
if_sl

Discussed on: arch@


189848 15-Mar-2009 rwatson

Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz


189836 14-Mar-2009 rrs

Opps.. I missed a file on the commit :-)


189829 14-Mar-2009 das

Namespace: Defining htonl() and friends here instead of arpa/inet.h is
a BSD extension.


189790 14-Mar-2009 rrs

Fixes several PR-SCTP releated bugs.
- When sending large PR-SCTP messages over a
lossy link we would incorrectly calculate the fwd-tsn
- When receiving large multipart pr-sctp packets we would
incorrectly send back a SACK that would renege improperly
on already received packets thus causing unneeded retransmissions.


189657 11-Mar-2009 rwatson

Add INP_INHASHLIST flag for inpcb->inp_flags to indicate whether
or not the inpcb is currenty on various hash lookup lists, rather
than using (lport != 0) to detect this. This means that the full
4-tuple of a connection can be retained after close, which should
lead to more sensible netstat output in the window between TCP
close and socket close.

MFC after: 2 weeks


189637 10-Mar-2009 rwatson

Remove unused v6 macro aliases for inpcb fields:

in6p_ip6_nxt
in6p_vflag
in6p_flags
in6p_socket
in6p_lport
in6p_fport
in6p_ppcb

Remove unused v6 macro aliases for inpcb flags:

IN6P_HIGHPORT
IN6P_LOWPORT
IN6P_ANONPORT
IN6P_RECVIF
IN6P_MTUDISC
IN6P_FAITH
IN6P_CONTROLOPTS

References to in6p_lport and in6_fport in sockstat are also replaced with
normal inp_lport and inp_fport references.

MFC after: 3 days
Reviewed by: bz


189635 10-Mar-2009 bms

Don't print inm_print() chatter when KTR_IGMPV3 is not enabled
in the KTR_COMPILE mask.

Found by: gnn


189615 10-Mar-2009 rwatson

Remove now-unused INP_UNMAPPABLEOPTS.

MFC after: 3 days
Discussed with: bz


189603 09-Mar-2009 bms

Fix uninitialized use of ifp for ii.

Found by: Peter Holm


189592 09-Mar-2009 bms

Merge IGMPv3 and Source-Specific Multicast (SSM) to the FreeBSD
IPv4 stack.

Diffs are minimized against p4.
PCS has been used for some protocol verification, more widespread
testing of recorded sources in Group-and-Source queries is needed.
sizeof(struct igmpstat) has changed.

__FreeBSD_version is bumped to 800070.


189494 07-Mar-2009 marius

On architectures with strict alignment requirements compensate
the misalignment of the IP header that prepending the EtherIP
header might have caused.

PR: 131921
MFC after: 1 week


189444 06-Mar-2009 rrs

Fixes for window probes:
1) WP should never be marked unless flight size is 0
2) When recovering from wp if the peer ack's it we don't mark for retran
3) When recovering, we must assure a timer is still running.


189371 04-Mar-2009 rrs

- PR-SCTP bug, where the CUM-ACK was not being updated
into the advance_peer_ack point so we would incorrectly
send a wrong value in the FWD-TSN
- PR-SCTP bug, where an PR packet is used for a window
probe which could incorrectly get the packet moved
back into the send_queue, which will cause major issues and
should not happen.
- Fix a trace to use the proper macro.


189359 04-Mar-2009 bms

In ip_output(), do not acquire the IN_MULTI_LOCK(),
and do not attempt to perform a group lookup.
This is a socket layer lock, and the bottom half of IP
really has no business taking it.

Use the value of the in_mcast_loop sysctl to determine
if we should loop back by default, in the absence of
any multicast socket options. Because the check on
group membership is now deferred to the input path,
an m_copym() is now required.

This should increase multicast send performance where the
source has not requested loopback, although this has not been
benchmarked or measured.

It is also a necessary change for IN_MULTI_LOCK to become
non-recursive, which is required in order to implement IGMPv3
in a thread-safe way.


189357 04-Mar-2009 bms

Add sysctl net.inet.ip.mcast.loop. This controls whether or not
IPv4 multicast sends are looped back to senders by default
on a stack-wide basis, rather than relying on the socket option.
Note that the sysctl only applies to newly created multicast sockets.


189347 04-Mar-2009 bms

Merge header file definitions used by the new IGMPv3 implementation.
This is a partial merge. Compatibility defines are retained for
the existing IGMPv2 implementation.


189346 04-Mar-2009 bms

Add various defines/macros required by IGMPv3:
* MCAST_UNDEFINED state.
* in_allhosts() macro (group is 224.0.0.1).
This uses a const endian comparison.
* IP_MAX_GROUP_SRC_FILTER, IP_MAX_SOCK_SRC_FILTER
default resource limits.


189343 04-Mar-2009 bms

Add function ip_checkrouteralert(), which will be used
by IGMPv3 to check for the IPv4 Router Alert [RFC2113]
option in a pulled-up IP mbuf chain.


189303 03-Mar-2009 bz

Start removing IPv6 Type 0 Routing header code.
RH0 was deprecated by RFC 5095.

While most of the code had been disabled by #if 0 already, leave a
bit of infrastructure for possible RH2 code and a log message under
BURN_BRIDGES in case a user still tries to send RH0 packets.

Reviewed by: gnn (a bit back, earlier version)


189289 02-Mar-2009 luigi

curr_time is a 64 bit variable so SYSCTL_LONG is not appropriate
as a handler.
The variable was exported only for debugging, but there is little reason
to do it now that the timekeeping is supported by various other variables.
For the time being just comment out the sysctl, but I think this
should go away.


189288 02-Mar-2009 luigi

fw_debug has been unused for ages, so remove it from the list
of sysctl_variables.
I would also remove it from the VNET record but I am unsure if
there is any ABI issue -- so for the time being just mark it as
unused in ip_fw.h, and then we will collect the garbage at some
appropriate time in the future.

MFC after: 3 days


189225 01-Mar-2009 bz

Add size-guards evaluated at compile-time to the main struct vnet_*
which are not in a module of their own like gif.

Single kernel compiles and universe will fail if the size of the struct
changes. Th expected values are given in sys/vimage.h.
See the comments where how to handle this.

Requested by: peter


189196 28-Feb-2009 rwatson

Remove unreachable code for generating RST segments from tcp_twcheck();
this code became stale when T/TCP support was removed.

Discussed with: bz, sam
MFC after: 1 month


189121 27-Feb-2009 rrs

Fix the add stream feature of strm-reset to really work:
- Fix the copy, we can't do a blind copy but must transfer
the data from the old to the new.
- Fix the ACK processing so we properly stop retransmitting
the thing.
- Fix it so if we get a retran we will properly reply with
the saved response without doing anything.

MFC after: 1 month


189106 27-Feb-2009 bz

For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.


189004 24-Feb-2009 rdivacky

Change the functions to ANSI in those cases where it breaks promotion
to int rule. See ISO C Standard: SS6.7.5.3:15.

Approved by: kib (mentor)
Reviewed by: warner
Tested by: silence on -current


188992 24-Feb-2009 rwatson

In tcp_usr_shutdown() and tcp_usr_send(), I missed converting NULL
checks for the tcpcb, previously used to detect complete disconnection,
with INP_DROPPED checks. Correct that, preventing shutdown() from
improperly generating a TCP segment with destination IP and port of
0.0.0.0:0.

PR: kern/132050
Reported by: david gueluy <david.gueluy at netasq.com>
MFC after: 3 weeks


188962 23-Feb-2009 rwatson

In in_rtqkill(), assert the radix head lock, and pass RTF_RNH_LOCKED
to in_rtrequest(); the radix head lock is already acquired before
rnh_walktree is called in in_rtqtimo_one(). This avoids a recursive
acquisition that is no longer permitted in 8.x due to use of an rwlock
for the radix head lock.

Reported by: dikshie <dikshie at gmail.com>
MFC after: 3 days


188854 20-Feb-2009 rrs

Add the add-stream capability. Still needs more
testing..

MFC after: 1 month


188852 20-Feb-2009 rrs

Fix a bug. The sending was being restricted improperly by
the max_burst. It should only be gated by cwnd in the
lower level send.

Obtained from: Michael Tuexen
MFC after: 1 week.


188676 16-Feb-2009 luigi

correct some #include


188673 16-Feb-2009 luigi

remove dependency on eventhandler.h, we only need a forward declaration


188672 16-Feb-2009 luigi

remove dependency on net/if.h of this header


188669 16-Feb-2009 luigi

use a const format string in the log message so we can check the
arguments (if/when we enable those checks)


188626 15-Feb-2009 luigi

remove unnecessary #include from vnet.h and vinet.h

Approved by: Marko Zec


188605 14-Feb-2009 rrs

This commit fixes the issue with alias_sctp.c. No
longer do we require SCTP to be in the kernel for the
lib to be able to handle SCTP. We do this by moving
the CRC32c checksum into libkern/crc32.c and then adjusting
all routines to use the common methods. Note that this
will improve the performance of iSCSI since they were
using the old single 256 bit table lookup versus the
slicing 8 algorithm (which gives a 4x speed up in
CRC32c calculation :-D)

Reviewed by:rwatson, gnn, scottl, paolo
MFC after: 4 week? (assuming we MFC the alias_sctp changes)


188590 13-Feb-2009 rrs

Have the jail code use the error returned to pass not constant
errors.
Obtained from: jamie@freebsd.org


188580 13-Feb-2009 luigi

remove unnecessary #include, and document some of the others


188578 13-Feb-2009 luigi

Use uint32_t instead of n_long and n_time, and uint16_t instead of n_short.
Add a note next to fields in network format.

The n_* types are not enough for compiler checks on endianness, and their
use often requires an otherwise unnecessary #include <netinet/in_systm.h>

The typedef in in_systm.h are still there.


188577 13-Feb-2009 rrs

Move the new rwnd field down to the very end
of the xsctp structure. This is where all new
fields belong (not that we will be ABI compatiable
with 7.x anyway.. sigh).


188398 09-Feb-2009 rrs

Add padding to then end of the xsctp_xxx structures to
allow future changes to be able to maintain ABI compatibility


188388 09-Feb-2009 rrs

Fix minor spacing problem found by s9indent from last
commit.


188387 09-Feb-2009 rrs

Fix INET only build breakage with SCTP - pointy hat to me :-)


188306 08-Feb-2009 bz

Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by: sam
Discussed with: rwatson (renaming to *_internal)
MFC after: 26 days
X-MFC: keep wrapper functions for public symbols?


188299 08-Feb-2009 piso

Silent LINT: add 2 stubs (update_crc32 and sctp_finalize_crc32) to fix LIBALIAS + SCTP_NO_CSUM case.


188294 07-Feb-2009 piso

Add SCTP NAT support.

Submitted by: CAIA (http://caia.swin.edu.au)


188148 05-Feb-2009 jamie

Remove redundant calls of prison_local_ip4 in in_pcbbind_setup, and of
prison_local_ip6 in in6_pcbbind.

Approved by: bz (mentor)


188144 05-Feb-2009 jamie

Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by: bz (mentor)


188100 03-Feb-2009 rrs

LOR fix - Lock only when calling the actual code that
is messing with the UDP tunnel. This means
that if two users actually tried to change the
tunnel port at the same time interesting things COULD
result, but its probably very unlikely to happen :-)


188067 03-Feb-2009 rrs

- Cleanup checksum code.
- Prepare for CRC offloading, add MIB counters (RS/MT).
- Bugfix: Disable CRC computation for IPv6 addresses with local scope (MT).
- Bugfix: Handle close() with SO_LINGER correctly when notifications
are generated during the close() call(MT).
- Bugfix: Generate DRY event when sender is dry during subscription.
Only for 1-to-1 style sockets (RS/MT)
- Bugfix: Put vtags for the correct amount of time into time-wait (MT).
- Bugfix: Clear vtag entries correctly on expiration (MT).
- Bugfix: shutdown() indicates ENOTCONN when called for unconnected
1-to-1 style sockets (MT).
- Bugfix: In sctp Auth code (PL).
- Add support for devices that support SCTP csum offload (igb).
- Add missing sctp_associd to mib sysctl xsctp_tcb structure (RS)
Obtained from: With help from Peter Lei and Michael Tuexen


188066 03-Feb-2009 rrs

Adds support for SCTP checksum offload. This means
we, like TCP and UDP, move the checksum calculation
into the IP routines when there is no hardware support
we call into the normal SCTP checksum routine.

The next round of SCTP updates will use
this functionality. Of course the IGB driver needs
a few updates to support the new intel controller set
that actually does SCTP csum offload too.

Reviewed by: gnn, rwatson, kmacy


187822 28-Jan-2009 luigi

initialize a couple of variables, gcc 4.2.4-4 (linux) reports
some possible uninitialized uses and the warning does make sense.


187821 28-Jan-2009 luigi

For some reason (probably dating ages ago) an #ifdef SYSCTL_NODE / #endif
section included a lot of stuff that did not belong there.
So split the block in multiple components each around the relevant stuff.

This said, I wonder if building a kernel where SYSCTL_NODE is not
defined is supported at all.

Submitted by: Marta Carbone


187684 25-Jan-2009 bz

For consistency with prison_{local,remote,check}_ipN rename
prison_getipN to prison_get_ipN.

Submitted by: jamie (as part of a larger patch)
MFC after: 1 week


187585 22-Jan-2009 bz

Add externs to fix build with VIMAGE_GLOBALS after r187289.


187380 18-Jan-2009 sam

remove too noisy DIAGNOSTIC code

Reviewed by: qingli


187304 15-Jan-2009 piso

Silent userland warnings about missing prototypes.

Submitted by: Roman Divacky <rdivacky@freebsd.org>


187289 15-Jan-2009 lstewart

Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.

The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by: rpaulo, gnn
Approved by: gnn, kmacy (mentors)
Sponsored by: FreeBSD Foundation


187062 11-Jan-2009 rwatson

Since we allow conditional allocation of labels on syncache entries,
remove historic assertion that labels are always present.


186980 09-Jan-2009 bz

Restrict arp, ndp and theoretically the FIB listing (if not
read with libkvm) to the addresses of a prison, when inside a
jail. [1]
As the patch from the PR was pre-'new-arp', add checks to the
llt_dump handlers as well.

While touching RTM_GET in route_output(), consistently use
curthread credentials rather than the creds from the socket
there. [2]

PR: kern/68189
Submitted by: Mark Delany <sxcg2-fuwxj@qmda.emu.st> [1]
Discussed with: rwatson [2]
Reviewed by: rwatson
MFC after: 4 weeks


186963 09-Jan-2009 adrian

Fix fat-fingered comment.

Noticed-by: julian


186961 09-Jan-2009 adrian

Fix indentation; add FALLTHROUGH.

Thanks Max!


186960 09-Jan-2009 adrian

Better comment what the socket option does. Thanks to Sam Leffler
for suggesting this.


186959 09-Jan-2009 adrian

Comment some potentially confusing logic.

Nitpicking by: mlaier

MFC after: 2 weeks


186955 09-Jan-2009 adrian

Implement a new IP option (not compiled/enabled by default) to allow
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.

This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.

This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.

Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.

Obtained from: Julian Elischer and others
MFC after: 2 weeks


186948 09-Jan-2009 bz

Make SIOCGIFADDR and related, as well as SIOCGIFADDR_IN6 and related
jail-aware. Up to now we returned the first address of the interface
for SIOCGIFADDR w/o an ifr_addr in the query. This caused problems for
programs querying for an address but running inside a jail, as the
address returned usually did not belong to the jail.
Like for v6, if there was an ifr_addr given on v4, you could probe
for more addresses on the interfaces that you were not allowed to see
from inside a jail. Return an error (EADDRNOTAVAIL) in that case
now unless the address is on the given interface and valid for the
jail.

PR: kern/114325
Reviewed by: rwatson
MFC after: 4 weeks


186935 09-Jan-2009 harti

Set a minimum of information in the routing message (like version and type)
so that generic routing message parsing code can parse the messages for
L2 info that are retrieved via the sysctl interface.


186821 06-Jan-2009 rrs

Addresses Roberts comments on comments. Also adds
the KASSERT and checks suggested.

Reviewed by: The udp tunneling was discussed on net@ under the
thread entitled "Heads up -- Thinking about UDP and tunneling"


186813 06-Jan-2009 rrs

Add the ability of an alternate transport protocol
to easily tunnel over udp by providing a hook
function that will be called instead of appending
to the socket buffer.


186717 03-Jan-2009 rwatson

Allow the IP_MINTTL socket option to be set to 0 so that it can be
disabled entirely, which is its default state before set to a
non-zero value.

PR: 128790
Submitted by: Nick Hilliard <nick at foobar dot org>
MFC after: 3 weeks


186708 03-Jan-2009 qingli

Some modules such as SCTP supplies a valid route entry as an input argument
to ip_output(). The destionation is represented in a sockaddr{} object
that may contain other pieces of information, e.g., port number. This
same destination sockaddr{} object may be passed into L2 code, which
could be used to create a L2 entry. Since there exists a L2 table per
address family, the L2 lookup function can make address family specific
comparison instead of the generic bcmp() operation over the entire
sockaddr{} structure.

Note in the IPv6 case the sin6_scope_id is not compared because the
address is currently stored in the embedded form inside the kernel.
The in6_lltable_lookup() has to account for the scope-id if this
storage format were to change in the future.


186544 28-Dec-2008 bz

For consistency use LLE_IS_VALID() in this 4th place that is actually
interested in the (void *)-1 return value hack.
This way we can easily identify those special parts of the code.


186500 26-Dec-2008 qingli

This checkin addresses a couple of issues:
1. The "route" command allows route insertion through the interface-direct
option "-iface". During if_attach(), an sockaddr_dl{} entry is created
for the interface and is part of the interface address list. This
sockaddr_dl{} entry describes the interface in detail. The "route"
command selects this entry as the "gateway" object when the "-iface"
option is present. The "arp" and "ndp" commands also interact with the
kernel through the routing socket when adding and removing static L2
entries. The static L2 information is also provided through the
"gateway" object with an AF_LINK family type, similar to what is
provided by the "route" command. In order to differentiate between
these two types of operations, a RTF_LLDATA flag is introduced. This
flag is set by the "arp" and "ndp" commands when issuing the add and
delete commands. This flag is also set in each L2 entry returned by the
kernel. The "arp" and "ndp" command follows a convention where a RTM_GET
is issued first followed by a RTM_ADD/DELETE. This RTM_GET request fills
in the fields for a "rtm" object, which is reinjected into the kernel by
a subsequent RTM_ADD/DELETE command. The entry returend from RTM_GET
is a prefix route, so the RTF_LLDATA flag must be specified when issuing
the RTM_ADD/DELETE messages.

2. Enforce the convention that NET_RT_FLAGS with a 0 w_arg is the
specification for retrieving L2 information. Also optimized the
code logic.

Reviewed by: julian


186474 24-Dec-2008 kmacy

Fix missed unlock and reference drop of lle

Found by: pho


186437 23-Dec-2008 bz

Remove long unused netinet/ipprotosw.h (basically since r82884).

Discussed with: rwatson
MFC after: 4 weeks


186411 23-Dec-2008 qingli

Don't create a bogus ARP entry for 0.0.0.0.


186317 19-Dec-2008 qingli

The proxy-arp code was broken and responds to ARP
requests for addresses that are not proxied locally.


186223 17-Dec-2008 bz

Another step assimilating IPv[46] PCB code:
normalize IN6P_* compat flags usage to their equialent
INP_* counterpart.

Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


186222 17-Dec-2008 bz

Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks


186200 17-Dec-2008 kmacy

default to doing lla_lookup with shared afdata lock and returning a
shared lock on the lle - thus restoring parallel performance to
pre-arpv2 level


186180 16-Dec-2008 rwatson

IPFW's pfil hook/unhook code ignores the return values of pfil_add_hook()
and pfil_remove_hook(), so cast them to (void).

MFC after: pretty soon


186178 16-Dec-2008 kmacy

ipfw doesn't use the radix node head lock to protect the radix tree - remove acquisition


186164 16-Dec-2008 kmacy

check pointer against NULL
add new line after declaration for style


186161 16-Dec-2008 kmacy

don't unlock lle if it is NULL


186150 16-Dec-2008 kmacy

unlock and destroy an llentry's lock before freeing

Found by: sam


186141 15-Dec-2008 bz

Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)


186119 15-Dec-2008 qingli

This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion


186086 14-Dec-2008 bz

Add a check, that is currently under discussion for 8 but that we need
to keep for 7-STABLE when MFCing in_pcbladdr() to not change the
behaviour there.

With this a destination route via a loopback interface is treated as
a valid and reachable thing for IPv4 source address selection, even
though nothing of that network is ever directly reachable, but it is
more like a blackhole route.
With this the source address will be selected and IPsec can grab the
packets before we would discard them at a later point, encapsulate them
and send them out from a different tunnel endpoint IP.

Discussed on: net
Reported by: Frank Behrens <frank@harz.behrens.de>
Tested by: Frank Behrens <frank@harz.behrens.de>
MFC after: 4 weeks (just so that I get the mail)


186057 13-Dec-2008 bz

De-virtualize the MD5 context for TCP initial seq number generation
and make it a function local variable like we do almost everywhere
inside the kernel.

Discussed with: rwatson, silby
MFC after: 4 weeks


186054 13-Dec-2008 kmacy

version that will compile


186053 13-Dec-2008 kmacy

radix node head lock needs to be held when calling rnh_addaddr


186052 13-Dec-2008 kmacy

don't acquire lock recursively


186048 13-Dec-2008 bz

Second round of putting global variables, which were virtualized
but formerly missed under VIMAGE_GLOBAL.

Put the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

Sponsored by: The FreeBSD Foundation


185937 11-Dec-2008 bz

Put a global variables, which were virtualized but formerly
missed under VIMAGE_GLOBAL.

Start putting the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

While there garbage collect a few dead externs from ip6_var.h.

Sponsored by: The FreeBSD Foundation


185934 11-Dec-2008 bz

Use the correct INIT_VNET_INET() as the virtualized variable here
are in vinet.h not in vinet6.h

Sponsored by: The FreeBSD Foundation


185895 10-Dec-2008 zec

Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


185858 10-Dec-2008 rwatson

Remove inconsistent white space from in_pcballoc().

MFC after: pretty soon


185857 10-Dec-2008 rwatson

Move syncache flag definitions below data structure, compress some vertical
whitespace.

MFC after: pretty soon


185855 10-Dec-2008 rwatson

Move flag definitions for t_flags and t_oobflags below the definition of
struct tcpcb so that the structure definition is a bit more vertically
compact. Can't yet fit it on one printed page, though.

MFC after: pretty soon


185845 10-Dec-2008 kmacy

unlock when done


185844 10-Dec-2008 kmacy

don't reference if_addr_mtx directly


185813 10-Dec-2008 rwatson

Update comment on INP_TIMEWAIT to say what it's about, as we caution
regarding the misplacement of flags in inp_vflag in an earlier comment.

MFC after: pretty soon


185795 09-Dec-2008 rwatson

Enhance one comment relating to recent TCP locking changes, and fix a
typo in another.

MFC after: 6 weeks


185791 09-Dec-2008 rwatson

Move macros defining flags and shortcus to nested structure fields in
inpcbinfo below the structure definition in order to make inpcbinfo
fit on a single printed page; related style tweaks.

MFC after: pretty soon


185775 08-Dec-2008 rwatson

Move from solely write-locking the global tcbinfo in tcp_input()
to read-locking in the TCP input path, allowing greater TCP
input parallelism where multiple ithreads or ithread and netisr
are able to run in parallel. Previously, most TCP input paths
held a write lock on the global tcbinfo lock, effectively
serializing TCP input.

Before looking up the connection, acquire a write lock if a
potentially state-changing flag is set on the TCP segment header
(FIN, RST, SYN), and otherwise a read lock. We may later have
to upgrade to a write lock in certain cases (ACKs received by the
syncache or during TIMEWAIT) in order to support global state
transitions, but this is never required for steady-state packets.

Upgrading from a write lock to a read lock must be done as a
trylock operation to avoid deadlocks, and actually violates the
lock order as the tcbinfo lock preceeds the inpcb lock held at
the time of upgrade. If the trylock fails, we bump the refcount
on the inpcb, drop both locks, and re-acquire in-order. If
another thread has freed the connection while the locks are
dropped, we free the inpcb and repeat the lookup (this should
hardly ever or never happen in practice).

For now, maintain a number of new counters measuring how many
times various cases execute, and in particular whether various
optimistic assumptions about when read locks can be used, whether
upgrades are done using the fast path, and whether connections
close in practice in the above-described race, actually occur.

MFC after: 6 weeks
Discussed with: kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy


185773 08-Dec-2008 rwatson

Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.

Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().

MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy


185713 06-Dec-2008 csjp

in_rtalloc1(9) returns a locked route, so make sure that we use
RTFREE_LOCKED() here. This macro makes sure the reference count
on the route is being managed properly. This elimates another
case which results in the following message being printed to the
console:

rtfree: 0xc841ee88 has 1 refs

Reviewed by: bz
MFC after: 2 weeks


185694 06-Dec-2008 rrs

Code from the hack-session known as the IETF (and a
bit of debugging afterwards):
- Fix protection code for notification generation.
- Decouple associd from vtag
- Allow vtags to have less strigent requirements in non-uniqueness.
o don't pre-hash them when you issue one in a cookie.
o Allow duplicates and use addresses and ports to
discriminate amongst the duplicates during lookup.
- Add support for the NAT draft draft-ietf-behave-sctpnat-00, this
is still experimental and needs more extensive testing with the
Jason Butt ipfw changes.
- Support for the SENDER_DRY event to get DTLS in OpenSSL working
with a set of patches from Michael Tuexen (hopefully heading to OpenSSL soon).
- Update the support of SCTP-AUTH by Peter Lei.
- Use macros for refcounting.
- Fix MTU for UDP encapsulation.
- Fix reporting back of unsent data.
- Update assoc send counter handling to be consistent with endpoint sent counter.
- Fix a bug in PR-SCTP.
- Fix so we only send another FWD-TSN when a SACK arrives IF and only
if the adv-peer-ack point progressed. However we still make sure
a timer is running if we do have an adv_peer_ack point.
- Fix PR-SCTP bug where chunks were retransmitted if they are sent
unreliable but not abandoned yet.

With the help of: Michael Teuxen and Peter Lei :-)
MFC after: 4 weeks


185636 05-Dec-2008 glebius

In a case of CARP status change run through the if_link_state_change()
routine, so that devd(8) and others are notified about link state change.


185571 02-Dec-2008 bz

Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation


185435 29-Nov-2008 bz

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


185420 28-Nov-2008 zec

Add an essential .h file that skipped from the last commit (r185419).

Pointy hat #1 on...

Pointed out by: bz


185419 28-Nov-2008 zec

Unhide declarations of network stack virtualization structs from
underneath #ifdef VIMAGE blocks.

This change introduces some churn in #include ordering and nesting
throughout the network stack and drivers but is not expected to cause
any additional issues.

In the next step this will allow us to instantiate the virtualization
container structures and switch from using global variables to their
"containerized" counterparts.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


185382 28-Nov-2008 des

missing V_


185371 27-Nov-2008 bz

Replace most INP_CHECK_SOCKAF() uses checking if it is an
IPv6 socket by comparing a constant inp vflag.
This is expected to help to reduce extra locking.

Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks


185370 27-Nov-2008 bz

Merge in6_pcbfree() into in_pcbfree() which after the previous
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.

Reviewed by: rwatson (as part of a larger changset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.


185366 27-Nov-2008 bz

Unify ipsec[46]_delete_pcbpolicy in ipsec_delete_pcbpolicy.
Ignoring different names because of macros (in6pcb, in6p_sp) and
inp vs. in6p variable name both functions were entirely identical.

Reviewed by: rwatson (as part of a larger changeset)
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrappers in 7 to keep the symbols.


185348 26-Nov-2008 zec

Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


185344 26-Nov-2008 bz

Remove in6_pcbdetach() as it is exactly the same function
as in_pcbdetach() and we don't need the code twice.

Reviewed by: rwatson
MFC after: 6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.


185333 26-Nov-2008 bz

Unify the v4 and v6 versions of pcbdetach and pcbfree as good
as possible so that they are easily diffable.

No functional changes.

Reviewed by: rwatson
MFC after: 6 weeks


185101 19-Nov-2008 julian

Fix a scope problem in the multiple routing table code that stopped the
SO_SETFIB socket option from working correctly.

Obtained from: Ironport
MFC after: 3 days


185088 19-Nov-2008 zec

Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


184883 12-Nov-2008 rrs

-Improvement: Add '\n' on debug output in sctp_lower_sosend().
-Improvement: panic() on INVARIANTS kernels if memory allocation
fails for a tagblock in sctp_add_vtag_to_timewait().
-Bugfix: Protect code in sctp_is_in_timewait() by
SCTP_INP_INFO_WLOCK/SCTP_INP_INFO_WUNLOCK.
-Cleanup: Get rid of unused variable now in sctp_init_asoc().
-Bugfix: Reuse the correct vtag in sctp_add_vtag_to_timewait().
-Cleanup: Get rid of unused constant SCTP_TIME_WAIT_SHORT
in sctp_constants.h.
-Improvement: Use all hash buckets of the vtag hash table.
-Cleanup: Get rid of then unused constant SCTP_STACK_VTAG_HASH_SIZE_A.
-Bugfix: Handle SHUTDOWN;SACK packet correctly.
-Bugfix: Last TSN in a gap ack block was not being "ack'd"
in the internal scoreboard.
Obtained from: (with help from Michael Tuexen)


184797 09-Nov-2008 bz

For consistency work on the local object passed into the function for the
lock operation instead using the global name.

Submitted by: ganbold
MFC after: 2 months


184731 06-Nov-2008 bz

Fix typo and while here another one.

Reviewed by: keramida
Reported by: keramida
MFC after: 2 months (with r184720)


184722 06-Nov-2008 bz

Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by: kmacy
MFC after: 2 months (along with r182851)


184721 06-Nov-2008 bz

Adopt the comment for tcp_maxmtu(); we are returning a number
not a pointer. While here update the rest of the comment to
better match what we have these days.

MFC after: 2 months


184720 06-Nov-2008 bz

Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

In case we return early and got a metricptr to pass the hostcache
info back to the caller we need to initialize the data to a defined
state (zero it) as tcp_hc_get() would do if there was no hit.
Without that the caller would check on random stack garbage which
could lead to undefined results.

This only affected tcp_mss() if there was no routing entry for the peer,
tcp_mtudisc() was not affected.

MFC after: 2 months (along with r182851)


184414 28-Oct-2008 oleg

Type of q_time (start of queue idle time) has changed: uint32_t -> uint64_t.
This should fix q_time overflow, which happens after 2^32/(86400*hz) days of
uptime (~50days for hz = 1000).
q_time overflow cause following:
- traffic shaping may not work in 'fast' mode (not enabled by default).
- incorrect average queue length calculation in RED/GRED algorithm.

NB: due to ABI change this change is not applicable to stable.

PR: kern/128401


184340 27-Oct-2008 rrs

More issues with pre-blocking:
a) Need for EEOR mode to take the min of the socket buffer size and the
add more threshold, otherwise if you are so silly as to set a send
buf size less than the add-more you could block forever in eeor mode.

b) We were incorrectly using the sysctl vs the calculated value. This
causes us to block forever if the addmore theshold is larger than
then the socket buffer size.


184336 27-Oct-2008 rrs

Two inter-related bugs.
- If we send EXACTLY the size left in the send buffer
and then send again, we end up with exactly 0 bytes and
don't hit the pre-block code to wait for more space.
- If we fall into the loop with our max_len == 0 (the bug
above) we then call in to copy out the data, setup the length
of the waiting to transmit data to 0 and call the mbuf copy routine
which 0 indicates copy all the data to the mbuf chain.. which it
does. This then leaves a "stuck" message on the stream queue with
its size exactly 0 bytes but all the data there and thus nothing
left in the uio structure. We then reach a stuck forever state
never being able to send data.


184334 27-Oct-2008 rrs

Get rid of ifdef for vimage on version 8 comparison. Now the
scrubbing program properly takes care of this.


184333 27-Oct-2008 rrs

Invariants changes that make more sense.


184304 26-Oct-2008 rwatson

In both dropwithreset paths in tcp_input.c, drop the tcbinfo lock
sooner to decomplicate locking and eliminate the need for a rather
chatty comment about why we have to handle the global lock in a
special way for the benefit of ipfw and pf cred rules.

MFC after: 3 days


184298 26-Oct-2008 rwatson

Remove endearing but syntactically unnecessary "return;" statements
directly before the final closeing brackets of some TCP functions.

MFC after: 3 days


184295 26-Oct-2008 bz

Style changes only:
- Consistently add parentheses to return statements.
- Use NULL instead of 0 when comparing pointers, also avoiding
unnecessary casts.
- Do not use pointers as booleans.

Reviewed by: rwatson (earlier version)
MFC after: 2 months


184214 23-Oct-2008 des

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


184205 23-Oct-2008 des

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


184097 20-Oct-2008 bz

Update a comment which to my reading had been misplaced in rev. 1.12
already (but probably had been way above as the code was there twice)
and describe what was last changed in rev. 1.199 there (which now is
in sync with in6_src.c r184096).

Pointed at by: mlaier
MFC after: 2 mmonths


184096 20-Oct-2008 bz

Bring over the change switching from using sequential to random
ephemeral port allocation as implemented in netinet/in_pcb.c rev. 1.143
(initially from OpenBSD) and follow-up commits during the last four and
a half years including rev. 1.157, 1.162 and 1.199.
This now is relying on the same infrastructure as has been implemented
in in_pcb.c since rev. 1.199.

Reviewed by: silby, rpaulo, mlaier
MFC after: 2 months


184031 18-Oct-2008 rrs

The flags value was not always being copied out in the recv routine like it
should be.
Obtained from: Michael Tuexen


184030 18-Oct-2008 rrs

New sockets (accepted) were not inheriting the proper snd/rcv buffer value.

Obtained from: Michael Tuexen


184029 18-Oct-2008 rrs

- Peers rwnd is now available for the MIB.
Obtained from: Michael Tuexen


184028 18-Oct-2008 rrs

- Adapt layer indication was always being given (it should only
be given when the user has enabled it). (Michael Tuexen)
- Sack Immediately was not being set properly on the actual chunk, it
was only put in the rcvd_flags which is incorrect. (Michael Tuexen)
- added an ifndef userspace to one of the already present macro's for
inet (Brad Penoff)
Obtained from: Michael Tuexen and Brad Penoff
MFC after: 4 weeks


184027 18-Oct-2008 rrs

Reported by Yehuda Weinraub (yehudasa@gamil.com) - CRC32C algorithm
uses incorrect init_bytes value. It SHOULD have the number
of bytes to get to a 4 byte boundary.

PR: 128134
MFC after: 4 weeks


183982 17-Oct-2008 bz

Add cr_canseeinpcb() doing checks using the cached socket
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.

Reviewed by: rwatson
MFC after: 3 months (set timer; decide then)


183954 16-Oct-2008 zec

Remove a useless global static variable.

Approved by: bz (ad-hoc mentor)


183887 14-Oct-2008 maxim

o Remove unnecessary parentheses and restore identation.

Prodded by: mlaier


183881 14-Oct-2008 maxim

o Reformat ipfw nat get|setsockopt code to look it more
style(9) compliant. No functional changes.


183744 10-Oct-2008 rwatson

Fix content and spelling of comment on _ipfw_insn.len -- a count of
32-bit words, not 32-byte words.

MFC after: 3 days


183662 07-Oct-2008 rwatson

Don't pass curthread to sbreserve_locked() in tcp_do_segment(), as the
netisr or ithread's socket buffer size limit is not the right limit to
use. Instead, pass NULL as the other two calls to sbreserve_locked()
in the TCP input path (tcp_mss()) do.

In practice, this is a no-op, as ithreads and the netisr run without a
process limit on socket buffer use, and a NULL thread pointer leads to
not using the process's limit, if any. However, if tcp_input() is
called in other contexts that do have limits, this may prevent the
incorrect limit from being used.

MFC after: 3 days


183610 04-Oct-2008 bz

Remove an INP_RUNLOCK() missed in SVN r183606, cvs rev. 1.195 raw_ip.c
when transitioning from so_cred to inp_cred.

MFC after: 6 weeks


183606 04-Oct-2008 bz

Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred


183571 03-Oct-2008 bz

Implement IPv4 source address selection for unbound sockets.

For the jail case we are already looping over the interface addresses
before falling back to the only IP address of a jail in case of no
match. This is in preparation for the upcoming multi-IPv4/v6/no-IP
jail patch this change was developed with initially.

This also changes the semantics of selecting the IP for processes within
a jail as it now uses the same logic as outside the jail (with additional
checks) but no longer is on a mutually exclusive code path.

Benchmarks had shown no difference at 95.0% confidence for neither the
plain nor the jail case (even with the additional overhead). See:
http://lists.freebsd.org/pipermail/freebsd-net/2008-September/019531.html

Inpsired by a patch from: Yahoo! (partially)
Tested by: latest multi-IP jail patch users (implictly)
Discussed with: rwatson (general things around this)
Reviewed by: mostly silence (feedback from bms)
Help with benchmarking from: kris
MFC after: 2 months


183550 02-Oct-2008 zec

Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation


183461 29-Sep-2008 rwatson

Expand comments relating various detach/free/drop inpcb routines.

MFC after: 3 days


183460 29-Sep-2008 rwatson

Fix typo in comment.

MFC after: 3 days


183418 27-Sep-2008 rwatson

When an inpcb doesn't have a socket but the inpcb is passed to ipfw
in the transmit path, such as TCPS_TIMEWAIT, fail the credential
extraction immediately rather than acquiring locks and looking up
the inpcb on the global lists in order to reach the conclusion that
the credential extraction has failed.

This is more efficient, but more importantly, it avoids lock
recursion on the inpcbinfo, which is no longer allowed with rwlocks.
This appears to have been responsible for at least two reported
panics.

MFC after: 3 days
Reported by: ganbold


183398 27-Sep-2008 rwatson

Rather than shadowing global variable 'lookup' in check_uidgid(), rename
it to ugid_lookupp. This should make debugging issues with ipfw uid
rules easier.

MFC after: 3 days


183388 26-Sep-2008 emaste

Move CTASSERT from header file to source file, per implementation note now
in the CTASSERT man page.

Submitted by: Ryan Stone


183356 25-Sep-2008 rwatson

As a follow-on to r183323, correct another case where ip_output() was
called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.

Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days


183323 24-Sep-2008 rwatson

When dropping a packet and issuing a reset during TCP segment handling,
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb. Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.

This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden. When these locks were mutexes, they simply recursed.

Reported by: Stefan Ehmann <shoesoft at gmx dot net>
MFC after: 3 days


183240 21-Sep-2008 rik

Export IPFW_TABLES_MAX value for compiled in defaults.


183015 14-Sep-2008 rik

Export IPFW_TABLES_MAX via sysctl. Part of PR: 127058.

PR: 127058


183014 14-Sep-2008 julian

oops commit the version that compiles


183013 14-Sep-2008 julian

Revert a part of the MRT commit that proved un-needed.
rt_check() in its original form proved to be sufficient and
rt_check_fib() can go away (as can its evil twin in_rt_check()).

I believe this does NOT address the crashes people have been seeing
in rt_check.

MFC after: 1 week


183012 14-Sep-2008 rik

Make the commet for the default rule number more clear.

Submitted by: yar@


183001 13-Sep-2008 bz

Implement IPv6 support for TCP MD5 Signature Option (RFC 2385)
the same way it has been implemented for IPv4.

Reviewed by: bms (skimmed)
Tested by: Nick Hilliard (nick netability.ie) (with more changes)
MFC after: 2 months


182885 09-Sep-2008 bz

Work around an integer division resulting in 0 and thus the
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.

See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.

Submitted by: Alana Huebner, Lawrence Stewart,
Grenville Armitage (caia.swin.edu.au)
Reviewed by: dwmalone, gnn, rpaulo
MFC After: 3 days


182855 07-Sep-2008 bz

To my reading there are no real consumers of ip6_plen (IPv6
Payload Length) as set in tcpip_fillheaders().
ip6_output() will calculate it based of the length from the
mbuf packet header itself.
So initialize the value in tcpip_fillheaders() in correct
(network) byte order.

With the above change, to my reading, all places calling tcp_trace()
pass in the ip6 header via ipgen as serialized in the mbuf and with
ip6_plen in network byte order.
Thus convert the IPv6 payload length to host byte order before printing.

MFC after: 2 months


182851 07-Sep-2008 bz

Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR: kern/118455
Reviewed by: silby (back in March)
MFC after: 2 months


182848 07-Sep-2008 bz

V_irtualize SVN r182846 tcp_mssdflt/tcp_v6mssdflt procedure based
sysctl implementations for VIMAGE the same way we did elsewhere:
update the implementation but leave the globals and the SYSCTL
statement untouched.


182846 07-Sep-2008 bz

Convert SYSCTL_INTs for tcp_mssdflt and tcp_v6mssdflt to
SYSCTL_PROCs and check that the default mss for neither v4 nor
v6 goes below the minimum MSS constant (216).

This prevents people from shooting themselves in the foot.

PR: kern/118455 (remotely related)
Reviewed by: silby (as part of a larger patch in March)
MFC after: 2 months


182841 07-Sep-2008 bz

Add a second KASSERT checking for len >= 0 in the tcp output path.

This is different to the first one (as len gets updated between those
two) and would have caught various edge cases (read bugs) at a well
defined place I had been debugging the last months instead of
triggering (random) panics further down the call graph.

MFC after: 2 months


182818 06-Sep-2008 rik

Export the IPFW_DEFAULT_RULE outside ip_fw2.c. This number in not only
the default rule number but also the maximum rule number. User space
software such as ipfw and natd should be aware of its value. The
software that already includes ip_fw.h should use the defined value. All
other a expected to use sysctl (as discussed on net@).

MFC after: 5 days.
Discussed on: net@


182775 05-Sep-2008 keramida

Slightly reword comment and remove typos.


182733 03-Sep-2008 julian

whitespace nit


182633 01-Sep-2008 brooks

Wrap an 81 column SYSCTL_NODE decleration.

Obtained from: //depot/projects/vimage-commit2/...


182591 01-Sep-2008 kmacy

Don't check if an interface can do tcp offload if there are no offload devices registered on the system.

Suggested by: rwatson
MFC after: 3 days


182563 31-Aug-2008 julian

fix tiny nti in comment


182488 30-Aug-2008 csjp

Improve the entropy of the source port randomization for network address
translation. It turns out this is useful for applications which require
source port randomization for security (i.e. dns servers).

Discussed with: secteam
Requested by: mlaier
MFC after: 2 weeks


182463 29-Aug-2008 gnn

Fix a bug whereby multicast packets that are looped back locally
wind up with the incorrect checksum on the wire when transmitted via
devices that do checksum offloading.

PR: kern/119635
Reviewed by: rwatson
MFC after: 5 days


182411 28-Aug-2008 rpaulo

Fix typo in comment.


182405 28-Aug-2008 rrs

ok, non static the function and put in the .h so
when we do INVARANT compile the compiler will not
dis the function that is not used. Hmm maybe I should have
made it ifndef INVARIANTs..


182403 28-Aug-2008 rrs

Fixes compile error when INVARIANTs is on. Adds an
empty goto to keep the compiler happy.


182367 28-Aug-2008 rrs

- Make strict-sacks be the default.
- Change it so that without INVARIANTs there are
no panics in SCTP.
- sctp_timer changes so that we have a recovery mechanism
when the sent list is out of order.


182311 27-Aug-2008 csjp

Fix a panic in MAC kernels that was a result of un-initialized label
storage. We can safely remove the label copying operations since
M_MOVE_PKTHDR will move the mbuf tags (which contain MAC labels) to
the destination mbuf.

MFC after: 1 week
Discussed with: rwatson


182268 27-Aug-2008 rrs

- When we close a socket with pending assoc's that are still
shutting down, NULL out the socket pointer so we won't
ever refer to a dead socket.

Obtained from: Neil Wilson


182148 25-Aug-2008 julian

Another missed V_ instance


182146 25-Aug-2008 julian

Another V_ forgotten


182145 25-Aug-2008 julian

We left out V_static_len from ip_fw2.c
(also a whitespace diff that i'd rahter fix her ethan break in the
vimage branch.)


182129 25-Aug-2008 julian

Move some struct defs around. This is a prep step for Vimage.A
No real effect of this at this time.


182114 24-Aug-2008 bz

Make the kernel compile with SCTP and SCTP_DEBUG but
no INET6 defined.


182089 24-Aug-2008 kmacy

Don't calculate checksum if it has already been validated

Obtained from: Chelsio Inc.
MFC after: 3 days


182056 23-Aug-2008 bz

Cache the cred locally in _syncache_add() while holding the locks, so
we can be sure that it's valid.
In case we abort early free it again else put it into the syncache.

We need the cred in the syncache to be able to restrict what will be
exportet by the sysctl helper function syncache_pcblist() (to netstat)
within jails.

PR: kern/126493
Reviewed by: rwatson (earlier versions)
MFC after: 3 days


182045 23-Aug-2008 bz

Add an explicit comment why we NULLify the two variables.

Reviewed by: rwatson
MFC after: 3 days


181966 21-Aug-2008 rwatson

Remove comments and #ifdef notyet'd code relating to directly dispatching
the IP multicast input code from the output path; we don't allow
reentrance of the input path from the IP output path, it must use the
netisr due to potential lock recursion.

MFC after: 3 days


181888 20-Aug-2008 julian

Fix some of the formatting fixes.. It's amazing how some thing stand out
in a commit message.


181887 20-Aug-2008 julian

A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.


181824 18-Aug-2008 philip

Fix ARP in bridging scenarios where the bridge shares its
MAC address with one of its members (see my r180140).

Pointy hat to: philip
Submitted by: Eygene Ryabinkin <rea-fbsd@codelabs.ru>
MFC after: 3 days


181803 17-Aug-2008 bz

Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch


181782 16-Aug-2008 bz

Fix a regression introduced in r179289 splitting up ip6_savecontrol()
into v4-only vs. v6-only inp_flags processing.
When ip6_savecontrol_v4() is called from ip6_savecontrol() we
were not passing back the **mp thus the information will be missing
in userland.
Istead of going with a *** as suggested in the PR we are returning
**mp now and passing in the v4only flag as a pointer argument.

PR: kern/126349
Reviewed by: rwatson, dwmalone


181464 09-Aug-2008 des

Nit


181365 07-Aug-2008 rwatson

Minor white space tweaks.

MFC after: 1 week


181364 07-Aug-2008 rwatson

Correct comment typo.

MFC after: 1 week (after inpcb rwlocking)


181337 05-Aug-2008 jhb

Minor style tweaks.


181139 01-Aug-2008 julian

The IPFW code accepts the use of the tablearg keyword along with the skipto
keyword. But it doesn't work. Two options.. make it no longer accept it,
or actually make it work.. I chose the 2nd..

Allow the tablearg to be used to specify a skipto destination.

This is actually a very powerful construct if used correctly, or a sink
of cpu cycles if used badly.

changes t teh man page will follow.


181056 31-Jul-2008 rpaulo

MFp4 (//depot/projects/tcpecn/):

TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
TCP ECN is defined in RFC 3168.

Partly reviewed by: dwmalone, silby
Obtained from: NetBSD


181054 31-Jul-2008 rrs

Adds support for the SCTP_PORT_REUSE option
Fixes a refcount bug found in the process

Obtained from: With the help of Michael Tuexen


180956 29-Jul-2008 rrs

Fix build breakage - kthread_exit() in 8 now has no arguments
MFC after: 1 week


180955 29-Jul-2008 rrs

- Out with some printfs.
- Fix a initialization of last_tsn_used
- Fix handling of mapped IPv4 addresses
Obtained from: Michael Tuexen and I :-)
MFC after: 1 week


180874 28-Jul-2008 mav

Some style and assertion fixes to the previous commits hinted by rwatson.
There is no functional changes.


180851 27-Jul-2008 mav

According to in_pcb.h protocol binding information has double locking.
It allows access it while list travercing holding only global pcbinfo lock.


180836 26-Jul-2008 mav

Increase UDBHASHSIZE from 16 to 128 items.
Previous value was chosen 10 years ago and not very effective now.
This change gives several percents speedup on 1000 L2TP mpd links.


180833 26-Jul-2008 mav

According to in_pcb.h protocol binding information has double locking.
It allows access it while list travercing holding only global pcbinfo lock.
This relaxed locking noticably increses receive socket lookup performance.


180828 26-Jul-2008 mav

Add hash table lookup for a fully connected raw sockets.

This gives significant performance improvements when many raw sockets used.
Benchmarks of mpd handeling 1000 simultaneous PPTP connections show up to 50%
performance boost. With higher number of connections benefit becomes even
bigger. PopTop snd others should also get some benefits.


180683 22-Jul-2008 avatar

Trying to fix compilation bustage:
- removing 'const' qualifier from an input parameter to conform to the type
required by rw_assert();
- using in_addr->s_addr to retrive 32 bits address value.

Observed by: tinderbox


180678 21-Jul-2008 kmacy

make new accessor functions consistent with existing style


180674 21-Jul-2008 kmacy

- Switch to INP_WLOCK macro from inp_wlock
- calling sodisconnect after tcp_twstart is both gratuitous and unsafe - remove

Submitted by: rwatson


180648 21-Jul-2008 kmacy

Add versions of tcp_twstart, tcp_close, and tcp_drop that hide the acquisition the tcbinfo lock.

MFC after: 1 week


180645 21-Jul-2008 kmacy

add interface for external consumers to syncache_expand - rename syncache_add in a manner consistent with other bits intended for offload


180641 21-Jul-2008 kmacy

Add accessor functions for socket fields.

MFC after: 1 week


180640 21-Jul-2008 kmacy

add inpcb accessor functions for fields needed by TOE devices


180631 20-Jul-2008 trhodes

Document a few sysctls.

Reviewed by: rwatson


180629 20-Jul-2008 bz

ia is a pointer thus use NULL rather then 0 for initialization and
in comparisons to make this more obvious.

MFC after: 5 days


180624 20-Jul-2008 kmacy

remove unused toedev functions and add comments for rest


180593 18-Jul-2008 dwmalone

Add an accept filter for TCP based DNS requests. It waits until the
whole first request is present before returning from accept.


180589 18-Jul-2008 rwatson

Eliminate use of the global ripsrc which was being used to pass address
information from rip_input() to rip_append(). Instead, pass the source
address for an IP datagram to rip_append() using a stack-allocated
sockaddr_in, similar to udp_input() and udp_append().

Prior to the move to rwlocks for inpcbinfo, this was not a problem, as
use of the global was synchronized using the ripcbinfo mutex, but with
read-locking there is the potential for a race during concurrent
receive.

This problem is not present in the IPv6 raw IP socket code, which
already used a stack variable for the address.

Spotted by: mav
MFC after: 1 week (before inpcbinfo rwlock changes)


180558 16-Jul-2008 rwatson

Fix error in comment.

MFC after: 3 weeks


180536 15-Jul-2008 rwatson

Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:

- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.

In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.

The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.

Tested by: ps, kris (older revision), bde
MFC after: 3 weeks


180535 15-Jul-2008 rpaulo

Fix commment in typo.

M tcp_output.c


180513 14-Jul-2008 eri

Fix carp(4) panics that can occur during carp interface configuration.

Approved by: mlaier (mentor)
Reported by: Scott Ullrich
MFC after: 1 week


180429 10-Jul-2008 rwatson

Slightly rearrange validation of UDP arguments and jail processing in
udp_output() so that argument validation occurs before jail processing.

Add additional comments explaining what's going on when we process
addresses and binding during udp_output().

MFC after: 3 weeks


180427 10-Jul-2008 bz

Pass the ucred along into in{,6}_pcblookup_local for upcoming
prison checks.

Reviewed by: rwatson


180425 10-Jul-2008 bz

For consistency take lport as u_short in in{,6}_pcblookup_local.
All callers either pass in an u_short or u_int16_t.

Reviewed by: rwatson


180422 10-Jul-2008 rwatson

Apply the MAC label to an outgoing UDP packet when other inpcb properties are
processed, meaning that we avoid the cost of MAC label assignment if we're
going to drop the packet due to mbuf exhaustion, etc.

MFC after: 3 weeks


180392 09-Jul-2008 bz

For consistency with the rest of the function use the locally cached
pointer pcbinfo rather than inp->inp_pcbinfo.

MFC after: 3 weeks


180387 09-Jul-2008 rrs

1) Adds the rest of the VIMAGE change macros
2) Adds some __UserSpace__ on some of the common defines that
the user space code needs
3) Fixes a bug when we send up data to a user that failed. We
need to a) trim off the data chunk headers, if present, and
b) make sure the frag bit is communicated properly for the
msgs coming off the stream queues... i.e. we see if some
of the msg has been taken.

Obtained from: jeli contributed the VIMAGE changes on this pass Thanks Julain!


180368 08-Jul-2008 rwatson

Provide some initial chicken-scratching annotations of locking for
struct inpcb.

Prodded by: bz
MFC after: 3 days


180348 07-Jul-2008 rwatson

Allow udp_notify() to accept read, as well as write, locks on the passed
inpcb. When directly invoking udp_notify() from udp_ctlinput(), acquire
only a read lock; we may still see write locks in udp_notify() as the
in_pcbnotifyall() routine is shared with TCP and always uses a write lock
on the inpcb being notified.

MFC after: 1 month


180346 07-Jul-2008 rwatson

Add additional udbinfo and inpcb locking assertions to udp_output(); for
some code paths, global or inpcb write locks are required, but for other
code paths, read locks or no locking at all are sufficient for the data
structures.

MFC after: 1 month


180344 07-Jul-2008 rwatson

First step towards parallel transmit in UDP: if neither a specific
source or a specific destination address is requested as part of a send
on a UDP socket, read lock the inpcb rather than write lock it. This
will allow fully parallel transmit down to the IP layer when sending
simultaneously from multiple threads on a connected UDP socket.

Parallel transmit for more complex cases, such as when sendto(2) is
invoked with an address and there's already a local binding, will
follow.

MFC after: 1 month


180338 07-Jul-2008 rwatson

Drop read lock on udbinfo earlier during delivery to the last matching
UDP socket for a datagram; the inpcb read lock is sufficient to provide
inpcb stability during udp_append().

MFC after: 1 month


180306 05-Jul-2008 rwatson

Rename raw_append() to rip_append(): the raw_ prefix is generally used
for functions in the generic raw socket library (raw_cb.c, raw_usrreq.c),
and they are not used for IPv4 raw sockets.

MFC after: 3 days


180305 05-Jul-2008 rwatson

Improve approximation of style(9) in raw socket code.


180264 04-Jul-2008 gonzo

Enqueue de-capsulated packet instead of performing direct dispatch. It's
possible to exhaust and garble stack with a packet that contains a couple
of hundreds nested encapsulation levels.

Submitted by: Ming Fu <fming@borderware.com>
Reviewed by: rwatson
PR: kern/85320


180239 04-Jul-2008 rwatson

Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred). Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired. This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by: bz
MFC after: 1 month
X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
but the rest can go back.


180215 03-Jul-2008 bz

Remove a bogusly introduced rtalloc_ign() in rev. 1.335/SVN 178029,
generating an RTM_MISS for every IP packet forwarded making user space
routing daemons unhappy.

PR: kern/123621, kern/124540, kern/122338
Reported by: Paul <paul gtcomm.net>, Mike Tancsa <mike sentex.net> on net@
Tested by: Paul and Mike
Reviewed by: andre
MFC after: 3 days


180198 02-Jul-2008 rwatson

Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by: kris, ps


180127 30-Jun-2008 rwatson

In udp_append() and udp_input(), make use of read locking on incpbs
rather than write locking: while we need to maintain a valid reference
to the inpcb and fix its state, no protocol layer state is modified
during an IPv4 UDP receive -- there are only changes at the socket
layer, which is separately protected by socket locking.

While parallel concurrent receive on a single UDP socket is currently
relatively unusual, introducing read locking in the transmit path,
allowing concurrent receive and transmit, will significantly improve
performance for loads such as BIND, memcached, etc.

MFC after: 2 months
Tested by: gnn, kris, ps


179971 24-Jun-2008 gonzo

In case of interface initialization failure remove struct in_ifaddr* from
in_ifaddrhashtbl in in_ifinit because error handler in in_control removes
entries only for AF_INET addresses. If in_ifinit is called for the cloned
inteface that has just been created its address family is not AF_INET and
therefor LIST_REMOVE is not called for respective LIST_INSERT_HEAD and
freed entries remain in in_ifaddrhashtbl and lead to memory corruption.

PR: kern/124384


179924 22-Jun-2008 mav

Partially revert previous commit. DeleteLink() does not deletes permanent
links so we should be aware of it and try to delete every link only once
or we will loop forever.


179920 21-Jun-2008 mav

Implement UDP transparent proxy support.

PR: bin/54274
Submitted by: Nicolai Petri <nicolai@petri.cc>


179912 21-Jun-2008 mav

Add support for PORT/EPRT FTP commands in lowercase.
Use strncasecmp() instead of huge local implementation to reduce code size.
Check space presence after command/code.

PR: kern/73034


179833 16-Jun-2008 ups

Change incorrect stale cookie detection in syncookie_lookup() that prematurely
declared a cookie as expired.

Reviewed by: andre@, silby@
Reported by: Yahoo!


179832 16-Jun-2008 ups

Fix a check in SYN cache expansion (syncache_expand()) to accept packets that arrive in the receive window instead of just on the left edge of the receive window.
This is needed for correct behavior when packets are lost or reordered.

PR: kern/123950
Reviewed by: andre@, silby@
Reported by: Yahoo!, Wang Jin
MFC after: 1 week


179803 15-Jun-2008 rrs

More prep for Vimage:
- only one functino to destroy an SCTP stack sctp_finish()
- Make it so this function also arranges for any threads
created by the image to do a kthread_exit()


179786 14-Jun-2008 rrs

- Fixes foobar on my part. Some missing virtualization macros from
specific logging cases.


179783 14-Jun-2008 rrs

- Macro-izes the packed declaration in all headers.
- Vimage prep - these are major restructures to move
all global variables to be accessed via a macro or two.
The variables all go into a single structure.
- Asconf address addition tweaks (add_or_del Interfaces)
- Fix rwnd calcualtion to be more conservative.
- Support SACK_IMMEDIATE flag to skip delayed sack
by demand of peer.
- Comment updates in the sack mapping calculations
- Invarients panic added.
- Pre-support for UDP tunneling (we can do this on
MAC but will need added support from UDP to
get a "pipe" of UDP packets in.
- clear trace buffer sysctl added when local tracing on.

Note the majority of this huge patch is all the vimage prep stuff :-)


179737 11-Jun-2008 jfv

Add generic TCP LOR into netinet


179490 02-Jun-2008 mlaier

Sort IP addresses before hashing them for the signature. Otherwise carp is
sensitive to address configuration order.

PR: kern/121574
Reported by: Douglas K. Rand, Wouter de Jong
Obtained from: OpenBSD (rev 1.114 + fixes)
MFC after: 2 weeks


179487 02-Jun-2008 rwatson

When allocating temporary storage to hold a TCP/IP packet header
template, use an M_TEMP malloc(9) allocation rather than an mbuf
with mtod(9) and dtom(9). This eliminates the last use of
dtom(9) in TCP.

MFC after: 3 weeks


179480 01-Jun-2008 mav

Increase LINK_TABLE_OUT_SIZE from 101 to 4001 like LINK_TABLE_IN_SIZE
to reduce performance degradation under heavy outgoing scan/flood.
Scalability is now much more important then several kilobytes of RAM.

Remove unneded TCP-specific expiration handeling. Before this connected
TCP sessions could never expire. Now connected TCP sessions will expire
after 24hours of inactivity.

Simplify HouseKeeping() to avoid several mul/div-s per packet. Taking into
account increased LINK_TABLE_OUT_SIZE, precision is still much more then
required.


179478 01-Jun-2008 mav

Make m_megapullup() more intelligent:
- to increase performance do not reallocate mbuf when possible,
- to support up to 16K packets (was 2K max) use mbuf cluster of proper size.
This change depends on recent ng_nat and ip_fw_nat changes.


179473 01-Jun-2008 mav

PKT_ALIAS_FOUND_HEADER_FRAGMENT result is not an error, so pass that packet.
This fixes packet fragmentation handeling.

Pass really available buffer size to libalias instead of MCLBYTES constant.
MCLBYTES constant were used with believe that m_megapullup() always moves
date into a fresh cluster that sometimes may become not so.


179472 01-Jun-2008 mav

Fix packet fragmentation support broken by copy/paste error in rev.1.60.
ip_id should be u_short, but not u_char.


179414 29-May-2008 rwatson

Read lock rather than write lock TCP inpcbs in monitoring sysctls. In
some cases, add explicit inpcb locking rather than relying on the global
lock, as we dereference inp_socket, but also allowing us to drop the
global lock more quickly.

MFC after: 1 week


179412 29-May-2008 rwatson

Employ read locks on UDP inpcbs, rather than write locks, when
monitoring UDP connections using sysctls. In some cases, add
previously missing locking of inpcbs, as inp_socket is followed,
which also allows us to drop global locks more quickly.

MFC after: 1 week


179289 24-May-2008 bz

Factor out the v4-only vs. the v6-only inp_flags processing in
ip6_savecontrol in preparation for udp_append() to no longer
need an WLOCK as we will no longer be modifying socket options.

Requested by: rwatson
Reviewed by: gnn
MFC after: 10 days


179201 22-May-2008 rwatson

Consistently check IPFW and DUMMYNET privileges in the configuration
routines for those modules, rather than in the raw socket code. This
each privilege check to occur in exactly once place and avoids
duplicate checks across layers.

MFC after: 3 weeks
Sponsored by: nCircle Network Security, Inc.


179180 21-May-2008 rrs

- sctputil.c - If debug is on, the INPKILL timer can deref a freed value.
Change so that we save off a type field for display and
NULL inp just for good measure.

- sctp_output.c - Fix it so in sending to the loopback we use the
src address of the inbound INIT. We don't want
to do this for non local addresses since otherwise
we might be ingressed filtered so we need to use
the best src address and list the address sent to.

Obtained from: time bug - Neil Wilson
MFC after: 1 week


179157 20-May-2008 rrs

- Adds support for the multi-asconf (From Kozuka-san)
- Adds some prepwork (Not all yet) for vimage in particular
support the delete the sctppcbinfo.xx structs. There is
still a leak in here if it were to be called plus we stil
need the regrouping (From Me and Michael Tuexen)
- Adds support for UDP tunneling. For BSD there is no
socket yet setup so its disabled, but major argument
changes are in here to emcompass the passing of the port
number (zero when you don't have a udp tunnel, the default
for BSD). Will add some hooks in UDP here shortly (discussed
with Robert) that will allow easy tunneling. (Mainly from
Peter Lei and Michael Tuexen with some BSD work from me :-D)
- Some ease for windows, evidently leave is reserved by their
compile move label leave: -> out:

MFC after: 1 week


179141 20-May-2008 rrs

- Define changes in sctp.h
- Bug in CA that does not get us incrementing the PBA properly which
made us more conservative.
- comment updated in sctp_input.c
- memsets added before we log
- added arg to hmac id's
MFC after: 2 weeks


178960 12-May-2008 gnn

Fix the loopback interface. Cleaning up some code with new macros
was a tad too aggressive.

PR: kern/123568
Submitted by: Vladimir Ermakov <samflanker at gmail dot com>
Obtained from: antoine


178888 09-May-2008 julian

Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.

Constraints:
------------

I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:
-------------------------

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:
--------------------

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco


178862 08-May-2008 jhb

Always bump tcpstat.tcps_badrst if we get a RST for a connection in the
syncache that has an invalid SEQ instead of only doing it when we suceed
in mallocing space for the log message.

MFC after: 1 week
Reviewed by: sam, bz


178801 05-May-2008 kmacy

replace spaces added in last change with tabs


178793 05-May-2008 kmacy

add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specific
extension fields for tcp_info


178730 02-May-2008 marck

Fix build, together with a bit of style breakage.


178673 29-Apr-2008 rwatson

Fix a comment typo.

MFC after: 3 days


178377 21-Apr-2008 rwatson

With IPv4 raw sockets, read lock rather than write lock the inpcb when
receiving or transmitting.

With IPv6 raw sockets, read lock rather than write lock the inpcb when
receiving. Unfortunately, IPv6 source address selection appears to
require a write lock on the inpcb for the time being.

MFC after: 3 months


178376 21-Apr-2008 rwatson

Read lock, rather than write lock, the inpcb when transmitting with or
delivering to an IP divert socket.

MFC after: 3 months


178349 20-Apr-2008 bz

Revert to rev. 1.161 - switch back to optimized TCP options ordering.

A lot of testing has shown that the problem people were seeing was due
to invalid padding after the end of option list option, which was corrected
in tcp_output.c rev. 1.146.

Thanks to: anders@, s3raphi, Matt Reimer
Thanks to: Doug Hardie and Randy Rose, John Mayer, Susan Guzzardi
Special thanks to: dwhite@ and BitGravity
Discussed with: silby
MFC after: 1 day


178325 20-Apr-2008 rwatson

Teach pf and ipfw to use read locks in inpcbs write than write locks
when reading credential data from sockets.

Teach pf to unlock the pcbinfo more quickly once it has acquired an
inpcb lock, as the inpcb lock is sufficient to protect the reference.

Assert locks, rather than read locks or write locks, on inpcbs in
subroutines--this is necessary as the inpcb may be passed down with a
write lock from the protocol, or may be passed down with a read lock
from the firewall lookup routine, and either is sufficient.

MFC after: 3 months


178319 19-Apr-2008 rwatson

In ip_output(), allow a read lock as well as a write lock when asserting
a lock on the passed inpcb.

MFC after: 3 months


178318 19-Apr-2008 rwatson

When querying the local or foreign address from an IP socket, acquire
only a read lock on the inpcb.

When an external module requests a read lock, acquire only a read lock.

MFC after: 3 months


178303 19-Apr-2008 kmacy

move tcbinfo lock acquisition in to syncache


178302 19-Apr-2008 kmacy

move cxgb_lt2.[ch] from NIC to TOE
move most offload functionality from NIC to TOE
factor out all socket and inpcb direct access
factor out access to locking in incpb, pcbinfo, and sockbuf


178290 17-Apr-2008 gnn

Add in check for loopback as well, which was missing from the original patch.

PR: 120958
Submitted by: James Snow <snow at teardrop.org>
MFC after: 2 weeks


178285 17-Apr-2008 rwatson

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


178280 17-Apr-2008 gnn

Clean up the code that checks the types of address so that it is
done by understandable macros.

Fix the bug that prevented the system from responding on interfaces with
link local addresses assigned.

PR: 120958
Submitted by: James Snow <snow at teardrop.org>
MFC after: 2 weeks


178251 16-Apr-2008 rrs

Allow SCTP to compile without INET6.
PR: 116816
Obtained from tuexen@fh-muenster.de:
MFC after: 2 weeks


178202 14-Apr-2008 rrs

Use the pru_flush infrastructure to avoid a panic

PR: 122710
MFC after: 1 week


178198 14-Apr-2008 rrs

Protection against errant sender sending a stream
seq number out of order with no missing TSN's (a
cisco box has this problem which will make a ssn
be held forever).
MFC after: 1 week


178197 14-Apr-2008 rrs

New logging values.


178196 14-Apr-2008 rrs

1) adds some additional logging
2) changes to use a inqueue_bytes calculated value in max_len calc's.
MFC after: 1 week


178167 13-Apr-2008 qingli

This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by: robert, sam, gnn, julian, kmacy


178029 09-Apr-2008 bz

Take the route mtu into account, if available, when sending an
ICMP unreach, frag needed. Up to now we only looked at the
interface MTU. Make sure to only use the minimum of the two.

In case IPSEC is compiled in, loop the mtu through ip_ipsec_mtu()
to avoid any further conditional maths.

Without this, PMTU was broken in those cases when there was a
route with a lower MTU than the MTU of the outgoing interface.

PR: kern/122338
Tested by: Mark Cammidge mark peralex.com
Reviewed by: silence on net@
MFC after: 2 weeks


177988 07-Apr-2008 andre

Remove TCP options ordering assumptions in tcp_addoptions(). Ordering
was changed in rev. 1.161 of tcp_var.h. All option now test for sufficient
space in TCP header before getting added.

Reported by: Mark Atkinson <atkin901-at-yahoo.com>
Tested by: Mark Atkinson <atkin901-at-yahoo.com>
MFC after: 1 week


177987 07-Apr-2008 andre

Remove now unnecessary comment.


177986 07-Apr-2008 andre

Use #defines for TCP options padding after EOL to be consistent.

Reviewed by: bz


177978 07-Apr-2008 rwatson

Add further TCP inpcb locking assertions to some TCP input code paths.

MFC after: 1 month


177961 06-Apr-2008 rwatson

In in_pcbnotifyall() and in6_pcbnotify(), use LIST_FOREACH_SAFE() and
eliminate unnecessary local variable caching of the list head pointer,
making the code a bit easier to read.

MFC after: 3 weeks


177599 25-Mar-2008 ru

Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by: arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.


177575 24-Mar-2008 kmacy

change inp_wlock_assert to inp_lock_assert


177536 24-Mar-2008 kmacy

Label inp as unused in the non-INVARIANTS case


177530 23-Mar-2008 kmacy

Insulate inpcb consumers outside the stack from the lock type and offset within the pcb by adding accessor functions.

Reviewed by: rwatson
MFC after: 3 weeks


177382 19-Mar-2008 piso

Explicitate the newpacket size.

Bug pointed out by: many
Pointy hat to: me :(


177326 17-Mar-2008 piso

Don't cache ptr to nat rule in case of tablearg argument.

Bug spotted by: Dyadchenko Mihail


177323 17-Mar-2008 piso

Don't abuse stack space while in kernel land, use heap instead.


177300 17-Mar-2008 rwatson

Fix indentation for a closing brace in in_pcballoc().

MFC after: 3 days


177175 14-Mar-2008 bz

Correct IPsec behaviour with a 'use' level in SP but no SA available.
In that case return an continue processing the packet without IPsec.

PR: 121384
MFC after: 5 days
Reported by: Cyrus Rahman (crahman gmail.com)
Tested by: Cyrus Rahman (crahman gmail.com) [slightly older version]


177098 12-Mar-2008 piso

-Don't pass down the entire pkt to ProtoAliasIn, ProtoAliasOut, FragmentIn
and FragmentOut.
-Axe the old PacketAlias API: it has been deprecated since 5.x.


176978 09-Mar-2008 bz

Padding after EOL option must be zeros according to RFC793 but
the NOPs used are 0x01.
While we could simply pad with EOLs (which are 0x00), rather use an
explicit 0x00 constant there to not confuse poeple with 'EOL padding'.
Put in a comment saying just that.

Problem discussed on: src-committers with andre, silby, dwhite as
follow up to the rev. 1.161 commit of tcp_var.h.
MFC after: 11 days


176884 06-Mar-2008 piso

MFP4:
restrict the utilization of direct pointers to the content of
ip packet. These modifications are functionally nop()s thus
can be merged with no side effects.


176805 04-Mar-2008 rpaulo

Change the default port range for outgoing connections by introducing
IPPORT_EPHEMERALFIRST and IPPORT_EPHEMERALLAST with values
10000 and 65535 respectively.
The rationale behind is that it makes the attacker's life more
difficult if he/she wants to guess the ephemeral port range and
also lowers the probability of a port colision (described in
draft-ietf-tsvwg-port-randomization-01.txt).

While there, remove code duplication in in_pcbbind_setup().

Submitted by: Fernando Gont <fernando at gont.com.ar>
Approved by: njl (mentor)
Reviewed by: silby, bms
Discussed on: freebsd-net


176778 03-Mar-2008 piso

When unloading kld, don't forget to flush the nat pointers.


176765 03-Mar-2008 piso

Raise a bit ipfw kld priority.

Discussed on: net-, ipfw-.


176736 02-Mar-2008 bz

Some "cleanup" of tcp_mss():
- Move the assigment of the socket down before we first need it.
No need to do it at the beginning and then drop out the function
by one of the returns before using it 100 lines further down.
- Use t_maxopd which was assigned the "tcp_mssdflt" for the corrrect
AF already instead of another #ifdef ? : #endif block doing the same.
- Remove an unneeded (duplicate) assignment of mss to t_maxseg just before
we possibly change mss and re-do the assignment without using t_maxseg
in between.

Reviewed by: silby
No objections: net@ (silence)
MFC after: 5 days


176716 01-Mar-2008 bz

Fix indentation (whitespace changes only).

MFC after: 6 days


176669 29-Feb-2008 piso

Move ipfw's nat code into its own kld: ipfw_nat.


176626 27-Feb-2008 dwmalone

Dummynet has a limit of 100 slots queue size (or 1MB, if you give
the limit in bytes) hard coded into both the kernel and userland.
Make both these limits a sysctl, so it is easy to change the limit.
If the userland part of ipfw finds that the sysctls don't exist,
it will just fall back to the traditional limits.

(100 packets is quite a small limit these days. If you want to test
TCP at 100Mbps, 100 packets can only accommodate a DBP of 12ms.)

Note these sysctls in the man page and warn against increasing them
without thinking first.

MFC after: 3 weeks


176517 24-Feb-2008 piso

Add table/tablearg support to ipfw's nat.

MFC After: 1 week


176502 24-Feb-2008 silby

Change FreeBSD 7 so that it returns TCP options in
the same order that FreeBSD 6 and before did. Doug
White and the other bloodhounds at ISC discovered that
while FreeBSD 7's ordering of options was more efficient,
it caused some cable modem routers to ignore the
SYN-ACKs ordered in this fashion.

The placement of sackOK after the timestamp option seems
to be the critical difference:

FreeBSD 6:
<mss 1460,nop,wscale 1,nop,nop,timestamp 3512155768 0,sackOK,eol>

FreeBSD 7.0:
<mss 1460,nop,wscale 3,sackOK,timestamp 1370692577 0>

FreeBSD 7.0 + this change:
<mss 1460,nop,wscale 3,nop,nop,timestamp 7371813 0,sackOK,eol>

MFC after: 1 week


176464 22-Feb-2008 rrs

Fixes a memory leak when VRF's are in play.

Submitted by: Prasad Narasimha (snprasad@cisco.com)
Reviewed by: rrs


176463 22-Feb-2008 rrs

- Takes out stray ifdef code that should not have been present.


176093 07-Feb-2008 glebius

If the vhid already present, return EEXIST instead of
non-informative EINVAL.


176086 07-Feb-2008 glebius

Remove unused structure member from struct in_ifadown_arg.


176042 06-Feb-2008 silby

Replace the random IP ID generation code we
obtained from OpenBSD with an algorithm suggested
by Amit Klein. The OpenBSD algorithm has a few
flaws; see Amit's paper for more information.

For a description of how this algorithm works,
please see the comments within the code.

Note that this commit does not yet enable random IP ID
generation by default. There are still some concerns
that doing so will adversely affect performance.

Reviewed by: rwatson
MFC After: 2 weeks


175892 02-Feb-2008 bz

Rather than passing around a cached 'priv', pass in an ucred to
ipsec*_set_policy and do the privilege check only if needed.

Try to assimilate both ip*_ctloutput code blocks calling ipsec*_set_policy.

Reviewed by: rwatson


175845 31-Jan-2008 rwatson

Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>


175752 28-Jan-2008 rrs

- Fix a comment about prison.
- Fix it so the VRF is captured while locks are held.
MFC after: 1 week


175751 28-Jan-2008 rrs

- Change back to using prioity 0. Which means don't change the
prioity when running the thread. (this is for the sctp_interator thread).

MFC after: 1 week


175750 28-Jan-2008 rrs

- Fix a bug where the socket may have been closed which
could cause a crash in the auth code.
Obtained from: Michael Tuexen
MFC after: 1 week


175748 28-Jan-2008 rrs

- Fixes a comparison wrap issue with sack gap ack blocks that
span the 32 bit roll over mark.


175659 25-Jan-2008 rwatson

Hide ipfw internal data structures behind IPFW_INTERNAL rather than
exposing them to all consumers of ip_fw.h. These structures are
used in both ipfw(8) and ipfw(4), but not part of the user<->kernel
interface for other applications to use, rather, shared
implementation.

MFC after: 3 days
Reported by: Paul Vixie <paul at vix dot com>


175630 24-Jan-2008 bz

Replace the last susers calls in netinet6/ with privilege checks.

Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).

Leave a few comments to be addressed later.

Reviewed by: rwatson (older version, before addressing his comments)


175626 24-Jan-2008 bz

Differentiate between addifaddr and delifaddr for the privilege check.

Reviewed by: rwatson
MFC after: 2 weeks


175612 23-Jan-2008 rwatson

tcp_usrreq.c:1.313 removed tcbinfo locking from tcp_usr_accept(), which
while in principle a good idea, opened us up to a race inherrent to
the syncache's direct insertion of incoming TCP connections into the
"completed connection" listen queue, as it transpires that the socket
is inserted before the inpcb is fully filled in by syncache_expand().
The bug manifested with the occasional returning of 0.0.0.0:0 in the
address returned by the accept() system call, which occurred if accept
managed to execute tcp_usr_accept() before syncache_expand() had copied
the endpoint addresses into inpcb connection state.

Re-add tcbinfo locking around the address copyout, which has the effect
of delaying the copy until syncache_expand() has finished running, as
it is run while the tcbinfo lock is held. This is undesirable in that
it increases contention on tcbinfo further, but a more significant
change will be required to how the syncache inserts new sockets in
order to fix this and keep more granular locking here. In particular,
either more state needs to be passed into sonewconn() so that
pru_attach() can fill in the fields *before* the socket is inserted, or
the socket needs to be inserted in the incomplete connection queue
until it is actually ready to be used.

Reported by: glebius (and kris)
Tested by: glebius


175438 18-Jan-2008 rwatson

In tcp_ctloutput(), don't hold the inpcb lock over sooptcopyin(), rather,
drop the lock and then re-acquire it, revalidating TCP connection state
assumptions when we do so. This avoids a potential lock order reversal
(and potential deadlock, although none have been reported) due to the
inpcb lock being held over a page fault.

MFC after: 1 week
PR: 102752
Reviewed by: bz
Reported by: Václav Haisman <v dot haisman at sh dot cvut dot cz>


175025 31-Dec-2007 julian

Don't duplicate the whole of arpresolve to arpresolve 2 for the sake
of two compares against 0. The negative effect of cache flushing
is probably more than the gain by not doing the two compares (the
value is almost certainly in register or at worst, cache).
Note that the uses of m_freem() are in error cases and m_freem()
handles NULL anyhow. So fast-path really isn't changed much at all.


174893 25-Dec-2007 oleg

Workaround p->numbytes overflow, which can result in infinite loop inside
dummynet module (prerequisite is using queues with "fat" pipe).

PR: kern/113548


174857 22-Dec-2007 rwatson

When IPSEC fails to allocate policy state for an inpcb, and MAC is in use,
free the MAC label on the inpcb before freeing the inpcb.

MFC after: 3 days
Submitted by: tanyong <tanyong at ercist dot iscas dot ac dot cn>,
zhouzhouyi


174775 19-Dec-2007 ru

Fix bugs in the TCP syncache timeout code. including:

When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).

When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.

Only HEAD and RELENG_7 are affected.

Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru


174768 19-Dec-2007 kmacy

Remove extraneous debug statements.

Noticed by: Andrey Chernov


174757 18-Dec-2007 kmacy

Incorporate TCP offload hooks in to core TCP code.
- Rename output routines tcp_gen_* -> tcp_output_*.
- Rename notification routines that turn in to no-ops in the absence of TOE
from tcp_gen_* -> tcp_offload_*.
- Fix some minor comment nits.
- Add a /* FALLTHROUGH */

Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack


174736 18-Dec-2007 rrs

- sctp-iterator should run at PI_NET priority ...not 0.

MFC after: 1 week


174704 17-Dec-2007 kmacy

incorporate feedback since initial commit
- rename tcp_ofld.[ch] to tcp_offload.[ch]
- document usage and locking conventions of the functions in the
toe_usrreqs function vector
- document tcpcb, inpcb, and socket fields used by toe
- widen the listen interface into 2 functions
- rename DISABLE_TCP_OFFLOAD to TCP_OFFLOAD_DISABLE
- shrink conditional compilation to reduce the likelihood of bitrot
- replace sc->sc_toepcb checks in tcp_syncache.c with TOEPCB_ISSET


174703 17-Dec-2007 kmacy

widen the routing event interface (arp update, redirect, and eventually pmtu change)
into separate functions

revert previous commit's changes to arpresolve and add a new interface
arpresolve2 which does arp resolution without an mbuf


174699 17-Dec-2007 kmacy

Don't panic in arpresolve if we're given a null mbuf. We could
insist that the caller just pass in an initialized mbuf even
if didn't have any data - but that seems rather contrived.


174651 16-Dec-2007 kmacy

Update tod_connect call to reflect updated interface


174648 16-Dec-2007 kmacy

Move arp update upcall to always be called for ARP replies - previous invocation
would not always get called at the appropriate times


174642 16-Dec-2007 kmacy

Update the toedev's connect interface to reflect the fact that the inpcb
doesn't cache the rtentry in HEAD.


174636 16-Dec-2007 kmacy

Add socket option for setting and retrieving the congestion control algorithm.
The name used is to allow compatibility with Linux.


174623 15-Dec-2007 kmacy

make naming prefixes consistent across tom_info


174569 13-Dec-2007 kmacy

Fix error in previous commit - the style fix changed flag name without
changing references to the flag


174560 12-Dec-2007 kmacy

Fix style issues with initial TCP offload commit

Requested by: rwatson
Submitted by: rwatson


174559 12-Dec-2007 kmacy

add interface for allowing consumers to register for ARP updates,
redirects, and path MTU changes

Reviewed by: silby


174558 12-Dec-2007 kmacy

Add interface for tcp offload to syncache:
- make neccessary changes to release offload resources when a syncache
entry is removed before connection establishment
- disable checks for offloaded connection where insufficient information
is available

Reviewed by: silby


174556 12-Dec-2007 kmacy

Add driver independent interface to offload active established TCP connections

Reviewed by: silby


174545 12-Dec-2007 kmacy

Remove spurious timestamp check. RFC 1323 explicitly states that timestamps MAY
be transmitted if negotiated.


174479 09-Dec-2007 dwmalone

If we are walking the IPv6 header chain and we hit an IPPROTO_NONE
header, then don't try to pullup anything, because there is no next
header if we hit IPPROTO_NONE. Set ulp to a non-NULL value so the
search for an upper layer header terinates.

This is based on Pekka's diagnosis, but I chose a simpler fix.

PR: 115261
Submitted by: Pekka Savola <pekkas@netcore.fi>
Reviewed by: mlaier
MFC after: 2 weeks


174388 07-Dec-2007 kmacy

Add padding for anticipated functionality
- vimage
- TOE
- multiq
- host rtentry caching

Rename spare used by 80211 to if_llsoftc

Reviewed by: rwatson, gnn
MFC after: 1 day


174387 07-Dec-2007 rrs

- More fixes for lock misses on the transfer of data to
the sent_queue. Sometimes I wonder why any code
ever works :-)
- Fix the pad of the last mbuf routine, It was working improperly
on non-4 byte aligned chunks which could cause memory overruns.

MFC after: 1 week


174348 06-Dec-2007 des

Simpler version of the previous commit.


174323 06-Dec-2007 rrs

- optimize the initialization of the SB max variables.
- Missing lock when sending data and moving it to the
outqueue.
- If a mbuf alloc fails during moving to outqueue the
reassembly of the old mbuf chain was incorrect.
- some_taken becomes a counter in sctputil.c instead of a set to 1.
- Fix a panic to be only under invarients and have a proper recovery.
- msg_flags needed to be set.to the value collected not or'd.

MFC after: 1 week


174266 04-Dec-2007 rrs

- More fixes for the non-blocking msg send, had the skip of the pre-block
test incorrect.
- Fix the initial buf calculation to be more friendly, calc is the same
but we use different variable to make it easier amongst the different
code versions.

MFC after: 1 week


174258 04-Dec-2007 rrs

- Opps, signedness issue with one of the new var's (this is an issue
mainly in apple but with the right -Wall it could effect us too).

MFC after: 1 week


174257 04-Dec-2007 rrs

- Found a problem in non-blocking sends. When
sending, once the locks are all unlocked to
do the copy's in, its possible that other
events could then raise the number of bytes
outstanding pushing it so not all the message
would fit. This would then cause us to send
only part of the message. This fix makes it
so we keep a "reserved" amount that can be
kept in mind when making calculations to send.
- rcv msg args with a NULL/NULL for to/tolen will return an error incorrectly
for the 1-2-1 model.
- We were not doing 0 len return correctly and not setting cantrcv more
correctly. Previouly we "fixed" this area by taking out the socantrcv
since we then could not get the data out. The correct rix is to still
flag the socket but alow a by-pass route to continue to read until
all data is consumed.

MFC after: 1 week


174256 04-Dec-2007 yar

For the sake of convenience, print the name of the network interface
IPv4 address duplication was detected on.

Idea by: marck


174248 04-Dec-2007 silby

Fix SACK negotiation that was broken in rev 1.105.

Before this fix, FreeBSD would negotiate SACK on outgoing
connections, but would always fail to negotiate it on incoming
connections.

Discovered by: James Healy and Lawrence Stewart
Submitted by: James Healy and Lawrence Stewart
MFC after: 3 days


174171 02-Dec-2007 guido

Consider the following situation:
1. A packet comes in that is to be forwarded
2. The destination of the packet is rewritten by some firewall code
3. The next link's MTU is too small
4. The packet has the DF bit set

Then the current code is such that instead of setting the next
link's MTU in the ICMP error, ip_next_mtu() is called and a guess
is sent as to which MTU is supposed to be tried next. This is because
in this case ip_forward() is called with srcrt set to 1. In that
case the ia pointer remains NULL but it is needed to get the MTU
of the interface the packet is to be sent out from.
Thus, we always set ia to the outgoing interface.

MFC after: 2 weeks


174120 30-Nov-2007 bz

Centralize and correct computation of TCP-MD5 signature offset within
the packet (tcp header options field).

Reviewed by: tools/regression/netinet/tcpconnect
MFC after: 3 days
Tested by: Nick Hilliard (see net@)


174119 30-Nov-2007 bz

Move call to tcp_signature_compute() after we adjusted the payload offset
in the tcp header. With relevant parts of the tcp header changing after
the 'signature' was computed, the signature becomes invalid.

Reviewed by: tools/regression/netinet/tcpconnect
MFC after: 3 days
Tested by: Nick Hilliard (see net@)


174023 28-Nov-2007 bz

Let opt be an array. Though &opt[0] == opt == &opt, &opt is highly
confusing and hard to understand so change it to just opt and
remove the extra cast no longer/not needed.

Discussed with: rwatson
MFC after: 3 days


174022 28-Nov-2007 bz

Correctly get the authentication key for TCP-MD5 from the SA.

Submitted by: Nick Hilliard on net@
MFC after: 8 weeks


173884 24-Nov-2007 rwatson

More carefully handle various cases in sysctl_drop(), such as unlocking
the inpcb when there's an inpcb without associated timewait state, and
not unlocking when the inpcb has been freed. This avoids a kernel panic
when tcpdrop(8) is run on a socket in the TIMEWAIT state.

MFC after: 3 days
Reported by: Rako <rako29 at gmail dot com>


173874 23-Nov-2007 jb

Fix strict alias warnings.


173835 21-Nov-2007 bz

Make TSO work with IPSEC compiled into the kernel.

The lookup hurts a bit for connections but had been there anyway
if IPSEC was compiled in. So moving the lookup up a bit gives us
TSO support at not extra cost.

PR: kern/115586
Tested by: gallatin
Discussed with: kmacy
MFC after: 2 months


173771 20-Nov-2007 silby

Comment out the syncache's test which ensures that hosts which negotiate TCP
timestamps in the initial SYN packet actually use them in the rest of the
connection. Unfortunately, during the 7.0 testing cycle users have already
found network devices that violate this constraint.

RFC 1323 states 'and may send a TSopt in other segments' rather than
'and MUST send', so we must allow it.

Discovered by: Rob Zietlow
Tracked down by: Kip Macy
PR: bin/118005


173706 17-Nov-2007 oleg

- New sysctl variable: net.inet.ip.dummynet.io_fast
If it is set to zero value (default) dummynet module will try to emulate
real link as close as possible (bandwidth & latency): packet will not leave
pipe faster than it should be on real link with given bandwidth.
(This is original behaviour of dummynet which was altered in previous commit)
If it is set to non-zero value only bandwidth is enforced: packet's latency
can be lower comparing to real link with given bandwidth.

- Document recently introduced dummynet(4) sysctl variables.

Requested by: luigi, julian
MFC after: 3 month


173509 10-Nov-2007 rrs

- Fix a bug in sctp_calc_rwnd() which resulted in wrong rwnd predictions.
- Fix a signedness problem that shows up in some 64 bit platforms (macos).

MFC after: 1 week


173399 06-Nov-2007 oleg

1) dummynet_io() declaration has changed.
2) Alter packet flow inside dummynet: allow certain packets to bypass
dummynet scheduler. Benefits are:

- lower latency: if packet flow does not exceed pipe bandwidth, packets
will not be (up to tick) delayed (due to dummynet's scheduler granularity).
- lower overhead: if packet avoids dummynet scheduler it shouldn't reenter ip
stack later. Such packets can be fastforwarded.
- recursion (which can lead to kernel stack exhaution) eliminated. This fix
long existed panic, which can be triggered this way:
kldload dummynet
sysctl net.inet.ip.fw.one_pass=0
ipfw pipe 1 config bw 0
for i in `jot 30`; do ipfw add 1 pipe 1 icmp from any to any; done
ping -c 1 localhost

3) Three new sysctl nodes are added:
net.inet.ip.dummynet.io_pkt - packets passed to dummynet
net.inet.ip.dummynet.io_pkt_fast - packets avoided dummynet scheduler
net.inet.ip.dummynet.io_pkt_drop - packets dropped by dummynet

P.S. Above comments are true only for layer 3 packets. Layer 2 packet flow
is not changed yet.

MFC after: 3 month


173398 06-Nov-2007 oleg

style(9) cleanup.

MFC after: 3 month


173179 30-Oct-2007 rrs

- Change the Time Wait of vtags value to match the cookie-life
- Select a tag gains ability to optionally save new tags
off in the timewait system.
- When looking up associations do not give back a stcb that
is in the about-to-be-freed state, and instead continue
looking for other candiates.
- New function to query to see if value is in time-wait.
- Timewait had a time comparison error that caused very
few vtags to actually stay in time-wait.
- When setting tags in time-wait, we now use the time
requested NOT a fixed constant value.
- sstat now gets the proper associd when we do the query.
- When we process an association, we expect the tag chosen
(if we have one from a cookie) to be in time-wait. Before
we would NOT allow the assoc up by checking if its good.
In theory this should have caused almost all assoc not
to come up except for the time-comparison bug above (this
bug was hidden by the time comparison bug :-D).
- Don't save tags for nonce values in the time-wait cache
since these are used only during cookie collisions and do
not matter if they are unique or not.
MFC after: 1 week


173102 28-Oct-2007 rwatson

Continue to move from generic network entry points in the TrustedBSD MAC
Framework by moving from mac_mbuf_create_netlayer() to more specific
entry points for specific network services:

- mac_netinet_firewall_reply() to be used when replying to in-bound TCP
segments in pf and ipfw (etc).

- Rename mac_netinet_icmp_reply() to mac_netinet_icmp_replyinplace() and
add mac_netinet_icmp_reply(), reflecting that in some cases we overwrite
a label in place, but in others we apply the label to a new mbuf.

Obtained from: TrustedBSD Project


173095 28-Oct-2007 rwatson

Move towards more explicit support for various network protocol stacks
in the TrustedBSD MAC Framework:

- Add mac_atalk.c and add explicit entry point mac_netatalk_aarp_send()
for AARP packet labeling, rather than using a generic link layer
entry point.

- Add mac_inet6.c and add explicit entry point mac_netinet6_nd6_send()
for ND6 packet labeling, rather than using a generic link layer entry
point.

- Add expliict entry point mac_netinet_arp_send() for ARP packet
labeling, and mac_netinet_igmp_send() for IGMP packet labeling,
rather than using a generic link layer entry point.

- Remove previous genering link layer entry point,
mac_mbuf_create_linklayer() as it is no longer used.

- Add implementations of new entry points to various policies, largely
by replicating the existing link layer entry point for them; remove
old link layer entry point implementation.

- Make MAC_IFNET_LOCK(), MAC_IFNET_UNLOCK(), and mac_ifnet_mtx global
to the MAC Framework rather than static to mac_net.c as it is now
needed outside of mac_net.c.

Obtained from: TrustedBSD Project


173018 26-Oct-2007 rwatson

Rename 'mac_mbuf_create_from_firewall' to 'mac_netinet_firewall_send' as
we move towards netinet as a pseudo-object for the MAC Framework.

Rename 'mac_create_mbuf_linklayer' to 'mac_mbuf_create_linklayer' to
reflect general object-first ordering preference.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


172970 25-Oct-2007 rwatson

Normalize TCP syncache-related MAC Framework entry points to match most
other entry points in the form mac_<object>_method().

Discussed with: csjp
Obtained from: TrustedBSD Project


172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


172800 19-Oct-2007 rpaulo

Remove IPTOS_CE and IPTOS_ECT constants. They were defined in RFC 2481
but later obsoleted by RFC 3168.
Discussed on freebsd-net with no objections.

Approved by: njl (mentor), rwatson


172795 19-Oct-2007 silby

Pick the smallest possible TCP window scaling factor that will still allow
us to scale up to sb_max, aka kern.ipc.maxsockbuf.

We do this because there are broken firewalls that will corrupt the window
scale option, leading to the other endpoint believing that our advertised
window is unscaled. At scale factors larger than 5 the unscaled window will
drop below 1500 bytes, leading to serious problems when traversing these
broken firewalls.

With the default maxsockbuf of 256K, a scale factor of 3 will be chosen by
this algorithm. Those who choose a larger maxsockbuf should watch out
for the compatiblity problems mentioned above.

Reviewed by: andre


172703 16-Oct-2007 rrs

- fix sctp_ifn initial refcount issue (prevents deletion)
- fix a bug during cookie collision that prevented an
association from coming up in a specific restart case.
- Fix it so the shutdown-pending flag gets removed (this is
more for correctness then needed) when we enter shutdown-sent
or shutdown-ack-sent states.
- Fix a bug that caused the receiver to sometimes NOT send
a SACK when a duplicate TSN arrived. Without this fix
it was possible for the association to fall down if the
- Deleted primary destination is also stored when SCTP_MOBILITY_BASE.
(Previously, it is stored when only SCTP_MOBILITY_FASTHANDOFF)
- Fix a locking issue where we might call send_initiate_ack() and
incorrectly state the lock held/not held. Also fix it so that
when we release the lock the inp cannot be deleted on us.
- Add the debug option that can cause the stack to panic instead
of aborting an assoc. This does not and should never show up
in options but is useful for debugging unexpected aborts.
- Add cumack_log sent to track sending cumack information for
the debug case where we are running a special log per assoc.
- Added extra () aroudn sctp_sbspace macro to avoid compile warnings.
MFC after: 1 week


172568 12-Oct-2007 kevlo

Spelling fix for interupt -> interrupt


172467 07-Oct-2007 silby

Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by: re (kensmith)


172464 07-Oct-2007 silby

Improve the debugging message:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received data after socket was closed, sending RST and removing tcpcb

So that it also includes how many bytes of data were received. It now looks
like this:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received X bytes of data after socket was closed, sending RST and removing tcpcb

Approved by: re (gnn)


172458 06-Oct-2007 rrs

- Fix the one-2-one model to properly do a socantrecv()
Approved by: re@freeBSD.org (Ken Smith)


172454 05-Oct-2007 rwatson

Disable TCP syncache debug logging by default. While useful in debugging
problems with the syncache, it produces a lot of console noise and has led
to quite a few false positive bug reports. It can be selectively
re-enabled when debugging specific problems by frobbing the same sysctl.

Discussed with: silby
Approved by: re (gnn)


172437 04-Oct-2007 rrs

- We should return error = 0 and the upper processing would
return a zero length read. Otherwise we don't return the
right error indication.

Approved by: re@freebsd.org (gnn)


172396 01-Oct-2007 rrs

- Bug fix managing congestion parameter on immediate
retransmittion by handover event (fast mobility code)
- Fixed problem of mobility code which is caused by remaining
parameters in the deleted primary destination.
- Add a missing lock. When a peer sends an INIT, and while we
are processing it to send an INIT-ACK the socket is closed,
we did not hold a lock to keep the socket from going away.
Add protection for this case.
- Fix so that arwnd is alway uses the minimal rwnd if the user
has set the socket buffer smaller. Found this when the test
org decided to see what happens when you set in a rwnd of 10
bytes (which is not allowed per RFC .. 4k is minimum).
- Fixes so a cookie-echo ootb will NOT cause an abort to
be sent. This was happening in a MPI collision case.
- Examined all panics and unless there was no recovery, moved
any that were not already to INVARANTS.

Approved by: re@freebsd.org (gnn)


172387 29-Sep-2007 maxim

o For dynamic rules log a parent rule number. Prefix a log message
by 'ipfw: '.

PR: kern/115755
Submitted by: sem
Approved by: re (gnn)
MFC after: 4 weeks


172312 24-Sep-2007 kib

Revert rev. 1.94. After recent tcp backouts, tcp_close() may return NULL.
Check the return value of tcp_close() being NULL before dereferencing it
in #ifdef TCPDEBUG block.

Reviewed by: rwatson
Approved by: re (gnn)


172309 24-Sep-2007 silby

Two changes:

- Reintegrate the ANSI C function declaration change
from tcp_timer.c rev 1.92

- Reorganize the tcpcb structure so that it has a single
pointer to the "tcp_timer" structure which contains all
of the tcp timer callouts. This change means that when
the single tcp timer change is reintegrated, tcpcb will
not change in size, and therefore the ABI between
netstat and the kernel will not change.

Neither of these changes should have any functional
impact.

Reviewed by: bmah, rrs
Approved by: re (bmah)


172307 23-Sep-2007 csjp

Certain consumers of rtalloc like gif(4) and if_stf(4) lookup the
route and once they are done with it, call rtfree(). rtfree() should
only be used when we are certain we hold the last reference to the
route. This bug results in console messages like the following:

rtfree: 0xc40f7000 has 1 refs

This patch switches the rtfree() to use RTFREE_LOCKED() instead,
which should handle the reference counting on the route better.

Approved by: re@ (gnn)
Reviewed by: bms
Reported by: many via net@ and current@
Tested by: many


172266 21-Sep-2007 rrs

- fix (global) address handling in the presence of duplicates, the
last interface should own the address, but the current code
fumbles the handoff. This fixes that.
- move address related debugs to PCB4 and add additional ones to
help in debugging address problems.

Approved by: re@freebsd.org (K Smith)


172218 18-Sep-2007 rrs

- The address lock is changed to a rwlock. This
also involves macro changes to have a RLOCK and a WLOCK
and placing the correct version within the code.
- The INP-INFO lock is changed to a rwlock.
- When sctp_shutdown() is called on Mac OS X, the socket lock is held.
So call sctp_chunk_output with SCTP_SO_LOCKED and
not SCTP_SO_NOT_LOCKED.
- Add SCTP_IPI_ADDR_[RW]LOCK and SCTP_IPI_ADDR_[RW]UNLOCK for Mac OS X.
- u_int64_t -> uint64_t
- add missing addr unlock for error return path
Approved by: re@freebsd.org (K Smith)


172203 16-Sep-2007 rrs

- For the 1-to-1 model, fix an off by one error that
allowed an extra connection over the backlog (by one)
Approved by: re@freebsd.org (B. Mah)


172190 15-Sep-2007 rrs

- Get rid of unsused constants for sysctl variables.
- Fix panic from mutex unlock on freed lock when ASCONF-ACK
aborts an assoc
- Fix panic from addr lock recursion when ASCONFs are queued
in the front states
- ASCONFs "queued" in the front states should really be
bundled after the COOKIE-ACK, not in front of it
- Fix issue with addresses deleted in the front states from
being sent with ASCONF(DELETE)-- replaced
sctp_asconf_queue_add_sa() with delete specific function
- Comment change in sctp.h the drafts are now RFC's
Approved by: re@freebsd.org (B Mah)


172157 13-Sep-2007 rrs

- DF bit was on for COOKIE-ECHO chunks. This is
incorrect and should be OFF letting IP fragment
large cookie-echos.
- Rename sysctl variable logging to log_level.
- Fix description of sysctl variable stats.
- Add sysctl variable log to make sctp_log readable via sysctl
mechanism (this is by compile switch and targets non KTR platforms or
when someone wants to do performance wise tracing).
- Removed debug code

Approved by: re@freebsd.org (B Mah)


172156 13-Sep-2007 rrs

- Incorrect error EAGAIN returned for invalid send on a locked
stream (using EEOR mode). Changed to EINVAL (in sctp_output.c)
- Static analysis comments added
- fix in mobility code to return a value (static analysis found).
- sctp6_notify function made visible instead of
static (this is needed for Panda).

Approved by: re@freebsd.org (B Mah)


172137 10-Sep-2007 rrs

- Removed debug code and more C++ style comments in the mobility
code in sctp_asconf.c
Approved by: re@freebsd.org (B Mah)


172118 10-Sep-2007 rrs

- Added some comments to tell where the htcp
code comes from.
- Fix a LOR on Mac OS X: Do not hold an stcb lock when
calling soisconnected for a socket which has the
SS_INCOMP bit set on so_state.
- fix a comment to be non c++ style.

Approved by: re@freebsd.org (B Mah)


172116 10-Sep-2007 kensmith

Make sure that either inp is NULL or we have obtained a lock on it before
jumping to dropunlock to avoid a panic. While here move the calls to
ipsec4_in_reject() and ipsec6_in_reject() so they are after we obtain
the lock on inp.

Original patch to avoid panic: pjd
Review of locking adjustments: gnn, sam
Approved by: re (rwatson)


172114 10-Sep-2007 rwatson

Further UDPv4 cleanup:

- Resort includes a bit.
- Correct typos and wording problems in comments.
- Rename udpcksum to udp_cksum to be consistent with other UDP-related
configuration variables.
- Remove indirection of udp_notify through local notify variable in
udp_ctlinput(), which is presumably due to copying and pasting from TCP,
where multiple notify routines exist.

Approved by: re (kensmith)


172091 08-Sep-2007 rrs

- send call has a reference to uio->uio_resid in
the recent send code, but uio may be NULL on sendfile
calls. Change to use sndlen variable.
- EMSGSIZE is not being returned in non-blocking mode
and needs a small tweak to look if the msg would
ever fit when returning EWOULDBLOCK.
- FWD-TSN has a bug in stream processing which could
cause a panic. This is a follow on to the codenomicon
fix.
- PDAPI level 1 and 2 do not work unless the reader
gets his returned buffer full. Fix so we can break
out when at level 1 or 2.
- Fix fast-handoff features to copy across properly on
accepted sockets
- Fix sctp_peeloff() system call when no true system call
exists to screen arguments for errors. In cases where a
real system call exists the system call itself does this.
- Fix raddr leak in recent add-ip code change for bundled
asconfs (even when non-bundled asconfs are received)
- Make sure ipi_addr lock is held when walking global addr
list. Need to change this lock type to a rwlock().
- Add don't wake flag on both input and output when the
socket is closing.
- When deleting an address verify the interface is correct
before allowing the delete to process. This protects panda
and unnumbered.
- Clean up old sysctl stuff and get rid of the old Open/Net
BSD structures.
- Add a function to watch the ranges in the sysctl sets.
- When appending in the reassembly queue, validate that
the assoc has not gone to about to be freed. If so
(in the middle) abort out. Note this especially effects
MAC I think due to the lock/unlock they do (or with
LOCK testing in place).
- Netstat patch to get rid of warnings.
- Make sure that no data gets queued to inactive/unconfirmed
destinations. This especially effect CMT but also makes a
impact on regular SCTP as well.
- During init collision when we detect seq number out
of sync we need to treat it like Case C and discard
the cookie (no invarient needed here).
- Atomic access to the random store.
- When we declare a vtag good, we need to shove it
into the time wait hash to prevent further use. When
the tag is put into the assoc hash, we need to remove it
from the twait hash (where it will surely be). This prevents
duplicate tag assignments.
- Move decr-ref count to better protect sysctl out of
data.
- ltrace error corrections in sctp6_usrreq.c
- Add hook for interface up/down to be sent to us.
- Make sysctl() exported structures independent of processor
architecture.
- Fix route and src addr cache clearing for delete address case.
- Make sure address marked SCTP_DEL_IP_ADDRESS is never selected
as src addr.
- in icmp handling fixed so we actually look at the icmp codes
to figure out what to do.
- Modified mobility code.
Reception of DELETE IP ADDRESS for a primary destination and
SET PRIMARY for a new primary destination is used for
retransmission trigger to the new primary destination.
Also, in this case, destination of chunks in send_queue are
changed to the new primary destination.
- Fix so that we disallow sending by mbuf to ever have EEOR
mode set upon it.

Approved by: re@freebsd.org (B Mah)


172090 08-Sep-2007 rrs

- Locking compatiability changes. This involves adding
additional flags to many function calls. The flags only
get used in BSD when we compile with lock testing. These
flags allow apple to escape the "giant" lock it holds on
the socket and have more fine-grained locking in the NKE.
It also allows us to test (with witness) the locking used
by apple via a compile switch (manually applied).

Approved by: re@freebsd.org(B Mah)


172074 07-Sep-2007 rwatson

Back out tcp_timer.c:1.93 and associated changes that reimplemented the many
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change. This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).

In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI. The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.

This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.

Approved by: re (kensmith)
Submitted by: silby
Tested by: rwatson, kris
Problems reported by: peter, kris, others


172006 29-Aug-2007 green

Repair ALTQ-tagging rules in IPFW which got broken in the last PF
import. The PF mbuf-tagging support routines changed to link the
allocated tags into the provided mbuf themselves, so the left-over
m_tag_prepend() was trying to add a bogus (usually NULL) tag.

Reviewed by: mlaier
Approved by: re


171990 27-Aug-2007 rrs

- During shutdown pending, when the last sack came in and
the last message on the send stream was "null" but still
there, a state we allow, we could get hung and not clean
it up and wait for the shutdown guard timer to clear the
association without a graceful close. Fix this so that
that we properly clean up.
- Added support for Multiple ASCONF per new RFC. We only
(so far) accept input of these and cannot yet generate
a multi-asconf.
- Sysctl'd support for experimental Fast Handover feature. Always
disabled unless sysctl or socket option changes to enable.
- Error case in add-ip where the peer supports AUTH and ADD-IP
but does NOT require AUTH of ASCONF/ASCONF-ACK. We need to
ABORT in this case.
- According to the Kyoto summit of socket api developers
(Solaris, Linux, BSD). We need to have:
o non-eeor mode messages be atomic - Fixed
o Allow implicit setup of an assoc in 1-2-1 model if
using the sctp_**() send calls - Fixed
o Get rid of HAVE_XXX declarations - Done
o add a sctp_pr_policy in hole in sndrcvinfo structure - Done
o add a PR_SCTP_POLICY_VALID type flag - yet to-do in a future patch!
- Optimize sctp6 calls to reuse code in sctp_usrreq. Also optimize
when we close sending out the data and disabling Nagle.
- Change key concatenation order to match the auth RFC
- When sending OOTB shutdown_complete always do csum.
- Don't send PKT-DROP to a PKT-DROP
- For abort chunks just always checksums same for
shutdown-complete.
- inpcb_free front state had a bug where in queue
data could wedge an assoc. We need to just abandon
ones in front states (free_assoc).
- If a peer sends us a 64k abort, we would try to
assemble a response packet which may be larger than
64k. This then would be dropped by IP. Instead make
a "minimum" size for us 64k-2k (we want at least
2k for our initack). If we receive such an init
discard it early without all the processing.
- When we peel off we must increment the tcb ref count
to keep it from being freed from underneath us.
- handling fwd-tsn had bugs that caused memory overwrites
when given faulty data, fixed so can't happen and we
also stop at the first bad stream no.
- Fixed so comm-up generates the adaption indication.
- peeloff did not get the hmac params copied.
- fix it so we lock the addr list when doing src-addr selection
(in future we need to use a multi-reader/one writer lock here)
- During lowlevel output, we could end up with a _l_addr set
to null if the iterator is calling the output routine. This
means we would possibly crash when we gather the MTU info.
Fix so we only do the gather where we have a src address
cached.
- we need to be sure to set abort flag on conn state when
we receive an abort.
- peeloff could leak a socket. Moved code so the close will
find the socket if the peeloff fails (uipc_syscalls.c)

Approved by: re@freebsd.org(Ken Smith)


171989 26-Aug-2007 maxim

o Fix bug I introduced in the previous commit (ipfw set extention):
pack a set number correctly.

Submitted by: oleg

o Plug a memory leak.

Submitted by: oleg and Andrey V. Elsukov
Approved by: re (kensmith)
MFC after: 1 week


171943 24-Aug-2007 rrs

- Fix address add handling to clear cached routes and source addresses
when peer acks the add in case the routing table changes.
- Fix sctp_lower_sosend to send shutdown chunk for mbuf send
case when sndlen = 0 and sinfoflag = SCTP_EOF
- Fix sctp_lower_sosend for SCTP_ABORT mbuf send case with null data,
So that it does not send the "null" data mbuf out and cause
it to get freed twice.
- Fix so auto-asconf sysctl actually effect the socket's asconf state.
- Do not allow SCTP_AUTO_ASCONF option to be used on subset bound sockets.
- Memset bug in sctp_output.c (arguments were reversed) submitted
found and reported by Dave Jones (davej@codemonkey.org.uk).
- PD-API point needs to be invoked >= not just > to conform to socket api
draft this fixes sctp_indata.c in the two places need to be >=.
- move M_NOTIFICATION to use M_PROTO5.
- PEER_ADDR_PARAMS did not fail properly if you specify an address
that is not in the association with a valid assoc_id. This meant
you got or set the stcb level values instead of the destination
you thought you were going to get/set. Now validate if the
stcb is non-null and the net is NULL that the sa_family is
set and the address is unspecified otherwise return an error.
- The thread based iterator could crash if associations were freed
at the exact time it was running. rework the worker thread to
use the increment/decrement to prevent this and no longer use
the markers that the timer based iterator uses.
- Fix the memleak in sctp_add_addr_to_vrf() for the case when it is
detected that ifa is already pointing to a ifn.
- Fix it so that if someone is so insane that they drop the
send window below the minimal add mark, they still can send.
- Changed all state for associations to use mask safe macro.
- During front states in association freeing in sctp_inpcbfree, we
had a locking problem where locks were not in place where they
should have been.
- Free association calls were not testing the return value in
sctp_inpcb_free() properly... others should be cast void returns
where we don't care about the return value.
- If a reference count is held on an assoc, even from the "force free"
we should not do the actual free.. but instead let the timer
free it.
- When we enter sctp_input(), if the SCTP_ASOC_ABOUT_TO_BE_FREED
flag is set, we must NOT process the packet but handle it like
ootb. This is because while freeing an assoc we release the
locks to get all the higher order locks so we can purge all
the hash tables. This leaves a hole if a packet comes in
just at that point. Now sctp_common_input_processing() will
call the ootb code in such a case.
- Change MBUF M_NOTIFICATION to use M_PROTO5 (per Sam L). This makes
it so we don't have a conflict (I think this is a covertity change).
We made this change AFTER some conversation and looking to make sure
that M_PROTO5 does not have a problem between SCTP and the 802.11
stuff (which is the only other place its used).
- Fixed lock order reversal and missing atomic protection around
locked_tcb during association lookup and the 1-2-1 model.
- Added debug to source address selection.
- V6 output must always do checksum even for loopback.
- Remove more locks around inp that are not needed for an atomically
added/subtracted ref count.
- slight optimization in the way we zero the array in sctp_sack_check()
- It was possible to respond to a ABORT() with bad checksum with
a PKT-DROP. This lead to a PKT-DROP/ABORT war. Add code to NOT
send a PKT-DROP to any ABORT().
- Add an option for local logging (useful for macintosh or when
you need better performing during debugging). Note no commands
are here to get the log info, you must just use kgdb.
- The timer code needs to be aware of if it needs to call
sctp_sack_check() to slide the maps and adjust the cum-ack.
This is because it may be out of sync cum-ack wise.
- Added threshold managment logging.
- If the user picked just the right size, that just filled the send
window minus one mtu, we would enter a forever loop not copying and
at the same time not blocking. Change from < to <= solves this.
- Sysctl added to control the fragment interleave level which defaults
to 1.
- My rwnd control was not being used to control the rwnd properly (we
did not add and subtract to it :-() this is now fixed so we handle
small messages (1 byte etc) better to bring our rwnd down more
slowly.

Approved by: re@freebsd.org (Bruce Mah)


171858 16-Aug-2007 rrs

- Remove extra comment for 7.0 (no GIANT here).
- Remove unneeded WLOCK/UNLOCK of inp for getting TCB lock.
- Fix panic that may occur when freeing an assoc that has partial
delivery in progress (may dereference null socket pointer when
queuing partial delivery aborted notification)
- Some spacing and comment fixes.
- Fix address add handling to clear cached routes and source addresses
when peer acks the add in case the routing table changes.
Approved by: re@freebsd.org (Bruce Mah)


171857 16-Aug-2007 qingli

Use the sequence number comparison macro to compare
projected_offset against isn_offset to account for
wrap around.

Reviewed by: gnn, kmacy, silby
Submitted by: yusheng.huang@bluecoat.com
Approved by: re
MFC: 3 days


171746 06-Aug-2007 csjp

Over the past couple of years, there have been a number of reports relating
the use of divert sockets to dead locks. A number of LORs have been reported
between divert and a number of other network subsystems including: IPSEC, Pfil,
multicast, ipfw and others. Other dead locks could occur because of recursive
entry into the IP stack. This change should take care of most if not all of
these issues.

A summary of the changes follow:

- We disallow multicast operations on divert sockets. It really doesn't make
semantic sense to allow this, since typically you would set multicast
parameters on multicast end points.

NOTE: As a part of this change, we actually dis-allow multicast options on
any socket that IS a divert socket OR IS NOT a SOCK_RAW or SOCK_DGRAM family

- We check to see if there are any socket options that have been specified on
the socket, and if there was (which is very un-common and also probably
doesnt make sense to support) we duplicate the mbuf carrying the options.

- We then drop the INP/INFO locks over the call to ip_output(). It should be
noted that since we no longer support multicast operations on divert sockets
and we have duplicated any socket options, we no longer need the reference
to the pcb to be coherent.

- Finally, we replaced the call to ip_input() to use netisr queuing. This
should remove the recursive entry into the IP stack from divert.

By dropping the locks over the call to ip_output() we eliminate all the lock
ordering issues above. By switching over to netisr on the inbound path,
we can no longer recursively enter the ip_input() code via divert.

I have tested this change by using the following command:

ipfwpcap -r 8000 - | tcpdump -r - -nn -v

This should exercise the input and re-injection (outbound) path, which is
very similar to the work load performed by natd(8). Additionally, I have
run some ospf daemons which have a heavy reliance on raw sockets and
multicast.

Approved by: re@ (kensmith)
MFC after: 1 month
LOR: 163
LOR: 181
LOR: 202
LOR: 203
Discussed with: julian, andre et al (on freebsd-net)
In collaboration with: bms [1], rwatson [2]

[1] bms helped out with the multicast decisions
[2] rwatson submitted the original netisr patches and came up with some
of the original ideas on how to combat this issue.


171745 06-Aug-2007 rrs

- change number assignments for SHA225-512 (match artisync
for bakeoff.. using the next sequential ones)
- In cookie processing 1-2-1, we did not increment the stcb
refcnt before releasing the tcb lock. We need to do this
to keep the tcb from being freed by a abort or ?? unlikely
but worth doing. Also get rid of unneed INP_WLOCK.
- extra receive info included the rcvinfo which killed the
padding/alignment. We now redefine all the fields properly
so they both align properly both to 128 bytes.
- A peeled off socket would not close without an error due to
its misguided idea that sctp_disconnect() was not supported
on it. This fixes it so it goes through the proper path.
- When an assoc was being deleted after abort (via a timer) a
small race condition exists where we might take a packet for
the old assoc (since we are waiting for a cleanup timer). This
state especially happens in mac. We now add a state in the asoc
so these can properly handle the packet as OOTB.
Approved by: re@freebsd.org(Ken Smith)


171744 06-Aug-2007 rwatson

Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.

Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)


171732 05-Aug-2007 bz

Rename option IPSEC_FILTERGIF to IPSEC_FILTERTUNNEL.
Also rename the related functions in a similar way.
There are no functional changes.

For a packet coming in with IPsec tunnel mode, the default is
to only call into the firewall with the "outer" IP header and
payload.

With this option turned on, in addition to the "outer" parts,
the "inner" IP header and payload are passed to the
firewall too when going through ip_input() the second time.

The option was never only related to a gif(4) tunnel within
an IPsec tunnel and thus the name was very misleading.

Discussed at: BSDCan 2007
Best new name suggested by: rwatson
Reviewed by: rwatson
Approved by: re (bmah)


171677 31-Jul-2007 peter

Change TCPTV_MIN to be independent of HZ. While it was documented to
be in ticks "for algorithm stability" when originally committed, it turns
out that it has a significant impact in timing out connections. When we
changed HZ from 100 to 1000, this had a big effect on reducing the time
before dropping connections.

To demonstrate, boot with kern.hz=100. ssh to a box on local ethernet
and establish a reliable round-trip-time (ie: type a few commands).
Then unplug the ethernet and press a key. Time how long it takes to
drop the connection.

The old behavior (with hz=100) caused the connection to typically drop
between 90 and 110 seconds of getting no response.

Now boot with kern.hz=1000 (default). The same test causes the ssh session
to drop after just 9-10 seconds. This is a big deal on a wifi connection.

With kern.hz=1000, change sysctl net.inet.tcp.rexmit_min from 3 to 30.
Note how it behaves the same as when HZ was 100. Also, note that when
booting with hz=100, net.inet.tcp.rexmit_min *used* to be 30.

This commit changes TCPTV_MIN to be scaled with hz. rexmit_min should
always be about 30. If you set hz to Really Slow(TM), there is a safety
feature to prevent a value of 0 being used.

This may be revised in the future, but for the time being, it restores the
old, pre-hz=1000 behavior, which is significantly less annoying.

As a workaround, to avoid rebooting or rebuilding a kernel, you can run
"sysctl net.inet.tcp.rexmit_min=30" and add "net.inet.tcp.rexmit_min=30"
to /etc/sysctl.conf. This is safe to run from 6.0 onwards.

Approved by: re (rwatson)
Reviewed by: andre, silby


171656 30-Jul-2007 des

Make tcpstates[] static, and make sure TCPSTATES is defined before
<netinet/tcp_fsm.h> is included into any compilation unit that needs
tcpstates[]. Also remove incorrect extern declarations and TCPDEBUG
conditionals. This allows kernels both with and without TCPDEBUG to
build, and unbreaks the tinderbox.

Approved by: re (rwatson)


171652 29-Jul-2007 bmah

Fix a typo in a log message: s/Reveived/Received/.

Approved by: re (rwatson)


171648 29-Jul-2007 mjacob

Fix compilation problems- tcpstates is only available if TCPDEBUG
is set.

Approved by: re (in spirit)


171643 28-Jul-2007 silby

Fix a panic introduced in rev 1.126.

Approved by: re (rwatson)


171640 28-Jul-2007 andre

Provide a sysctl to toggle reporting of TCP debug logging:

sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug. Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

"ignored" means a segment contains invalid flags/information and
is dropped without changing state or issuing a reply.

"rejected" means a segments contains invalid flags/information but
is causing a reply (usually RST) and may cause a state change.

Approved by: re (rwatson)


171639 28-Jul-2007 andre

o Move setting/resetting logic of syncache timer from macro
SYNCACHE_TIMEOUT to new function syncache_timeout().
o Fix inverted timeout callout engagement logic to actually
enable the timer for the bucket row. Before SYN|ACK was
not retransmitted.
o Simplify SYN|ACK retransmit timeout backoff calculation.
o Improve logging of retransmit and timeout events.
o Reset timeout when duplicate SYN arrives.
o Add comments.
o Rearrange SYN cookie statistics counting.

Bug found by: silby
Submitted by: silby (different version)
Approved by: re (rwatson)


171638 28-Jul-2007 andre

o Move all detailed checks for RST in LISTEN state from tcp_input() to
syncache_rst().
o Fix tests for flag combinations of RST and SYN, ACK, FIN. Before
a RST for a connection in syncache did not properly free the entry.
o Add more detailed logging.

Approved by: re (rwatson)


171637 28-Jul-2007 rwatson

Replace references to NET_CALLOUT_MPSAFE with CALLOUT_MPSAFE, and remove
definition of NET_CALLOUT_MPSAFE, which is no longer required now that
debug.mpsafenet has been removed.

The once over: bz
Approved by: re (kensmith)


171605 27-Jul-2007 silby

Export the contents of the syncache to netstat.

Approved by: re (kensmith)
MFC after: 2 weeks


171591 25-Jul-2007 andre

Fix comments in tcp_do_segment().

Approved by: re (kensmith)


171572 24-Jul-2007 rrs

- take out a needless panic under invariants for sctp_output.c
- Fix addrs's error checking of sctp_sendx(3) when addrcnt is less than
SCTP_SMALL_IOVEC_SIZE
- re-add back inpcb_bind local address check bypass capability
- Fix it so sctp_opt_info is independant of assoc_id postion.
- Fix cookie life set to use MSEC_TO_TICKS() macro.
- asconf changes
o More comment changes/clarifications related to the old local address
"not" list which is now an explicit restricted list.

o Rename some functions for clarity:
- sctp_add/del_local_addr_assoc to xxx_local_addr_restricted()
- asconf related iterator functions to sctp_asconf_iterator_xxx()

o Fix bug when the same address is deleted and added (and removed from
the asconf queue) where the ifa is "freed" twice refcount wise,
possibly freeing it completely.

o Fix bug in output where the first ASCONF would not go out after the
last address is changed (e.g. only goes out when retransmitted).

o Fix bug where multiple ASCONFs can be bundled in the same packet with
the and with the same serial numbers.

o Fix asconf stcb iterator to not send ASCONF until after all work
queue entries have been processed.

o Change behavior so that when the last address is deleted (auto asconf
on a bound all endpoint) no action is taken until an address is
added; at that time, an ASCONF add+delete is sent (if the assoc
is still up).

o Fix local address counting so that address scoping is taken into
account.

o #ifdef SCTP_TIMER_BASED_ASCONF the old timer triggered sending
of ASCONF (after an RTO). The default now is to send
ASCONF immediately (except for the case of changing/deleting the
last usable address).
Approved by: re(ken smith)@freebsd.org


171531 21-Jul-2007 rrs

- remove duplicate code from sctp_asconf.c
- remove duplicate #include <sys/priv.h> that is not under
#ifdef FreeBSD version to allow compile on 6.1
- static analysis changes per the cisco SA tool including:
o some SA_IGNORE comments
o some checks for NULL before unlock.
o type corrections int -> size_t
- Fix it so sctp_alloc_asoc takes a thread/proc argument. Without this
we pass a NULL in to bind on implicit assoc setup and crash :-(
Approved by: re@freebsd.org(Ken Smith)


171508 19-Jul-2007 rwatson

Attempt to improve feature parity between UDPv4 and UDPv6 by merging
UDPv4 features to UDPv6:

- Add MAC checks on delivery and MAC labeling on transmit.
- Check for (and reject) datagrams with destination port 0.
- For multicast delivery, check the source port only if the socket being
considered as a destination has been connected.
- Implement UDP blackholing based on net.inet.udp.blackhole.
- Add a new ICMPv6 unreachable reply rate limiting category for failed
delivery attempts and implement rate limiting for UDPv6 (submitted by
bz).

Approved by: re (kensmith)
Reviewed by: bz


171477 17-Jul-2007 rrs

- added pre-checks to the bindx call.
- use proper tick gathering macro instead of ticks directly.
- Placed reasonable boundaries on sets that a user can do
that are converted to ticks from ms.
- Fix CMT_PF to always check to be sure CMT is on.
- Fix ticks use of CMT_PF.
- put back code to allow asconfs to be queued while INITs are in flight
and before the assoc is established.
- During window probes, an ack'd packet might be left with the window
probe mark on it causing it to be retransmitted. Change so that
the flight decrease macro clears the window_probe mark.
- Additional logging flight size/reading and ASOC LOG. This
is only enabled if you manually insert things into opt_sctp.h
since its a set of debug code only.
- Found an interesting SMP race in the way data was appended which
could cause a reader to lose a part of a message, had to
reorder when we marked the message was complete to after
the data was appended.
- bug in ADD-IP for the subset bound socket case when the peer has only
one address
- fix ASCONF implicit success/error handling case
- proper support of jails in Freebsd 6>
- copy out the timeval for the 64 bit sparc world on cookie-echo
alignment error crashes without this).
Approved by: re(Ken Smith)


171440 14-Jul-2007 rrs

- Modular congestion control, with RFC2581 being the default.
- CMT_PF states added (w/sysctl to turn the PF version on)
- sctp_input.c had a missing incr of cookie case when the
auth was bad. This meant a free was called without an
increment to refcnt, added increment like rest of code.
- There was a case, unlikely, when the scope of the destination
changed (this is a TSNH case). In that case, it would not free
the alloc'ed asoc (in sctp_input.c).
- When listed addresses found a colliding cookie/Init, then
the collided upon tcb was not unlocked in sctp_pcb.c
- Add error checking on arguments of sctp_sendx(3) to prevent it from
referencing a NULL pointer.
- Fix an error return of sctp_sendx(3), it was returing
ENOMEM not -1.
- Get assoc id was changed to use the sanctified socket api
method for getting a assoc id (PEER_ADDR_INFO instead of
PEER_ADDR_PARAMS).
- Fix it so a peeled off socket will get a proper error return
if it trys to send to a different address then it is connected to.
- Fix so that select_a_stream can avoid an endless loop that
could hang a caller.
- time_entered (state set time) was not being set in all cases
to the time we went established.
Approved by: re(ken smith)


171339 10-Jul-2007 rwatson

Further cleanup of UDPv4:

- Move udp_sendspace and udp_recvspace global variables and associated
sysctls to the top of the file where most other such things are present.

- Rename static variable 'blackhole' to 'udp_blackhole' and unstaticize
so that we can add blackhole support for UDPv6 using the same MIB
variable.

- Move udp_append() above udp_input() to match the function order in
udp6_usrreq.c.

Approved by: re (kensmith)


171317 09-Jul-2007 bms

Fix a regression in IPv4 multicast join path (IP_ADD_MEMBERSHIP).

With the in_mcast.c code, if an interface for an IPv4 multicast join was
not specified, and a route did not exist for the specified group in the
unicast forwarding tables, the join would be rejected with the error
EADDRNOTAVAIL.
This change restores the old behaviour whereby if no interface is specified,
and no route exists for the group destination, the IPv4 address list is
walked to find a non-loopback, multicast-capable interface to satisfy
the join request.
This should resolve problems with starting multicast services during
system boot or when a default forwarding entry does not exist.

Approved by: re (rwatson)


171290 07-Jul-2007 rwatson

Minor UDPv4 cleanup: capitalize comment, move statistics update after mbuf
free to be consistent with other error handling, and release socket buffer
lock before freeing mbufs and statistics updates rather than after.

Approved by: re (kensmith)


171230 05-Jul-2007 peter

Fix a second warning, introduced by my last "fix". I committed the wrong
diff from the wrong machine.

Pointy hat to: peter
Approved by: re (rwatson - blanket, several days ago)


171229 05-Jul-2007 peter

Fix cast-qualifiers warning when INET6 is not present

Approved by: re (rwatson)


171173 03-Jul-2007 mlaier

Link pf 4.1 to the build:
- move ftp-proxy from libexec to usr.sbin
- add tftp-proxy
- new altq mtag link

Approved by: re (kensmith)


171167 03-Jul-2007 gnn

Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing


171158 02-Jul-2007 rrs

- Consolidate the code that free's chunks to actually also
call the sctp_free_remote_address() function.
- Assure that when we allocate a chunk the whoTo is NULL,
also when we free it and place it into the cache we NULL
it (that way the consolidation code will always work).
- Fix a small race, when a empty data holder is left on the stream
out queue, and both sides do a shutdown, the empty data holder
would prevent us from sending a SHUTDOWN-ACK and at the same time we
never would cleanup the empty holder (since nothing was ever in queue).
We now add a utility function that a) cleans up empty holders and
b) properly determines if there are still pending data chunks on
the stream out wheel.
Approved by: re@freebsd.org (Ken Smith)


171157 02-Jul-2007 rwatson

Continue pre-7.0 privilege cleanup: update suser(9) comments to be priv(9)
comments.

Approved by: re (bmah)


171139 01-Jul-2007 gnn

Fix a dangling netinet6 to netipsec transition for SCTP include files.

Approved by: re


171133 01-Jul-2007 gnn

Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing


171088 29-Jun-2007 rrs

- When a SCTP socket is closed, but the last data
SACK is lost, we would incorrectly abort the association
instead of retransmitting the SACK.
Approved by: re@freebsd.org (Ken Smith)


171032 25-Jun-2007 rrs

- Update bindx address checking to properly screen out address
per the socket api, adding port validation. We allow port 0
or the already bound port number and no others.

Approved by: re@freebsd.org (Ken Smith)


170994 22-Jun-2007 rrs

- Fix type casts in calling sctp_m_getptr, it expects a int not
an unsigned (returned by sizeof) also add cast to comparison check
for size bounds.
Approved by: re(bmah@freebsd.org)


170992 22-Jun-2007 rrs

- Fix stream reset so it limits the number of streams that can be listed
- Fix fwd-tsn to use proper accessor so it does not overrun mbufs
- Fix stream reset error reporting to actually work (it has always been
broken if the peer rejects a stream reset)
- Some 64 bit friendly changes

Approved by: re(bmah@freebsd.org)


170943 18-Jun-2007 rrs

- Two more static analisys bugs found by cisco's tool on a subsequent
run.


170931 18-Jun-2007 rrs

- Fixes cstatic issues found by cisco sa tool (missing frees and such
on error legs)
- align sctp_sockstore to 64 bit boundary ..


170923 18-Jun-2007 maxim

o Make ipfw set more robust -- now it is possible:
- to show a specific set: ipfw set 3 show
- to delete rules from the set: ipfw set 9 delete 100 200 300
- to flush the set: ipfw set 4 flush
- to reset rules counters in the set: ipfw set 1 zero

PR: kern/113388
Submitted by: Andrey V. Elsukov
Approved by: re (kensmith)
MFC after: 6 weeks


170921 18-Jun-2007 rrs

Add additional logging level mask for packet_logging too.


170899 17-Jun-2007 rrs

- The packet log needs to copy all of the buffer not to the end.


170894 17-Jun-2007 rrs

Back out last change to inpcb_free. Turns out we need
to hold off freeing if there is data pending ... someone
might do send/close. Which means we want the data to
go and then close it after startup. Added comments to
the code as well to note that this is done for a reason.


170861 17-Jun-2007 mjacob

Make gcc4.2 happy and zero save_ip for the unlikely (blackhole != 0)
codepath.


170859 17-Jun-2007 rrs

- For sctp_input/sctp6_input add announcment when a packet arrives (debug)
- re-factor the packet drop in sctp_output a bit more, we don't need the
trim after all, but the size calc is now corrected.
- When a assoc is in the COOKIE-ECHO/COOKIE-WAIT state and the user
closes, it should not matter if data is queued, the assoc should be
purged.
- In error leg a missing free_chunk when iph comes in NULL (should not
happen but just in case).


170856 17-Jun-2007 mjacob

Replace incorrect local OFFSET_OF macro with the correct and generic
offsetof macro.


170855 17-Jun-2007 mjacob

Simplification to quiet a gcc4.2 warning. Just by setting match.s_addr
to nonzero you fulfill the same function as the variable 'cmp'. so you
might as well zero match and test against it later.

Reviewed by: timeout on review request


170824 16-Jun-2007 rrs

- Better handle sending large pkt-drops. We were not triming
the data with m_adj if a large pkt arrived with a bad csum
some systems can't handle you not triming the tail (think panda :-D)


170814 16-Jun-2007 rrs

- Raise max range of sctp_logging sysctl so panda does not disallow
us to turn on logging levels.


170806 16-Jun-2007 rrs

- Matthew's changes to get inlines out, plus a few of my own
to deal with the VRF inline function -> becomes a macro now.
Submitted by: Matthew Jacobs


170800 15-Jun-2007 mjacob

Garbage collect some debug code that not only no longer could
work but in fact probably causes a random pointer dereferences.
Garbage collect the tp variable too.


170791 15-Jun-2007 rrs

Name change SCTP_KTR_SUBSYS -> KTR_SCTP


170790 15-Jun-2007 rrs

Remove extraneous extern (its gotten from sctp_sysctl.h)


170788 15-Jun-2007 rrs

When removing a stream from the output-stream-wheel, if its the
first stream we saw we must update the starting point in the
wheel, else we may loop in an endless loop.


170786 15-Jun-2007 rrs

- Update the comment lines in sctp_input.c
- We need to init the INP_LOCK since otherwise for
non-SMP kernels you crash when you set the TOS.


170785 15-Jun-2007 bms

Stub out imported IGMPv3 definitions which clash with those of
the XORP router; the IGMPv3 definitions will be updated at a later
point in time when IGMPv3/MLDv2 support is fully merged.


170781 15-Jun-2007 rrs

- Issue one, new stack reduction left packet_drop handling still
thinking it had the whole chunk. This could cause a crash if
a large packet drop came in. Fixed by adjusting the trunc length
down to the limit.
- Large sacks with lots of segments could also have same issue. Changed
duplicate and segment handling to use proper get_m_ptr function to
pull each block from mbuf chains.


170751 15-Jun-2007 rrs

- Add VRF id to sctp_ifa structure, needed mainly in panda but useful
during deletes of ifa's in diff VRF's when applicable.


170747 15-Jun-2007 rrs

KTR_GEN -> KTR_SUBSYS (for Kris).


170744 14-Jun-2007 rrs

- Fix so ifn's are properly deleted when the ref count goes to 0.
- Fix so VRF's will clean themselves up when no references are around.
- Allow sctp_ifa to be passed into inpcb_bind, addr_mgmt_ep_sa to bypass
normal validation checks.
- turn auto-asconf off for subset bound sockets
- Moves all logging to use KTR. This gets rid of most
of the logging #ifdef's with a few exceptions reducing
the number of config options for SCTP.


170665 13-Jun-2007 rrs

- fix bindx to check addresses against socket's protocol family


170664 13-Jun-2007 rwatson

Remove IPX over IP tunneling support, which allows IPX routing over IP
tunnels, and was not MPSAFE. The code can be easily restored in the
event that someone with an IPX over IP tunnel configuration can work
with me to test patches.

This removes one of five remaining consumers of NET_NEEDS_GIANT.

Approved by: re (kensmith)


170642 13-Jun-2007 rrs

- Fixed cookie handling to calc an RTO when
its an INIT collision case.
- Fixed RTO calc to maintain a seperate variable to track
if a RTO calc as been done, this allows the RTO var to be
doubled during initial timeouts.
- Reduces the amount of stack used by process control.
- Use a constant for the peer chunk overhead.
- Name change to spell candidate correctly.


170613 12-Jun-2007 bms

Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.

This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.

The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html

Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.

This work was financially supported by another FreeBSD committer.

Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)


170606 12-Jun-2007 rrs

- Restructure so bindx functions are not done inline to socket option
but are a seperate call that can be re-used if needed.
- 64 bit issues
o re-arrange cookie so it is better 64 bit aligned
o For wire level things we need the packed attribute.


170587 12-Jun-2007 rwatson

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


170516 10-Jun-2007 andre

Fix a case in tcp_do_segment() where tcp_update_sack_list() would
be called with an incorrect segment end value. tcp_reass() may
trim segments when they overlap with already existing ones in the
reassembly queue. Instead of saving the segment end value before
the call to tcp_reass() compute it on the fly based on the effective
segment length afterwards.

This bug was not really problematic as no information got lost and
the eventual SACK information computation was correct nontheless.

MFC after: 1 week


170515 10-Jun-2007 andre

Fix style for comments, be more verbose and add some more.


170470 09-Jun-2007 andre

Make the handling of the tcp window explicit for the SYN_SENT case
in tcp_outout(). This is currently not strictly necessary but paves
the way to simplify the entire SYN options handling quite a bit.
Clarify comment. No change in effective behavour with this commit.

RFC1323 requires the window field in a SYN (i.e., a <SYN> or
<SYN,ACK>) segment itself never be scaled.


170469 09-Jun-2007 andre

Remove some bogosity from the SYN_SENT case in tcp_do_segment
and simplify handling of the send/receive window scaling. No
change in effective behavour.

RFC1323 requires the window field in a SYN (i.e., a <SYN> or
<SYN,ACK>) segment itself never be scaled.

Noticed by: yar


170467 09-Jun-2007 andre

Don't send pure window updates when the peer has closed the connection
and won't ever send more data.


170464 09-Jun-2007 andre

Handle a race condition on >2 core machines in tcp_timer() when
a timer issues a shutdown and a simultaneous close on the socket
happens. This race condition is inherent in the current socket/
inpcb life cycle system but can be handled well.

Reported by: kris
Tested by: kris (on 8-core machine)


170463 09-Jun-2007 rrs

- Opps.. takes out debug printfs I accidentally left in :-(


170462 09-Jun-2007 rrs

- fix send_failed notification contents
- Reorder send failed to be in correct order.
- Fixed calulation of init-ack to be right off
mbuf lengths instead of the precalculated value. This
will fix one 64 bit platform issue.


170435 08-Jun-2007 yar

Replace a constant with an already defined symbolic name for it.

Tested with: md5(1)


170434 08-Jun-2007 yar

Add a sysctl for the purge run interval so that it can
be tuned along with the rest of hostcache parameters.
The new sysctl name is `net.inet.tcp.hostcache.prune'.


170428 08-Jun-2007 rrs

- RTO was not being initialized to 0, thus the rtt calculation
algoritm would not go through the proper initialization.
- The initialization was incorrect as well, causing problems in
sat networks with > 1sec RTT
- Get rid of magic numbers in RTT calculations.


170405 07-Jun-2007 andre

In tcp_hc_insert() we may have the case where we have hit the global
cache size limit but this bucket row is empty. Normally we want to
recycle the oldest entry in the bucket row. If there isn't any the
TAILQ_REMOVE leads to a panic by trying to remove a non-existing
element. Fix this by just returning NULL and failing the insert.
This is not a problem as the TCP hostache is only advisory.

Submitted by: jhb


170385 06-Jun-2007 andre

Correctly print SEQ and IRS in the corresponding log message in
syncache_expand().


170373 06-Jun-2007 glebius

Do not leak lock in the case of EEXIST error.

PR: kern/92776
Submitted by: Ed Schouten <Ed.Schouten tunix.nl>


170354 06-Jun-2007 rrs

- Fixes a case where doing a sysctl would leave locks held
when coping out association data.
- Fixes a small bug that prevented the SCTP_UNORDERED indication
from going up to the app on a recv in the sinfo_flags field.


170289 04-Jun-2007 dwmalone

Despite several examples in the kernel, the third argument of
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.

Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.

In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.


170205 02-Jun-2007 rrs

- fix initial pcb vrf setting when the initial vrf is not the
default_vrf_id
- Missing lock/unlock of inp added as well in the v6 side.
- IFN hash table moves to sctppcbinfo since indexes are
unique across systems (including different VRFs) this makes it easier
to do ifn lookups.


170181 01-Jun-2007 rrs

- Take out the broken table-id concept. Panda Routers have a M-VRF
concept that is NOT well thought out for a multi-homed transport
protocol. So the useless table-id entries passed around need to
be removed.
- Add a event timer for the zero copy api.
- Fix a bug in sctp_timer.c when searching for an alternate
with the largest ssthresh (the compare was wrong).


170174 01-Jun-2007 jeff

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


170153 31-May-2007 rwatson

(1) In tcp_usrclosed(), tp can never become NULL, so don't test for NULL
before handling the socket disconnection case.

(2) Clean up surrounding comments and formatting.

Found with: Coverity Prevent(tm) (1)
CID: 2203


170140 30-May-2007 rrs

- Fixed (Apple) compiler warnings in sctp_input.c, sctputil.c, sctp_output.c
- Fixed a LOR in handling a cookie. Turns out create lock is applied.
And if we abort processing, this causes LOR. Changed to force the
timer to clean up, that way create lock is released.


170138 30-May-2007 rrs

- Fix a memory overwrite when the mapping array
is expanded, size of expansion was not taken int consideration.
- Fix so vtag hash is 1 bigger so that it modulo's out
correctly, avoids a panic when restart with right modulo happens.
- do not dereference stcb when control->do_not_ref_stcb is set
- Fix up packet logging to not often use a lock and also to
add to options.
- Fix some logging option duplication in the sctputil.h


170099 29-May-2007 rrs

Adds gcc attribute to prevent inlining of a function. If
it goes inline we may well blow the stack if witness and
such are enabled.


170094 29-May-2007 rrs

- Fix spelling errors in comments per Ruslan (.. thanks... )


170091 29-May-2007 rrs

- Fixes so we won't try to start a timer when we
hold a wq lock for the iterator. Panda uses a
silly recursive lock they hold through the timer.
- Add poor mans wireshark compile option..
- Allocate and start using SCTP_M_XXX for all SCTP_MALLOC() calls.
- sysctl now will get back the refcnt for viewing by onlookers.

Reviewed by: gnn


170078 28-May-2007 andre

Make log messages more verbose and simpler to understand for non-experts.
Update comments to be more conscious, verbose and fully reflect reality.


170058 28-May-2007 andre

Fix indentation of the syncache_expand() section in tcp_input().


170056 28-May-2007 rrs

- fixed autclose to not allow setting on 1-2-1 model.
- bounded cookie-life to 1 second minimum in socket option set.
- Delayed_ack_time becomes delayed_ack per new socket api document.
- Improve port number selection, we now use low/high bounds and
no chance of a endless loop. Only one call to random per bind
as well.
- fixes so set_peer_primary pre-screens addresses to be
valid to this host.
- maxseg did not allow setting on an assoc basis. We needed
to thus track and use an association value instead of a inp value.
- Fixed ep get of HB status to report back properly.
- use settings flag to tell if assoc level hb is on off not
the timer.. since the timer may still run if unconf address
are present.
- check for crazy ENABLE/DISABLE conditions.
- set and get of pmtud (fixed path mtu) not always taking into account ovh.
- Getting PMTU info on stcb only needs to return PMTUD_ENABLED if
any net is doing PMTU discovery.
- Panic or warning fixed to not do so when a valid ip frag is
taking place.
- sndrcvinfo appearing in both inp and stcb was full size, instead
of the non-pad version. This saves about 92 bytes from each struct
by carefully converting to use the smaller version.
- one-2-one model get(maxseg) would always get ep value, never the
tcb's value.
- The delayed ack time could be under a tick, this fixes so
it bounds it to at least 1 tick for platforms whos tick
is more than a ms.
- Fragment interleave level set to wrong default value.
- Fragment interleave could not set level 0.
- Defered stream reset was broken due to a guard check and ntohl issue.
- Found two lock order reversals and fixed.
- Tighten up address checking, if the user gives an address the sa_len
had better be set properly.
- Get asoc by assoc-id would return a locked tcb when it was asked
not to if the tcb was in the restart hash.
- sysctl to dig down and get more association details

Reviewed by: gnn


170055 28-May-2007 andre

Refactor and rewrite in parts the SYN handling code on listen sockets
in tcp_input():

o tighten the checks on allowed TCP flags to be RFC793 and
tcp-secure conform
o log check failures to syslog at LOG_DEBUG level
o rearrange the code flow to be easier to follow
o add KASSERTs to validate assumptions of the code flow

Add sysctl net.inet.tcp.syncache.rst_on_sock_fail defaulting to enable
that controls the behavior on socket creation failure for a otherwise
successful 3-way handshake. The socket creation can fail due to global
memory shortage, listen queue limits and file descriptor limits. The
sysctl allows to chose between two options to deal with this. One is
to send a reset to the other endpoint to notify it about the failure
(default). The other one is to ignore and treat the failure as a
transient error and have the other endpoint retransmit for another try.

Reviewed by: rwatson (in general)


170030 27-May-2007 rwatson

Normalize spelling and grammar in TCP hostcache comments.


170024 27-May-2007 rwatson

In tcp_timer_2msl(), tp can never become NULL, so don't check it for
NULL before entering tcp_trace().

Found with: Coverity Prevent(tm)
CID: 1840


170019 27-May-2007 rwatson

Don't assign sp to the value of s when we're about to assign it instead to
s + strlen(s).

Found with: Coverity Prevent(tm)
CID: 2243


169997 25-May-2007 andre

The printf %b list in PRINT_TH_FLAGS has to be in octal numbering.
Thus convert \8 to \10 and the warnings go away.

Pointed out by: sam, ru, thompsa


169914 23-May-2007 andre

Add CWR back into the PRINT_TH_FLAGS list as gcc42 doesn't complain
about \8 in a string anymore.


169913 23-May-2007 andre

In tcp_log_addrs():
o add the hex output of the th_flags field to the example log
line in comments
o simplify the log line length calculation and make it less
evil
o correct the test for the length panic; the line isn't on
the stack but malloc'ed


169686 18-May-2007 andre

Be more restrictive with segment validity checks in syncache_expand()
and log check failures to syslog at LOG_DEBUG level.

Always prefill the sc->sc_ts field to use it in the checks.


169685 18-May-2007 andre

o Add syslog logging under LOG_DEBUG to various failures caused by
bogus segments
o Add more KASSERT()s
o Update comments


169683 18-May-2007 andre

Add tcp_log_addrs() function to generate and standardized TCP log line
for use thoughout the tcp subsystem.

It is IPv4 and IPv6 aware creates a line in the following format:

"TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>"

A "\n" is not included at the end. The caller is supposed to add
further information after the standard tcp log header.

The function returns a NUL terminated string which the caller has
to free(s, M_TCPLOG) after use. All memory allocation is done
with M_NOWAIT and the return value may be NULL in memory shortage
situations.

Either struct in_conninfo || (struct tcphdr && (struct ip || struct
ip6_hdr) have to be supplied.

Due to ip[6].h header inclusion limitations and ordering issues the
struct ip and struct ip6_hdr parameters have to be casted and passed
as void * pointers.

tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
void *ip6hdr)

Usage example:

struct ip *ip;
char *tcplog;

if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) {
log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n",
tcplog, __func__);
free(s, M_TCPLOG);
}


169682 18-May-2007 jhb

Fix statistical accounting for bytes and packets during sack retransmits.

MFC after: 1 week
Submitted by: mohans


169664 17-May-2007 jinmei

- Disabled responding to NI queries from a global address by default as
specified in RFC4620. A new flag for icmp6_nodeinfo was added to enable the
feature.
- Also cleaned up the code so that the semantics of the icmp6_nodeinfo
flags is clearer (i.e., defined specific macro names instead of using
hard-coded values).

Approved by: gnn (mentor)
MFC after: 1 week


169655 17-May-2007 rrs

- Fixed 1-2-1 model to not worry about associd in sockopts
- Fixed RTOinfo for bounding.
- Fixed connect() to return ECONNREFUSED when an ABORT is received.
- Added comments to direct Static Analysis not to look at some things
it does not understand (comments are /* sa_ignore XXXXX */)
- Bind when colliding was broken, missing not_found = 1 before
checking to see if the port was in use caused endless bind loop.
- Cookie life needs to be in milliseconds to conform to socket api.
- Cookie life is not supposed to change if its 0, On the assoc
level set we changed it to 0 opps.
- Two more static analysis issues identified by the cisco
tool. Null checks needed.
- An issue for sendfile(). Need to validate the correct
input argument.
- When sending failed due to a no route to host, we leaked
the mbuf chain failing to call m_freem().
- Fix #ifdef issue for getting hash block len when HAVE_SHA2 is NOT defined
Reviewed by: gnn


169635 17-May-2007 oleg

Unbreak IPv4 kernel build.


169625 16-May-2007 rwatson

Remove leading spaces before tabs spotted thanks to silby using
kwrite to read ip_input.c.


169613 16-May-2007 andre

Remove now unused stuff forgotten in the previous commit.


169608 16-May-2007 andre

Move TIME_WAIT related functions and timer handling from files
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:

tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()

tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()

This is a mechanical move with appropriate renames and making
them static if used only locally.

The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.


169598 16-May-2007 dwmalone

When verifying the IPv4 UDP checksum, don't overwrite the checksum
value in the mbuf with the result of the calculation. Previously,
if we chose to return an ICMP message, the quoted UDP checksum bytes
would be different to what was sent.

PR: 112471
Submitted by: Matthew Luckie <mluckie@cs.waikato.ac.nz>
MFC after: 3 weeks


169541 13-May-2007 andre

Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait from tcp_subr.c -> tcp_timewait.c


169482 11-May-2007 andre

Drop everything that doesn't belong into this new file.
It's neither functional not connected to the build yet.


169481 11-May-2007 andre

Drop everything that doesn't belong into this new file.
It's neither functional nor connected to the build yet.


169480 11-May-2007 andre

Make the TCP timer callout obtain Giant if the network stack is marked
as non-mpsafe.

This change is to be removed when all protocols are mp-safe.


169477 11-May-2007 andre

Add the timestamp offset to struct tcptw so we can generate proper
ACKs in TIME_WAIT state that don't get dropped by the PAWS check
on the receiver.


169469 11-May-2007 rwatson

Coalesce two identical UCB licenses into a single license instance with
one set of copyright years.

White space and comment cleanup.

Export $FreeBSD$ via __FBSDID.


169467 11-May-2007 rwatson

Minor white space and style cleanups.


169466 11-May-2007 rwatson

White space and style cleanup.


169465 11-May-2007 rwatson

Minor white space/style normalization.


169464 11-May-2007 rwatson

Normalize style a bit: reduce pseudo-randomness of comment layout and
white space. Remove 'register'.


169462 11-May-2007 rwatson

Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr
protocol entry points using functions named proto_getsockaddr and
proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr.
While it's true that sockaddrs are allocated and set, the net effect is
to retrieve (get) the socket address or peer address from a socket, not
set it, so align names to that intent.


169461 11-May-2007 rwatson

Remove unneeded wrappers for in_setsockaddr() and in_setpeeraddr(), which
used to exist so pcbinfo locks could be acquired, but are no longer
required as a result of socket/pcb reference model refinements.


169457 10-May-2007 andre

Fix an incorrect replace of a timer reference made during the TCP timer
rewrite in rev. 1.132. This unmasked yet another bug that causes certain
connections to get indefinately stuck in LAST_ACK state.


169454 10-May-2007 rwatson

Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.


169420 09-May-2007 rrs

Two major items here:
- All printf that was surrounded by #ifdef SCTP_DEBUG moves to
a macro that does all of this. This removes all printfs from
the code and makes the code more portable and easier to
read.
- Static Analysis (cisco) - found a few bugs, but mostly we
add checks for NULL pointers and such to make the tool
happy. We now pass the Cisco SA tools checks except for
where it does not understand tailq/lists. We still need
to look at the coverity tools output too (this is like
the cisco SA tool) and see if it wants us to fix any other
items. Hopefully this will be the last major churn in the
code other than bug fixes.


169417 09-May-2007 maxim

o Fix style(9) bugs introduced in the last commit.

Pointed out by: bde


169405 09-May-2007 maxim

o Unbreak "options TCPDEBUG" && "nooptions INET6" kernel build.

PR: kern/112517
Submitted by: vd


169382 08-May-2007 rrs

- Copyright change, cisco's silly tool wants it to say:
"Copyright (c) 2001-2007, by Cisco Systems,"
instead of
*Copyright (c) 2001-2007, Cisco Systems,"

- Also fix a few straglers that were still in 2006.


169380 08-May-2007 rrs

- Get rid of the sctp_inpcb_free() "magic numbers", now they
are sensible defines that tell what you are directing
the function to do.


169378 08-May-2007 rrs

- Static analyisis fixes for cisco's commit (this is equivilant
to the coverity tool.. may even be the same one.. not sure).
- A bug in the way sctp_abort() and friends were
setting the IP_CLOSE flag.. and NOT passing the
last argument as a (,1)... so that things would
get freed..


169352 08-May-2007 rrs

- More macros for OS compatabilty
- PR-SCTP would ignore FWD-TSN's above a rwnd's worth
of TSN's (1 byte msgs).. this left the peer hopelessly
out of sync.. or an attacker. So now we abort the assoc.
- New IFN hash, also rename hashes to match addr/ifn now
that the vrf has multiple.
- Do not enable SCTP_PCB_FLAGS_RECVDATAIOEVNT per default
as defined in the Socket API ID.
- Export MTU information via sysctl.
- Vrf's need table id's. This is default for
BSD, but may be other things later when BSD
fully supports VRFs.
- Additional stream reset bug (caught by cisco dev-test).
- Additional validations for the address in sending a message (socket api).
-------- and -----
- Fix association notifications not to give the active open
side false notifications.
- Fix so sendfile and SENDALL will work properly (missing
flag to say socket sender is done).
- Fix Bug that prevented COOKIES from being retransmitted.
- Break out connectx into helper sub-models so that iox routines can
reuse the helpers.
- When an address is added during system init (non-dynamic mode) make
sure that the "defer use" flag is not set.
** its compiling on XR now :-D **

Reviewed by: gnn


169350 07-May-2007 rwatson

Rather than selectively zeroing fields in the tcp_debug structure
throughout tcp_trace(), zero the entire structure up front.

Minor style fixes.


169349 07-May-2007 rwatson

Since udp_peeraddr() and udp_sockaddr() directly wrap in_setpeeraddr()
and in_setsockaddr(), containing only stale comments on why they
exist, remove them and initialize the protosw for UDP to directly
reference in_setpeeraddr() and in_setsockaddr().


169348 07-May-2007 rwatson

Minor style tweaks.


169347 07-May-2007 rwatson

When setting up timewait state for a TCP connection, don't hold the
socket lock over a crhold() of so_cred: so_cred is constant after
socket creation, so doesn't require locking to read.


169318 06-May-2007 andre

Remove unused requested_s_scale from struct tcpcb.


169317 06-May-2007 andre

Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool. Change all users accordingly.


169316 06-May-2007 andre

o Remove redundant tcp reassembly check in header prediction code
o Rearrange code to make intent in TCPS_SYN_SENT case more clear
o Assorted style cleanup
o Comment clarification for tcp_dropwithreset()


169315 06-May-2007 andre

Reorder the TCP header prediction test to check for the most volatile
values first to spend less time on a fallback to normal processing.


169314 06-May-2007 andre

Remove the defunct remains of the TCPS_TIME_WAIT cases from tcp_do_segment
and change it to a void function.

We use a compressed structure for TCPS_TIME_WAIT to save memory. Any late
late segments arriving for such a connection is handled directly in the TW
code.


169309 06-May-2007 andre

Fix two comments.


169295 06-May-2007 rrs

Two bugs:
- Locks were not being unlocked when an invalid size chunk is
sent in.
- When a notification comes in, we cannot use it to look up
the fragment interleave stream information since its not
on a stream.


169272 04-May-2007 rwatson

Add global mutex tcp_debug_mtx, which will protect global TCP debugging
state tcp_debug, tcp_debx. Acquire and drop as required in tcp_trace().

Move to ANSI C function header, correct prototype types so that short TCP
state is no longer promoted to int unnecessarily.

Add comments.

MFC after: 3 weeks


169268 04-May-2007 rwatson

Tweak comment at end of tcp_input() when calling into tcp_do_segment(): the
pcbinfo lock will be released as well, not just the pcb lock.


169254 04-May-2007 rrs

Fixes a missing unlock in the one-2-one hash table, if
it was full and a collision occured, then we would leave
a inp locked. Also fixes a missing inp unlock if IPSEC was
on and it failed during the attach. Bug found by Weongyo Jeong.


169245 04-May-2007 bz

Add support for filtering on Routing Header Type 0 and
Mobile IPv6 Routing Header Type 2 in addition to filter
on the non-differentiated presence of any Routing Header.

MFC after: 3 weeks


169236 03-May-2007 rwatson

sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags
on each socket buffer with the socket buffer's mutex. This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.

This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.

While here, fix two historic bugs:

(1) a bug allowing I/O to be occasionally interlaced during long I/O
operations (discovere by Isilon).

(2) a bug in which failed non-blocking acquisition of the socket buffer
I/O serialization lock might be ignored (discovered by sam).

SCTP portion of this patch submitted by rrs.


169208 02-May-2007 rrs

- Somehow the disable fragment option got lost. We could
set/clear it but would not do it. Now we will.
- Moved to latest socket api for extended sndrcv info struct.
- Moved to support all new levels of fragment interleave (0-2).
- Codenomicon security test updates - length checks and such.
- Bug in stream reset (2 actually).
- setpeerprimary could unlock a null pointer, fixed.
- Added a flag in the pcb so netstat can see if we are listening easier.

Obtained from: (some of the Listen changes from Weongyo Jeong)


169179 01-May-2007 rwatson

Remove unused pcbinfo arguments to in_setsockaddr() and
in_setpeeraddr().


169154 30-Apr-2007 rwatson

Rename some fields of struct inpcbinfo to have the ipi_ prefix,
consistent with the naming of other structure field members, and
reducing improper grep matches. Clean up and comment structure
fields in structure definition.


169149 30-Apr-2007 maxim

o Kill EOLWS while I'm here.


169148 30-Apr-2007 maxim

o Fix strtoul() error conditions check.

PR: kern/108211
Submitted by: Yong Tang
MFC after: 2 weeks


168986 23-Apr-2007 andre

o Fix INP lock leak in the minttl case
o Remove indirection in the decision of unlocking inp
o Further annotation of locking in tcp_input()


168961 23-Apr-2007 rrs

Fixes cut and paste bug using wrong pointer reference.


168945 22-Apr-2007 rrs

Moves the PCB features and flags from sctp_pcb.h to
sctp.h so that netstat can access and display these
values.


168943 22-Apr-2007 rrs

- Somehow the disable fragment option got lost. We could
set/clear it but would not do it. Now we will.
- Moved to latest socket api for extended sndrcv info struct.
- Moved to support all new levels of fragment interleave.


168906 20-Apr-2007 andre

o Remove unncessary TOF_SIGLEN flag from struct tcpopt
o Correctly set to->to_signature in tcp_dooptions()
o Update comments


168905 20-Apr-2007 andre

Add more KASSERT's.


168904 20-Apr-2007 andre

o Remove unused and redundant TCP option definitions
o Replace usage of MAX_TCPOPTLEN with the correctly constructed and
derived MAX_TCPOPTLEN


168903 20-Apr-2007 andre

Remove bogus check for accept queue length and associated failure handling
from the incoming SYN handling section of tcp_input().

Enforcement of the accept queue limits is done by sonewconn() after the
3WHS is completed. It is not necessary to have an earlier check before a
connection request enters the SYN cache awaiting the full handshake. It
rather limits the effectiveness of the syncache by preventing legit and
illegit connections from entering it and having them shaken out before we
hit the real limit which may have vanished by then.

Change return value of syncache_add() to void. No status communication
is required.


168902 20-Apr-2007 andre

Simplifly syncache_expand() and clarify its semantics. Zero is returned
when the ACK is invalid and doesn't belong to any registered connection,
either in syncache or through SYN cookies. True but a NULL struct socket
is returned when the 3WHS completed but the socket could not be created
due to insufficient resources or limits reached.

For both cases an RST is sent back in tcp_input().

A logic error leading to a panic is fixed where syncache_expand() would
free the mbuf on socket allocation failure but tcp_input() later supplies
it to tcp_dropwithreset() to issue a RST to the peer.

Reported by: kris (the panic)


168901 20-Apr-2007 andre

Only update TCP timestamp on SYN duplication if it is present on
current SYN in syncache_add(). Otherwise disable timestamps.


168900 20-Apr-2007 andre

o Plug memory leak in syncache_add() on MAC label allocation failure.
o Simplify code flow with 'done' goto label.
o Remove mbuf argument from syncache_respond(). It doesn't make use
of it.


168859 19-Apr-2007 rrs

- More work on making send lock contention.
- Removed free-oqueue cache.
- Fix counter for sq entries
- Increased the amount of information retained
on ASOC_TSN logging on the association.
- Made it so with the ASOC_TSN logging on
sending or recieving an abort we dump the log.
- Went through and added invariant's around some
panic's that needed them.
- decrements went to atomic_subtact_int instead of add -1
- Removed residual count increment that threw off a
strm oq count.
- Tracks and complaints if we don't have a LAST fragment and
clean up the sp structure.
- Track a new stat that counts number of abandoned msgs that
happen if you close without reading.
- Fix lookup of frag point to be aware of a 0 assoc-id.
Reviewed by: gnn


168845 18-Apr-2007 andre

Make tcp_twrespond() use tcp_addoptions() instead of a home grown version.


168817 17-Apr-2007 andre

When we run into the syncache entry limits syncache_add() tries
to free the oldest entry in the current bucket row. The global
entry limit may be smaller than the bucket rows and their limit
combined however. Thus only try to free a syncache entry if we
found one in this bucket row.

Reported by: kris


168812 17-Apr-2007 rwatson

Shorten text string for ip_fw2 dynamic rules zone by removing the word
"zone", which is generally not present in zone names. This reduces the
incidence of line-wrapping in "vmstat -z " using 80-column displays.

MFC after: 3 days


168769 15-Apr-2007 rwatson

Remove unused variable tcbinfo_mtx.


168757 15-Apr-2007 rrs

Fix stupid syntax error - Pointy hat to me :-(


168755 15-Apr-2007 rrs

- Add more comments to sctps_stats struture in sctp_uio.h
- Fix bug that prevented EEOR mode from working
and simplified the can_we_split code in the process.
- Reduce lock contention for the tcb_send_lock. I did
this especially for EEOR mode, still need to look at
why I need a lock when removing from the tailq and the
->next is NOT null. A lock fixes it but it implies a
bug yet exists.
- Activated Andre's proposed changes to better use the mbuf
infrastructure.
- Fixed places that were not using the aloc macro's to take
advantage of the per assoc cache.
- Adds ifdef fix so any logging will enable stat_logging to
get the right data structures in place (suggested by Max Laier).


168731 14-Apr-2007 mlaier

Fix a typeo - unbreak the build.


168709 14-Apr-2007 rrs

- fix source address selection when picking an acceptable address
- name change of prefered -> preferred
- CMT fast recover code added.
- Comment fixes in CMT.
- We were not giving a reason of cant_start_asoc per socket api
if we failed to get init/or/cookie to bring up an assoc. Change
so we don't just give a generic "comm lost" but look at actual
states of dying assoc.
- change "crc32" arguments to "crc32c" to silence strict/noisy
compiler warnings when crc32() is also declared
- A few minor tweaks to get the portable stuff truely portable
for sctp6_usrreq.c :-D
- one-2-one style vrf match problem.
- window recovery would leave chks marked for retran
during window probes on the sent queue. This would then
cause an out-of-order problem and assure that the flight
size "problem" would occur.
- Solves a flight size logging issue that caused rwnd
overruns, flight size off as well as false retransmissions.g
- Macroize the up and down of flight size.
- Fix a ECNE bug in its counting.
- The strict_sacks options was causing aborts when window probing
was active, fix to make strict sacks a bit smarter about what
the next unsent TSN is.
- Fixes a one-2-one wakeup bug found by Martin Kulas.
- If-defed out form, Andre's copy routines pending his
commit of at least m_last().. need to adjust for 6.2 as
well.. since m_last won't exist.
Reviewed by: gnn


168621 11-Apr-2007 ru

Make "struct tcp_timer" visible only to the kernel, and unbreak world.


168615 11-Apr-2007 andre

Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by: rwatson (earlier version)


168590 10-Apr-2007 rwatson

Add a new privilege, PRIV_NETINET_REUSEPORT, which will replace superuser
checks to see whether bind() can reuse a port/address combination while
it's already in use (for some definition of use).


168459 07-Apr-2007 piso

Prevent the usage of an uninitialized variable: do not accept
StartMediaTx message before an OpnRcvChnAck message was received.

Reviewed by: glebius
Approved by: glebius (mentor)
MFC after: 3 days
Found with: Coverity Prevent(tm)
CID: 498


168458 07-Apr-2007 piso

Silence Coverity about an unused variable.

Reviewed by: glebius
Approved by: glebius (mentor)
MFC after: 3 days
CID: 538


168369 04-Apr-2007 andre

Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some
further INP_INFO_WLOCK_ASSERT() while there.


168368 04-Apr-2007 andre

Move last tcpcb initialization for the inbound connection case from
tcp_input() to syncache_socket() where it belongs and the majority
of it already happens.

The "tp->snd_up = tp->snd_una" is removed as it is done with the
tcp_sendseqinit() macro a few lines earlier.


168365 04-Apr-2007 andre

Some local and style(9) cleanups.


168364 04-Apr-2007 andre

Retire unused TCP_SACK_DEBUG.


168363 04-Apr-2007 andre

In tcp_dooptions() skip over SACK options if it is a SYN segment.


168346 04-Apr-2007 kan

Include string.h for non-kernel builds to get proper memcpy prototype.


168344 04-Apr-2007 kan

Include string.h for non-kernel builds to get proper strcpy, strlen
prototypes.


168342 04-Apr-2007 kan

Do not assign result of (char *) cast to u_char * variable.


168328 03-Apr-2007 julian

Since we switched to using monatomically increasing timestamps,
they have been reported back to the userland as being in 1970.
Add boot time to the timestamp to give the time in the scale of the 'current'
real timescale. Not perfect if you change the time a lot but good enough
to keep all the rules correct relative to each other correct in terms
of time relative to "now".


168299 03-Apr-2007 rrs

- fixed several places where we did not release INP locks.
- fixed a refcount bug in the new ifa structures.
- use vrf's from default stcb or inp whenever possible.
- Address limits raised to account for a full IP fragmented
packet (1000 addresses).
- flight size correcting updated to include one message only
and to handle case where the peer does not cumack the
next segment aka lists 1/1 in sack blocks..
- Various bad init/init-ack handling could cause a panic
since we tried to unlock the destroyed mutex. Fixes
so we properly exit when we need to destroy an assoc.
(Found by Cisco DevTest team :D)
- name rename in src-addr-selection from pass to sifa.
- route structure typedef'd to allow different platforms
and updated into sctp_os_bsd file.
- Max retransmissions a chunk can be made added.
Reviewed by: gnn


168124 31-Mar-2007 rrs

- Found bug in min split point bundling which caused
incorrect, non-bundlable fragmentation.
- Added min residual to better control split points for
both how big a msg must be as well as how much needs
to be left over.
- With our new algo in place, we need to implicitly
set "end of msg" on the sp-> structure otherwise we
end up with "hung" associations.
- Room reserved up front in IP header by pushing IP
header to back of mbuf.
- Fix so FR's peg count of retransmissions needed.
- Fix so an unlucky chunk that never gets across
will kill the assoc via the kill timer and send an
abort too.
- Fix bug in sctp_input which can result in a crash.
- Do not strip off IP options anymore.
- Clean up sctp_calculate_rto().
- Get rid of unused sysctl.
- Fixed so we discard all M-Cast
- Fixed so port check done AFTER checksum
- Fixed bug in fragmentation code that prevented
us from fragmenting a small complete message when
we needed to.
- Window probes were not marked back to unsent and
flight adjusted when a sack came in with no
window change or accepting of the probe data.
We now fix this with having a mark on the net and
the chunk so we can clear it out when the sack arrives
forcing it to retran just like it was "new" this
improves the handling of window probes, which were
dropped by the receiver.
- Tighten AUTH protocol error checks during INIT/INIT-ACK exchange


168032 29-Mar-2007 bms

Fix a bug in IPv4 address configuration exposed by refcounting.
* Join the IPv4 all-hosts multicast group 224.0.0.1 once only;
that is, when an IPv4 address is first configured on an interface.
* Do not join it for subsequent IPv4 addresses as this violates IGMP.
* Be sure to leave the group when all IPv4 addresses have been removed
from the interface.
* Add two DIAGNOSTIC printfs related to the issue.

Further care and attention is needed in this area; it is suggested that
netinet's attachment to the ifnet structure be compartmentalized and
non-implicit.

Bug found by: andre
MFC after: 1 month


167989 28-Mar-2007 andre

When blackholing do a 'dropunlock' in the new world order to prevent the
INP_INFO_LOCK from leaking.

Reported by: ache
Found by: rwatson


167960 28-Mar-2007 rwatson

Remove stale comment about not enabling inpcb and inpcbinfo lock assertions
when IPv6 is enabled.

MFC after: 3 days


167888 25-Mar-2007 andre

In tcp_sack_doack() remove too tight KASSERT() added in last revision. This
function may be called without any TCP SACK option blocks present. Protect
iteration over SACK option blocks by checking for SACK options present flag
first.

Bug reported by: wkoszek, keramida, Nicolas Blais


167886 25-Mar-2007 rwatson

Replace a comment about RSVP/mrouting with a different but similar comment
explaining that some more locking is needed. The routing pieces are done,
but there is an interlocking issue between optionally compiled code and
mandatory code.

Spotted by: kris


167873 24-Mar-2007 maxim

o Use a define for a buffer size.

Prodded by: db

o Add missed vars for TCPDEBUG in tcp_do_segment().

Prodded by: tinderbox


167839 23-Mar-2007 andre

Split tcp_input() into its two functional parts:

o tcp_input() now handles TCP segment sanity checks and preparations
including the INPCB lookup and syncache.
o tcp_do_segment() handles all data and ACK processing and is IPv4/v6
agnostic.

Change all KASSERT() messages to ("%s: ", __func__).

The changes in this commit are primarily of mechanical nature and no
functional changes besides the function split are made.

Discussed with: rwatson


167834 23-Mar-2007 andre

Tidy up some code to conform better to surroundings and style(9), 0 = NULL
and space/tab.


167833 23-Mar-2007 andre

Bring SACK option handling in tcp_dooptions() in line with all other
options and ajust users accordingly.


167831 23-Mar-2007 bms

Purge two redundant case labels.


167796 22-Mar-2007 glebius

Remove global list of all llinfo_arp entries and use a callout per
instance expiry of the ARP entries. Since we no longer abuse the IPv4
radix head lock, we can now enter arp_rtrequest() with a lock held on
an arbitrary rt_entry.

Reviewed by: bms


167785 21-Mar-2007 andre

ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.


167784 21-Mar-2007 andre

Match up SYSCTL declarations in style.


167780 21-Mar-2007 andre

Subtract optlen in the maximum length check for TSO and finally avoid
slightly oversized TSO mbuf chains.

Submitted by: kmacy


167779 21-Mar-2007 andre

Tidy up IPFIREWALL_FORWARD sections and comments.


167778 21-Mar-2007 andre

Update and clarify comments in first section of tcp_input().


167777 21-Mar-2007 andre

Tidy up the ACCEPTCONN section of tcp_input(), ajust comments and remove
old dead T/TCP code.


167775 21-Mar-2007 andre

Tidy up tcp_log_in_vain and blackhole.


167774 21-Mar-2007 andre

Make TCP_DROP_SYNFIN a standard part of TCP. Disabled by default it
doesn't impede normal operation negatively and is only a few lines of
code. It's close relatives blackhole and log_in_vain aren't options
either.


167772 21-Mar-2007 andre

Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code. Should
the problem surface in the future we can simply resurrect it from cvs
history.


167739 20-Mar-2007 bms

Increase default size of raw IP send and receive buffers to the same as
udp_sendspace, to avoid a situation where jumbograms (datagrams > 9KB)
are unnecessarily fragmented.

A common use case for this is OSPF link-state database synchronization
during adjacency bringup on a high speed network with a large MTU.

It is not possible to auto-tune this setting until a socket is bound to
a given interface, and because the laddr part of the inpcb tuple may be
overridden, it makes no sense to do so. Applications may request a larger
socket buffer size by using the SO_SENDBUF and SO_RECVBUF socket options.

Certain applications such as Quagga ospfd do not probe for interface MTU
and therefore do not increase SO_SENDBUF in this use case.
XORP is not affected by this problem as it preemptively uses SO_SENDBUF
and SO_RECVBUF to account for any possible additional latency in XRL IPC.

PR: kern/108375
Requested by: Vladimir Ivanov
MFC after: 1 week


167736 20-Mar-2007 rrs

- window update sacks sent incorrectly after
shutdown which caused extra abort from peer.
- RTT time calculation was not being done in
express sack handling since it refered to an unused
variable (rto_pending). Removed variable.
- socket buffer high water access macro-ized.


167729 20-Mar-2007 bms

Implement reference counting for ifmultiaddr, in_multi, and in6_multi
structures. Detect when ifnet instances are detached from the network
stack and perform appropriate cleanup to prevent memory leaks.

This has been implemented in such a way as to be backwards ABI compatible.
Kernel consumers are changed to use if_delmulti_ifma(); in_delmulti()
is unable to detect interface removal by design, as it performs searches
on structures which are removed with the interface.

With this architectural change, the panics FreeBSD users have experienced
with carp and pfsync should be resolved.

Obtained from: p4 branch bms_netdev
Reviewed by: andre
Sponsored by: Garance A Drosehn
Idea from: NetBSD
MFC after: 1 month


167721 19-Mar-2007 andre

Match up SYSCTL declaration style.


167718 19-Mar-2007 andre

Match up SYSCTL_INT declarations in style.


167715 19-Mar-2007 andre

Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream
sockets.

Adjust tcp_output() make use of it.

Tested by: gallatin


167698 19-Mar-2007 rrs

Adds a hash table to speed local address lookup
on a per VRF basis (BSD has only one VRF currently).
Hash table is sized to 16 but may need to be adjusted
for machines with large numbers of addresses.
Reviewed by: gnn


167695 19-Mar-2007 rrs

- errno -> becomes error in sctp_output.c and sctputil.c
- SB_CLEAR macro defined and used for sb clearing.
- Fix for CMT express_sack_handling did not do proper
pseudo-cumack updates.
- Get rid of extraneous function that was never used ip_2_ip6_hdr()
- Fixed source address selection bug (initialization problem).
- Source address selection debug added.


167682 18-Mar-2007 bms

In IPv4 fast forwarding path, send ICMP unreachable messages for
routes which have RTF_REJECT set *and* a zero expiry timer.

PR: kern/109246
MFC after: 10 days
Submitted by: Ingo Flaschberger


167659 17-Mar-2007 andre

Unbreak IPv6 after consolidation of TCP options insertion.

Submitted by: tegge


167658 17-Mar-2007 kmacy

Fix the most obvious of the bugs introduced by recent syncache changes

- *ip is not initialized in the case of inet6 connection, but ip->ip_len is
being changed anyway

Now the question is, why does it think an ipv4 connection is an ipv6 connection?
xemacs still doesn't work over X11 forwarding, but the kernel no longer panics.


167636 16-Mar-2007 rwatson

Remove unused and #if 0'd net.inet.tcp.tcp_rttdflt sysctl.


167606 15-Mar-2007 andre

Consolidate insertion of TCP options into a segment from within tcp_output()
and syncache_respond() into its own generic function tcp_addoptions().

tcp_addoptions() is alignment agnostic and does optimal packing in all cases.

In struct tcpopt rename to_requested_s_scale to just to_wscale.

Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."

Reviewed by: silby, mohans, julian
Sponsored by: TCP/IP Optimization Fundraise 2005


167598 15-Mar-2007 rrs

- Sysctl's move to seperate file
- moved away from ifn/ifa access to sctp_ifa/sctp_ifn
built and managed by the add-ip code.
- cleaned up add-ip code to use the iterator
- made iterator be a thread, which enables auto-asconf now.
- rewrote and cleaned up source address selection (also
made it use new structures).
- Fixed a couple of memory leaks.
- DACK now settable as to how many packets to delay as
well as time.
- connectx() to latest socket API, new associd arg.
- Fixed issue with revoking and loosing potential to
send when we inflate the flight size. We now inflate
the cwnd too and deflate it later when the revoked
chunk is sent or acked.
- Got rid of some temp debug code
- src addr selection moved to a common file (sctp_output.c)
- Support for simple VRF's (we have support for multi-vfr
via compile switch that is scrubbed from BSD but we won't
need multi-vrf until we first get VRF :-D)
- Rest of mib work for address information now done
- Limit number of addresses in INIT/INIT-ACK to
a #def (30).

Reviewed by: gnn


167593 15-Mar-2007 bms

Diff reduction with NetBSD; use IN_LOCAL_GROUP() to check if an address
is within the locally scoped multicast range 224.0.0.0/24.


167342 08-Mar-2007 bms

Fix IP_SENDSRCADDR semantics.

* To use this option with a UDP socket, it must be bound to a local port,
and INADDR_ANY, to disallow possible collisions with existing udp inpcbs
bound to the same port on other interfaces at send time.

* If the socket is bound to INADDR_ANY, specifying IP_SENDSRCADDR with
INADDR_ANY will be rejected as it is ambiguous.

* If the socket is bound to an address other than INADDR_ANY, specifying
IP_SENDSRCADDR with INADDR_ANY will be disallowed by in_pcbbind_setup().

Reviewed by: silence on -net
Tested with: src/tools/regression/netinet/ipbroadcast
MFC after: 4 days


167310 07-Mar-2007 qingli

This patch is provided to fix a couple of deployment issues observed
in the field. In one situation, one end of the TCP connection sends
a back-to-back RST packet, with delayed ack, the last_ack_sent variable
has not been update yet. When tcp_insecure_rst is turned off, the code
treats the RST as invalid because last_ack_sent instead of rcv_nxt is
compared against th_seq. Apparently there is some kind of firewall that
sits in between the two ends and that RST packet is the only RST
packet received. With short lived HTTP connections, the symptom is
a large accumulation of connections over a short period of time .

The +/-(1) factor is to take care of implementations out there that
generate RST packets with these types of sequence numbers. This
behavior has also been observed in live environments.

Reviewed by: silby, Mike Karels
MFC after: 1 week


167205 04-Mar-2007 bms

Purge an out-of-date comment.


167141 01-Mar-2007 bms

Fix undirected broadcast sends for the case where SO_DONTROUTE has also
been set at the socket layer, in our somewhat convoluted IPv4 source
selection logic in ip_output().

IP_ONESBCAST is actually a special case of SO_DONTROUTE, as 255.255.255.255
must always be delivered on a local link with a TTL of 1.

If IP_ONESBCAST has been set at the socket layer, also perform destination
interface lookup for point-to-point interfaces based on the destination
address of the link; previously it was not possible to use the option with
such interfaces; also, the destination/broadcast address fields map to the
same field within struct ifnet, which doesn't help matters.

One more valid fix going forward for these issues is to treat 255.255.255.255
as a destination in its own right in the forwarding trie. Other
implementations do this. It fits with the use of multiple paths, though
it then becomes necessary to specify interface preference.
This hack will eventually go away when that comes to pass.

Reviewed by: andre
MFC after: 1 week


167139 01-Mar-2007 andre

Prevent TSO mbuf chain from overflowing a few bytes by subtracting the
TCP options size before the TSO total length calculation.

Bug found by: kmacy


167120 28-Feb-2007 mohans

In the SYN_SENT case, Initialize the snd_wnd before the call to tcp_mss().
The TCP hostcache logic in tcp_mss() depends on the snd_wnd being initialized.


167116 28-Feb-2007 bms

Style: Move declaration of subsystem mutex to where other
mutexes are in this file, and use macros for dealing with it.


167107 28-Feb-2007 glebius

Add EHOSTDOWN and ENETUNREACH to the list of soft errors, that shouldn't
be returned up to the caller.

PR: 100172
Submitted by: "Andrew - Supernews" <andrew supernews.net>
Reviewed by: rwatson, bms


167106 28-Feb-2007 glebius

Toss the code, that handles errors from ip_output(), to make it more
readable:
- Merge two embedded if() into one.
- Introduce switch() block to handle different kinds of errors.

Reviewed by: rwatson, bms


167072 27-Feb-2007 bms

Add INADDR_ALLRPTS_GROUP define for 224.0.0.22 for future IGMPv3 support.

Obtained from: OpenSolaris


167036 26-Feb-2007 mohans

Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.


166972 25-Feb-2007 bms

Unlock a mutex which should be unlocked before returning.

MFC after: 1 week


166938 24-Feb-2007 bms

Make IPv6 multicast forwarding dynamically loadable from a GENERIC kernel.
It is built in the same module as IPv4 multicast forwarding, i.e. ip_mroute.ko,
if and only if IPv6 support is enabled for loadable modules.
Export IPv6 forwarding structs to userland netstat(1) via sysctl(9).


166842 20-Feb-2007 rwatson

Rename two identically named log_in_vain variables: tcp_input.c's static
log_in_vain to tcp_log_in_vain, and udp_usrreq's global log_in_vain to
udp_log_in_vain.

MFC after: 1 week


166841 20-Feb-2007 rwatson

Gratuitous UDP restyling toward style(9) in 7.x.


166811 18-Feb-2007 rwatson

#ifdef INET6 printing of inpcb IPv6 addresses in DDB. Patch committed
with minor adjustments.

Submitted by: Florian C. Smeets <flo at kasimir dot com>


166807 17-Feb-2007 rwatson

Add "show inpcb", "show tcpcb" DDB commands, which should come in handy
for debugging sblock and other network panics.


166793 16-Feb-2007 rwatson

Remove unused inp6_ifindex field from inpcb, as well as unused macro
shortcut for it.


166792 16-Feb-2007 rwatson

Remove unused in6p_ip6_hlim macro shortcut for non-present
inp_depend6.inp6_hlim field in the inpcb.


166675 12-Feb-2007 rrs

- Copyright updates (aka 2007)
- ZONE get now also take a type cast so it does the
cast like mtod does.
- New macro SCTP_LIST_EMPTY, which in bsd is just
LIST_EMPTY
- Removal of const in some of the static hmac functions
(not needed)
- Store length changes to allow for new fields in auth
- Auth code updated to current draft (this should be the
RFC version we think).
- use uint8_t instead of u_char in LOOPBACK address comparison
- Some u_int32_t converted to uint32_t (in crc code)
- A bug was found in the mib counts for ordered/unordered
count, this was fixed (was referencing a freed mbuf).
- SCTP_ASOCLOG_OF_TSNS added (code will probably disappear
after my testing completes. It allows us to keep a
small log on each assoc of the last 40 TSN's in/out and
stream assignment. It is NOT in options and so is only
good for private builds.
- Some CMT changes in prep for Jana fixing his problem
with reneging when CMT is enabled (Concurrent Multipath
Transfer = CMT).
- Some missing mib stats added.
- Correction to number of open assoc's count in mib
- Correction to os_bsd.h to get right sha2 macros
- Add of special AUTH_04 flags so you can compile the code
with the old format (in case the peer does not yet support
the latest auth code).
- Nonce sum was incorrectly being set in when ecn_nonce was
NOT on.
- LOR in listen with implicit bind found and fixed.
- Moved away from using mbuf's for socket options to using
just data pointers. The mbufs were used to harmonize
NetBSD code since both Net and Open used this method. We
have decided to move away from that and more conform to
FreeBSD style (which makes more sense).
- Very very nasty bug found in some of my "debug" code. The
cookie_how collision case tracking had an endless loop in
it if you got a second retransmission of a cookie collision
case. This would lock up a CPU .. ugly..
- auth function goes to using size_t instead of int which
conforms to socketapi better
- Found the nasty bug that happens after 9 days of testing.. you
get the data chunk, deliver it and due to the reference to a ch->
that every now and then has been deleted (depending on the postion
in the mbuf) you have an invalid ch->ch.flags.. and thus you don't
advance the stream sequence number.. so you block the stream
permanently. The fix is to make local variables of these guys
and set them up before you have any chance of trimming the
mbuf.
- style fix in sctp_util.h, not sure how this got bad maybe in
the last patch? (aka it may not be in the real source).
- Found interesting bug when using the extended snd/rcv info where
we would get an error on receiving with this. Thats because
it was NOT padded to the same size as the snd_rcv info. We
increase (add the pad) so the two structs are the same size
in sctp_uio.h
- In sctp_usrreq.c one of the most common things we did for
socket options was to cast the pointer and validate the size.
This as been macro-ized to help make the code more readable.
- in sctputil.c two things, the socketapi class found a missing
flag type (the next msg is a notification) and a missing
scope recovery was also fixed.

Reviewed by: gnn


166629 10-Feb-2007 bms

Use MAXTTL.

Obtained from: NetBSD


166623 10-Feb-2007 bms

If the rendezvous point for a group is not specified, do not send
IGMPMSG_WHOLEPKT notifications to the userland PIM routing daemon,
as an optimization to mitigate the effects of high multicast
forwarding load.

This is an experimental change, therefore it must be explicitly enabled by
setting the sysctl/tunable net.inet.pim.squelch_wholepkt to a non-zero value.
The tunable may be set from the loader or from within the kernel environment
when loading ip_mroute.ko as a module.

Submitted by: edrt <edrt at citiz.net>
See also: http://mailman.icsi.berkeley.edu/pipermail/xorp-users/2005-June/000639.html


166622 10-Feb-2007 bms

Build PIM by default as part of the IPv4 multicast forwarding path.
Make PIM dynamically loadable by using encap_attach_func().
PIM may now be loaded into a GENERIC kernel.

Tested with: ports/net/pimdd && tcpreplay && wireshark
Reviewed by: Pavlin Radoslavov


166576 08-Feb-2007 bms

Store the cached route in vifp in the normal send_packet() case.
The VIFF_TUNNEL case no longer exists, therefore this field is free to
use, and its use eliminates a static data member.


166575 08-Feb-2007 bms

Nuke the token bucket filter code. Attempting to request rate limiting
by the token bucket filter will result in EINVAL being returned.

If you want to rate-limit traffic in future, use ALTQ or dummynet; this
isn't a general purpose QoS engine.

Preserve the now unused fields in struct vif so as to avoid having to
recompile netstat(1) and other tools.

Reviewed by: Pavlin Radslavov, Bill Fenner


166555 07-Feb-2007 bms

eliminate redundant macro MC_SEND()


166549 07-Feb-2007 bms

Remove support for IPIP tunnels in IPv4 multicast forwarding. XORP has
never used them; with mrouted, their functionality may be replaced by
explicitly configuring gif(4) instances and specifying them with the
'phyint' keyword.

Bump __FreeBSD_version to 700030, and update UPDATING.
A doc update is forthcoming.

Discussed on: net
Reviewed by: fenner
MFC after: 3 months


166507 05-Feb-2007 bms

When fast-forwarding is enabled, do not forward directed IPv4 broadcasts
to locally attached broadcast networks.

Note well: This relies on the layer 2 route cloning behaviour in BSD.

PR: 98799
Tested by: Dmitry Sergienko
MFC after: 1 week


166479 03-Feb-2007 alc

Include opt_ipdivert.h so that the message announcing ipfw correctly
describes the state of IPDIVERT.


166452 03-Feb-2007 bms

In fast forwarding path, defer processing of 169.254.0.0/16
to ip_input(). See RFC 3927 section 2.7.


166450 03-Feb-2007 bms

In regular forwarding path, reject packets destined for 169.254.0.0/16
link-local addresses. See RFC 3927 section 2.7.


166436 02-Feb-2007 bms

Comply with RFC 3927, by forcing ARP replies which contain a source
address within the link-local IPv4 prefix 169.254.0.0/16, to be
broadcast at link layer.

Reviewed by: fenner
MFC after: 2 weeks


166433 02-Feb-2007 bms

Expose smoothed RTT and RTT variance measurements to userland via
socket option TCP_INFO.
Note that the units used in the original Linux API are in microseconds,
so use a 64-bit mantissa to convert FreeBSD's internal measurements
from struct tcpcb from ticks.


166423 02-Feb-2007 glebius

Since rev. 1.94 of netinet/in.c, the netinet layer frees all its
multicast memberships, when interface is detached. Thus, when
an underlying interface is detached, we do not need to free
our multicast memberships.

Reviewed by: bms


166405 01-Feb-2007 andre

Auto sizing TCP socket buffers.

Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
net.inet.tcp.sendbuf_auto=1 (enabled)
net.inet.tcp.sendbuf_inc=8192 (8K, step size)
net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
net.inet.tcp.recvbuf_auto=1 (enabled)
net.inet.tcp.recvbuf_inc=16384 (16K, step size)
net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by: many (on HEAD and RELENG_6)
Approved by: re
MFC after: 1 month


166403 01-Feb-2007 andre

Change the way the advertized TCP window scaling is computed. Instead of
upper-bounding it to the size of the initial socket buffer lower-bound it
to the smallest MSS we accept. Ideally we'd use the actual MSS information
here but it is not available yet.

For socket buffer auto sizing to be effective we need room to grow the
receive window. The window scale shift is determined at connection setup
and can't be changed afterwards. The previous, original, method effectively
just did a power of two roundup of the socket buffer size at connection
setup severely limiting the headroom for larger socket buffers.

Tested by: many (as part of the socket buffer auto sizing patch)
MFC after: 1 month


166368 31-Jan-2007 bms

Import macros IN_LINKLOCAL(), IN_PRIVATE(), IN_LOCAL_GROUP(), IN_ANY_LOCAL().
This is not a functional change.

IN_LINKLOCAL() tests if an address falls within the IPv4 link-local prefix.
IN_PRIVATE() tests if an address falls within an RFC 1918 private prefix.
IN_LOCAL_GROUP() tests if an address falls within the statically assigned
link-local multicast scope specified in RFC 2365.
IN_ANY_LOCAL() tests for either of IN_LINKLOCAL() or IN_LOCAL_GROUP().

As with the existing macros in the FreeBSD netinet stack, comparisons
are performed in host-byte order.

See also: RFC 1918, RFC 2365, RFC 3927
Obtained from: NetBSD (dyoung@)
MFC after: 2 weeks


166228 25-Jan-2007 glebius

Make it possible that carpdetach() unlocks on return. Then, in
carp_clone_destroy() we are on a safe side, we don't need to
unlock the cif, that can me already non-existent at this point.

Reported by: Anton Yuzhaninov <citrin rambler-co.ru>


166226 25-Jan-2007 glebius

Spacing.


166086 18-Jan-2007 rrs

- most all includes (#include <>) migrate to the sctp_os_bsd.h file
- Finally all splxx() are removed
- Count error fixed in mapping array which might
cause a wrong cumack generation.
- Invariants around panic for case D + printf when no invariants.
- one-to-one model race condition fixed by using
a pre-formed connection and then completing the
work so accept won't happen on a non-formed
association.
- Some additional paranoia checks in sctp_output.
- Locks that were missing in the accept code.

Approved by: gnn


166023 15-Jan-2007 rrs

- Macroizes the V6ONLY flag check.
- Added a short time wait (not used yet) constant
- Corrected the type of the crc32c table (it was
unsigned long and really is a uint32_t
- Got rid of the user of MHeaders until they
are truely needed by lower layers.
- Fixed an initialization problem in the readq structure
(ordering was off).
- Found yet another collision bug when the random number
generator returns two numbers on one side (during a collision)
that are the same. Also added some tracking of cookies
that will go away when we know that we have the last collision
bug gone.
- Fixed an init bug for book_size_scale, that was causing
Early FR code to run when it should not.
- Fixed a flight size tracking bug that was associated with
Early FR but due to above bug also effected all FR's
- Fixed it so Max Burst also will apply to Fast Retransmit.
- Fixed a bug in the temporary logging code that allowed a
static log array overflow
- hashinit_flags is now used.
- Two last mcopym's were converted to the macro sctp_m_copym that
has always been used by all other places
- macro sctp_m_copym was converted to upper case.
- We now validate sinfo_flags on input (we did not before).
- Fixed a bug that prevented a user from sending data and immediately
shuting down with one send operation.
- Moved to use hashdestroy instead of free() in our macros.
- Fixed an init problem in our timed_wait vtag where we
did not fully initialize our time-wait blocks.
- Timer stops were re-positioned.
- A pcb cleanup method was added, however this probably will
not be used in BSD.. unless we make module loadable protocols
- I think this fixes the mysterious timer bug.. it was a
ordering of locks problem in the way we did timers. It
now conforms to the timeout(9) manual (except for the
_drain part, we had to do this a different way due
to locks).
- Fixed error return code so we get either CONNREUSED or CONNRESET
depending on where one is in progression
- Purged an unused clone macro.
- Fixed a read erro code issue where we were NOT getting the proper
error when the connection was reset.
- Purged an unused clone macro.
- Fixed a read erro code issue where we were NOT getting the proper
error when the connection was reset.
Approved by: gnn


166010 14-Jan-2007 maxim

o Increment requests counter right before send out an ARP query actually.
Otherwise the code could lead to the spurious EHOSTDOWN errors.

PR: kern/107807
Submitted by: Dmitrij Tejblum
MFC after: 1 month


165966 12-Jan-2007 imp

Marking this as __packed was needed to get the alignment and offset of
members right. However, it also said it was aligned(1), which meant
that gcc generated really bad code. Mark this as aligned(4). This
makes things a little faster on arm (a couple percent), but also saves
about 30k on the size of the kernel for arm.

I talked about doing this with bde, but didn't check with him before
the commit, so I'm hesitant say 'reviewed by: bde'.


165919 09-Jan-2007 julian

Remove two lines that somehow snuck back in after testing.
ip is now an argument to the function ipfw_log()


165831 06-Jan-2007 maxim

o One more typo in the comment.

PR: kern/107609
Submitted by: Dr. Markus Waldeck


165802 05-Jan-2007 piso

Prevent adding a rule with a nat action in case IPFIREWALL_NAT was not defined.

Reviewed: luigi


165750 03-Jan-2007 piso

Wrap ipfw nat support in a new kernel config option named
"IPFIREWALL_NAT": this way nat is turned off by default and
POLA is preserved.

Reviewed by: rwatson


165738 02-Jan-2007 julian

Remove a bunch of dependencies in the IP header being the first thing in the
mbuf. First moves toward being able to cope better with having layer 2 (or
other encapsulation data) before the IP header in the packet being examined.
More commits to come to round out this functionality. This commit should
have no practical effect but clears the way for what is coming.
Revirewed by: luigi, yar
MFC After: 2 weeks


165710 01-Jan-2007 imp

Fix typo in comment.

Submitted by: remko


165709 31-Dec-2006 imp

Add comment about udp checksums being off in BSD 4.2 compatibility mode.

Submitted by: Dr. Markus Waldeck
PR: kern/106657


165657 30-Dec-2006 jhb

Whitespace fix and remove an extra cast.


165648 29-Dec-2006 piso

Summer of Code 2005: improve libalias - part 2 of 2

With the second (and last) part of my previous Summer of Code work, we get:

-ipfw's in kernel nat

-redirect_* and LSNAT support

General information about nat syntax and some examples are available
in the ipfw (8) man page. The redirect and LSNAT syntax are identical
to natd, so please refer to natd (8) man page.

To enable in kernel nat in rc.conf, two options were added:

o firewall_nat_enable: equivalent to natd_enable

o firewall_nat_interface: equivalent to natd_interface

Remember to set net.inet.ip.fw.one_pass to 0, if you want the packet
to continue being checked by the firewall ruleset after being
(de)aliased.

NOTA BENE: due to some problems with libalias architecture, in kernel
nat won't work with TSO enabled nic, thus you have to disable TSO via
ifconfig (ifconfig foo0 -tso).

Approved by: glebius (mentor)


165647 29-Dec-2006 rrs

a) macro-ization of all mbuf and random number
access plus timers. This makes the code
more portable and able to change out the
mbuf or timer system used more easily ;-)
b) removal of all use of pkt-hdr's until only
the places we need them (before ip_output routines).
c) remove a bunch of code not needed due to <b> aka
worrying about pkthdr's :-)
d) There was one last reorder problem it looks where
if a restart occur's and we release and relock (at
the point where we setup our alias vtag) we would
end up possibly getting the wrong TSN in place. The
code that fixed the TSN's just needed to be shifted
around BEFORE the release of the lock.. also code that
set the state (since this also could contribute).
Approved by: gnn


165634 29-Dec-2006 jhb

Some whitespace nits and remove a few casts.


165243 15-Dec-2006 piso

o made in kernel libalias mpsafe
o fixed a comment
o made in kernel libalias a bit less verbose (disabled automatic
logging everytime a new link is added or deleted)

Approved by: glebius (mentor)


165220 14-Dec-2006 rrs

1) Fixes on a number of different collision case LOR's.
2) Fix all "magic numbers" to be constants.
3) A collision case that would generate two associations to
the same peer due to a missing lock is fixed.
4) Added tracking of where timers are stopped.
Approved by: gnn


165149 13-Dec-2006 csjp

Fix LOR between the syncache and inpcb locks when MAC is present in the
kernel. This LOR snuck in with some of the recent syncache changes. To
fix this, the inpcb handling was changed:

- Hang a MAC label off the syncache object
- When the syncache entry is initially created, we pickup the PCB lock
is held because we extract information from it while initializing the
syncache entry. While we do this, copy the MAC label associated with
the PCB and use it for the syncache entry.
- When the packet is transmitted, copy the label from the syncache entry
to the mbuf so it can be processed by security policies which analyze
mbuf labels.

This change required that the MAC framework be extended to support the
label copy operations from the PCB to the syncache entry, and then from
the syncache entry to the mbuf.

These functions really should be referencing the syncache structure instead
of the label. However, due to some of the complexities associated with
exposing this syncache structure we operate directly on it's label pointer.
This should be OK since we aren't making any access control decisions within
this code directly, we are merely allocating and copying label storage so
we can properly initialize mbuf labels for any packets the syncache code
might create.

This also has a nice side effect of caching. Prior to this change, the
PCB would be looked up/locked for each packet transmitted. Now the label
is cached at the time the syncache entry is initialized.

Submitted by: andre [1]
Discussed with: rwatson

[1] andre submitted the tcp_syncache.c changes


165123 12-Dec-2006 bz

In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.

This is the "+ one more change" missed in the original commit.

Noticed by: tinderbox
Pointy hat to: me (#1)


165118 12-Dec-2006 bz

MFp4: 92972, 98913 + one more change

In ip6_sprintf no longer use and return one of eight static buffers
for printing/logging ipv6 addresses.
The caller now has to hand in a sufficiently large buffer as first
argument.


165082 10-Dec-2006 bms

Back out revision 1.264.

Fixing the IP accounting issue, if we plan to do so, needs to be better
thought out; the 'fix' introduces a hash lookup and a possible kernel panic.

Reported by: Mark Tinguely


164863 04-Dec-2006 rwatson

Improve style(9) conformance of igmp.c.


164808 01-Dec-2006 imp

Make sure that carp_header is 36 bytes long


164798 01-Dec-2006 piso

Make libalias.conf parsing a bit smarter.
This closes PR kern/106112.

While here, add mbuf's #includes i forgot in the previous commit.

Approved by: gleb


164797 01-Dec-2006 piso

Remove m_megapullup from ng_nat and put it under libalias.

Approved by: gleb


164768 30-Nov-2006 rwatson

Consistently use #ifdef INET6 rather than mixing and matching with
#if defined(INET6).

Don't comment the end of short #ifdef blocks.

Comment cleanup.

Line wrap.


164516 22-Nov-2006 sam

Change error codes returned by protocol operations when an inpcb is
marked INP_DROPPED or INP_TIMEWAIT:
o return ECONNRESET instead of EINVAL for close, disconnect, shutdown,
rcvd, rcvoob, and send operations
o return ECONNABORTED instead of EINVAL for accept

These changes should reduce confusion in applications since EINVAL is
normally interpreted to mean an invalid file descriptor. This change
does not conflict with POSIX or other standards I checked. The return
of EINVAL has always been possible but rare; it's become more common
with recent changes to the socket/inpcb handling and with finer-grained
locking and preemption.

Note: there are other instances of EINVAL for this state that were
left unchanged; they should be reviewed.

Reviewed by: rwatson, andre, ru
MFC after: 1 month


164258 13-Nov-2006 bz

Add SCTP as a known upper layer protocol over v6.
We are not yet aware of the protocol internals but this way
SCTP traffic over v6 will not be discarded.

Reported by: Peter Lei via rrs
Tested by: Peter Lei <peterlei cisco.com>


164205 11-Nov-2006 rrs

In a true restart case, the send_lock was
not being aquired. This meant that when we cleanup
the outbound we may have one in transit to be
added with the old sequence number. This is bad
since then we loose a message :(

Also the report_outbound needed to have the right
lock when its called which it did not.. I added
the lock with of course a flag since we want to
have the lock before we call it in the restart
case.

This also fixed the FIX ME case where, in the cookie
collision case, we mark for retransmit any that
were bundled with the cookie that was dropped.
This also means changes to the output routine
so we can assure getting the COOKIE-ACK sent
BEFORE we retransmit the Data.

Approved by: gnn


164181 11-Nov-2006 rrs

Turns out we would reset the TSN seq counter during
a colliding INIT. This if fine except when we have
data outstanding... we basically reset it to the
previous value it was.. so then we end up assigning
the same TSN to two different data chunks.
This patch:

1) Finds a missing lock for when we change the stream
numbers during COOKIE and INIT-ACK processing.. we
were NOT locking the send_buffer.. which COULD cause
problems (found by inspection looking for <2>)

2) Fixes a case during a colliding INIT where we incorrectly
reset the sending Sequence thus in some cases duplicately
assigning a TSN.

3) Additional enhancments to logging so we can see strm/tsn in
the receiver AND new tracking to watch what the sender
is doing with TSN and STRM seq's.

Approved by: gnn


164144 10-Nov-2006 rrs

This patch fixes a LOR that happens during INIT-ACK collision.
We were calling select_a_tag() inside sctp_send_initate_ack().
During collision cases we have a stcb and thus a SCTP_LOCK. When
we call select_a_tag it (below it) locks the INFO lock. We now
1) pre-select the nonce-tie-tags in sctputil.c during setup of
a tcb.
2) In the other case where we have to select tags, we unlock after
incr the ref cnt (so assoc won't go away0 and then do the
tag selection followed by a relock and decr the refcnt.
Approved by: gnn


164139 09-Nov-2006 rrs

Fixes an issue with handling of stream reset. When a
reset comes in we need to calculate the length and
therefore the number of listed streams (if any) based
on the TLV type. Otherwise if we get a retran we could
in theory panic by sending a notification to a user with
a incorrect list and thus no memory listing the streams.
Found in IOS by devtest :-)
Approved by: gnn


164085 08-Nov-2006 rrs

-Fixes first of all the getcred on IPv6 and V4. The
copy's were incorrect and so was the locking.
-A bug was also found that would create a race and
panic when an abort arrived on a socket being read
from.
-Also fix the reader to get MSG_TRUNC when a partial
delivery is aborted.
-Also addresses a couple of coverity caught error path
memory leaks and a couple of other valid complaints
Approved by: gnn


164075 07-Nov-2006 marcus

Fix TFTP NAT support by making sure the appropriate fingerprinting checks
are done.

Reviewed by: piso


164039 06-Nov-2006 rwatson

Convert three new suser(9) calls introduced between when the priv(9)
patch was prepared and committed to priv(9) calls. Add XXX comments
as, in each case, the semantics appear to differ from the TCP/UDP
versions of the calls with respect to jail, and because cr_canseecred()
is not used to validate the query.

Obtained from: TrustedBSD Project


164038 06-Nov-2006 rrs

This changes tracks down the EEOR->NonEEOR mode failure
to wakeup on close of the sender. It basically moves
the return (when the asoc has a reader/writer) further
down and gets the wakeup and assoc appending (of the
PD-API event) moved up before the return. It also
moves the flag set right before the return so we can
assure only once adding the PD-API events.

Approved by: gnn


164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


163998 05-Nov-2006 ru

Revert previous commit, and instead make the expression in rev. 1.2
match the style of this file.

OK'ed by: rrs


163996 05-Nov-2006 rrs

Tons of fixes to get all the 64bit issues removed.
This also moves two 16 bit int's to become 32 bit
values so we do not have to use atomic_add_16.
Most of the changes are %p, casts and other various
nasty's that were in the orignal code base. With this
commit my machine will now do a build universe.. however
I as yet have not tested on a 64bit machine .. it may not work :-(


163980 04-Nov-2006 ru

Fix pointer arithmetic to be 64-bit friendly.


163979 04-Nov-2006 ru

Remove bogus casts that Randall for some reason didn't borrow
from my supplied patch.


163974 04-Nov-2006 jb

Remove a bogus cast in an attempt to fix the tinderbox builds on
lots of arches.


163964 03-Nov-2006 rrs

More 64 bit pointer fun.
%p changed in multiple prints
the mtod() was also fixed.


163959 03-Nov-2006 rrs

Fix two of the 64bit errors on the printfs.


163957 03-Nov-2006 rrs

Somehow I missed this one. The sys/cdef.h was out
of order with respect to the FSBID..


163954 03-Nov-2006 rrs

Opps... in my fix up of all the $FreeBSD:$-> $FreeBSD$ I
inserted a few to the new files.. but I falied to
add the #include <sys/cdef.h>

Which causes a compile error.. sorry about that... got it
now :-)

Approved by:gnn


163953 03-Nov-2006 rrs

Ok, here it is, we finally add SCTP to current. Note that this
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.com
tuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0

So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.

I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)

There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..

If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)

Reviewed by: gnn
Approved by: gnn


163758 29-Oct-2006 oleg

- Use non-recursive mutex. MTX_RECURSE is unnecessary since rev. 1.70
- Pay respect to net.isr.direct: use netisr_dispatch() instead of ip_input()

Reviewed by: glebius, rwatson

- purge_flow_set():
- Do not leak memory while purging queues which are not bound to pipe.
- style(9) cleanup

MFC after: 2 months


163721 27-Oct-2006 oleg

- Convert
net.inet.ip.dummynet.curr_time
net.inet.ip.dummynet.searches
net.inet.ip.dummynet.search_steps
to SYSCTL_LONG nodes. It will prevent frequent wrap around on 64bit archs.

- Implement simple mechanics for dummynet(4) internal time correction.
Under certain circumstances (system high load, dummynet lock contention, etc)
dummynet's tick counter can be significantly slower than it should be.
(I've observed up to 25% difference on one of my production servers).
Since this counter used for packet scheduling, it's accuracy is vital for
precise bandwidth limitation.

Introduce new sysctl nodes:
net.inet.ip.dummynet.
tick_lost - number of ticks coalesced by taskqueue thread.
tick_adjustment - number of time corrections done.
tick_diff - adjusted vs non-adjusted tick counter difference
tick_delta - last vs 'standard' tick differnece (usec).
tick_delta_sum - accumulated (and not corrected yet) time
difference (usec).

Reviewed by: glebius
MFC after: 2 month


163720 27-Oct-2006 oleg

Use separate thread for servicing dummynet(4).
Utilize taskqueue(9) API.

Submitted by: glebius
MFC after: 2 month


163717 27-Oct-2006 oleg

style(9) cleanup.

MFC after: 2 month


163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


163548 21-Oct-2006 julian

revert last change.. premature.. need to wait until if_ethersubr.c
uses pfil to get to ipfw.


163545 20-Oct-2006 julian

Move some variables to a more likely place
and remove "temporary" stuff that is not needed any more.


163237 11-Oct-2006 maxim

o Do not do args->f_id.addr_type == 6 when there is
IS_IP6_FLOW_ID() exactly for that.


163236 11-Oct-2006 maxim

o Kill a nit in the comment.


163235 11-Oct-2006 maxim

o Extend not very informative ipfw(4) message 'drop session, too many
entries' by src:port and dst:port pairs. IPv6 part is non-functional
as ``limit'' does not support IPv6 flows.

PR: kern/103967
Submitted by: based on Bruce Campbell patch
MFC after: 1 month


163224 11-Oct-2006 ru

Merge the rest of my changes.


163127 08-Oct-2006 piso

Various mdoc and grammar fixes.

Approved by: glebius
Reviewed by: glebius, ru


163069 07-Oct-2006 bz

Set scope on MC address so IPv6 carp advertisement will not get dropped
in ip6_output. In case this fails handle the error directly and log it[1].
In addition permit CARP over v6 in ip_fw2.

PR: kern/98622
Similar patch by: suz
Discussed with: glebius [1]
Tested by: Paul.Dekkers surfnet.nl, Philippe.Pegon crc.u-strasbg.fr
MFC after: 3 days


163006 04-Oct-2006 glebius

Save space on stack moving token ring stuff to its own hack block.


163005 04-Oct-2006 glebius

Style rev. 1.152.


162798 29-Sep-2006 andre

Remove stone-aged and irrelevant "#ifndef notdef".


162797 29-Sep-2006 bms

Nits.

Submitted by: ru


162794 29-Sep-2006 bms

Push removal of mrouted down to the rest of the tree.


162768 29-Sep-2006 maxim

o Convert w/spaces to tabs in the previous commit.


162767 29-Sep-2006 silby

Rather than autoscaling the number of TIME_WAIT sockets to maxsockets / 5,
scale it to min(ephemeral port range / 2, maxsockets / 5) so that people
with large gobs of memory and/or large maxsockets settings will not
exhaust their entire ephemeral port range with sockets in the TIME_WAIT
state during periods of heavy load.

Those who wish to tweak the size of the TIME_WAIT zone can still do so with
net.inet.tcp.maxtcptw.

Reviewed by: glebius, ru


162739 28-Sep-2006 andre

When tcp_output() receives an error upon sending a packet it reverts parts
of its internal state to ignore the failed send and try again a bit later.
If the error is EPERM the packet got blocked by the local firewall and the
revert may cause the session to get stuck and retry indefinitely. This way
we treat it like a packet loss and let the retransmit timer and timeouts
do their work over time.

The correct behavior is to drop a connection that gets an EPERM error.
However this _may_ introduce some POLA problems and a two commit approach
was chosen.

Discussed with: glebius
PR: kern/25986
PR: kern/102653


162725 28-Sep-2006 andre

When doing TSO correctly do the check to prevent a maximum sized IP packet
from overflowing.


162719 28-Sep-2006 bms

Fix the IPv4 multicast routing detach path. On interface detach whilst
the MROUTER is running, the system would panic as described in the PR.

The fix in the PR is a good start, however, the other state associated
with the multicast forwarding cache has to be freed in order to avoid
leaking memory and other possible panics.

More care and attention is needed in this area.

PR: kern/82882
MFC after: 1 week


162718 28-Sep-2006 bms

The IPv4 code should clean up multicast group state when an interface
goes away. Without this change, it leaks in_multi (and often ether_multi
state) if many clonable interfaces are created and destroyed in quick
succession.

The concept of this fix is borrowed from KAME. Detailed information about
this behaviour, as well as test cases, are available in the PR.

PR: kern/78227
MFC after: 1 week


162685 27-Sep-2006 piso

Compilation.


162674 26-Sep-2006 piso

Summer of Code 2005: improve libalias - part 1 of 2

With the first part of my previous Summer of Code work, we get:

-made libalias modular:

-support for 'particular' protocols (like ftp/irc/etcetc) is no more
hardcoded inside libalias, but it's available through external
modules loadable at runtime

-modules are available both in kernel (/boot/kernel/alias_*.ko) and
user land (/lib/libalias_*)

-protocols/applications modularized are: cuseeme, ftp, irc, nbt, pptp,
skinny and smedia

-added logging support for kernel side

-cleanup

After a buildworld, do a 'mergemaster -i' to install the file libalias.conf
in /etc or manually copy it.

During startup (and after every HUP signal) user land applications running
the new libalias will try to read a file in /etc called libalias.conf:
that file contains the list of modules to load.

User land applications affected by this commit are ppp and natd:
if libalias.conf is present in /etc you won't notice any difference.

The only kernel land bit affected by this commit is ng_nat:
if you are using ng_nat, and it doesn't correctly handle
ftp/irc/etcetc sessions anymore, remember to kldload
the correspondent module (i.e. kldload alias_ftp).

General information and details about the inner working are available
in the libalias man page under the section 'MODULAR ARCHITECTURE
(AND ipfw(4) SUPPORT)'.

NOTA BENE: this commit affects _ONLY_ libalias, ipfw in-kernel nat
support will be part of the next libalias-related commit.

Approved by: glebius
Reviewed by: glebius, ru


162642 26-Sep-2006 jmg

fix calculating to_tsecr... This prevents the rtt calculations from
going all wonky...


162627 25-Sep-2006 bms

Fix an incompatibility between CARP and IPv4 multicast routing, whereby
the VRRPv2 advertisements will originate from the wrong source address.
This only affects kernels compiled with MROUTING and after the MRT_INIT
ioctl() has been issued.
Set imo_multicast_vif in carp's softc to the invalid value -1 after it is
zeroed by softc allocation, to stop the ip_output() path looking up the
incorrect source address thinking a vif is set.

PR: kern/100532
Submitted by: Bohus Plucinsky
MFC after: 1 week


162625 25-Sep-2006 bms

Spleling

Submitted by: pjd


162615 25-Sep-2006 bms

Account for output IP datagrams on the ifaddr where they originated from,
*not* the first ifaddr on the ifp. This is similar to what NetBSD does.

PR: kern/72936
Submitted by: alfred
Reviewed by: andre


162612 25-Sep-2006 jmg

if min is greater than max, prefer max over min... I managed to get a
retransmit timer that was going to take 19 days to trigger...

Reviewed by: silby


162586 23-Sep-2006 jmg

now that we don't automagicly increase the MTU of host routes, when we copy
the loopback interface, copy it's mtu also.. This means that we again have
large mtu support for local ip addresses...


162580 23-Sep-2006 bms

Always set the IP version in the TCP input path, to preserve
the header field for possible later IPSEC SPD lookup, even
when the kernel is built without 'options INET6'.

PR: kern/57760
MFC after: 1 week
Submitted by: Joachim Schueth


162376 17-Sep-2006 andre

Make tcp_usr_send() free the passed mbufs on error in all cases as the
comment to it claims.

Sponsored by: TCP/IP Optimization Fundraise 2005


162351 16-Sep-2006 jhay

Handle a list of IPv6 src and dst addresses correctly, eg.
ipfw add allow ip6 from any to 2000::/16,2002::/16

PR: 102422 (part 3)
Submitted by: Andrey V. Elsukov <bu7cher at yandex dot ru>
MFC after: 5 days


162325 15-Sep-2006 andre

When doing TSO subtract hdrlen from TCP_MAXWIN to prevent ip->ip_len
from wrapping when we generate a maximally sized packet for later
segmentation.

Noticed by: gallatin
Sponsored by: TCP/IP Optimization Fundraise 2005


162306 14-Sep-2006 ache

Add missing #ifdef INET6 (can't be compiled)


162278 13-Sep-2006 andre

Remove unessary includes and follow common ordering style.


162277 13-Sep-2006 andre

Rewrite of TCP syncookies to remove locking requirements and to enhance
functionality:

- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.

This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.

The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.

A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.

Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


162238 12-Sep-2006 csjp

Introduce a new entry point, mac_create_mbuf_from_firewall. This entry point
exists to allow the mandatory access control policy to properly initialize
mbufs generated by the firewall. An example where this might happen is keep
alive packets, or ICMP error packets in response to other packets.

This takes care of kernel panics associated with un-initialize mbuf labels
when the firewall generates packets.

[1] I modified this patch from it's original version, the initial patch
introduced a number of entry points which were programmatically
equivalent. So I introduced only one. Instead, we should leverage
mac_create_mbuf_netlayer() which is used for similar situations,
an example being icmp_error()

This will minimize the impact associated with the MFC

Submitted by: mlaier [1]
MFC after: 1 week

This is a RELENG_6 candidate


162231 11-Sep-2006 andre

Fix a NULL pointer dereference of ro->ro_rt->rt_flags by checking for the
validity of ro->ro_rt first. This prevents crashing on any non-normally
routed IP packet.

Coverity CID: 162 (incorrectly, it was re-introduced by previous commit)


162205 10-Sep-2006 jmg

make use of the host route's mtu for processing. This means we can now
support a network w/ split mtu's by assigning each host route the correct
mtu. an aspiring programmer could write a daemon to probe hosts and find
out if they support a larger mtu.


162151 08-Sep-2006 glebius

Add a sysctl net.inet.tcp.nolocaltimewait that allows to suppress
creating a compress TIME WAIT states, if both connection endpoints
are local. Default is off.


162111 07-Sep-2006 ru

Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).


162110 07-Sep-2006 andre

Second step of TSO (TCP segmentation offload) support in our network stack.

TSO is only used if we are in a pure bulk sending state. The presence of
TCP-MD5, SACK retransmits, SACK advertizements, IPSEC and IP options prevent
using TSO. With TSO the TCP header is the same (except for the sequence number)
for all generated packets. This makes it impossible to transmit any options
which vary per generated segment or packet.

The length of TSO bursts is limited to TCP_MAXWIN.

The sysctl net.inet.tcp.tso globally controls the use of TSO and is enabled.

TSO enabled sends originating from tcp_output() have the CSUM_TCP and CSUM_TSO
flags set, m_pkthdr.csum_data filled with the header pseudo-checksum and
m_pkthdr.tso_segsz set to the segment size (net payload size, not counting
IP+TCP headers or TCP options).

IPv6 currently lacks a pseudo-header checksum function and thus doesn't support
TSO yet.

Tested by: Jack Vogel <jfvogel-at-gmail.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


162108 07-Sep-2006 ru

Remove a microoptimization for i386 that was a micropessimization for amd64.


162084 06-Sep-2006 andre

First step of TSO (TCP segmentation offload) support in our network stack.

o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005


162071 06-Sep-2006 andre

Check inp_flags instead of inp_vflag for INP_ONESBCAST flag.

PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


162068 06-Sep-2006 andre

Fix the socket option IP_ONESBCAST by giving it its own case in ip_output()
and skip over the normal IP processing.

Add a supporting function ifa_ifwithbroadaddr() to verify and validate the
supplied subnet broadcast address.

PR: kern/99558
Tested by: Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


162064 06-Sep-2006 glebius

o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.

Reviewed by: ru


162035 05-Sep-2006 glebius

Finally fix rev. 1.256

Pointy hat to: glebius


162033 05-Sep-2006 glebius

Remove extra parenthesis in last commit.

Nitpicked by: ru


162031 05-Sep-2006 glebius

- Make net.inet.tcp.maxtcptw modifiable at run time.
- If net.inet.tcp.maxtcptw was ever set explicitly, do
not change it if kern.ipc.maxsockets is changed.


161974 04-Sep-2006 thomas

Fix typo in comment.


161767 31-Aug-2006 jhay

Recognise IPv6 PIM packets.

MFC after: 1 week


161645 26-Aug-2006 mohans

Fix for a bug that causes the computation of "len" in tcp_output() to
get messed up, resulting in an inconsistency between the TCP state
and so_snd.


161456 18-Aug-2006 julian

comply with style police

Submitted by: ru
MFC after: 1 month


161424 17-Aug-2006 julian

Allow ipfw to forward to a destination that is specified by a table.
for example:
fwd tablearg ip from any to table(1)
where table 1 has entries of the form:
1.1.1.0/24 10.2.3.4
208.23.2.0/24 router2

This allows trivial implementation of a secondary routing table implemented
in the firewall layer.

I expect more work (under discussion with Glebius) to follow this to clean
up some of the messy parts of ipfw related to tables.

Reviewed by: Glebius
MFC after: 1 month


161380 17-Aug-2006 julian

Remove the IPFIREWALL_FORWARD_EXTENDED option and make it on by default as it always was
in older versions of FreeBSD. This option is pointless as it is needed in just
about every interesting usage of forward that I have ever seen. It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x It doesn't make
the system any safer and just wastes huge amounts of develper time
when the system doesn't behave as expected when code is moved from
4.x to 6.x or 7.x
Reviewed by: glebius
MFC after: 1 week


161226 11-Aug-2006 mohans

Fixes an edge case bug in timewait handling where ticks rolling over causing
the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry).
Reviewed by: silby


160981 04-Aug-2006 brooks

With exception of the if_name() macro, all definitions in net_osdep.h
were unused or already in if_var.h so add if_name() to if_var.h and
remove net_osdep.h along with all references to it.

Longer term we may want to kill off if_name() entierly since all modern
BSDs have if_xname variables rendering it unnecessicary.


160966 04-Aug-2006 oleg

Remove useless NULL pointer check: we are using M_WAITOK flag for memory
allocation.

Submitted by: Andrey Elsukov <bu7cher at yandex dot ru>
Approved by: glebius (mentor)
MFC after: 1 week


160925 02-Aug-2006 rwatson

Move soisdisconnected() in tcp_discardcb() to one of its calling contexts,
tcp_twstart(), but not to the other, tcp_detach(), as the socket is
already being torn down and therefore there are no listeners. This avoids
a panic if kqueue state is registered on the socket at close(), and
eliminates to XXX comments. There is one case remaining in which
tcp_discardcb() reaches up to the socket layer as part of the TCP host
cache, which would be good to avoid.

Reported by: Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>


160920 02-Aug-2006 oleg

Do not leak memory while flushing rules.

Noticed by: yar
Approved by: glebius (mentor)
MFC after: 1 week


160549 21-Jul-2006 rwatson

Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket. pru_abort is now a
notification of close also, and no longer detaches. pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket. This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree(). With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by: gnn


160491 18-Jul-2006 ups

Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by: rwatson@, mohans@
MFC after: 1 week


160195 09-Jul-2006 sam

Revise network interface cloning to take an optional opaque
parameter that can specify configuration parameters:
o rev cloner api's to add optional parameter block
o add SIOCCREATE2 that accepts parameter data
o rev vlan support to use new api (maintain old code)

Reviewed by: arch@


160164 08-Jul-2006 mlaier

Make in-kernel multicast protocols for pfsync and carp work after enabling
dynamic resizing of multicast membership array.

Reported and testing by: Maxim Konovalov, Scott Ullrich
Reminded by: thompsa
MFC after: 2 weeks


160134 06-Jul-2006 rwatson

Remove unneeded mac.h include.

MFC after: 3 days


160123 05-Jul-2006 oleg

Complete timebase (time_second -> time_uptime) conversion.

PR: kern/94249
Reviewed by: andre (few months ago)
Approved by: glebius (mentor)


160097 04-Jul-2006 maxim

o Kill BUGS section as it is not valid since rev. 1.4 alias_pptp.c.

Spotted by: ru.unix.bsd activists
MFC after: 1 week


160038 29-Jun-2006 yar

There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures. Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by: glebius, jhb, ru


160032 29-Jun-2006 yar

Use TAILQ_FOREACH consistently.


160027 29-Jun-2006 glebius

Fix URL to Bellovin's paper.

Submitted by: Anton Yuzhaninov <citrin rambler-co.ru>


160025 29-Jun-2006 bz

Eliminate the offset argument from send_reject. It's not been
used since FreeBSD-SA-06:04.ipfw.
Adopt send_reject6 to what had been done for legacy IP: no longer
send or permit sending rejects for any but the first fragment.

Discussed with: oleg, csjp (some weeks ago)


160024 29-Jun-2006 bz

Use INPLOOKUP_WILDCARD instead of just 1 more consistently.

OKed by: rwatson (some weeks ago)


159976 27-Jun-2006 pjd

- Use suser_cred(9) instead of directly checking cr_uid.
- Change the order of conditions to first verify that we actually need
to check for privileges and then eventually check them.

Reviewed by: rwatson


159955 26-Jun-2006 andre

In syncache_respond() do not reply with a MSS that is larger than what
the peer announced to us but make it at least tcp_minmss in size.

Sponsored by: TCP/IP Optimization Fundraise 2005


159950 26-Jun-2006 andre

Some cleanups and janitorial work to tcp_syncache:

o don't assign remote/local host/port information manually between provided
struct in_conninfo and struct syncache, bcopy() it instead
o rename sc_tsrecent to sc_tsreflect in struct syncache to better capture
the purpose of this field
o rename sc_request_r_scale to sc_requested_r_scale for ditto reasons
o fix IPSEC error case printf's to report correct function name
o in syncache_socket() only transpose enhanced tcp options parameters to
struct tcpcb when the inpcb doesn't has TF_NOOPT set
o in syncache_respond() reorder stack variables
o in syncache_respond() remove bogus KASSERT()

No functional changes.

Sponsored by: TCP/IP Optimization Fundraise 2005


159949 26-Jun-2006 andre

Some cleanups and janitorial work to tcp_dooptions():

o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
usage accordingly
o update the comments to the tcp_dooptions() invocation in
tcp_input():after_listen to reflect reality
o move the logic checking the echoed timestamp out of tcp_dooptions() to the
only place that uses it next to the invocation described in the previous
item
o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
o add comments in to struct tcpopt.to_flags #defines

No functional changes.

Sponsored by: TCP/IP Optimization Fundraise 2005


159945 26-Jun-2006 andre

Reverse the source/destination parameters to in[6]_pcblookup_hash() in
syncache_respond() for the #ifdef MAC case.

Submitted by: Tai-hwa Liang <avatar-at-mmlab.cse.yzu.edu.tw>


159944 26-Jun-2006 rwatson

In tcp6_usr_attach(), return immediately if SS_ISDISCONNECTED, to
avoid dereferencing an uninitialized inp variable.

Submitted by: Michiel Boland <michiel at boland dot org>
MFC after: 1 month


159922 25-Jun-2006 andre

Decrement the global syncache counter in syncache_expand() when the entry
is removed from the bucket. This fixes the syncache statistics.


159859 22-Jun-2006 andre

Move the syncookie MD5 context from globals to the stack to make it MP safe.


159857 22-Jun-2006 ume

- Pullup even when the extention header is unknown, to prevent
infinite loop with net.inet6.ip6.fw.deny_unknown_exthdrs=0.
- Teach ipv6 and ipencap as they appear in an IPv4/IPv6 over IPv6
tunnel.
- Test the next extention header even when the routing header type
is unknown with net.inet6.ip6.fw.deny_unknown_exthdrs=0.

Found by: xcast-fan-club
MFC after: 1 week


159787 20-Jun-2006 andre

Allocate a zero'ed syncache hashtable. mtx_init() tests the supplied
memory location for already existing/initialized mutexes. With random
data in the memory location this fails (ie. after a soft reboot).

Reported by: brueffer, YAMAMOTO Shigeru
Submitted by: YAMAMOTO Shigeru <shigeru-at-iij.ad.jp>


159772 19-Jun-2006 dwmalone

When we receive an out-of-window SYN for an "ESTABLISHED" connection,
ACK the SYN as required by RFC793, rather than ignoring it. NetBSD
have had a similar change since 1999.

PR: 93236
Submitted by: Grant Edwards <grante@visi.com>
MFC after: 1 month


159733 18-Jun-2006 andre

Remove T/TCP RFC1644 Connection Count comparison macros. They are no longer
used and needed.

Sponsored by: TCP/IP Optimization Fundraise 2005


159727 18-Jun-2006 andre

Do not access syncache entry before it was allocated for the TF_NOOPT case
in syncache_add().

Found by: Coverity Prevent
CID: 1473


159725 18-Jun-2006 andre

Move all syncache related structures to tcp_syncache.c. They are only used
there.

This unbreaks userland programs that include tcp_var.h.

Discussed with: rwatson


159722 18-Jun-2006 andre

Remove double lock acquisition in syncookie_lookup() which came from last
minute conversions to macros.

Pointy hat to: andre


159701 17-Jun-2006 andre

Fix the !INET6 compile.

Reported by: alc


159698 17-Jun-2006 andre

Rearrange fields in struct syncache and syncache_head to make them more
cache line friendly.

Sponsored by: TCP/IP Optimization Fundraise 2005


159697 17-Jun-2006 andre

ANSIfy and tidy up comments.

Sponsored by: TCP/IP Optimization Fundraise 2005


159695 17-Jun-2006 andre

Add locking to TCP syncache and drop the global tcpinfo lock as early
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.

On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.

Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005


159636 15-Jun-2006 oleg

Add support of 'tablearg' feature for:
- 'tag' & 'untag' action parameters.
- 'tagged' & 'limit' rule options.
Rule examples:
pipe 1 tag tablearg ip from table(1) to any
allow ip from any to table(2) tagged tablearg
allow tcp from table(3) to any 25 setup limit src-addr tablearg

sbin/ipfw/ipfw2.c:
1) new macros
GET_UINT_ARG - support of 'tablearg' keyword, argument range checking.
PRINT_UINT_ARG - support of 'tablearg' keyword.
2) strtoport(): do not silently truncate/accept invalid port list expressions
like: '1,2-abc' or '1,2-3-4' or '1,2-3x4'. style(9) cleanup.

Approved by: glebius (mentor)
MFC after: 1 month


159635 15-Jun-2006 oleg

install_state(): style(9) cleanup

Approved by: glebius (mentor)
MFC after: 1 month


159448 09-Jun-2006 thompsa

Enable proxy ARP answers on any of the bridged interfaces if proxy record
belongs to another interface within the bridge group.

PR: kern/94408
Submitted by: Eygene A. Ryabinkin
MFC after: 1 month


159398 08-Jun-2006 oleg

install_state() should properly initialize 'addr_type' field of newly created
flows for O_LIMIT rules. Otherwise 'ipfw -d show' is unable to display
PARENT rules properly.
(This bug was exposed by ipfw2.c rev.1.90)

Approved by: glebius (mentor)
MFC after: 2 weeks


159397 08-Jun-2006 oleg

Fix following rules: pipe X (tag|altq) Y ...

Approved by: glebius (mentor)
MFC after: 2 weeks


159218 04-Jun-2006 rwatson

Push acquisition of pcbinfo lock out of tcp_usr_attach() into
tcp_attach() after the call to soreserve(), as it doesn't require
the global lock. Rearrange inpcb locking here also.

MFC after: 1 month


159199 03-Jun-2006 rwatson

When entering a timer on a tcpcb, don't continue processing if it has been
dropped. This prevents a bug introduced during the socket/pcb refcounting
work from occuring, in which occasionally the retransmit timer may fire
after a connection has been reset, resulting in the resulting R|A TCP
packet having a source port of 0, as the port reservation has been
released.

While here, fixing up some RUNLOCK->WUNLOCK bugs.

MFC after: 1 month


159198 03-Jun-2006 rwatson

Acquire udbinfo lock after call to soreserve() rather than before, as it
is not required. This simplifies error-handling, and reduces the time
that this lock is held.

MFC after: 1 month


159180 02-Jun-2006 csjp

Fix the following bpf(4) race condition which can result in a panic:

(1) bpf peer attaches to interface netif0
(2) Packet is received by netif0
(3) ifp->if_bpf pointer is checked and handed off to bpf
(4) bpf peer detaches from netif0 resulting in ifp->if_bpf being
initialized to NULL.
(5) ifp->if_bpf is dereferenced by bpf machinery
(6) Kaboom

This race condition likely explains the various different kernel panics
reported around sending SIGINT to tcpdump or dhclient processes. But really
this race can result in kernel panics anywhere you have frequent bpf attach
and detach operations with high packet per second load.

Summary of changes:

- Remove the bpf interface's "driverp" member
- When we attach bpf interfaces, we now set the ifp->if_bpf member to the
bpf interface structure. Once this is done, ifp->if_bpf should never be
NULL. [1]
- Introduce bpf_peers_present function, an inline operation which will do
a lockless read bpf peer list associated with the interface. It should
be noted that the bpf code will pickup the bpf_interface lock before adding
or removing bpf peers. This should serialize the access to the bpf descriptor
list, removing the race.
- Expose the bpf_if structure in bpf.h so that the bpf_peers_present function
can use it. This also removes the struct bpf_if; hack that was there.
- Adjust all consumers of the raw if_bpf structure to use bpf_peers_present

Now what happens is:

(1) Packet is received by netif0
(2) Check to see if bpf descriptor list is empty
(3) Pickup the bpf interface lock
(4) Hand packet off to process

From the attach/detach side:

(1) Pickup the bpf interface lock
(2) Add/remove from bpf descriptor list

Now that we are storing the bpf interface structure with the ifnet, there is
is no need to walk the bpf interface list to locate the correct bpf interface.
We now simply look up the interface, and initialize the pointer. This has a
nice side effect of changing a bpf interface attach operation from O(N) (where
N is the number of bpf interfaces), to O(1).

[1] From now on, we can no longer check ifp->if_bpf to tell us whether or
not we have any bpf peers that might be interested in receiving packets.

In collaboration with: sam@
MFC after: 1 month


159163 02-Jun-2006 rwatson

Minor restyling and cleanup around ipport_tick().

MFC after: 1 month


158879 24-May-2006 oleg

Implement internal (i.e. inside kernel) packet tagging using mbuf_tags(9).
Since tags are kept while packet resides in kernelspace, it's possible to
use other kernel facilities (like netgraph nodes) for altering those tags.

Submitted by: Andrey Elsukov <bu7cher at yandex dot ru>
Submitted by: Vadim Goncharov <vadimnuclight at tpu dot ru>
Approved by: glebius (mentor)
Idea from: OpenBSD PF
MFC after: 1 month


158800 21-May-2006 maxim

o In udp|rip_disconnect() acquire a socket lock before the socket
state modification. To prevent races do that while holding inpcb
lock.

Reviewed by: rwatson


158799 21-May-2006 maxim

o Add missed error check: in ip_ctloutput() sooptcopyin() returns a
result but we never examine it.

Reviewed by: rwatson
MFC after: 2 weeks


158729 18-May-2006 bms

Initialize the new members of struct ip_moptions as
a defensive programming measure.

Note that whilst these members are not used by the ip_output()
path, we are passing an instance of struct ip_moptions here
which is declared on the stack (which could be considered a
bad thing).

ip_output() does not consume struct ip_moptions, but in case it
does in future, declare an in_multi vector on the stack too to
behave more like ip_findmoptions() does.


158645 16-May-2006 glebius

Since m_pullup() can return a new mbuf, change gre_input2() to
return mbuf back to gre_input(). If the former returns mbuf back
to the latter, then pass it to raw_input().

Coverity ID: 829


158644 16-May-2006 glebius

- Backout one line from 1.78. The tp can be freed by tcp_drop().
- Style next line.

Coverity ID: 912


158588 15-May-2006 maxim

o In rip_disconnect() do not call rip_abort(), just mark a socket
as not connected. In soclose() case rip_detach() will kill inpcb for
us later.

It makes rawconnect regression test do not panic a system.

Reviewed by: rwatson
X-MFC after: with all 1th April inpcb changes


158580 14-May-2006 mlaier

Use only lower 64bit of src/dest (and src/dest port) for hashing of IPv6
connections and get rid of the flow_id as it is not guaranteed to be stable
some (most?) current implementations seem to just zero it out.

PR: kern/88664
Reported by: jylefort
Submitted by: Joost Bekkers (w/ changes)
Tested by "regisr" <regisrApoboxDcom>


158563 14-May-2006 bms

Fix a long-standing limitation in IPv4 multicast group membership.

By making the imo_membership array a dynamically allocated vector,
this minimizes disruption to existing IPv4 multicast code. This
change breaks the ABI for the kernel module ip_mroute.ko, and may
cause a small amount of churn for folks working on the IGMPv3 merge.

Previously, sockets were subject to a compile-time limitation on
the number of IPv4 group memberships, which was hard-coded to 20.
The imo_membership relationship, however, is 1:1 with regards to
a tuple of multicast group address and interface address. Users who
ran routing protocols such as OSPF ran into this limitation on machines
with a large system interface tree.


158500 12-May-2006 mlaier

Remove ip6fw. Since ipfw has full functional IPv6 support now and - in
contrast to ip6fw - is properly lockes, it is time to retire ip6fw.


158470 12-May-2006 mlaier

Reintroduce net.inet6.ip6.fw.enable sysctl to dis/enable the ipv6 processing
seperately. Also use pfil hook/unhook instead of keeping the check
functions in pfil just to return there based on the sysctl. While here fix
some whitespace on a nearby SYSCTL_ macro.


158433 11-May-2006 mlaier

Don't claim "(+ipv6)" if we didn't build with INET6.


158332 06-May-2006 rwatson

Modify UDP to use sosend_dgram() instead of sosend(). This allows
for signicantly optimized UDP socket I/O when using a single UDP
socket from many threads or processes that share it, by avoiding
significant locking and other overhead in the general sosend()
path that isn't necessary for simple datagram sockets. Specifically,
this change results in a significant performance improvement for
threaded name service in BIND9 under load.

Suggested by: Jinmei_Tatsuya at isc dot org


158305 05-May-2006 bz

Make sure the ip data pointer is correct before touching it again
after ipsec4_output processing else KAME IPSec using the handbook
configuration with gif(4) will panic the kernel.

Problem reported by: t. patterson <tp lot.org>
Tested by: t. patterson <tp lot.org>


158304 05-May-2006 rwatson

Only return (tw) from tcp_twclose() if reuse is passed, otherwise
return NULL. In principle this shouldn't change the behavior, but
avoids returning a potentially invalid/inappropriate pointer to
the caller.

Found with: Coverity Prevent (tm)
Submitted by: pjd
MFC after: 3 months


158302 05-May-2006 pjd

/tmp/cvsTXPIwQ


158036 25-Apr-2006 marcel

In in_pcbdrop(), fix !INVARIANTS build.


158021 25-Apr-2006 rwatson

Rename 'last' to 'inp' in udp_append(): the name 'last' is due to
the fact that the loop through inpcb's in udp_input() tracks the
last inpcb while looping. We keep that name in the calling loop
but not in the delivery routine itself.

MFC after: 3 months


158009 25-Apr-2006 rwatson

Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop(). Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after: 3 months


157993 24-Apr-2006 rwatson

Instead of calling tcp_usr_detach() from tcp_usr_abort(), break out
common pcb tear-down logic into tcp_detach(), which is called from
either. Invoke tcp_drop() from the tcp_usr_abort() path rather than
tcp_disconnect(), as we want to drop it immediately not perform a
FIN sequence. This is one reason why some people were experiencing
panics in sodealloc(), as the netisr and aborting thread were
simultaneously trying to tear down the socket. This bug could often
be reproduced using repeated runs of the listenclose regression test.

MFC after: 3 months
PR: 96090
Reported by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris
Tested by: Peter Kostouros <kpeter at melbpc dot org dot au>, kris


157977 23-Apr-2006 rwatson

Replace isn_mtx direct use with ISN_*() lock macros so that locking
details/strategy can be changed without touching every use.

MFC after: 3 months


157967 22-Apr-2006 rwatson

Introduce a new TCP mutex, isn_mtx, which protects the initial sequence
number state, rather than re-using pcbinfo. This introduces some
additional mutex operations during isn query, but avoids hitting the TCP
pcbinfo lock out of yet another frequently firing TCP timer.

MFC after: 3 months


157966 22-Apr-2006 rwatson

Assert the inpcb lock when rehashing an inpcb.

Improve consistency of style around some current assertions.

MFC after: 3 months


157965 22-Apr-2006 rwatson

Remove pcbinfo locking from in_setsockaddr() and in_setpeeraddr();
holding the inpcb lock is sufficient to prevent races in reading
the address and port, as both the inpcb lock and pcbinfo lock are
required to change the address/port.

Improve consistency of spelling in assertions about inp != NULL.

MFC after: 3 months


157927 21-Apr-2006 ps

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


157833 18-Apr-2006 glebius

Merge rev. 1.240 of ip_output.c, so that IPFIREWALL_FORWARD_EXTENDED
kernel option will affect both forwarding methods - classic and fast.


157609 09-Apr-2006 rwatson

Modify tcp_timewait() to accept an inpcb reference, not a tcptw
reference. For now, we allow the possibility that the in_ppcb
pointer in the inpcb may be NULL if a timewait socket has had its
tcptw structure recycled. This allows tcp_timewait() to
consistently unlock the inpcb.

Reported by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


157569 06-Apr-2006 mohans

Eliminate debug code that catches bugs in the hinting of sack variables
(tcp_sack_output_debug checks cached hints aginst computed values by walking the
scoreboard and reports discrepancies). The sack hinting code has been stable for
many months now so it is time for the debug code to go. Leaving tcp_sack_output_debug
ifdef'ed out in case we need to resurrect it at a later point.


157534 05-Apr-2006 rwatson

Don't unlock a timewait structure if the pointer is NULL in
tcp_timewait(). This corrects a bug (or lack of fixing of a bug)
in tcp_input.c:1.295.

Submitted by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


157526 05-Apr-2006 mohans

Certain (bad) values of sack blocks can end up corrupting the sack scoreboard.
Make the checks in tcp_sack_doack() more robust to prevent this.

Submitted by: Raja Mukerji (raja@mukerji.com)
Reviewed by: Mohan Srinivasan


157478 04-Apr-2006 glebius

Add a tunable net.inet.tcp.maxtcptw, that allows to set a limit
on tcptw zone independently from setting a limit on socket zone.


157474 04-Apr-2006 rwatson

Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL. We currently do allow this to happen, but may want to remove that
possibility in the future. This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled. This will
be cleaned up in the future.

Found by: Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after: 3 months


157433 03-Apr-2006 rwatson

In TCP notify routines, check inpcb for INP_TIMEWAIT and INP_DROPPED.
The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT
checks appear to have always been required, but not been there, which
is/was a bug. This avoids unconditionally casting of in_ppcb to a tcpcb,
when it may be a twtcb, which may have resulted in obscure ICMP-related
panics in earlier releases.

MFC after: 3 months


157432 03-Apr-2006 rwatson

Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after: 3 months


157431 03-Apr-2006 rwatson

Style tweaks: convert to ANSI from K&R function prototypes.

MFC after: 3 months


157430 03-Apr-2006 rwatson

Update comment on tcp_close() for new world order.

MFC after: 3 months


157429 03-Apr-2006 rwatson

Clarify comment on handling of non-timewait TCP states in
tcp_usr_detach().

MFC after: 3 months


157427 03-Apr-2006 rwatson

Fix up locking surrounding tcp_drop sysctl: in the new world order, we
don't free inpcbs until after the socket is closed, so we always need
to unlock an inpcb after calling tcp_drop() on it.

MFC after: 3 months


157424 03-Apr-2006 rwatson

After checking for SO_ISDISCONNECTED in tcp_usr_accept(), return
immediately rather than jumping to the normal output handling, which
assumes we've pulled out the inpcb, which hasn't happened at this
point (and isn't necessary).

Return ECONNABORTED instead of EINVAL when the inpcb has entered
INP_TIMEWAIT or INP_DROPPED, as this is the documented error value.

This may correct the panic seen by Ganbold.

MFC after: 1 month
Reported by: Ganbold <ganbold at micom dot mng dot net>


157423 03-Apr-2006 rwatson

Correct incorrect assertion in div_bind(): inp must not be NULL here.

Reported by: tegge
MFC after: 3 months


157410 02-Apr-2006 rwatson

During reformulation of tcp_usr_detach(), the call to initiate TCP
disconnect for fully connected sockets was dropped, meaning that if
the socket was closed while the connection was alive, it would be
leaked. Structure tcp_usr_detach() so that there are two clear
parts: initiating disconnect, and reclaiming state, and reintroduce
the tcp_disconnect() call in the first part.

MFC after: 3 months


157386 01-Apr-2006 rwatson

Properly handle an edge case previously not handled correctly: a
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close(). In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).

MFC after: 3 months


157376 01-Apr-2006 rwatson

Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.

- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.

- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after: 3 months


157374 01-Apr-2006 rwatson

Update in_pcb-derived basic socket types following changes to
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, in protocol
shutdown methods, and in raw IP send.

- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.

- Invoke in_pcbfree() after in_pcbdetach() in order to free the
detached in_pcb structure for a socket.

MFC after: 3 months


157373 01-Apr-2006 rwatson

Break out in_pcbdetach() into two functions:

- in_pcbdetach(), which removes the link between an inpcb and its
socket.

- in_pcbfree(), which frees a detached pcb.

Unlike the previous in_pcbdetach(), neither of these functions will
attempt to conditionally free the socket, as they are responsible only
for managing in_pcb memory. Mirror these changes into in6_pcbdetach()
by breaking it into in6_pcbdetach() and in6_pcbfree().

While here, eliminate undesired checks for NULL inpcb pointers in
sockets, as we will now have as an invariant that sockets will always
have valid so_pcb pointers.

MFC after: 3 months


157370 01-Apr-2006 rwatson

Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.

MFC after: 3 months


157366 01-Apr-2006 rwatson

Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.

MFC after: 3 months


157143 26-Mar-2006 rwatson

Define two new inpcb flags in the inp_vflag field, which for whatever
reason, seems to be where new flags are getting defined:

INP_DROPPED - The protocol has terminated this connection and the socket
is not reusable: when the socket code enters the protocol,
an error is immediately returned. This will substitute for
NULLing the so_pcb socket field, helping to implement the
invariant that all valid sockets have valid pcb's in TCP.

INP_SOCKREF - The protocol has become the owner of the socket reference,
and will need to free it when freeing the pcb, which will
be used when a TCP socket is closed but still has queued
data.

MFC after: 1 month


157142 26-Mar-2006 rwatson

Minor style tweak: tab after #define, not space.

MFC after: 1 month


157136 26-Mar-2006 rwatson

Explicitly assert socket pointer is non-NULL in tcp_input() so as to
provide better debugging information.

Prefer explicit comparison to NULL for tcpcb pointers rather than
treating them as booleans.

MFC after: 1 month


156947 21-Mar-2006 glebius

o Introduce carp_multicast_cleanup(), which removes and frees
multicast addresses from carp interface. [1]
o Rewrite carpdetach(), so that it does the following things: [1]
- Stops callouts.
- Decrements carp_suppress_preempt, if needed.
- Downs interface and sets CARP state to INIT.
- Calls carp_multicast_cleanup().
- Detaches softc from carp_if and if we are the last frees
the carp_if.
o Use new carpdetach() in carp_clone_destroy().
o In carp_ifdetach() acquire the carp_if lock and cleanup all
interfaces hanging on carp_if. [1]
o Make carp_ifdetach() static and use EVENT(9) to call it
from if_detach(). [2]
o In carp_setrun() exit if the softc doesn't have a valid pointer
to parent. [1]

Obtained from: OpenBSD [1]
Submitted by: Dan Lukes <dan obluda.cz> [2]
PR: kern/82908 [2]


156926 20-Mar-2006 keramida

Add descriptions for the sysctls:

net.inet.icmp.drop_redirect
net.inet.icmp.log_redirect
net.inet.icmp.icmplim
net.inet.icmp.icmplim_output

Approved & text by: andre


156877 19-Mar-2006 dwmalone

Make net.inet.ip.portrange.reservedhigh and
net.inet.ip.portrange.reservedlow apply to IPv6 aswell as IPv4.

We could have made new sysctls for IPv6, but that potentially makes
things complicated for mapped addresses. This seems like the least
confusing option and least likely to cause obscure problems in the
future.

This change makes the mac_portacl module useful with IPv6 apps.

Reviewed by: ume
MFC after: 1 month


156763 16-Mar-2006 rwatson

Change soabort() from returning int to returning void, since all
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.


156409 07-Mar-2006 thompsa

Further refine the bridge hack in the arp code. Only do the special arp
handling for interfaces which are actually in the bridge group, ignore all
others.

MFC after: 3 days


156240 03-Mar-2006 glebius

- Do not leak read lock in IP_FW_TABLE_GETSIZE case of ipfw_ctl().
- Acquire read (not write) lock in case of IP_FW_TABLE_LIST.

In collaboration with: ru


156125 28-Feb-2006 andre

Rework TCP window scaling (RFC1323) to properly scale the send window
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).

Further changes to this code to come. This is a first incremental step
to a general overhaul and streamlining of the TCP code.

PR: kern/15095
PR: kern/92690 (partly)
Reviewed by: qingli (and tested with ANVL)
Sponsored by: TCP/IP Optimization Fundraise 2005


155961 23-Feb-2006 qingli

This patch fixes the problem where the current TCP code can not handle
simultaneous open. Both the bug and the patch were verified using the
ANVL test suite.

PR: kern/74935
Submitted by: qingli (before I became committer)
Reviewed by: andre
MFC after: 5 days


155861 20-Feb-2006 ume

Obey opt_inet6.h in kernel build directory.

Reported by: Peter Losher <plosher-keyword-freebsd.a36e57__at__plosh.net>
MFC after: 3 days


155819 18-Feb-2006 andre

Remove unneeded includes and provide more accurate description
to others.

Submitted by: garys
PR: kern/86437


155817 18-Feb-2006 andre

Add missing TH_PUSH to the TH_FLAGS enumeration.

Submitted by: Andre Albsmeier <Andre.Albsmeier-at-siemens.com>
PR: kern/85203


155767 16-Feb-2006 andre

Have TCP Inflight disable itself if the RTT is below a certain
threshold. Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage. It defaults
to 10ms.

Tested by: Joao Barros <joao.barros-at-gmail.com>,
Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by: TCP/IP Optimization Fundraise 2005


155759 16-Feb-2006 andre

In in_pcbconnect_setup() reduce code duplication and use ip_rtaddr()
to find the outgoing interface for this connection.

Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 2 weeks


155758 16-Feb-2006 andre

Make sysctl_msec_to_ticks(SYSCTL_HANDLER_ARGS) generally available instead
of being private to tcp_timer.c.

Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


155659 14-Feb-2006 ru

When sending a packet from dummynet, indicate that we're forwarding
it so that ip_id etc. don't get overwritten. This fixes forwarding
of fragmented IP packets through a dummynet pipe -- fragments came
out with modified and different(!) ip_id's, making it impossible to
reassemble a datagram at the receiver side.

Submitted by: Alexander Karptsov (reworked by me)
MFC after: 3 days


155487 09-Feb-2006 qingli

Set the M_ZERO flag when calling uma_zalloc() to allocate a syncache entry.

Reviewed by: andre, glebius
MFC after: 3 days


155463 08-Feb-2006 qingli

Redo the previous fix by setting the UMA_ZONE_ZINIT bit in the syncache
zone, eliminating the need to call bzero() after each syncache entry
allocation.

Suggested by: glebius
Reviewed by: andre
MFC after: 3 days


155439 07-Feb-2006 qingli

Fixes a crash due to the memory of the newly allocated syncache entry
in syncache_lookup() is not cleared and may lead to an arbitrary and
bogus rtentry pointer which later gets free'd.

Reviewed by: andre
MFC after: 3 days


155425 07-Feb-2006 oleg

Fix five years old bug in ip_reass(): if we are using 'full' (i.e. including
pseudo header) hardware rx checksum offloading ip_reass() fails to calculate
TCP/UDP checksum for reassembled packet correctly. This also should fix
recent 'NFS over UDP over bge' issue exposed by if_bge.c rev. 1.123

Reviewed by: sam (earlier version), bde
Approved by: glebius (mentor)
MFC after: 2 weeks


155277 04-Feb-2006 ume

Never select the PCB that has INP_IPV6 flag and is bound to :: if
we have another PCB which is bound to 0.0.0.0. If a PCB has the
INP_IPV6 flag, then we set its cost higher than IPv4 only PCBs.

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME
MFC after: 1 week


155248 03-Feb-2006 glebius

Dropping the lock in the transmit_event() is not safe, because we
store some pipe pointers on stack. If user reconfigures dummynet
in the interlock gap, we can work with freed pipes after relock.

To fix this, we decided not to send packets in transmit_event(),
but fill a queue. At the end of dummynet() and dummynet_io(),
after the lock is dropped, if there is something in the queue
we run dummynet_send() to process the queue.

In collaboration with: ru


155245 03-Feb-2006 glebius

Axe unused function.


155221 02-Feb-2006 csjp

Use PFIL_HOOKED macros in if_bridge and pass the right argument to
rw_assert. This un-breaks the build.

Submitted by: Kostik Belousov
Pointy hat to: csjp


155201 02-Feb-2006 csjp

Somewhat re-factor the read/write locking mechanism associated with the packet
filtering mechanisms to use the new rwlock(9) locking API:

- Drop the variables stored in the phil_head structure which were specific to
conditions and the home rolled read/write locking mechanism.
- Drop some includes which were used for condition variables
- Drop the inline functions, and convert them to macros. Also, move these
macros into pfil.h
- Move pfil list locking macros intp phil.h as well
- Rename ph_busy_count to ph_nhooks. This variable will represent the number
of IN/OUT hooks registered with the pfil head structure
- Define PFIL_HOOKED macro which evaluates to true if there are any
hooks to be ran by pfil_run_hooks
- In the IP/IP6 stacks, change the ph_busy_count comparison to use the new
PFIL_HOOKED macro.
- Drop optimization in pfil_run_hooks which checks to see if there are any
hooks to be ran, and returns if not. This check is already performed by the
IP stacks when they call:

if (!PFIL_HOOKED(ph))
goto skip_hooks;

- Drop in assertion which makes sure that the number of hooks never drops
below 0 for good measure. This in theory should never happen, and if it
does than there are problems somewhere
- Drop special logic around PFIL_WAITOK because rw_wlock(9) does not sleep
- Drop variables which support home rolled read/write locking mechanism from
the IPFW firewall chain structure.
- Swap out the read/write firewall chain lock internal to use the rwlock(9)
API instead of our home rolled version
- Convert the inlined functions to macros

Reviewed by: mlaier, andre, glebius
Thanks to: jhb for the new locking API


155179 01-Feb-2006 andre

Move the IPSEC related code blocks to their own file to unclutter
and signifincantly improve the readability of ip_input() and
ip_output() again.

The resulting IPSEC hooks in ip_input() and ip_output() may be
used later on for making IPSEC loadable.

This move is mostly mechanical and should preserve current IPSEC
behaviour as-is. Nothing shall prevent improvements in the way
IPSEC interacts with the IPv4 stack.

Discussed with: bz, gnn, rwatson; (earlier version)


155166 01-Feb-2006 ru

Brain-o (use standard int types now).


155152 31-Jan-2006 ru

Fix multicast routing on 64-bit platforms.

Tested on: amd64
MFC after: 3 days


155145 31-Jan-2006 thompsa

Now that the bridge also processes Ethernet frames as itself, two arp replies
will be sent if there is an address on the bridge. Exclude the bridge from the
special arp handling.

This has been tested with all combinations of addresses on the bridge and members.

Pointed out by: Michal Mertl


155037 30-Jan-2006 glebius

Add some initial locking to gif(4). It doesn't covers the whole driver,
however IPv4-in-IPv4 tunnels are now stable on SMP. Details:

- Add per-softc mutex.
- Hold the mutex on output.

The main problem was the rtentry, placed in softc. It could be
freed by ip_output(). Meanwhile, another thread being in
in_gif_output() can read and write this rtentry.

Reported by: many
Tested by: Alexander Shiryaev <aixp mail.ru>


155018 29-Jan-2006 thompsa

Back out of r1.148, it causes two arp replies to be sent with different mac
addresses. One for the bridged interface with the IP address assigned but then
another with the mac for the bridge itself.


154780 24-Jan-2006 andre

When doing IP forwarding with [FAST_]IPSEC compiled into the kernel
ip_forward() would report back a zero MTU in ICMP needfrag messages
because on a IPSEC SP lookup failure no MTU got computed.

Fix this by changing the logic to compute a new MTU in any case if
IPSEC didn't do it.

Change MTU computation logic to use egress interface MTU if available
or the next smaller MTU compared to the current packet size instead
of falling back to a very small fixed MTU.

Fix associated comment.

PR: kern/91412
MFC after: 3 days


154779 24-Jan-2006 andre

In ip_mdq() compute the TV_DELTA the correct way around.

PR: kern/91851
Submitted by: SAKAI Hiroaki <sakai.hiroaki-at-jp.fujitsu.com>
MFC after: 3 days


154777 24-Jan-2006 andre

In in_control() remove the temporary in_ifaddr structure from the
ia_hash only if it actually is an AF_INET address. All other places
test for sa_family == AF_INET but this one.

PR: kern/92091
Submitted by: Seth Kingsley <sethk-at-meowfishies.com>
MFC after: 3 days


154769 24-Jan-2006 oleg

Fix minor bug in uRPF:
If net.link.ether.inet.useloopback=1 and we send broadcast packet using our
own source ip address it may be rejected by uRPF rules.

Same bug was fixed for IPv6 in rev. 1.115 by suz.

PR: kern/76971
Approved by: glebius (mentor)
MFC after: 3 days


154767 24-Jan-2006 glebius

Implement 'ipfw fwd laddr,port' feature for UDP. According to ipfw(8)
it should work, however it never did. People expect it to work.

PR: kern/90834


154733 23-Jan-2006 glebius

Fix build.


154728 23-Jan-2006 andre

Simplify ip_next_mtu() and make its logic more easy to see while
silencing code analysis tools.

Found by: Coverity Prevent(tm)
Coverity ID: CID341
Sponsored by: TCP/IP Optimization Fundraise 2005


154666 22-Jan-2006 rwatson

Convert remaining functions to ANSI C function declarations; remove
'register' where present.

MFC after: 1 week


154665 22-Jan-2006 rwatson

Convert last remaining function in ip_gre.c to ANSI C function
declaration.

MFC after: 1 week


154625 21-Jan-2006 bz

Fix stack corruptions on amd64.

Vararg functions have a different calling convention than regular
functions on amd64. Casting a varag function to a regular one to
match the function pointer declaration will hide the varargs from
the caller and we will end up with an incorrectly setup stack.

Entirely remove the varargs from these functions and change the
functions to match the declaration of the function pointers.
Remove the now unnecessary casts.

Lots of explanations and help from: peter
Reviewed by: peter
PR: amd64/89261
MFC after: 6 days


154567 20-Jan-2006 csjp

- Change the return type for init_tables from void to int so we can propagate
errors from rn_inithead back to the ipfw initialization function.
- Check return value of rn_inithead for failure, if table allocation has
failed for any reason, free up any tables we have created and return ENOMEM
- In ipfw_init check the return value of init_tables and free up any mutexes or
UMA zones which may have been created.
- Assert that the supplied table is not NULL before attempting to dereference.

This fixes panics which were a result of invalid memory accesses due to failed
table allocation. This is an issue mainly because the R_Zalloc function is a
malloc(M_NOWAIT) wrapper, thus making it possible for allocations to fail.

Found by: Coverity Prevent (tm)
Coverity ID: CID79
MFC after: 1 week


154563 20-Jan-2006 csjp

Destroy the dynamic rule zone in the event that we fail to insert the
initial default rule.

MFC after: 1 week


154528 18-Jan-2006 andre

Do not derefence the ip header pointer in the IPv6 case.
This fixes a bug in the previous commit.

Found by: Coverity Prevent(tm)
Coverity ID: CID253
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


154526 18-Jan-2006 andre

In in_delayed_cksum() we can't perform a m_pullup() as it may
change the mbuf pointer and we don't have any way of passing
it back to the callers. Instead just fail silently without
updating the checksum but leaving the mbuf+chain intact.

A search in our GNATS database did not turn up any match for
the existing warning message when this case is encountered.

Found by: Coverity Prevent(tm)
Coverity ID: CID779
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


154524 18-Jan-2006 andre

In syncache_expand() insert a proper syncache_free() to fix a case
that currently can't be triggered. But better be safe than sorry
later on. Additionally it properly silences Coverity Prevent for
future tests.

Found by: Coverity Prevent(tm)
Coverity ID: CID802
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


154520 18-Jan-2006 andre

Prevent dereferencing a NULL route pointer when trying to update the
route MTU.

This bug is very difficult to reach and not remotely exploitable.

Found by: Coverity Prevent(tm)
Coverity ID: CID162
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


154518 18-Jan-2006 andre

Return mbuf pointer or NULL from ip_fastforward() as the mbuf pointer
may have changed by m_pullup() during fastforward processing.

While this is a bug it is actually never triggered in real world
situations and it is not remotely exploitable.

Found by: Coverity Prevent(tm)
Coverity ID: CID780
Sponsored by: TCP/IP Optimization Fundraise 2005


154400 15-Jan-2006 rwatson

Modify the IP fragment reassembly code so that it uses a new UMA zone,
ipq_zone, to allocate fragment headers from, rather than using cast mbuf
storage. This was one of the few remaining uses of mbuf storage for
local data structures that relied on dtom(). Implement the resource
limit on ipq's using UMA zone limits, but preserve current sysctl
semantics using a sysctl proc.

MFC after: 3 weeks


154395 15-Jan-2006 rwatson

Staticize ipqlock, since it is local to ip_input.c.

MFC after: 3 days


154366 14-Jan-2006 gnn

Check the correct TTL in both the IPv6 and IPv4 cases.

Submitted by: glebius
Reviewed by: gnn, bz
Found with: Coverity Prevent(tm)


154355 14-Jan-2006 glebius

UMA can return NULL not only in case when our zone is full, but
also in case of generic memory shortage. In the latter case we may
not find an old entry.

Found with: Coverity Prevent(tm)


154349 14-Jan-2006 rwatson

Remove dead code: 'opts' is not used in udp_append(), only in udp_input(),
so no need to assign it to NULL or conditionally free it.

Found with: Coverity Prevent(tm)
MFC after: 3 days


154271 12-Jan-2006 thompsa

Include the bridge interface itself in the special arp handling.

PR: 90973
MFC after: 1 week


154216 11-Jan-2006 cperciva

Correct insecure temporary file usage in texindex. [06:01]
Correct insecure temporary file usage in ee. [06:02]
Correct a race condition when setting file permissions, sanitize file
names by default, and fix a buffer overflow when handling files
larger than 4GB in cpio. [06:03]
Fix an error in the handling of IP fragments in ipfw which can cause
a kernel panic. [06:04]

Security: FreeBSD-SA-06:01.texindex
Security: FreeBSD-SA-06:02.ee
Security: FreeBSD-SA-06:03.cpio
Security: FreeBSD-SA-06:04.ipfw


153621 21-Dec-2005 thompsa

Add RFC 3378 EtherIP support. This change makes it possible to add gif
interfaces to bridges, which will then send and receive IP protocol 97 packets.
Packets are Ethernet frames with an EtherIP header prepended.

Obtained from: NetBSD
MFC after: 2 weeks


153553 20-Dec-2005 delphij

Use consistent indent character as other IPPROTO_* lines did.


153552 20-Dec-2005 gnn

Add protocol number for SCTP.

Submitted by: Randall Stewart rrs at cisco.com
MFC after: 1 week


153513 18-Dec-2005 glebius

Add a knob to suppress logging of attempts to modify
permanent ARP entries.

Submitted by: Andrew Alcheyev <buddy telenet.ru>


153478 16-Dec-2005 emaste

Add descriptions for sysctl -d.

Approved by: glebius
Silence from: rwatson (mentor)


153476 16-Dec-2005 glebius

Cleanup __FreeBSD_version.


153461 15-Dec-2005 jhb

Use %t (ptrdiff_t modifier) to print a couple of pointer differences rather
than casting them to int.


153427 14-Dec-2005 mux

Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by: Thomas Hurst <tom@hur.st>
MFC after: 3 days


153374 13-Dec-2005 glebius

Add a new feature for optimizining ipfw rulesets - substitution of the
action argument with the value obtained from table lookup. The feature
is now applicable only to "pipe", "queue", "divert", "tee", "netgraph"
and "ngtee" rules.

An example usage:

ipfw pipe 1000 config bw 1000Kbyte/s
ipfw pipe 4000 config bw 4000Kbyte/s
ipfw table 1 add x.x.x.x 1000
ipfw table 1 add x.x.x.y 4000
ipfw pipe tablearg ip from table(1) to any

In the example above the rule will throw different packets to different pipes.

TODO:
- Support "skipto" action, but without searching all rules.
- Improve parser, so that it warns about bad rules. These are:
- "tablearg" argument to action, but no "table" in the rule. All
traffic will be blocked.
- "tablearg" argument to action, but "table" searches for entry with
a specific value. All traffic will be blocked.
- "tablearg" argument to action, and two "table" looks - for src and
for dst. The last lookup will match.


153164 06-Dec-2005 glebius

When we drop packet due to no space in output interface output queue, also
increase the ifp->if_snd.ifq_drops.

PR: 72440
Submitted by: ikob


153163 06-Dec-2005 glebius

Optimize parallel processing of ipfw(4) rulesets eliminating the locking
of the radix lookup tables. Since several rnh_lookup() can run in
parallel on the same table, we can piggyback on the shared locking
provided by ipfw(4).
However, the single entry cache in the ip_fw_table can't be used lockless,
so it is removed. This pessimizes two cases: processing of bursts of similar
packets and matching one packet against the same table several times during
one ipfw_chk() lookup. To optimize the processing of similar packet bursts
administrator should use stateful firewall. To optimize the second problem
a solution will be provided soon.

Details:
o Since we piggyback on the ipfw(4) locking, and the latter is per-chain,
the tables are moved from the global declaration to the
struct ip_fw_chain.
o The struct ip_fw_table is shrunk to one entry and thus vanished.
o All table manipulating functions are extended to accept the struct
ip_fw_chain * argument.
o All table modifing functions use IPFW_WLOCK_ASSERT().


153072 04-Dec-2005 ru

Fix -Wundef.


152928 29-Nov-2005 ume

obey opt_inet6.h and opt_ipsec.h in kernel build directory.

Requested by: hrs


152917 29-Nov-2005 glebius

Garbage-collect now unused struct _ipfw_insn_pipe and flush_pipe_ptrs(),
thus removing a few XXXes.
Document the ABI breakage in UPDATING.


152910 29-Nov-2005 glebius

First step in removing welding between ipfw(4) and dummynet.

o Do not use ipfw_insn_pipe->pipe_ptr in locate_flowset(). The
_ipfw_insn_pipe isn't touched by this commit to preserve ABI
compatibility.
o To optimize the lookup of the pipe/flowset in locate_flowset()
introduce hashes for pipes and queues:
- To preserve ABI compatibility utilize the place of global list
pointer for SLIST_ENTRY.
- Introduce locate_flowset(queue nr) and locate_pipe(pipe nr).
o Rework all the dummynet code to deal with the hashes, not global
lists. Also did some style(9) changes in the code blocks that were
touched by this sweep:
- Be conservative about flowset and pipe variable names on stack,
use "fs" and "pipe" everywhere.
- Cleanup whitespaces.
- Sort variables.
- Give variables more meaningful names.
- Uppercase and dots in comments.
- ENOMEM when malloc(9) failed.


152767 24-Nov-2005 ru

Fix prototype.


152655 21-Nov-2005 ps

Fix for a bug that causes SACK scoreboard corruption when the limit
on holes per connection is reached.

Reported by: Patrik Roos
Submitted by: Mohan Srinivasan
Reviewed by: Raja Mukerji, Noritoshi Demizu


152612 19-Nov-2005 andre

Remove 'ipprintfs' which were protected under DIAGNOSTIC. It doesn't
have any know to enable it from userland and could only be enabled by
either setting it to 1 at compile time or through the kernel debugger.

In the future it may be brought back as KTR tracing points.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


152608 19-Nov-2005 andre

Move MAX_IPOPTLEN and struct ipoption back into ip_var.h as
userland programs depend on it.

Pointed out by: le
Sponsored by: TCP/IP Optimization Fundraise 2005


152592 18-Nov-2005 andre

Consolidate all IP Options handling functions into ip_options.[ch] and
include ip_options.h into all files making use of IP Options functions.

From ip_input.c rev 1.306:
ip_dooptions(struct mbuf *m, int pass)
save_rte(m, option, dst)
ip_srcroute(m0)
ip_stripoptions(m, mopt)

From ip_output.c rev 1.249:
ip_insertoptions(m, opt, phlen)
ip_optcopy(ip, jp)
ip_pcbopts(struct inpcb *inp, int optname, struct mbuf *m)

No functional changes in this commit.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


152583 18-Nov-2005 andre

Purge layer specific mbuf flags on layer crossings to avoid confusing
upper or lower layers.

Sponsored by: TCP/IP Optimization Fundraise 2005


152582 18-Nov-2005 andre

Rework icmp_error() to deal with truncated IP packets from
ip_forward() when doing extended quoting in error messages.

Sponsored by: TCP/IP Optimization Fundraise 2005


152581 18-Nov-2005 andre

In ip_forward() copy as much into the temporary error mbuf as we
have free space in it. Allocate correct mbuf from the beginning.
This allows icmp_error() to quote the entire TCP header in error
messages.

Sponsored by: TCP/IP Optimization Fundraise 2005


152550 17-Nov-2005 glebius

MFOpenBSD 1.62:

Prevent backup CARP hosts from replying to arp requests, fixes strangeness
with some layer-3 switches. From Bill Marquette.

Tested by: Kazuaki Oda <kaakun highway.ne.jp>


152410 14-Nov-2005 ru

Unbreak for !INET6 case.


152315 11-Nov-2005 ru

- Store pointer to the link-level address right in "struct ifnet"
rather than in ifindex_table[]; all (except one) accesses are
through ifp anyway. IF_LLADDR() works faster, and all (except
one) ifaddr_byindex() users were converted to use ifp->if_addr.

- Stop storing a (pointer to) Ethernet address in "struct arpcom",
and drop the IFP2ENADDR() macro; all users have been converted
to use IF_LLADDR() instead.


152288 10-Nov-2005 suz

fixed a bug that uRPF does not work properly for an IPv6 packet bound for the sending machine itself (this is a bug introduced due to a change in ip6_input.c:Rev.1.83)

Pointed out by: Sean McNeil and J.R.Oldroyd
MFC after: 3 days


152242 09-Nov-2005 ru

Use sparse initializers for "struct domain" and "struct protosw",
so they are easier to follow for the human being.


152209 08-Nov-2005 thompsa

Move the cloned interface list management in to if_clone. For some drivers the
softc lists and associated mutex are now unused so these have been removed.

Calling if_clone_detach() will now destroy all the cloned interfaces for the
driver and in most cases is all thats needed to unload.

Idea by: brooks
Reviewed by: brooks


152188 08-Nov-2005 glebius

Rework ARP retransmission algorythm so that ARP requests are
retransmitted without suppression, while there is demand for
such ARP entry. As before, retransmission is rate limited to
one packet per second. Details:
- Remove net.link.ether.inet.host_down_time
- Do not set/clear RTF_REJECT flag on route, to
avoid rt_check() returning error. We will generate error
ourselves.
- Return EWOULDBLOCK on first arp_maxtries failed
requests , and return EHOSTDOWN/EHOSTUNREACH
on further requests.
- Retransmit ARP request always, independently from return
code. Ratelimit to 1 pps.


151967 02-Nov-2005 andre

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


151897 31-Oct-2005 rwatson

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


151888 30-Oct-2005 rwatson

Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.


151824 28-Oct-2005 glebius

First fill in structure with valid values, and only then attach it
to the global list.

Reviewed by: rwatson


151688 26-Oct-2005 yar

Since carp(4) interfaces presently are kinda fake yet possess
IP addresses, mark them with LOOPBACK so that routing daemons
take them easy for link-state routing protocols.

Reviewed by: glebius


151556 22-Oct-2005 mlaier

Fix build after in6_joingroup change. It remains unclear if DAD breaks CARP
or not.


151555 22-Oct-2005 glebius

In in_addprefix() compare not only route addresses, but their masks,
too. This fixes problem when connected prefixes overlap.

Obtained from: OpenBSD (rev. 1.40 by claudio);
[ I came to this fix myself, and then found out that
OpenBSD had already fixed it the same way.]


151539 21-Oct-2005 suz

sync with KAME regarding NDP

- introduced fine-grain-timer to manage ND-caches and IPv6 Multicast-Listeners
- supports Router-Preference <draft-ietf-ipv6-router-selection-07.txt>
- better prefix lifetime management
- more spec-comformant DAD advertisement
- updated RFC/internet-draft revisions

Obtained from: KAME
Reviewed by: ume, gnn
MFC after: 2 month


151464 19-Oct-2005 rwatson

Convert if (tp->t_state == TCPS_LISTEN) panic() into a KASSERT.

MFC after: 2 weeks


151266 12-Oct-2005 thompsa

Change the reference counting to count the number of cloned interfaces for each
cloner. This ensures that ifc->ifc_units is not prematurely freed in
if_clone_detach() before the clones are destroyed, resulting in memory modified
after free. This could be triggered with if_vlan.

Assert that all cloners have been destroyed when freeing the memory.

Change all simple cloners to destroy their clones with ifc_simple_destroy() on
module unload so the reference count is properly updated. This also cleans up
the interface destroy routines and allows future optimisation.

Discussed with: brooks, pjd, -current
Reviewed by: brooks


151263 12-Oct-2005 maxim

o INP_ONESBCAST is inpcb.inp_vflag flag not inp_flags. The confusion
with IP_PORTRANGE_HIGH leads to the incorrect checksum calculation.

PR: kern/87306
Submitted by: Rickard Lind
Reviewed by: bms
MFC after: 2 weeks


151254 12-Oct-2005 philip

Unbreak the net.inet6.tcp6.getcred sysctl.

This makes inetd/auth work again in IPv6 setups.

Pointy hat to: ume/KAME


150942 04-Oct-2005 thompsa

When bridging is enabled and an ARP request is recieved on a member interface,
the arp code will search all local interfaces for a match. This triggers a
kernel log if the bridge has been assigned an address.

arp: ac:de:48:18:83:3d is using my IP address 192.168.0.142!

bridge0: flags=8041<UP,RUNNING,MULTICAST> mtu 1500
inet 192.168.0.142 netmask 0xffffff00
ether ac:de:48:18:83:3d

Silence this warning for 6.0 to stop unnecessary bug reports, the code will need
to be reworked.

Approved by: mlaier (mentor)
MFC after: 3 days


150941 04-Oct-2005 andre

Correct brainfart in SO_BINTIME test.

Pointed out by: nate
Pointy hat to: andre


150940 04-Oct-2005 andre

Make SO_BINTIME timestamps available on raw_ip sockets.

Sponsored by: TCP/IP Optimization Fundraise 2005


150853 03-Oct-2005 rwatson

Unlock Giant symmetrically with respect to lock acquire order as that's
generally nicer.

Spotted by: johan
MFC after: 1 week


150852 03-Oct-2005 rwatson

Acquire Giant conditionally in in_addmulti() and in_delmulti() based on
whether the interface being accessed is IFF_NEEDSGIANT or not. This
avoids lock order reversals when calling into the interface ioctl
handler, which could potentially lead to deadlock.

The long term solution is to eliminate non-MPSAFE network drivers.

Discussed with: jhb
MFC after: 1 week


150804 02-Oct-2005 maxim

o Teach sysctl_drop() how to deal with the sockets in TIME_WAIT state.
This is a special case because tcp_twstart() destroys a tcp control
block via tcp_discardcb() so we cannot call tcp_drop(struct *tcpcb) on
such connections. Use tcp_twclose() instead.

MFC after: 5 days


150636 27-Sep-2005 mlaier

Remove bridge(4) from the tree. if_bridge(4) is a full functional
replacement and has additional features which make it superior.

Discussed on: -arch
Reviewed by: thompsa
X-MFC-after: never (RELENG_6 as transition period)


150594 26-Sep-2005 andre

Implement IP_DONTFRAG IP socket option enabling the Don't Fragment
flag on IP packets. Currently this option is only repected on udp
and raw ip sockets. On tcp sockets the DF flag is controlled by the
path MTU discovery option.

Sending a packet larger than the MTU size of the egress interface
returns an EMSGSIZE error.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005


150351 19-Sep-2005 andre

Use monotonic 'time_uptime' instead of 'time_second' as timebase
for rt->rt_rmx.rmx_expire.


150350 19-Sep-2005 andre

Use monotonic 'time_uptime' instead of 'time_second' as timebase
for timeouts.


150296 18-Sep-2005 rwatson

Take a first cut at cleaning up ifnet removal and multicast socket
panics, which occur when stale ifnet pointers are left in struct
moptions hung off of inpcbs:

- Add in_ifdetach(), which matches in6_ifdetach(), and allows the
protocol to perform early tear-down on the interface early in
if_detach().

- Annotate that if_detach() needs careful consideration.

- Remove calls to in_pcbpurgeif0() in the handling of SIOCDIFADDR --
this is not the place to detect interface removal! This also
removes what is basically a nasty (and now unnecessary) hack.

- Invoke in_pcbpurgeif0() from in_ifdetach(), in both raw and UDP
IPv4 sockets.

It is now possible to run the msocket_ifnet_remove regression test
using HEAD without panicking.

MFC after: 3 days


150131 14-Sep-2005 andre

Do not ignore all other TCP options (eg. timestamp, window scaling)
when responding to TCP SYN packets with TCP_MD5 enabled and set.

PR: kern/82963
Submitted by: <demizu at dd.iij4u.or.jp>
MFC after: 3 days


150122 14-Sep-2005 bz

Fix panic when kernel compiled without INET6 by rejecting
IPv6 opcodes which are behind #if(n)def INET6 now.

PR: kern/85826
MFC after: 3 days


149929 10-Sep-2005 andre

In tcp_ctlinput() do not swap ip->ip_len a second time. It
has been done in icmp_input() already.

This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was
proposed in the ICMP reply.

PR: kern/81813
Submitted by: Vitezslav Novy <vita at fio.cz>
MFC after: 3 days


149909 09-Sep-2005 glebius

- Do not hold route entry lock, when calling arprequest(). One such
call was introduced by me in 1.139, the other one was present before.
- Do all manipulations with rtentry and la before dropping the lock.
- Copy interface address from route into local variable before dropping
the lock. Supply this copy as argument to arprequest()

LORs fixed:
http://sources.zabbadoz.net/freebsd/lor/003.html
http://sources.zabbadoz.net/freebsd/lor/037.html
http://sources.zabbadoz.net/freebsd/lor/061.html
http://sources.zabbadoz.net/freebsd/lor/062.html
http://sources.zabbadoz.net/freebsd/lor/064.html
http://sources.zabbadoz.net/freebsd/lor/068.html
http://sources.zabbadoz.net/freebsd/lor/071.html
http://sources.zabbadoz.net/freebsd/lor/074.html
http://sources.zabbadoz.net/freebsd/lor/077.html
http://sources.zabbadoz.net/freebsd/lor/093.html
http://sources.zabbadoz.net/freebsd/lor/135.html
http://sources.zabbadoz.net/freebsd/lor/140.html
http://sources.zabbadoz.net/freebsd/lor/142.html
http://sources.zabbadoz.net/freebsd/lor/145.html
http://sources.zabbadoz.net/freebsd/lor/152.html
http://sources.zabbadoz.net/freebsd/lor/158.html


149907 09-Sep-2005 glebius

When a carp(4) interface is being destroyed and is in a promiscous mode,
first interface is detached from parent and then bpfdetach() is called.
If the interface was the last carp(4) interface attached to parent, then
the mutex on parent is destroyed. When bpfdetach() calls if_setflags()
we panic on destroyed mutex.

To prevent the above scenario, clear pointer to parent, when we detach
ourselves from parent.


149783 04-Sep-2005 sam

clear lock on error in O_LIMIT case of install_state

Submitted by: Ted Unangst
MFC after: 3 days


149635 30-Aug-2005 andre

Use the correct mbuf type for MGET().


149506 26-Aug-2005 glebius

Add newline to debuging printf.

PR: kern/85271
Submitted by: Simon Morgan


149455 25-Aug-2005 glebius

- Refuse hashsize of 0, since it is invalid.
- Use defined constant instead of 512.


149451 25-Aug-2005 glebius

When we have a published ARP entry for some IP address, do reply on
ARP requests only on the network where this IP address belong, to.

Before this change we did replied on all interfaces. This could
lead to an IP address conflict with host we are doing ARP proxy
for.

PR: kern/75634
Reviewed by: andre


149404 24-Aug-2005 ps

Remove a KASSERT in the sack path that fails because of a interaction
between sack and a bug in the "bad retransmit recovery" logic. This is
a workaround, the underlying bug will be fixed later.

Submitted by: Mohan Srinivasan, Noritoshi Demizu


149403 24-Aug-2005 ps

Fix up the comment for MAX_SACK_BLKS.

Submitted by: Noritoshi Demizu


149391 23-Aug-2005 andre

Remove unnecessary IPSEC includes.

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


149378 22-Aug-2005 andre

o Fix a logic error when not doing mbuf cluster allocation.
o Change an old panic() to a clean function exit.

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


149371 22-Aug-2005 andre

Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


149370 22-Aug-2005 andre

Always quote the entire TCP header when responding and allocate an mbuf
cluster if needed.

Fixes the TCP issues raised in I-D draft-gont-icmp-payload-00.txt.

This aids in-the-wild debugging a lot and allows the receiver to do
more elaborate checks on the validity of the response.

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005


149369 22-Aug-2005 andre

Handle pure layer 2 broad- and multicasts properly and simplify related
checks.

PR: kern/85052
Submitted by: Dmitrij Tejblum <tejblum at yandex-team.ru>
MFC after: 3 days


149350 21-Aug-2005 andre

Commit correct version of the change and note the name of the new
sysctl: net.inet.icmp.quotelen and defaults to 8 bytes.

Pointy hat to: andre


149349 21-Aug-2005 andre

Add a sysctl to change to length of the quotation of the original
packet in an ICMP reply. The minimum of 8 bytes is internally
enforced. The maximum quotation is the remaining space in the
reply mbuf.

This option is added in response to the issues raised in I-D
draft-gont-icmp-payload-00.txt.

MFC after: 2 weeks
Spnsored by: TCP/IP Optimizations Fundraise 2005


149347 21-Aug-2005 andre

Add an option to have ICMP replies to non-local packets generated with
the IP address the packet came through in. This is useful for routers
to show in traceroutes the actual path a packet has taken instead of
the possibly different return path.

The new sysctl is named net.inet.icmp.reply_from_interface and defaults
to off.

MFC after: 2 weeks


149221 18-Aug-2005 glebius

In order to support CARP interfaces kernel was taught to handle more
than one interface in one subnet. However, some userland apps rely on
the believe that this configuration is impossible.

Add a sysctl switch net.inet.ip.same_prefix_carp_only. If the switch
is on, then kernel will refuse to add an additional interface to
already connected subnet unless the interface is CARP. Default
value is off.

PR: bin/82306
In collaboration with: mlaier


149052 14-Aug-2005 bz

Fix broken build of rev. 1.108 in case of no INET6 and IPFIREWALL
compiled into kernel.

Spotted and tested by: Michal Mertl <mime at traveller.cz>


149020 13-Aug-2005 bz

* Add dynamic sysctl for net.inet6.ip6.fw.
* Correct handling of IPv6 Extension Headers.
* Add unreach6 code.
* Add logging for IPv6.

Submitted by: sysctl handling derived from patch from ume needed for ip6fw
Obtained from: is_icmp6_query and send_reject6 derived from similar
functions of netinet6,ip6fw
Reviewed by: ume, gnn; silence on ipfw@
Test setup provided by: CK Software GmbH
MFC after: 6 days


148980 12-Aug-2005 rodrigc

Add NATM_LOCK() and NATM_UNLOCK() in places where npcb_add() and
npcb_free() are called, in order to eliminate witness panics.
This was overlooked in removal of GIANT from ATM.

Reviewed by: rwatson


148955 11-Aug-2005 glebius

o Fix a race between three threads: output path,
incoming ARP packet and route request adding/removing
ARP entries. The root of the problem is that
struct llinfo_arp was accessed without any locks.
To close race we will use locking provided by
rtentry, that references this llinfo_arp:
- Make arplookup() return a locked rtentry.
- In arpresolve() hold the lock provided by
rt_check()/arplookup() until the end of function,
covering all accesses to the rtentry itself and
llinfo_arp it refers to.
- In in_arpinput() do not drop lock provided by
arplookup() during first part of the function.
- Simplify logic in the first part of in_arpinput(),
removing one level of indentation.
- In the second part of in_arpinput() hold rtentry
lock while copying address.

o Fix a condition when route entry is destroyed, while
another thread is contested on its lock:
- When storing a pointer to rtentry in llinfo_arp list,
always add a reference to this rtentry, to prevent
rtentry being destroyed via RTM_DELETE request.
- Remove this reference when removing entry from
llinfo_arp list.

o Further cleanup of arptimer():
- Inline arptfree() into arptimer().
- Use official queue(3) way to pass LIST.
- Hold rtentry lock while reading its structure.
- Do not check that sdl_family is AF_LINK, but
assert this.

Reviewed by: sam
Stress test: http://www.holm.cc/stress/log/cons141.html
Stress test: http://people.freebsd.org/~pho/stress/log/cons144.html


148920 10-Aug-2005 obrien

Remove public declarations of variables that were forgotten when they were
made static.


148918 10-Aug-2005 obrien

Match IPv6 and use a static struct pr_usrreqs nousrreqs.


148903 09-Aug-2005 rwatson

Add helper function ip_findmoptions(), which accepts an inpcb, and attempts
to atomically return either an existing set of IP multicast options for the
PCB, or a newlly allocated set with default values. The inpcb is returned
locked. This function may sleep.

Call ip_moptions() to acquire a reference to a PCB's socket options, and
perform the update of the options while holding the PCB lock. Release the
lock before returning.

Remove garbage collection of multicast options when values return to the
default, as this complicates locking substantially. Most applications
allocate a socket either to be multicast, or not, and don't tend to keep
around sockets that have previously been used for multicast, then used for
unicast.

This closes a number of race conditions involving multiple threads or
processes modifying the IP multicast state of a socket simultaenously.

MFC after: 7 days


148887 09-Aug-2005 rwatson

Propagate rename of IFF_OACTIVE and IFF_RUNNING to IFF_DRV_OACTIVE and
IFF_DRV_RUNNING, as well as the move from ifnet.if_flags to
ifnet.if_drv_flags. Device drivers are now responsible for
synchronizing access to these flags, as they are in if_drv_flags. This
helps prevent races between the network stack and device driver in
maintaining the interface flags field.

Many __FreeBSD__ and __FreeBSD_version checks maintained and continued;
some less so.

Reviewed by: pjd, bz
MFC after: 7 days


148883 09-Aug-2005 glebius

In preparation for fixing races in ARP (and probably in other
L2/L3 mappings) make rt_check() return a locked rtentry.


148682 03-Aug-2005 rwatson

Introduce in_multi_mtx, which will protect IPv4-layer multicast address
lists, as well as accessor macros. For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.

For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.

Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().

Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.

Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.

Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.

Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.

Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 10 days


148653 02-Aug-2005 rwatson

Modify network protocol consumers of the ifnet multicast address lists
to lock if_addr_mtx.

Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 1 week


148616 01-Aug-2005 ume

recover the line which was wrongly disappeared during scope cleanup.
tcpdrop(8) should work for IPv6, again.


148613 01-Aug-2005 bz

Add support for IPv6 over GRE [1]. PR kern/80340 includes the
FreeBSD specific ip_newid() changes NetBSD does not have.
Correct handling of non AF_INET packets passed to bpf [2].

PR: kern/80340[1], NetBSD PRs 29150[1], 30844[2]
Obtained from: NetBSD ip_gre.c rev. 1.34,1.35, if_gre.c rev. 1.56
Submitted by: Gert Doering <gert at greenie.muc.de>[2]
MFC after: 4 days


148414 26-Jul-2005 ume

include scope6_var.h for in6_clearscope().


148387 25-Jul-2005 ume

include netinet6/scope6_var.h.


148385 25-Jul-2005 ume

scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
scoped addresses as a special case.
- scope boundary check will be stricter. For example, the current
*BSD code allows a packet with src=::1 and dst=(some global IPv6
address) to be sent outside of the node, if the application do:
s = socket(AF_INET6);
bind(s, "::1");
sendto(s, some_global_IPv6_addr);
This is clearly wrong, since ::1 is only meaningful within a single
node, but the current implementation of the *BSD kernel cannot
reject this attempt.

Submitted by: JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from: KAME


148324 23-Jul-2005 keramida

Misc spelling and/or English fixes in comments.

Reviewed by: glebius, andre


148176 20-Jul-2005 ume

move RFC3542 related definitions into ip6.h.

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Reviewed by: mlaier
Obtained from: KAME


148171 20-Jul-2005 ume

add missing RFC3542 definition.

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME


148169 20-Jul-2005 ume

update comments:
- RFC2292bis -> RFC3542
- typo fixes

Submitted by: Keiichi SHIMA <keiichi__at__iijlab.net>
Obtained from: KAME


148157 19-Jul-2005 rwatson

Remove no-op spl references in in_pcb.c, since in_pcb locking has been
basically complete for several years now. Update one spl comment to
reference the locking strategy.

MFC after: 3 days


148156 19-Jul-2005 rwatson

Remove no-op spl's and most comment references to spls, as TCP locking
is believed to be basically done (modulo any remaining bugs).

MFC after: 3 days


148155 19-Jul-2005 rwatson

Remove spl() calls from ip_slowtimo(), as IP fragment queue locking was
merged several years ago.

Submitted by: gnn
MFC after: 1 day


148015 14-Jul-2005 mlaier

Export pfsyncstats via sysctl "net.inet.pfsync" in order to print them with
netstat (seperate commit).

Requested by: glebius
MFC after: 1 week


147785 05-Jul-2005 rwatson

Eliminate MAC entry point mac_create_mbuf_from_mbuf(), which is
redundant with respect to existing mbuf copy label routines. Expose
a new mac_copy_mbuf() routine at the top end of the Framework and
use that; use the existing mpo_copy_mbuf_label() routine on the
bottom end.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA, SPAWAR
Approved by: re (scottl)


147781 05-Jul-2005 ps

Fix for a bug in newreno partial ack handling where if a large amount
of data is partial acked, snd_cwnd underflows, causing a burst.

Found, Submitted by: Noritoshi Demizu
Approved by: re


147758 03-Jul-2005 mlaier

Remove ambiguity from hlen. IPv4 is now indicated by is_ipv4 and we need a
proper hlen value for IPv6 to implement O_REJECT and O_LOG.

Reviewed by: glebius, brooks, gnn
Approved by: re (scottl)


147744 02-Jul-2005 thompsa

Check the alignment of the IP header before passing the packet up to the
packet filter. This would cause a panic on architectures that require strict
alignment such as sparc64 (tier1) and ia64/ppc (tier2).

This adds two new macros that check the alignment, these are compile time
dependent on __NO_STRICT_ALIGNMENT which is set for i386 and amd64 where
alignment isn't need so the cost is avoided.

IP_HDR_ALIGNED_P()
IP6_HDR_ALIGNED_P()

Move bridge_ip_checkbasic()/bridge_ip6_checkbasic() up so that the alignment
is checked for ipfw and dummynet too.

PR: ia64/81284
Obtained from: NetBSD
Approved by: re (dwhite), mlaier (mentor)


147735 01-Jul-2005 ps

Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by: Andrey Chernov.
Submitted by: Noritoshi Demizu, Raja Mukerji.
Approved by: re


147734 01-Jul-2005 ps

Fix for a SACK crash caused by a bug in tcp_reass(). tcp_reass()
does not clear tlen and frees the mbuf (leaving th pointing at
freed memory), if the data segment is a complete duplicate.
This change works around that bug. A fix for the tcp_reass() bug
will appear later (that bug is benign for now, as neither th nor
tlen is referenced in tcp_input() after the call to tcp_reass()).

Found by: Pawel Jakub Dawidek.
Submitted by: Raja Mukerji, Noritoshi Demizu.
Approved by: re


147718 01-Jul-2005 glebius

When doing ARP load balancing source IP is taken in network byte order,
so residue of division for all hosts on net is the same, and thus only
one VHID answers. Change source IP in host byte order.

Reviewed by: mlaier
Approved by: re (scottl)


147666 29-Jun-2005 simon

Fix ipfw packet matching errors with address tables.

The ipfw tables lookup code caches the result of the last query. The
kernel may process multiple packets concurrently, performing several
concurrent table lookups. Due to an insufficient locking, a cached
result can become corrupted that could cause some addresses to be
incorrectly matched against a lookup table.

Submitted by: ru
Reviewed by: csjp, mlaier
Security: CAN-2005-2019
Security: FreeBSD-SA-05:13.ipfw

Correct bzip2 permission race condition vulnerability.

Obtained from: Steve Grubb via RedHat
Security: CAN-2005-0953
Security: FreeBSD-SA-05:14.bzip2
Approved by: obrien

Correct TCP connection stall denial of service vulnerability.

A TCP packets with the SYN flag set is accepted for established
connections, allowing an attacker to overwrite certain TCP options.

Submitted by: Noritoshi Demizu
Reviewed by: andre, Mohan Srinivasan
Security: CAN-2005-2068
Security: FreeBSD-SA-05:15.tcp

Approved by: re (security blanket), cperciva


147637 27-Jun-2005 ps

- Postpone SACK option processing until after PAWS checks. SACK option
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.

Submitted by: Noritoshi Demizu
Reveiewed by: Mohan Srinivasan, Raja Mukerji
Approved by: re


147636 27-Jun-2005 phk

Libalias incorrectly applies proxy rules to the global divert
socket: it should only look for existing translation entries,
not create new ones (no matter how it got the idea).

Approved by: re(scottl)


147623 27-Jun-2005 glebius

Disable checksum processing in LibAlias, when it works as a
kernel module. LibAlias is not aware about checksum offloading,
so the caller should provide checksum calculation. (The only
current consumer is ng_nat(4)). When TCP packet internals has
been changed and it requires checksum recalculation, a cookie
is set in th_x2 field of TCP packet, to inform caller that it
needs to recalculate checksum. This ugly hack would be removed
when LibAlias is made more kernel friendly.

Incremental checksum updates are left as is, since they don't
conflict with offloading.

Approved by: re (scottl)


147611 26-Jun-2005 dwmalone

Fix some long standing bugs in writing to the BPF device attached to
a DLT_NULL interface. In particular:

1) Consistently use type u_int32_t for the header of a
DLT_NULL device - it continues to represent the address
family as always.
2) In the DLT_NULL case get bpf_movein to store the u_int32_t
in a sockaddr rather than in the mbuf, to be consistent
with all the DLT types.
3) Consequently fix a bug in bpf_movein/bpfwrite which
only permitted packets up to 4 bytes less than the MTU
to be written.
4) Fix all DLT_NULL devices to have the code required to
allow writing to their bpf devices.
5) Move the code to allow writing to if_lo from if_simloop
to looutput, because it only applies to DLT_NULL devices
but was being applied to other devices that use if_simloop
possibly incorrectly.

PR: 82157
Submitted by: Matthew Luckie <mjl@luckie.org.nz>
Approved by: re (scottl)


147605 25-Jun-2005 ups

Fix a timer ticks wrap around bug for minmssoverload processing.

Approved by: re (scottl,dwhite)
MFC after: 4 weeks


147549 23-Jun-2005 imp

Add back missing copyright and license statement. This is identical
to the statement in ip_mroute.h, as well as being the same as what
OpenBSD has done with this file. It matches the copyright in NetBSD's
1.1 through 1.14 versions of the file as well, which they subsequently
added back.

It appears to have been lost in the 4.4-lite1 import for FreeBSD 2.0,
but where and why I've not investigated further. OpenBSD had the same
problem. NetBSD had a copyright notice until Multicast 3.5 was
integrated verbatim back in 1995. This appears to be the version that
made it into 4.4-lite1.

Approved by: re (scottl)
MFC after: 3 days


147535 23-Jun-2005 ps

Fix for a bug in tcp_sack_option() causing crashes.

Submitted by: Noritoshi Demizu, Mohan Srinivasan.
Approved by: re (scottl blanket SACK)


147503 20-Jun-2005 bz

Fix IP(v6) over IP tunneling most likely broken with ifnet changes.

Reviewed by: gnn
Approved by: re (dwhite), rwatson (mentor)


147501 20-Jun-2005 glebius

- Don't use legacy function in a non-legacy one. This gives us
possibility to compile libalias without legacy support.
- Use correct way to mark variable as unused.

Approved by: re (dwhite)


147418 16-Jun-2005 mlaier

In verify_rev_path6():
- do not use static memory as we are under a shared lock only
- properly rtfree routes allocated with rtalloc
- rename to verify_path6()
- implement the full functionality of the IPv4 version

Also make O_ANTISPOOF work with IPv6.

Reviewed by: gnn
Approved by: re (blanket)


147415 16-Jun-2005 mlaier

Fix indentation in INET6 section in preperation of more serious work.

Approved by: re (blanket ip6fw removal)


147319 12-Jun-2005 mlaier

When doing matching based on dst_ip/src_ip make sure we are really looking
on an IPv4 packet as these variables are uninitialized if not. This used to
allow arbitrary IPv6 packets depending on the value in the uninitialized
variables.

Some opcodes (most noteably O_REJECT) do not support IPv6 at all right now.

Reviewed by: brooks, glebius
Security: IPFW might pass IPv6 packets depending on stack contents.
Approved by: re (blanket)


147256 10-Jun-2005 brooks

Stop embedding struct ifnet at the top of driver softcs. Instead the
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.

This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.

Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.

Reviewed by: sobomax, sam


147247 10-Jun-2005 green

Modify send_pkt() to return the generated packet and have the caller
do the subsequent ip_output() in IPFW. In ipfw_tick(), the keep-alive
packets must be generated from the data that resides under the
stateful lock, but they must not be sent at that time, as this would
cause a lock order reversal with the normal ordering (interface's
lock, then locks belonging to the pfil hooks).

In practice, this caused deadlocks when using IPFW and if_bridge(4)
together to do stateful transparent filtering.

MFC after: 1 week


147205 10-Jun-2005 thompsa

Add dummynet(4) support to if_bridge, this code is largely based on bridge.c.

This is the final piece to match bridge.c in functionality, we can now be a
drop-in replacement.

Approved by: mlaier (mentor)


147180 09-Jun-2005 ps

Fix a mis-merge. Remove a redundant call to tcp_sackhole_insert

Submitted by: Mohan Srinivasan


147169 09-Jun-2005 ps

Fix for a crash in tcp_sack_option() caused by hitting the limit on
the number of sack holes.

Reported by: Andrey Chernov
Submitted by: Noritoshi Demizu
Reviewed by: Raja Mukerji


147061 06-Jun-2005 ps

Fix for a bug in the change that walks the scoreboard backwards from
the tail (in tcp_sack_option()). The bug was caused by incorrect
accounting of the retransmitted bytes in the sackhint.

Reported by: Kris Kennaway.
Submitted by: Noritoshi Demizu.


146986 05-Jun-2005 thompsa

Add hooks into the networking layer to support if_bridge. This changes struct
ifnet so a buildworld is necessary.

Approved by: mlaier (mentor)
Obtained from: NetBSD


146962 04-Jun-2005 green

Better explain, then actually implement the IPFW ALTQ-rule first-match
policy. It may be used to provide more detailed classification of
traffic without actually having to decide its fate at the time of
classification.

MFC after: 1 week


146953 04-Jun-2005 ps

Changes to tcp_sack_option() that
- Walks the scoreboard backwards from the tail to reduce the number of
comparisons for each sack option received.
- Introduce functions to add/remove sack scoreboard elements, making
the code more readable.

Submitted by: Noritoshi Demizu
Reviewed by: Raja Mukerji, Mohan Srinivasan


146894 03-Jun-2005 mlaier

Add support for IPv4 only rules to IPFW2 now that it supports IPv6 as well.
This is the last requirement before we can retire ip6fw.

Reviewed by: dwhite, brooks(earlier version)
Submitted by: dwhite (manpage)
Silence from: -ipfw


146883 02-Jun-2005 iedowse

Use IFF_LOCKGIANT/IFF_UNLOCKGIANT around calls to the interface
if_ioctl routine. This should fix a number of code paths through
soo_ioctl() that could call into Giant-locked network drivers without
first acquiring Giant.


146866 01-Jun-2005 rwatson

When aborting tcp_attach() due to a problem allocating or attaching the
tcpcb, lock the inpcb before calling in_pcbdetach() or in6_pcbdetach(),
as they expect the inpcb to be passed locked.

MFC after: 7 days


146865 01-Jun-2005 rwatson

Assert tcbinfo lock, inpcb lock in tcp_disconnect().
Assert tcbinfo lock, inpcb lock in in tcp_usrclosed().

MFC after: 7 days


146864 01-Jun-2005 rwatson

Assert tcbinfo lock in tcp_drop() due to its call of tcp_close()
Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach()
Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop()

MFC after: 7 days


146863 01-Jun-2005 rwatson

Assert that tcbinfo is locked in tcp_input() before calling into
tcp_drop().

MFC after: 7 days


146862 01-Jun-2005 rwatson

Assert the tcbinfo lock whenever tcp_close() is to be called by
tcp_input().

MFC after: 7 days


146861 01-Jun-2005 rwatson

Assert tcbinfo lock in tcp_attach(), as it is required; the caller
(tcp_usr_attach()) currently grabs it.

MFC after: 7 days


146860 01-Jun-2005 rwatson

Commit correct version of previous commit (in_pcb.c:1.164). Use the
local variables as currently named.

MFC after: 7 days


146859 01-Jun-2005 rwatson

Assert pcbinfo lock in in_pcbdisconnect() and in_pcbdetach(), as the
global pcb lists are modified.

MFC after: 7 days


146858 01-Jun-2005 rwatson

Slight white space tweak.

MFC after: 7 days


146854 01-Jun-2005 rwatson

De-spl UDP.

MFC after: 3 days


146704 28-May-2005 tanimura

Let OSPFv3 go through ipfw. Some more additional checks would be
desirable, though.


146630 25-May-2005 ps

This is conform with the terminology in

M.Mathis and J.Mahdavi,
"Forward Acknowledgement: Refining TCP Congestion Control"
SIGCOMM'96, August 1996.

Submitted by: Noritoshi Demizu, Raja Mukerji


146552 23-May-2005 ps

Rewrite of tcp_sack_option(). Kentaro Kurahone (NetBSD) pointed out
that if we sort the incoming SACK blocks, we can update the scoreboard
in one pass of the scoreboard. The added overhead of sorting upto 4
sack blocks is much lower than traversing (potentially) large
scoreboards multiple times. The code was updating the scoreboard with
multiple passes over it (once for each sack option). The rewrite fixes
that, reducing the complexity of the main loop from O(n^2) to O(n).

Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.


146463 21-May-2005 ps

Replace t_force with a t_flag (TF_FORCEDATA).

Submitted by: Raja Mukerji.
Reviewed by: Mohan, Silby, Andre Opperman.


146304 16-May-2005 ps

Introduce routines to alloc/free sack holes. This cleans up the code
considerably.

Submitted by: Noritoshi Demizu.
Reviewed by: Raja Mukerji, Mohan Srinivasan.


146226 15-May-2005 glebius

- When carp interface is destroyed, and it affects global preemption
suppresion counter, decrease the latter. [1]
- Add sysctl to monitor preemption suppression.

PR: kern/80972 [1]
Submitted by: Frank Volf [1]
MFC after: 1 week


146193 13-May-2005 ps

Fix for a bug where the "nexthole" sack hint is out of sync with the
real next hole to retransmit from the scoreboard, caused by a bug
which did not update the "nexthole" hint in one case in
tcp_sack_option().

Reported by: Daniel Eriksson
Submitted by: Mohan Srinivasan


146182 13-May-2005 glebius

In div_output() explicitly set m->m_nextpkt to NULL. If divert socket
is not userland, but ng_ksocket, then m->m_nextpkt may be non-NULL. In
this case we would panic in sbappend.


146123 11-May-2005 ps

When looking for the next hole to retransmit from the scoreboard,
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.

This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.

The debug code that sanity checks the hints against the computed
value will be removed eventually.

Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.


145978 07-May-2005 cperciva

Fix two issues which were missed in FreeBSD-SA-05:08.kmem.

Reported by: Uwe Doering


145963 06-May-2005 glebius

Add a workaround for 64-bit archs: store unsigned long return value in
temporary variable, check it and then cast to in_addr_t.


145961 06-May-2005 glebius

s/DEBUG/LIBALIAS_DEBUG/, since DEBUG is defined in LINT and
not supported for kernel build.


145953 06-May-2005 cperciva

If we are going to
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.

Security: FreeBSD-SA-05:08.kmem


145933 05-May-2005 glebius

More bits for kernel version:
- copy inet_aton() from libc
- disable getservbyname() lookup and accept only numeric port


145932 05-May-2005 glebius

Always include alias.h before alias_local.h


145931 05-May-2005 glebius

When used in kernel define NO_FW_PUNCH, NO_LOGGING, NO_USE_SOCKETS.


145930 05-May-2005 glebius

Fix argument order for bcopy() in last commit.

Noticed by: njl
Pointy hat to: glebius


145929 05-May-2005 glebius

Use bcopy() instead of memmove().


145928 05-May-2005 glebius

Hide fflush(3) under ifdef DEBUG.


145927 05-May-2005 glebius

Things required to build libalias as kernel module:
- kernel module declarations and handler.
- macros to map malloc(3) calls to malloc(9) ones.
- malloc(9) declarations.
- call finishoff() from module handler MOD_UNLOAD case
instead of atexit(3).
- use panic(9) instead of abort(3)
- take time from time_second instead of gettimeofday(2)
- define INADDR_NONE


145926 05-May-2005 glebius

Add NO_USE_SOCKETS knob, which cuts off functionality socket binding.


145925 05-May-2005 glebius

Add NO_LOGGING knob, which cuts off functionality of debug logging to a file.


145921 05-May-2005 glebius

Play with includes so that libalias can be compiled both as userland
library and kernel module.


145869 04-May-2005 andre

If we don't get a suggested MTU during path MTU discovery
look up the packet size of the packet that generated the
response, step down the MTU by one step through ip_next_mtu()
and try again.

Suggested by: dwmalone


145868 04-May-2005 glebius

Cleanup IPFW2 ifdefs.


145867 04-May-2005 glebius

Makefile is not needed here.


145866 04-May-2005 andre

Add another step of 1280 (gif(4) tunnels) to ip_next_mtu().


145864 04-May-2005 glebius

IPFW version 2 is the only option in HEAD and RELENG_5.
Thus, cleanup unnecessary now ifdefs.


145863 04-May-2005 andre

Pass icmp_error() the MTU argument directly instead of
an interface pointer. This simplifies a couple of uses
and removes some XXX workarounds.


145773 01-May-2005 rwatson

Remove now unused inirw variable from previous use of COMMON_END().

Reported by: csjp


145771 01-May-2005 grehan

Fix typo in last commit.

Approved by: rwatson


145766 01-May-2005 rwatson

Slide unlocking of the tcbinfo lock earlier in tcp_usr_send(), as it's
needed only for implicit connect cases. Under load, especially on SMP,
this can greatly reduce contention on the tcbinfo lock.

NB: Ambiguities about the state of so_pcb need to be resolved so that
all use of the tcbinfo lock in non-implicit connection cases can be
eliminated.

Submited by: Kazuaki Oda <kaakun at highway dot ne dot jp>


145565 26-Apr-2005 brooks

Introduce a struct icmphdr which contains the type, code, and cksum
fields of an ICMP packet.

Use this to allow ipfw to pullup only these values since it does not use
the rest of the packet and it was failed on ICMP packets because they
were not long enough.

struct icmp should probably be modified to use these at some point, but
that will break a fair bit of code so it can wait for another day.

On the off chance that adding this struct breaks something in ports,
bump __FreeBSD_version.

Reported by: Randy Bush <randy at psg dot com>
Tested by: Randy Bush <randy at psg dot com>


145373 21-Apr-2005 ps

Remove some code that snuck in by accident.

Submitted by: Mohan Srinivasan


145372 21-Apr-2005 ps

Fix for interaction problems between TCP SACK and TCP Signature.
If TCP Signatures are enabled, the maximum allowed sack blocks aren't
going to fit. The fix is to compute how many sack blocks fit and tack
these on last. Also on SYNs, defer padding until after the SACK
PERMITTED option has been added.

Found by: Mohan Srinivasan.
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.


145371 21-Apr-2005 ps

Undo rev 1.71 as it is the wrong change.


145370 21-Apr-2005 ps

- Make the sack scoreboard logic use the TAILQ macros. This improves
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.

Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.


145369 21-Apr-2005 ps

Fix for 2 bugs related to TCP Signatures :
- If the peer sends the Signature option in the SYN, use of Timestamps
and Window Scaling were disabled (even if the peer supports them).
- The sender must not disable signatures if the option is absent in
the received SYN. (See comment in syncache_add()).

Found, Submitted by: Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>.
Reviewed by: Mohan Srinivasan <mohans at yahoo-inc dot com>.


145360 21-Apr-2005 andre

Move Path MTU discovery ICMP processing from icmp_input() to
tcp_ctlinput() and subject it to active tcpcb and sequence
number checking. Previously any ICMP unreachable/needfrag
message would cause an update to the TCP hostcache. Now only
ICMP PMTU messages belonging to an active TCP session with
the correct src/dst/port and sequence number will update the
hostcache and complete the path MTU discovery process.

Note that we don't entirely implement the recommended counter
measures of Section 7.2 of the paper. However we close down
the possible degradation vector from trivially easy to really
complex and resource intensive. In addition we have limited
the smallest acceptable MTU with net.inet.tcp.minmss sysctl
for some time already, further reducing the effect of any
degradation due to an attack.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.2
MFC after: 3 days


145355 21-Apr-2005 andre

Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days


145321 20-Apr-2005 glebius

Remove anti-LOR bandaid, it is not needed now.

Sponsored by: Rambler


145268 19-Apr-2005 phk

Make DUMMYNET compile without INET6


145267 19-Apr-2005 phk

typo


145266 19-Apr-2005 phk

Make IPFIREWALL compile without INET6


145246 18-Apr-2005 brooks

Add IPv6 support to IPFW and Dummynet.

Submitted by: Mariano Tortoriello and Raffaele De Lorenzo (via luigi)


145244 18-Apr-2005 ps

Rewrite of tcp_update_sack_list() to make it simpler and more readable
than our original OpenBSD derived version.

Submitted by: Noritoshi Demizu
Reviewed by: Mohan Srinivasan, Raja Mukerji


145093 15-Apr-2005 brooks

Centralized finding the protocol header in IP packets in preperation for
IPv6 support. The header in IPv6 is more complex then in IPv4 so we
want to handle skipping over it in one location.

Submitted by: Mariano Tortoriello and Raffaele De Lorenzo (via luigi)


145087 14-Apr-2005 ps

Fix for a TCP SACK bug where more than (win/2) bytes could have been
in flight in SACK recovery.

Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>


144858 10-Apr-2005 ps

- Tighten up the Timestamp checks to prevent a spoofed segment from
setting ts_recent to an arbitrary value, stopping further
communication between the two hosts.
- If the Echoed Timestamp is greater than the current time,
fall back to the non RFC 1323 RTT calculation.

Submitted by: Raja Mukerji (raja at moselle dot com)
Reviewed by: Noritoshi Demizu, Mohan Srinivasan


144857 10-Apr-2005 ps

- If the reassembly queue limit was reached or if we couldn't allocate
a reassembly queue state structure, don't update (receiver) sack
report.
- Similarly, if tcp_drain() is called, freeing up all items on the
reassembly queue, clean the sack report.

Found, Submitted by: Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).


144856 10-Apr-2005 ps

When the rightmost SACK block expands, rcv_lastsack should be updated.
(Fix for kern/78226).

Submitted by : Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by : Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).


144855 10-Apr-2005 ps

Remove some unused sack fields.

Submitted by : Noritoshi Demizu, Mohan Srinivasan.


144792 08-Apr-2005 maxim

o Nano optimize ip_reass() code path for the first fragment: do not
try to reasseble the packet from the fragments queue with the only
fragment, finish with the first fragment as soon as we create a queue.

Spotted by: Vijay Singh

o Drop the fragment if maxfragsperpacket == 0, no chances we
will be able to reassemble the packet in future.

Reviewed by: silby


144786 08-Apr-2005 maxim

o Tweak the comment a bit.


144785 08-Apr-2005 maxim

o Disable random port allocation when ip.portrange.first ==
ip.portrange.last and there is the only port for that because:
a) it is not wise; b) it leads to a panic in the random ip port
allocation code. In general we need to disable ip port allocation
randomization if the last - first delta is ridiculous small.

PR: kern/79342
Spotted by: Anjali Kulkarni
Glanced at by: silby
MFC after: 2 weeks


144712 06-Apr-2005 glebius

When a packet has been reinjected into ipfw(4) after dummynet(4) processing
we have a non-NULL args.rule. If the same packet later is subject to "tee"
rule, its original is sent again into ipfw_chk() and it reenters at the same
rule. This leads to infinite loop and frozen router.

Assign args.rule to NULL, any time we are going to send packet back to
ipfw_chk() after a tee rule. This is a temporary workaround, which we
will leave for RELENG_5. In HEAD we are going to make divert(4) save
next rule the same way as dummynet(4) does.

PR: kern/79546
Submitted by: Oleg Bulyzhin
Reviewed by: maxim, andre
MFC after: 3 days


144693 06-Apr-2005 brooks

Use ACTION_PTR(r) instead of (r->cmd + r->act_ofs).

Reviewed by: md5


144691 05-Apr-2005 brooks

Make dummynet_flush() match its prototype.


144666 05-Apr-2005 phk

natd core dumps when -reverse switch is used because of a bug in
libalias.

In /usr/src/lib/libalias/alias.c, the functions LibAliasIn and
LibAliasOutTry call the legacy PacketAliasIn/PacketAliasOut instead
of LibAliasIn/LibAliasOut when the PKT_ALIAS_REVERSE option is set.
In this case, the context variable "la" gets lost because the legacy
compatibility routines expect "la" to be global. This was obviously
an oversight when rewriting the PacketAlias* functions to the
LibAlias* functions.

The fix (as shown in the patch below) is to remove the legacy
subroutine calls and replace with the new ones using the "la" struct
as the first arg.

Submitted by: Gil Kloepfer <fgil@kloepfer.org>
Confirmed by: <nicolai@catpipe.net>
PR: 76839
MFC after: 3 days


144329 30-Mar-2005 glebius

When several carp interfaces are attached to Ethernet interface,
carp_carpdev_state_locked() is called every time carp interface is attached.
The first call backs up flags of the first interface, and the second
call backs up them again, erasing correct values.
To solve this, a carp_sc_state_locked() function is introduced. It is
called when interface is attached to parent, instead of calling
carp_carpdev_state_locked. carp_carpdev_state_locked() calls
carp_sc_state_locked() for each sc in chain.

Reported by: Yuriy N. Shkandybin, sem


144301 29-Mar-2005 glebius

- Don't free mbuf, passed to interface output method if the latter
returns error. In this case mbuf has already been freed. [1]
- Remove redundant declaration.

PR: kern/78893 [1]
Submitted by: Liang Yi [1]
Reviewed by: sam
MFC after: 1 day


144260 29-Mar-2005 sam

eliminate extraneous null ptr checks

Noticed by: Coverity Prevent analysis tool


144163 26-Mar-2005 sam

deal with malloc failures

Noticed by: Coverity Prevent analysis tool
Together with: mdodd


144016 23-Mar-2005 maxim

o Document net.inet.ip.portrange.random* sysctls.
o Correct a comment about random port allocation threshold
implementation.

Reviewed by: silby, ru
MFC after: 3 days


143881 20-Mar-2005 glebius

ifma_protospec is a pointer. Use NULL when assigning or compating it.


143868 20-Mar-2005 glebius

Remove a workaround from previos revision. It proved to be incorrect.
Add two another workarounds for carp(4) interfaces:
- do not add connected route when address is assigned to carp(4) interface
- do not add connected route when other interface goes down

Embrace workarounds with #ifdef DEV_CARP


143806 18-Mar-2005 glebius

If vhid exists return more informative EEXIST instead of EINVAL. While here
remove redundant brackets.


143804 18-Mar-2005 glebius

Fix a potential crash that could occur when CARP_LOG is being used.

Obtained from: OpenBSD (pat)


143676 16-Mar-2005 sam

plug resource leak

Noticed by: Coverity Prevent analysis tool


143610 14-Mar-2005 rwatson

In tcp_usr_send(), broaden coverage of the socket buffer lock in the
non-OOB case so that the sbspace() check is performed under the same
lock instance as the append to the send socket buffer.

MFC after: 1 week


143491 13-Mar-2005 glebius

Embrace with #ifdef DEV_CARP carp-related code.


143374 10-Mar-2005 glebius

Add antifootshooting workaround, which will make all routes "connected"
to carp(4) interfaces host routes. This prevents a problem, when connected
network is routed to carp(4) interface.


143339 09-Mar-2005 ps

Add limits on the number of elements in the sack scoreboard both
per-connection and globally. This eliminates potential DoS attacks
where SACK scoreboard elements tie up too much memory.

Submitted by: Raja Mukerji (raja at moselle dot com).
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).


143314 09-Mar-2005 glebius

Make ARP do not complain about wrong interface if correct interface
is a carp one and address matched it.

Reviewed by: brooks


143083 03-Mar-2005 marcus

Fix a problem in the Skinny ALG where a specially crafted packet could cause
a libalias application (e.g. natd, ppp, etc.) to crash. Note: Skinny support
is not enabled in natd or ppp by default.

Approved by: secteam (nectar)
MFC after: 1 day
Secuiryt: This fixes a remote DoS exploit


142996 02-Mar-2005 glebius

Fix typo. Unbreak build. Take pointy hat.


142914 01-Mar-2005 glebius

Add more locking when reading/writing to carp softc. When carp softc is
attached to a parent interface we use its mutex to lock the softc. This
means that in several places like carp_ioctl() we lock softc conditionaly.
This should be redesigned.

To avoid LORs when MII announces us a link state change, we schedule
a quick callout and call carp_carpdev_state_locked() from it.

Initialize callouts using NET_CALLOUT_MPSAFE.

Sponsored by: Rambler
Reviewed by: mlaier


142911 01-Mar-2005 glebius

- Add carp_mtx. Use it to protect list of all carp interfaces.
- In carp_send_ad_all() walk through list of all carp interfaces
instead of walking through list of all interfaces.

Sponsored by: Rambler
Reviewed by: mlaier


142906 01-Mar-2005 glebius

Use NET_CALLOUT_MPSAFE macro.


142901 01-Mar-2005 glebius

Revert change to struct ifnet. Use ifnet pointer in softc. Embedding
ifnet into smth will soon be removed.

Requested by: brooks


142897 01-Mar-2005 glebius

Remove debugging printf.

Reviewed by: mlaier


142798 28-Feb-2005 yar

Support running carp(4) over a vlan(4) parent interface.

Encouraged by: glebius


142785 28-Feb-2005 glebius

Remove unused field from carp softc.

OK'ed by: mcbride@OpenBSD


142784 28-Feb-2005 glebius

Fix tcpdump(8) on carp(4) interface:
- Use our loop DLT type, not OpenBSD. [1]
- The fields that are converted to network byte order are not 32-bit
fields but 16-bit fields, so htons should be used in htonl. [1]
- Secondly, ip_input changes ip->ip_len into its value without
the ip-header length. So, restore the length to make bpf happy. [1]
- Use bpf_mtap2(), use temporary af1, since bpf_mtap2 doesn't
understand uint8_t af identifier.

Submitted by: Frank Volf [1]


142688 27-Feb-2005 ps

If the receiver sends an ack that is out of [snd_una, snd_max],
ignore the sack options in that segment. Else we'd end up
corrupting the scoreboard.

Found by: Raja Mukerji (raja at moselle dot com)
Submitted by: Mohan Srinivasan


142641 27-Feb-2005 mlaier

Unbreak the build. carp_iamatch6 and carp_macmatch6 are not supposed to be
static as they are used elsewhere.


142564 26-Feb-2005 glebius

Remove carp_softc.sc_ifp member in favor of union pointers in struct ifnet.

Obtained from: OpenBSD


142559 26-Feb-2005 glebius

Staticize local functions.


142452 25-Feb-2005 glebius

New lines when logging.


142451 25-Feb-2005 glebius

Embrace macros with do {} while (0)

Submitted by: maxim


142447 25-Feb-2005 glebius

Call carp_carpdev_state() from carp_set_addr6(). See log for rev 1.4.

Sponsored by: Rambler


142446 25-Feb-2005 glebius

Improve logging:
- Simplify CARP_LOG() and making it working (we don't have addlog in FreeBSD).
- Introduce CARP_DEBUG() which logs with LOG_DEBUG severity when
net.inet.carp.log > 1
- Use CARP_DEBUG to log state changes of carp interfaces.

After CARP_LOG() cleanup it appeared that carp_input_c() does not need sc
argument. Remove it.

Sponsored by: Rambler


142371 24-Feb-2005 glebius

Fix problem when master comes up with one interface down, and preempts
mastering on all other interfaces:

- call carp_carpdev_state() on initialize instead of just setting to INIT
- in carp_carpdev_state() check that interface is UP, instead of checking
that it is not DOWN, because a rebooted machine may have interface in
UNKNOWN state.

Sponsored by: Rambler
Obtained from: OpenBSD (partially)


142268 23-Feb-2005 sam

fix potential invalid index into ip_protox array

Noticed by: Coverity Prevent analysis tool


142266 23-Feb-2005 mux

Unbreak CARP build on 64-bit architectures.

Tested on: sparc64


142248 22-Feb-2005 andre

Bring back the full packet destination manipulation for 'ipfw fwd'
with the kernel compile time option:

options IPFIREWALL_FORWARD_EXTENDED

This option has to be specified in addition to IPFIRWALL_FORWARD.

With this option even packets targeted for an IP address local
to the host can be redirected. All restrictions to ensure proper
behaviour for locally generated packets are turned off. Firewall
rules have to be carefully crafted to make sure that things like
PMTU discovery do not break.

Document the two kernel options.

PR: kern/71910
PR: kern/73129
MFC after: 1 week


142243 22-Feb-2005 glebius

Remove promisc counter from parent interface in carp_clone_destroy(),
so that parent interface is not left in promiscous mode after carp
interface is destroyed.

This is not perfect, since promisc counter is added when carp
interface is assigned an IP address. However, when address is removed
parent interface is still in promiscuous mode. Only removal of
carp interface removes promisc from parent. Same way in OpenBSD.

Sponsored by: Rambler


142215 22-Feb-2005 glebius

Add CARP (Common Address Redundancy Protocol), which allows multiple
hosts to share an IP address, providing high availability and load
balancing.

Original work on CARP done by Michael Shalayeff, with many
additions by Marco Pfatschbacher and Ryan McBride.

FreeBSD port done solely by Max Laier.

Patch by: mlaier
Obtained from: OpenBSD (mickey, mcbride)


142212 22-Feb-2005 glebius

We can make code simplier after last change.

Noticed by: Andrew Thompson


142207 22-Feb-2005 glebius

In in_pcbconnect_setup() jailed sockets are treated specially: if local
address is not supplied, then jail IP is choosed and in_pcbbind() is called.
Since udp_output() does not save local addr after call to in_pcbconnect_setup(),
in_pcbbind() is called for each packet, and this is incorrect.

So, we shall treat jailed sockets specially in udp_output(), we will save
their local address.

This fixes a long standing bug with broken sendto() system call in jails.

PR: kern/26506
Reviewed by: rwatson
MFC after: 2 weeks


142206 22-Feb-2005 glebius

In in_pcbconnect_setup() remove a check that route points at
loopback interface. Nobody have explained me sense of this check.
It breaks connect() system call to a destination address which is
loopback routed (e.g. blackholed).

Reviewed by: silence on net@
MFC after: 2 weeks


142190 21-Feb-2005 rwatson

In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn


142031 17-Feb-2005 ps

Remove 2 (SACK) fields from the tcpcb. These are only used by a
function that is called from tcp_input(), so they oughta be passed on
the stack instead of stuck in the tcpcb.

Submitted by: Mohan Srinivasan


141961 16-Feb-2005 ps

Fix for a SACK (receiver) bug where incorrect SACK blocks are
reported to the sender - in the case where the sender sends data
outside the window (as WinXP does :().

Reported by: Sam Jensen <sam at wand dot net dot nz>
Submitted by: Mohan Srinivasan


141928 14-Feb-2005 ps

- Retransmit just one segment on initiation of SACK recovery.
Remove the SACK "initburst" sysctl.
- Fix bugs in SACK dupack and partialack handling that can cause
large bursts while in SACK recovery.

Submitted by: Mohan Srinivasan


141886 14-Feb-2005 maxim

o Add handling of an IPv4-mapped IPv6 address.
o Use SYSCTL_IN() macro instead of direct call of copyin(9).

Submitted by: ume

o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.

Suggested by: ume


141383 06-Feb-2005 glebius

Jump to common action checks after doing specific once. This fixes adding
of divert rules, which I break in previous commit.

Pointy hat to: glebius


141381 06-Feb-2005 maxim

o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8)
utility:

The tcpdrop command drops the TCP connection specified by the
local address laddr, port lport and the foreign address faddr,
port fport.

Obtained from: OpenBSD
Reviewed by: rwatson (locking), ru (man page), -current
MFC after: 1 month


141351 05-Feb-2005 glebius

Add a ng_ipfw node, implementing a quick and simple interface between
ipfw(4) and netgraph(4) facilities.

Reviewed by: andre, brooks, julian


141282 04-Feb-2005 ume

teach scope of IPv6 address to net.inet6.tcp6.getcred.

MFC after: 1 week


141078 31-Jan-2005 rwatson

Update an additional reference to the rate of ISN tick callouts that was
missed in tcp_subr.c:1.216: projected_offset must also reflect how often
the tcp_isn_tick() callout will fire.

MFC after: 2 weeks
Submitted by: silby


141076 31-Jan-2005 csjp

Change the state allocator from using regular malloc to using
a UMA zone instead. This should eliminate a bit of the locking
overhead associated with with malloc and reduce the memory
consumption associated with each new state.

Reviewed by: rwatson, andre
Silence on: ipfw@
MFC after: 1 week


141072 30-Jan-2005 rwatson

Have tcp_isn_tick() fire 100 times a second, rather than HZ times a
second; since the default hz has changed to 1000 times a second,
this resulted in unecessary work being performed.

MFC after: 2 weeks
Discussed with: phk, cperciva
General head nod: silby


141064 30-Jan-2005 rwatson

Prefer (NULL) spelling of (0) for pointers.

MFC after: 3 days


141063 30-Jan-2005 rwatson

Remove clause three from tcp_syncache.c license per permission of
McAfee. Update copyright to McAfee from NETA.


140675 23-Jan-2005 alc

Correctly move the packet header in ip_insertoptions().

Reported by: Anupam Chanda
Reviewed by: sam@
MFC after: 2 weeks


140505 20-Jan-2005 ru

Sort sections.


140345 16-Jan-2005 glebius

- Reduce number of arguments passed to dummynet_io(), we already have cookie
in struct ip_fw_args itself.
- Remove redundant &= 0xffff from dummynet_io().


140224 14-Jan-2005 glebius

o Clean up interface between ip_fw_chk() and its callers:

- ip_fw_chk() returns action as function return value. Field retval is
removed from args structure. Action is not flag any more. It is one
of integer constants.
- Any action-specific cookies are returned either in new "cookie" field
in args structure (dummynet, future netgraph glue), or in mbuf tag
attached to packet (divert, tee, some future action).

o Convert parsing of return value from ip_fw_chk() in ipfw_check_{in,out}()
to a switch structure, so that the functions are more readable, and a future
actions can be added with less modifications.

Approved by: andre
MFC after: 2 months


140138 12-Jan-2005 ps

Fix a TCP SACK related crash resulting from incorrect computation
of len in tcp_output(), in the case where the FIN has already been
transmitted. The mis-computation of len is because of a gcc
optimization issue, which this change works around.

Submitted by: Mohan Srinivasan


139976 10-Jan-2005 brian

include "alias.h", not <alias.h>

MFC after: 3 days


139823 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


139606 03-Jan-2005 silby

Add a sysctl (net.inet.tcp.insecure_rst) which allows one to specify
that the RFC 793 specification for accepting RST packets should be
following. When followed, this makes one vulnerable to the attacks
described in "slipping in the window", but it may be necessary in
some odd circumstances.


139558 02-Jan-2005 silby

Port randomization leads to extremely fast port reuse at high
connection rates, which is causing problems for some users.

To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.

Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds. These thresholds may be tuned via sysctl.

Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.

MFC After: 20 seconds


139310 25-Dec-2004 rwatson

Remove an errant blank line apparently introduced in
ip_output.c:1.194.


139298 25-Dec-2004 rwatson

In the dropafterack case of tcp_input(), it's OK to release the TCP
pcbinfo lock before calling tcp_output(), as holding just the inpcb
lock is sufficient to prevent garbage collection.


139297 25-Dec-2004 rwatson

Revert parts of tcp_input.c:1.255 associated with the header predicted
cases for tcp_input():

While it is true that the pcbinfo lock provides a pseudo-reference to
inpcbs, both the inpcb and pcbinfo locks are required to free an
un-referenced inpcb. As such, we can release the pcbinfo lock as
long as the inpcb remains locked with the confidence that it will not
be garbage-collected. This leads to a less conservative locking
strategy that should reduce contention on the TCP pcbinfo lock.

Discussed with: sam


139222 23-Dec-2004 rwatson

Attempt to consistently use () around return values in calls to
return() in newer code (sysctl, ISN, timewait).


139221 23-Dec-2004 rwatson

Remove an XXXRW comment relating to whether or not the TCP timers are
MPSAFE: they are now believed to be.

Correct a typo in a second comment.

MFC after: 2 weeks


139220 23-Dec-2004 rwatson

Remove the now unused tcp_canceltimers() function. tcpcb timers are
now stopped as part of tcp_discardcb().

MFC after: 2 weeks


139219 23-Dec-2004 rwatson

Remove an annotation of a minor race relating to the update of
multiple MIB entries using sysctl in short order, which might
result in unexpected values for tcp_maxidle being generated by
tcp_slowtimo. In practice, this will not happen, or at least,
doesn't require an explicit comment.

MFC after: 2 weeks


138653 10-Dec-2004 glebius

In certain cases ip_output() can free our route, so check
for its presence before RTFREE().

Noticed by: ru


138652 10-Dec-2004 glebius

Revert last change.

Andre:
First lets get major new features into the kernel in a clean and nice way,
and then start optimizing. In this case we don't have any obfusication that
makes later profiling and/or optimizing difficult in any way.

Requested by: csjp, sam


138642 10-Dec-2004 csjp

This commit adds a shared locking mechanism very similar to the
mechanism used by pfil. This shared locking mechanism will remove
a nasty lock order reversal which occurs when ucred based rules
are used which results in hard locks while mpsafenet=1.

So this removes the debug.mpsafenet=0 requirement when using
ucred based rules with IPFW.

It should be noted that this locking mechanism does not guarantee
fairness between read and write locks, and that it will favor
firewall chain readers over writers. This seemed acceptable since
write operations to firewall chains protected by this lock tend to
be less frequent than reads.

Reviewed by: andre, rwatson
Tested by: myself, seanc
Silence on: ipfw@
MFC after: 1 month


138631 09-Dec-2004 glebius

Check that DUMMYNET_LOADED before seeking dummynet m_tag.

Reviewed by: andre
MFC after: 1 week


138615 09-Dec-2004 mlaier

More fixing of multiple addresses in the same prefix. This time do not try
to arp resolve "secondary" local addresses.

Found and submitted by: ru
With additions from: OpenBSD (rev. 1.47)
Reviewed by: ru


138499 06-Dec-2004 ru

Time out routes created by redirect.


138470 06-Dec-2004 glebius

- Make route cacheing optional, configurable via IFF_LINK0 flag.
- Turn it off by default.

Requested by: many
Reviewed by: andre
Approved by: julian (mentor)
MFC after: 3 days


138416 05-Dec-2004 rwatson

Assert the tcptw inpcb lock in tcp_timer_2msl_reset(), as fields in
the tcptw undergo non-atomic read-modify-writes.

MFC after: 2 weeks


138410 05-Dec-2004 rwatson

Assert inpcb lock in:

tcpip_fillheaders()
tcp_discardcb()
tcp_close()
tcp_notify()
tcp_new_isn()
tcp_xmit_bandwidth_limit()

Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).

MFC after: 2 weeks


138409 05-Dec-2004 rwatson

Minor grammer fix in comment.


138408 05-Dec-2004 rwatson

Pass the inpcb reference into ip_getmoptions() rather than just the
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.

Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().

MFC after: 2 weeks


138407 05-Dec-2004 rwatson

Define INP_UNLOCK_ASSERT() to assert that an inpcb is unlocked.

MFC after: 2 weeks


138404 05-Dec-2004 rwatson

Push the inpcb argument into ip_setmoptions() when setting IP multicast
socket options, so that it is available for locking.


138397 05-Dec-2004 rwatson

Start working through inpcb locking for ip_ctloutput() by cleaning up
modifications to the inpcb IP options mbuf:

- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
comparison to a comparison with NULL for readability.


138199 29-Nov-2004 ps

Fixes a bug in SACK causing us to send data beyond the receive window.

Found by: Pawel Worach and Daniel Hartmeier
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


138148 28-Nov-2004 rwatson

Assert the inpcb lock in tcp_xmit_timer() as it performs read-modify-
write of various time/rtt-related fields in the tcpcb.


138147 28-Nov-2004 rwatson

Expand coverage of the receive socket buffer lock when handling urgent
pointer updates: test available space while holding the socket buffer
mutex, and continue to hold until until the pointer update has been
performed.

MFC after: 2 weeks


138136 27-Nov-2004 rwatson

Do export the advertised receive window via the tcpi_rcv_space field of
struct tcp_info.


138118 26-Nov-2004 rwatson

Implement parts of the TCP_INFO socket option as found in Linux 2.6.
This socket option allows processes query a TCP socket for some low
level transmission details, such as the current send, bandwidth, and
congestion windows. Linux provides a 'struct tcpinfo' structure
containing various variables, rather than separate socket options;
this makes the API somewhat fragile as it makes it dificult to add
new entries of interest as requirements and implementation evolve.
As such, I've included a large pad at the end of the structure.
Right now, relatively few of the Linux API fields are filled in, and
some contain no logical equivilent on FreeBSD. I've include __'d
entries in the structure to make it easier to figure ou what is and
isn't omitted. This API/ABI should be considered unstable for the
time being.


138098 25-Nov-2004 silby

Fix a problem where our TCP stack would ignore RST packets if the receive
window was 0 bytes in size. This may have been the cause of unsolved
"connection not closing" reports over the years.

Thanks to Michiel Boland for providing the fix and providing a concise
test program for the problem.

Submitted by: Michiel Boland
MFC after: 2 weeks


138040 23-Nov-2004 rwatson

In tcp_reass(), assert the inpcb lock on the passed tcpcb, since the
contents of the tcpcb are read and modified in volume.

In tcp_input(), replace th comparison with 0 with a comparison with
NULL.

At the 'findpcb', 'dropafterack', and 'dropwithreset' labels in
tcp_input(), assert 'headlocked'. Try to improve consistency between
various assertions regarding headlocked to be more informative.

MFC after: 2 weeks


138025 23-Nov-2004 rwatson

tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it. Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after: 2 weeks


138024 23-Nov-2004 rwatson

De-spl tcp_slowtimo; tcp_maxidle assignment is subject to possible
but unlikely races that could be corrected by having tcp_keepcnt
and tcp_keepintvl modifications go through handler functions via
sysctl, but probably is not worth doing. Updates to multiple
sysctls within evaluation of a single addition are unlikely.

Annotate that tcp_canceltimers() is currently unused.

De-spl tcp_timer_delack().

De-spl tcp_timer_2msl().

MFC after: 2 weeks


138020 23-Nov-2004 rwatson

Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller. This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after: 2 weeks


138019 23-Nov-2004 rwatson

Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after: 2 weeks


138018 23-Nov-2004 rwatson

Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after: 2 weeks


137988 22-Nov-2004 rwatson

Remove "Unlocked read" annotations associated with previously unlocked
use of socket buffer fields in the TCP input code. These references
are now protected by use of the receive socket buffer lock.

MFC after: 1 week


137971 21-Nov-2004 rwatson

s/send/sent/ in comment describing TCPS_SYN_RECEIVED.


137860 18-Nov-2004 glebius

- Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag
from divert sockets.
- Remove div_disconnect() method, since it shouldn't be called now.
- Remove div_abort() method. It was never called directly, since protocol
doesn't have listen queue. It was called only from div_disconnect(),
which is removed now.

Reviewed by: rwatson, maxim
Approved by: julian (mentor)
MT5 after: 1 week
MT4 after: 1 month


137833 17-Nov-2004 mlaier

Fix host route addition for more than one address to a loopback interface
after allowing more than one address with the same prefix.

Reported by: Vladimir Grebenschikov <vova NO fbsd SPAM ru>
Submitted by: ru (also NetBSD rev. 1.83)
Pointyhat to: mlaier


137668 13-Nov-2004 mlaier

Merge copyright notices.

Requested by: njl


137630 12-Nov-2004 glebius

Fix ng_ksocket(4) operation as a divert socket, which is pretty useful
and has been broken twice:

- in the beginning of div_output() replace KASSERT with assignment, as
it was in rev. 1.83. [1] [to be MFCed]
- refactor changes introduced in rev. 1.100: do not prepend a new tag
unconditionally. Before doing this check whether we have one. [2]

A small note for all hacking in this area:
when divert socket is not a real userland, but ng_ksocket(4), we receive
_the same_ mbufs, that we transmitted to socket. These mbufs have rcvif,
the tags we've put on them. And we should treat them correctly.

Discussed with: mlaier [1]
Silence from: green [2]
Reviewed by: maxim
Approved by: julian (mentor)
MFC after: 1 week


137628 12-Nov-2004 mlaier

Change the way we automatically add prefix routes when adding a new address.
This makes it possible to have more than one address with the same prefix.
The first address added is used for the route. On deletion of an address
with IFA_ROUTE set, we try to find a "fallback" address and hand over the
route if possible.
I plan to MFC this in 4 weeks, hence I keep the - now obsolete - argument to
in_ifscrub as it must be considered KAPI as it is not static in in.c. I will
clean this after the MFC.

Discussed on: arch, net
Tested by: many testers of the CARP patches
Nits from: ru, Andrea Campi <andrea+freebsd_arch webcom it>
Obtained from: WIDE via OpenBSD
MFC after: 1 month


137584 11-Nov-2004 phk

Add missing '='

Spotted by: obrien


137450 09-Nov-2004 andre

Fix a double-free in the 'hlen > m->m_len' sanity check.

Bug report by: <james@towardex.com>
MFC after: 2 weeks


137396 08-Nov-2004 suz

support TCP-MD5(IPv4) in KAME-IPSEC, too.

MFC after: 3 week


137386 08-Nov-2004 phk

Initialize struct pr_userreqs in new/sparse style and fill in common
default elements in net_init_domain().

This makes it possible to grep these structures and see any bogosities.


137349 07-Nov-2004 rwatson

Do some re-sorting of TCP pcbinfo locking and assertions: make sure to
retain the pcbinfo lock until we're done using a pcb in the in-bound
path, as the pcbinfo lock acts as a pseuo-reference to prevent the pcb
from potentially being recycled. Clean up assertions and make sure to
assert that the pcbinfo is locked at the head of code subsections where
it is needed. Free the mbuf at the end of tcp_input after releasing
any held locks to reduce the time the locks are held.

MFC after: 3 weeks


137302 06-Nov-2004 andre

Fix a double-free in the 'm->m_len < sizeof (struct ip)' sanity check.

Bug report by: <james@towardex.com>
MFC after: 2 weeks


137183 04-Nov-2004 phk

Hide udp_in6 behind #ifdef INET6


137179 04-Nov-2004 bms

When performing IP fast forwarding, immediately drop traffic which is
destined for a blackhole route.

This also means that blackhole routes do not need to be bound to lo(4)
or disc(4) interfaces for the net.inet.ip.fastforwarding=1 case.

Submitted by: james at towardex dot com
Sponsored by: eXtensible Open Router Project <URL:http://www.xorp.org/>
MFC after: 3 weeks


137176 04-Nov-2004 rwatson

Until this change, the UDP input code used global variables udp_in,
udp_in6, and udp_ip6 to pass socket address state between udp_input(),
udp_append(), and soappendaddr_locked(). While file in the default
configuration, when running with multiple netisrs or direct ithread
dispatch, this can result in races wherein user processes using
recvmsg() get back the wrong source IP/port. To correct this and
related races:

- Eliminate udp_ip6, which is believed to be generated but then never
used. Eliminate ip_2_ip6_hdr() as it is now unneeded.

- Eliminate setting, testing, and existence of 'init' status fields
for the IPv6 structures. While with multiple UDP delivery this
could lead to amortization of IPv4 -> IPv6 conversion when
delivering an IPv4 UDP packet to an IPv6 socket, it added
substantial complexity and side effects.

- Move global structures into the stack, declaring udp_in in
udp_input(), and udp_in6 in udp_append() to be used if a conversion
is required. Pass &udp_in into udp_append().

- Re-annotate comments to reflect updates.

With this change, UDP appears to operate correctly in the presence of
substantial inbound processing parallelism. This solution avoids
introducing additional synchronization, but does increase the
potential stack depth.

Discovered by: kris (Bug Magnet)
MFC after: 3 weeks


137139 02-Nov-2004 andre

Remove RFC1644 T/TCP support from the TCP side of the network stack.

A complete rationale and discussion is given in this message
and the resulting discussion:

http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on: -arch


137066 30-Oct-2004 rwatson

Correct a bug in TCP SACK that could result in wedging of the TCP stack
under high load: only set function state to loop and continuing sending
if there is no data left to send.

RELENG_5_3 candidate.

Feet provided: Peter Losher <Peter underscore Losher at isc dot org>
Diagnosed by: Aniel Hartmeier <daniel at benzedrine dot cx>
Submitted by: mohan <mohans at yahoo-inc dot com>


136967 26-Oct-2004 rwatson

Add a matching tunable for net.inet.tcp.sack.enable sysctl.


136960 26-Oct-2004 bms

Check that rt_mask(rt) is non-NULL before dereferencing it, in the
RTM_ADD case, thus avoiding a panic.

Submitted by: Iasen Kostov


136953 25-Oct-2004 andre

IPDIVERT is a module now and tell the other parts of the kernel about it.
IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.


136910 24-Oct-2004 ru

For variables that are only checked with defined(), don't provide
any fake value.


136792 22-Oct-2004 andre

Shave 40 unused bytes from struct tcpcb.


136790 22-Oct-2004 andre

When printing the initialization string and IPDIVERT is not compiled into the
kernel refer to it as "loadable" instead of "disabled".


136788 22-Oct-2004 andre

Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload.

Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8)
man pages.


136717 19-Oct-2004 andre

Destroy the UMA zone on unload.


136716 19-Oct-2004 andre

Slightly extend the locking during unload to fully cover the protocol
deregistration. This does not entirely close the race but narrows the
even previously extremely small chance of a race some more.


136715 19-Oct-2004 rwatson

Annotate a newly introduced race present due to the unloading of
protocols: it is possible for sockets to be created and attached
to the divert protocol between the test for sockets present and
successful unload of the registration handler. We will need to
explore more mature APIs for unregistering the protocol and then
draining consumers, or an atomic test-and-unregister mechanism.


136714 19-Oct-2004 andre

Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability
of protocols. The call to divert_packet() is done through a function pointer. All
semantics of IPDIVERT remain intact. If IPDIVERT is not loaded ipfw will refuse to
install divert rules and natd will complain about 'protocol not supported'. Once
it is loaded both will work and accept rules and open the divert socket. The module
can only be unloaded if no divert sockets are open. It does not close any divert
sockets when an unload is requested but will return EBUSY instead.


136713 19-Oct-2004 andre

Properly declare the "net.inet" sysctl subtree.


136712 19-Oct-2004 andre

Pre-emptively define IPPROTO_SPACER to 32767, the same value as PROTO_SPACER
to document that this value is globally assigned for a special purpose and
may not be reused within the IPPROTO number space.


136695 19-Oct-2004 andre

Make use of the PROTO_SPACER functionality for dynamically loadable
protocols in inetsw[] and define initially eight spacer slots.

Remove conflicting declaration 'struct pr_usrreqs nousrreqs'. It is
now declared and initialized in kern/uipc_domain.c.


136694 19-Oct-2004 andre

Support for dynamically loadable and unloadable IP protocols in the ipmux.

With pr_proto_register() it has become possible to dynamically load protocols
within the PF_INET domain. However the PF_INET domain has a second important
structure called ip_protox[] that is derived from the 'struct protosw inetsw[]'
and takes care of the de-multiplexing of the various protocols that ride on
top of IP packets.

The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[]
array mux in a consistent and easy way. To register a protocol within
ip_protox[] the existence of a corresponding and matching protocol definition
in inetsw[] is required. The function does not allow to overwrite an already
registered protocol. The unregister function simply replaces the mux slot with
the default index pointer to IPPROTO_RAW as it was previously.


136691 19-Oct-2004 andre

Add a macro for the destruction of INP_INFO_LOCK's used by loadable modules.


136690 19-Oct-2004 andre

Make comments more clear. Change the order of one if() statement to check the
more likely variable first.


136682 18-Oct-2004 rwatson

Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>


136449 12-Oct-2004 rwatson

Don't release the udbinfo lock until after the last use of UDP inpcb
in udp_input(), since the udbinfo lock is used to prevent removal of
the inpcb while in use (i.e., as a form of reference count) in the
in-bound path.

RELENG_5 candidate.


136441 12-Oct-2004 rwatson

Modify the thrilling "%D is using my IP address %s!" message so that
it isn't printed if the IP address in question is '0.0.0.0', which is
used by nodes performing DHCP lookup, and so constitute a false
positive as a report of misconfiguration.


136440 12-Oct-2004 rwatson

When the access control on creating raw sockets was modified so that
processes in jail could create raw sockets, additional access control
checks were added to raw IP sockets to limit the ways in which those
sockets could be used. Specifically, only the socket option IP_HDRINCL
was permitted in rip_ctloutput(). Other socket options were protected
by a call to suser(). This change was required to prevent processes
in a Jail from modifying system properties such as multicast routing
and firewall rule sets.

However, it also introduced a regression: processes that create a raw
socket with root privilege, but then downgraded credential (i.e., a
daemon giving up root, or a setuid process switching back to the real
uid) could no longer issue other unprivileged generic IP socket option
operations, such as IP_TOS, IP_TTL, and the multicast group membership
options, which prevented multicast routing daemons (and some other
tools) from operating correctly.

This change pushes the access control decision down to the granularity
of individual socket options, rather than all socket options, on raw
IP sockets. When rip_ctloutput() doesn't implement an option, it will
now pass the request directly to in_control() without an access
control check. This should restore the functionality of the generic
IP socket options for raw sockets in the above-described scenarios,
which may be confirmed with the ipsockopt regression test.

RELENG_5 candidate.

Reviewed by: csjp


136327 09-Oct-2004 rwatson

Acquire the send socket buffer lock around tcp_output() activities
reaching into the socket buffer. This prevents a number of potential
races, including dereferencing of sb_mb while unlocked leading to
a NULL pointer deref (how I found it). Potentially this might also
explain other "odd" TCP behavior on SMP boxes (although haven't
seen it reported).

RELENG_5 candidate.


136226 07-Oct-2004 rwatson

When running with debug.mpsafenet=0, initialize IP multicast routing
callouts as non-CALLOUT_MPSAFE. Otherwise, they may trigger an
assertion regarding Giant if they enter other parts of the stack from
the callout.

MFC after: 3 days
Reported by: Dikshie < dikshie at ppk dot itb dot ac dot id >


136151 05-Oct-2004 ps

- Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.

Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>


136075 03-Oct-2004 green

Add support to IPFW for matching by TCP data length.


136073 03-Oct-2004 green

Add support to IPFW for classification based on "diverted" status
(that is, input via a divert socket).


136071 03-Oct-2004 green

Add to IPFW the ability to do ALTQ classification/tagging.


135977 30-Sep-2004 green

Validate the action pointer to be within the rule size, so that trying to
add corrupt ipfw rules would not potentially panic the system or worse.


135920 29-Sep-2004 mlaier

Add an additional struct inpcb * argument to pfil(9) in order to enable
passing along socket information. This is required to work around a LOR with
the socket code which results in an easy reproducible hard lockup with
debug.mpsafenet=1. This commit does *not* fix the LOR, but enables us to do
so later. The missing piece is to turn the filter locking into a leaf lock
and will follow in a seperate (later) commit.

This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in
forseeable future.

Suggested by: rwatson
A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;)
Reviewed by: rwatson, csjp
Tested by: -pf, -ipfw, LINT, csjp and myself
MFC after: 3 days

LOR IDs: 14 - 17 (not fixed yet)


135919 29-Sep-2004 rwatson

Assign so_pcb to NULL rather than 0 as it's a pointer.

Spotted by: dwhite


135731 24-Sep-2004 maxim

o Turn net.inet.ip.check_interface sysctl off by default.

When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in
rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to
0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only. Among with the
fact this knob is not documented it breaks POLA especially in bridge
environment.

OK'ed by: andre
Reviewed by: -current


135318 16-Sep-2004 andre

Fix an out of bounds write during the initialization of the PF_INET protocol
family to the ip_protox[] array. The protocol number of IPPROTO_DIVERT is
larger than IPPROTO_MAX and was initializing memory beyond the array.
Catch all these kinds of errors by ignoring protocols that are higher than
IPPROTO_MAX or 0 (zero).

Add more comments ip_init().


135275 15-Sep-2004 andre

Clarify some comments for the M_FASTFWD_OURS case in ip_input().


135274 15-Sep-2004 andre

Remove the last two global variables that are used to store packet state while
it travels through the IP stack. This wasn't much of a problem because IP
source routing is disabled by default but when enabled together with SMP and
preemption it would have very likely cross-corrupted the IP options in transit.

The IP source route options of a packet are now stored in a mtag instead of the
global variable.


135168 13-Sep-2004 andre

Do not allow 'ipfw fwd' command when IPFIREWALL_FORWARD is not compiled into
the kernel. Return EINVAL instead.


135167 13-Sep-2004 andre

If we have to 'ipfw fwd'-tag a packet the second time in ipfw_pfil_out() don't
prepend an already existing tag again. Instead unlink it and prepend it again
to have it as the first tag in the chain.

PR: kern/71380


135160 13-Sep-2004 andre

Make comments more clear for the packet changed cases after pfil hooks.


135158 13-Sep-2004 andre

Fix ip_input() fallback for the destination modified cases (from the packet
filters). After the ipfw to pfil move ip_input() expects M_FASTFWD_OURS
tagged packets to have ip_len and ip_off in host byte order instead of
network byte order.

PR: kern/71652
Submitted by: mlaier (patch)


135154 13-Sep-2004 andre

Make 'ipfw tee' behave as inteded and designed. A tee'd packet is copied
and sent to the DIVERT socket while the original packet continues with the
next rule. Unlike a normally diverted packet no IP reassembly attemts are
made on tee'd packets and they are passed upwards totally unmodified.

Note: This will not be MFC'd to 4.x because of major infrastucture changes.

PR: kern/64240 (and many others collapsed into that one)


134991 09-Sep-2004 glebius

Check flag do_bridge always, even if kernel was compiled without
BRIDGE support. This makes dynamic bridge.ko working.

Reviewed by: sam
Approved by: julian (mentor)
MFC after: 1 week


134852 06-Sep-2004 jmg

revert comment from rev1.158 now that rev1.225 backed it out..

MFC after: 3 days


134823 05-Sep-2004 glebius

Recover normal behavior: return EINVAL to attempt to add a divert rule
when module is built without IPDIVERT.

Silence from: andre
Approved by: julian (mentor)


134793 05-Sep-2004 jmg

fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...


134391 27-Aug-2004 andre

Apply error and success logic consistently to the function netisr_queue() and
its users.

netisr_queue() now returns (0) on success and ERRNO on failure. At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.

Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success. Due to this
schednetisr() was never called to kick the scheduling of the isr. However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.

PR: kern/70988
Found by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 3 days


134385 27-Aug-2004 andre

In the case the destination of a packet was changed by the packet filter
to point to a local IP address; and the packet was sourced from this host
we fill in the m_pkthdr.rcvif with a pointer to the loopback interface.

Before the function ifunit("lo0") was used to obtain the ifp. However
this is sub-optimal from a performance point of view and might be dangerous
if the loopback interface has been renamed. Use the global variable 'loif'
instead which always points to the loopback interface.

Submitted by: brooks


134384 27-Aug-2004 andre

Remove a junk line left over from the recent IPFW to PFIL_HOOKS conversion.


134383 27-Aug-2004 andre

Always compile PFIL_HOOKS into the kernel and remove the associated kernel
compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and
thus it becomes a standard part of the network stack.

If no hooks are connected the entire packet filter hooks section and related
activities are jumped over. This removes any performance impact if no hooks
are active.

Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.


134346 26-Aug-2004 ru

Revert the last change to sys/modules/ipfw/Makefile and fix a
standalone module build in a better way.

Silence from: andre
MFC after: 3 days


134290 25-Aug-2004 pjd

Allocate memory when dumping pipes with M_WAITOK flag.
On a system with huge number of pipes, M_NOWAIT failes almost always,
because of memory fragmentation.
My fix is different than the patch proposed by Pawel Malachowski,
because in FreeBSD 5.x we cannot sleep while holding dummynet mutex
(in 4.x there is no such lock).
My fix is also ugly, but there is no easy way to prepare nice and clean fix.

PR: kern/46557
Submitted by: Eugene Grosbein <eugen@grosbein.pp.ru>
Reviewed by: mlaier


134172 22-Aug-2004 mlaier

Allow early drop for non-ALTQ enabled queues in an ALTQ-enabled kernel.
Previously the early drop was disabled unconditionally for ALTQ-enabled
kernels.

This should give some benefit for the normal gateway + LAN-server case with
a busy LAN leg and an ALTQ managed uplink.

Reviewed and style help from: cperciva, pjd


134142 22-Aug-2004 rwatson

When sliding the m_data pointer forward, update m_pktrhdr.len as well
as m_len, or the pkthdr length will be inconsistent with the actual
length of data in the mbuf chain. The symptom of this occuring was
"out of data" warnings from in_cksum_skip() on large UDP packets sent
via the loopback interface.

Foot shot: green


134122 21-Aug-2004 csjp

When a prison is given the ability to create raw sockets (when the
security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged
access to jails is given out, it is possible for prison root to manipulate
various network parameters which effect the host environment. This commit
plugs a number of security holes associated with the use of raw sockets
and prisons.

This commit makes the following changes:

- Add a comment to rtioctl warning developers that if they add
any ioctl commands, they should use super-user checks where necessary,
as it is possible for PRISON root to make it this far in execution.
- Add super-user checks for the execution of the SIOCGETVIFCNT
and SIOCGETSGCNT IP multicast ioctl commands.
- Add a super-user check to rip_ctloutput(). If the calling cred
is PRISON root, make sure the socket option name is IP_HDRINCL,
otherwise deny the request.

Although this patch corrects a number of security problems associated
with raw sockets and prisons, the warning in jail(8) should still
apply, and by default we should keep the default value of
security.jail.allow_raw_sockets MIB to 0 (or disabled) until
we are certain that we have tracked down all the problems.

Looking forward, we will probably want to eliminate the
references to curthread.

This may be a MFC candidate for RELENG_5.

Reviewed by: rwatson
Approved by: bmilekic (mentor)


134119 21-Aug-2004 rwatson

When prepending space onto outgoing UDP datagram payloads to hold the
UDP/IP header, make sure that space is also allocated for the link
layer header. If an mbuf must be allocated to hold the UDP/IP header
(very likely), then this will avoid an additional mbuf allocation at
the link layer. This trick is also used by TCP and other protocols to
avoid extra calls to the mbuf allocator in the ethernet (and related)
output routines.


134055 20-Aug-2004 andre

Fix a stupid typo which prevented an ipfw KLD unload from successfully cleaning
up its remains. Do not terminate 'if' lines with ';'.

Spotted by: claudio@OpenBSD.ORG (sitting 3m from my desk)
Pointy hat to: andre


134049 19-Aug-2004 andre

When unloading ipfw module use callout_drain() to make absolutely sure that
all callouts are stopped and finished. Move it before IPFW_LOCK() to avoid
deadlocking when draining callouts.


134041 19-Aug-2004 andre

For IPv6 access pointer to tcpcb only after we have checked it is valid.

Found by: Coverity's automated analysis (via Ted Unangst)


134026 19-Aug-2004 andre

Give a useful error message if someone tries to compile IPFIREWALL into the
kernel without specifying PFIL_HOOKS as well.


134023 19-Aug-2004 andre

Do not unconditionally ignore IPDIVERT and IPFIREWALL_FORWARD when building
the ipfw KLD.

For IPFIREWALL_FORWARD this does not have any side effects. If the module
has it but not the kernel it just doesn't do anything.

For IPDIVERT the KLD will be unloadable if the kernel doesn't have IPDIVERT
compiled in too. However this is the least disturbing behaviour. The user
can just recompile either module or the kernel to match the other one. The
access to the machine is not denied if ipfw refuses to load.


134022 19-Aug-2004 andre

Bring back the sysctl 'net.inet.ip.fw.enable' to unbreak the startup scripts
and to be able to disable ipfw if it was compiled directly into the kernel.


133994 19-Aug-2004 rwatson

Push down pcbinfo and inpcb locking from udp_send() into udp_output().
This provides greater context for the locking and allows us to avoid
locking the pcbinfo structure if not binding operations will take
place (i.e., already bound, connected, and no expliti sendto()
address).


133993 19-Aug-2004 rwatson

In in_pcbrehash(), do assert the inpcb lock as well as the pcbinfo lock.


133923 18-Aug-2004 rwatson

Fix build of ip_input.c with "options IPSEC" -- the "pass:" label
is used with both FAST_IPSEC and IPSEC, but was defined for only
FAST_IPSEC.


133922 18-Aug-2004 peter

Make the kernel compile again if you are not using PFIL_HOOKS


133920 17-Aug-2004 andre

Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland
and preserves the ipfw ABI. The ipfw core packet inspection and filtering
functions have not been changed, only how ipfw is invoked is different.

However there are many changes how ipfw is and its add-on's are handled:

In general ipfw is now called through the PFIL_HOOKS and most associated
magic, that was in ip_input() or ip_output() previously, is now done in
ipfw_check_[in|out]() in the ipfw PFIL handler.

IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to
be diverted is checked if it is fragmented, if yes, ip_reass() gets in for
reassembly. If not, or all fragments arrived and the packet is complete,
divert_packet is called directly. For 'tee' no reassembly attempt is made
and a copy of the packet is sent to the divert socket unmodified. The
original packet continues its way through ip_input/output().

ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet
with the new destination sockaddr_in. A check if the new destination is a
local IP address is made and the m_flags are set appropriately. ip_input()
and ip_output() have some more work to do here. For ip_input() the m_flags
are checked and a packet for us is directly sent to the 'ours' section for
further processing. Destination changes on the input path are only tagged
and the 'srcrt' flag to ip_forward() is set to disable destination checks
and ICMP replies at this stage. The tag is going to be handled on output.
ip_output() again checks for m_flags and the 'ours' tag. If found, the
packet will be dropped back to the IP netisr where it is going to be picked
up by ip_input() again and the directly sent to the 'ours' section. When
only the destination changes, the route's 'dst' is overwritten with the
new destination from the forward m_tag. Then it jumps back at the route
lookup again and skips the firewall check because it has been marked with
M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with
'option IPFIREWALL_FORWARD' to enable it.

DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for
a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will
then inject it back into ip_input/ip_output() after it has served its time.
Dummynet packets are tagged and will continue from the next rule when they
hit the ipfw PFIL handlers again after re-injection.

BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as
they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS.

More detailed changes to the code:

conf/files
Add netinet/ip_fw_pfil.c.

conf/options
Add IPFIREWALL_FORWARD option.

modules/ipfw/Makefile
Add ip_fw_pfil.c.

net/bridge.c
Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw
is still directly invoked to handle layer2 headers and packets would
get a double ipfw when run through PFIL_HOOKS as well.

netinet/ip_divert.c
Removed divert_clone() function. It is no longer used.

netinet/ip_dummynet.[ch]
Neither the route 'ro' nor the destination 'dst' need to be stored
while in dummynet transit. Structure members and associated macros
are removed.

netinet/ip_fastfwd.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_fw.h
Removed 'ro' and 'dst' from struct ip_fw_args.

netinet/ip_fw2.c
(Re)moved some global variables and the module handling.

netinet/ip_fw_pfil.c
New file containing the ipfw PFIL handlers and module initialization.

netinet/ip_input.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code. ip_forward() does not longer require
the 'next_hop' struct sockaddr_in argument. Disable early checks
if 'srcrt' is set.

netinet/ip_output.c
Removed all direct ipfw handling code and replace it with the new
'ipfw forward' handling code.

netinet/ip_var.h
Add ip_reass() as general function. (Used from ipfw PFIL handlers
for IPDIVERT.)

netinet/raw_ip.c
Directly check if ipfw and dummynet control pointers are active.

netinet/tcp_input.c
Rework the 'ipfw forward' to local code to work with the new way of
forward tags.

netinet/tcp_sack.c
Remove include 'opt_ipfw.h' which is not needed here.

sys/mbuf.h
Remove m_claim_next() macro which was exclusively for ipfw 'forward'
and is no longer needed.

Approved by: re (scottl)


133874 16-Aug-2004 rwatson

White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>


133849 16-Aug-2004 obrien

Put the 'antispoof' opcode in the proper place in the opcode list such
that it doesn't break the ipfw2 ABI.


133720 14-Aug-2004 dwmalone

Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD
have already done this, so I have styled the patch on their work:

1) introduce a ip_newid() static inline function that checks
the sysctl and then decides if it should return a sequential
or random IP ID.

2) named the sysctl net.inet.ip.random_id

3) IPv6 flow IDs and fragment IDs are now always random.
Flow IDs and frag IDs are significantly less common in the
IPv6 world (ie. rarely generated per-packet), so there should
be smaller performance concerns.

The sysctl defaults to 0 (sequential IP IDs).

Reviewed by: andre, silby, mlaier, ume
Based on: NetBSD
MFC after: 2 months


133719 14-Aug-2004 phk

Fix outgoing ICMP on global instance.


133600 12-Aug-2004 csjp

Add the ability to associate ipfw rules with a specific prison ID.
Since the only thing truly unique about a prison is it's ID, I figured
this would be the most granular way of handling this.

This commit makes the following changes:

- Adds tokenizing and parsing for the ``jail'' command line option
to the ipfw(8) userspace utility.
- Append the ipfw opcode list with O_JAIL.
- While Iam here, add a comment informing others that if they
want to add additional opcodes, they should append them to the end
of the list to avoid ABI breakage.
- Add ``fw_prid'' to the ipfw ucred cache structure.
- When initializing ucred cache, if the process is jailed,
set fw_prid to the prison ID, otherwise set it to -1.
- Update man page to reflect these changes.

This change was a strong motivator behind the ucred caching
mechanism in ipfw.

A sample usage of this new functionality could be:

ipfw add count ip from any to any jail 2

It should be noted that because ucred based constraints
are only implemented for TCP and UDP packets, the same
applies for jail associations.

Conceptual head nod by: pjd
Reviewed by: rwatson
Approved by: bmilekic (mentor)


133591 12-Aug-2004 dwmalone

In tcp6_ctlinput, lock tcbinfo around the call to syncache_unreach
so that the locks held are the same as the IPv4 case.

Reviewed by: rwatson


133557 12-Aug-2004 andre

Fix two cases of incorrect IPQ_UNLOCK'ing in the merged ip_reass() function.
The first one was going to 'dropfrag', which unlocks the IPQ, before the lock
was aquired; The second one doing a unlock and then a 'goto dropfrag' which
led to a double-unlock.

Tripped over by: des


133532 12-Aug-2004 rwatson

When udp_send() fails, make sure to free the control mbufs as well as
the data mbuf. This was done in most error cases, but not the case
where the inpcb pointer is surprisingly NULL.


133517 11-Aug-2004 andre

Backout removal of UMA_ZONE_NOFREE flag for all zones which are established
for structures with timers in them. It might be that a timer might fire
even when the associated structure has already been free'd. Having type-
stable storage in this case is beneficial for graceful failure handling and
debugging.

Discussed with: bosko, tegge, rwatson


133509 11-Aug-2004 andre

Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and
TCP code. This flag would have prevented giving back excessive free slabs
to the global pool after a transient peak usage.


133497 11-Aug-2004 andre

Make use of in_localip() function and replace previous direct LIST_FOREACH
loops over INADDR_HASH.


133486 11-Aug-2004 andre

Add the function in_localip() which returns 1 if an internet address is for
the local host and configured on one of its interfaces.


133485 11-Aug-2004 andre

Only invoke verify_path() for verrevpath and versrcreach when we have an IP packet.


133482 11-Aug-2004 andre

Only check for local broadcast addresses if the mbuf is flagged with M_BCAST.


133481 11-Aug-2004 andre

Consistently use NULL for pointer comparisons.


133480 11-Aug-2004 andre

Make IP fastforwarding ALTQ-aware by adding the input traffic conditioner
check and disabling the early output interface queue length check.


133477 11-Aug-2004 andre

Correct the displayed bandwidth calculation for a readout via sysctl. The
saved value does not have to be scaled with HZ; it is already in bytes per
second. Only the multiply by eight remains to show bits per second (bps).


133469 11-Aug-2004 rwatson

Assert the locks of inpcbinfo's and inpcb's passed into in_pcbconnect()
and in_pcbconnect_setup(), since these functions frob the port and
address state of inpcbs.


133390 09-Aug-2004 andre

Make a comment that IP source routing is not SMP and PREEMPTION safe.


133389 09-Aug-2004 andre

Make a comment that "ipfw forward" is not SMP and PREEMPTION safe.


133387 09-Aug-2004 andre

New ipfw option "antispoof":

For incoming packets, the packet's source address is checked if it
belongs to a directly connected network. If the network is directly
connected, then the interface the packet came on in is compared to
the interface the network is connected to. When incoming interface
and directly connected interface are not the same, the packet does
not match.

Usage example:

ipfw add deny ip from any to any not antispoof in

Manpage education by: ru


133192 06-Aug-2004 rwatson

Pass pcbinfo structures to in6_pcbnotify() rather than pcbhead
structures, allowing in6_pcbnotify() to lock the pcbinfo and each
inpcb that it notifies of ICMPv6 events. This prevents inpcb
assertions from firing when IPv6 generates and delievers event
notifications for inpcbs.

Reported by: kuriyama
Tested by: kuriyama


133189 06-Aug-2004 rwatson

When iterating the UDP inpcb list processing an inbound broadcast
or multicast packet, we don't need to acquire the inpcb mutex
unless we are actually using inpcb fields other than the bound port
and address. Since we hold the pcbinfo lock already, these can't
change. Defer acquiring the inpcb mutex until we have a high
chance of a match. This avoids about 120 mutex operations per UDP
broadcast packet received on one of my work systems.

Reviewed by: sam


133128 04-Aug-2004 rwatson

Now that IPv6 performs basic in6pcb and inpcb locking, enable inpcb
lock assertions even if IPv6 is compiled into the kernel. Previously,
inclusion of IPv6 and locking assertions would result in a rapid
assertion failure as IPv6 was not properly locking inpcbs.


133121 04-Aug-2004 marcus

Fix Skinny and PPTP NAT'ing after the introduction of the {ip,tcp,udp}_next
functions. Basically, the ip_next() function was used to get the PPTP and
Skinny headers when tcp_next() should have been used instead. Symptoms of
this included a segfault in natd when trying to process a PPTP or Skinny
packet.

Approved by: des


133074 03-Aug-2004 andre

o Delayed checksums are now calculated in divert_packet() for diverted packets
Remove the XXX-escaped code that did it in ip_output()'s IPHACK section.


133072 03-Aug-2004 andre

o Move the inflight sysctls to their own sub-tree under net.inet.tcp to be
more consistent with the other sysctls around it.


133069 03-Aug-2004 andre

o Move all parts of the IP reassembly process into the function ip_reass() to
make it fully self-contained.
o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len
including the IP header.
o Computation of the delayed checksum is moved into divert_packet().

Reviewed by: silby


133046 03-Aug-2004 hsu

Fix bug with tracking the previous element in a list.

Found by: edrt@citiz.net
Submitted by: pavlin@icir.org


132794 28-Jul-2004 yar

Disallow a particular kind of port theft described by the following scenario:

Alice is too lazy to write a server application in PF-independent
manner. Therefore she knocks up the server using PF_INET6 only
and allows the IPv6 socket to accept mapped IPv4 as well. An evil
hacker known on IRC as cheshire_cat has an account in the same
system. He starts a process listening on the same port as used
by Alice's server, but in PF_INET. As a consequence, cheshire_cat
will distract all IPv4 traffic supposed to go to Alice's server.

Such sort of port theft was initially enabled by copying the code that
implemented the RFC 2553 semantics on IPv4/6 sockets (see inet6(4)) for
the implied case of the same owner for both connections. After this
change, the above scenario will be impossible. In the same setting,
the user who attempts to start his server last will get EADDRINUSE.

Of course, using IPv4 mapped to IPv6 leads to security complications
in the first place, but there is no reason to make it even more unsafe.

This change doesn't apply to KAME since it affects a FreeBSD-specific
part of the code. It doesn't modify the out-of-box behaviour of the
TCP/IP stack either as long as mapping IPv4 to IPv6 is off by default.

MFC after: 1 month


132717 28-Jul-2004 jayanth

Fix a bug in the sack code that was causing data to be retransmitted
with the FIN bit set for all segments, if a FIN has already been sent before.
The fix will allow the FIN bit to be set for only the last segment, in case
it has to be retransmitted.

Fix another bug that would have caused snd_nxt to be pulled by len if
there was an error from ip_output. snd_nxt should not be touched
during sack retransmissions.


132676 26-Jul-2004 jayanth

Fix for a SACK bug where the very last segment retransmitted
from the SACK scoreboard could result in the next (untransmitted)
segment to be skipped.


132675 26-Jul-2004 jmg

compare pointer against NULL, not 0

when inpcb is NULL, this is no longer invalid since jlemon added the
tcp_twstart function... this prevents close "failing" w/ EINVAL when it
really was successful...

Reviewed by: jeremy (NetBSD)


132653 26-Jul-2004 cperciva

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


132510 21-Jul-2004 andre

Extend versrcreach by checking against the rt_flags for RTF_REJECT and
RTF_BLACKHOLE as well.

To quote the submitter:

The uRPF loose-check implementation by the industry vendors, at least on Cisco
and possibly Juniper, will fail the check if the route of the source address
is pointed to Null0 (on Juniper, discard or reject route). What this means is,
even if uRPF Loose-check finds the route, if the route is pointed to blackhole,
uRPF loose-check must fail. This allows people to utilize uRPF loose-check mode
as a pseudo-packet-firewall without using any manual filtering configuration --
one can simply inject a IGP or BGP prefix with next-hop set to a static route
that directs to null/discard facility. This results in uRPF Loose-check failing
on all packets with source addresses that are within the range of the nullroute.

Submitted by: James Jun <james@towardex.com>


132469 20-Jul-2004 rwatson

M_PREPEND() the IP header on to the front of an outgoing raw IP packet
using M_DONTWAIT rather than M_WAITOK to avoid sleeping on memory
while holding a mutex.


132418 19-Jul-2004 jayanth

Let IN_FASTREOCOVERY macro decide if we are in recovery mode.

Nuke sackhole_limit for now. We need to add it back to limit the total
number of sack blocks in the system.


132417 19-Jul-2004 jayanth

Fix a potential panic in the SACK code that was causing
1) data to be sent to the right of snd_recover.
2) send more data then whats in the send buffer.

The fix is to postpone sack retransmit to a subsequent recovery episode
if the current retransmit pointer is beyond snd_recover.

Thanks to Mohan Srinivasan for helping fix the bug.

Submitted by:Daniel Lang


132315 17-Jul-2004 dwmalone

Fix the !INET6 build.

Reported by: alc


132307 17-Jul-2004 dwmalone

The tcp syncache code was leaving the IPv6 flowlabel uninitialised
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.

Tested and Discovered by: Orla McGann <orly@cnri.dit.ie>
Approved by: ume, silby
MFC after: 1 month


132280 17-Jul-2004 mlaier

Define semantic of M_SKIP_FIREWALL more precisely, i.e. also pass associated
icmp_error() packets. While here retire PACKET_TAG_PF_GENERATED (which
served the same purpose) and use M_SKIP_FIREWALL in pf as well. This should
speed up things a bit as we get rid of the tag allocations.

Discussed with: juli


132274 17-Jul-2004 jmallett

Make M_SKIP_FIREWALL a global (and semantic) flag, preventing anything from
using M_PROTO6 and possibly shooting someone's foot, as well as allowing the
firewall to be used in multiple passes, or with a packet classifier frontend,
that may need to explicitly allow a certain packet. Presently this is handled
in the ipfw_chk code as before, though I have run with it moved to upper
layers, and possibly it should apply to ipfilter and pf as well, though this
has not been investigated.

Discussed with: luigi, rwatson


132259 16-Jul-2004 ume

when IN6P_AUTOFLOWLABEL is set, the flowlabel is not set on
outgoing tcp connections.

Reported by: Orla McGann <orly@cnri.dit.ie>
Reviewed by: Orla McGann <orly@cnri.dit.ie>
Obtained from: KAME


132199 15-Jul-2004 phk

Do a pass over all modules in the kernel and make them return EOPNOTSUPP
for unknown events.

A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".


132107 13-Jul-2004 stefanf

Remove erroneous semicolons.


132044 12-Jul-2004 rwatson

After each label in tcp_input(), assert the inpcbinfo and inpcb lock
state that we expect.


131840 08-Jul-2004 brian

Change the following environment variables to kernel options:

bootp -> BOOTP
bootp.nfsroot -> BOOTP_NFSROOT
bootp.nfsv3 -> BOOTP_NFSV3
bootp.compat -> BOOTP_COMPAT
bootp.wired_to -> BOOTP_WIRED_TO

- i.e. back out the previous commit. It's already possible to
pxeboot(8) with a GENERIC kernel.

Pointed out by: dwmalone


131814 08-Jul-2004 brian

Change the following kernel options to environment variables:

BOOTP -> bootp
BOOTP_NFSROOT -> bootp.nfsroot
BOOTP_NFSV3 -> bootp.nfsv3
BOOTP_COMPAT -> bootp.compat
BOOTP_WIRED_TO -> bootp.wired_to

This lets you PXE boot with a GENERIC kernel by putting this sort of thing
in loader.conf:

bootp="YES"
bootp.nfsroot="YES"
bootp.nfsv3="YES"
bootp.wired_to="bge1"

or even setting the variables manually from the OK prompt.


131700 06-Jul-2004 des

Push WARNS back up to 6, but define NO_WERROR; I want the warts out in the
open where people can see them and hopefully fix them.


131699 06-Jul-2004 des

Introduce inline {ip,udp,tcp}_next() functions which take a pointer to an
{ip,udp,tcp} header and return a void * pointing to the payload (i.e. the
first byte past the end of the header and any required padding). Use them
consistently throughout libalias to a) reduce code duplication, b) improve
code legibility, c) get rid of a bunch of alignment warnings.


131693 06-Jul-2004 des

Rewrite twowords() to access its argument through a char pointer and not
a short pointer. The previous implementation seems to be in a gray zone
of the C standard, and GCC generates incorrect code for it at -O2 or
higher on some platforms.


131690 06-Jul-2004 des

Temporarily lower WARNS to 3 while I figure out the alignment issues on
alpha.


131614 05-Jul-2004 des

Make libalias WARNS?=6-clean. This mostly involves renaming variables
named link, foo_link or link_foo to lnk, foo_lnk or lnk_foo, fixing
signed / unsigned comparisons, and shoving unused function arguments
under the carpet.

I was hoping WARNS?=6 might reveal more serious problems, and perhaps
the source of the -O2 breakage, but found no smoking gun.


131613 05-Jul-2004 des

Parenthesize return values.


131612 05-Jul-2004 des

Mechanical whitespace cleanup.


131566 04-Jul-2004 phk

Add LibAliasOutTry() which checks a packet for a hit in the tables, but
does not create a new entry if none is found.


131504 02-Jul-2004 ru

Mechanically kill hard sentence breaks.


131427 01-Jul-2004 jayanth

On receiving 3 duplicate acknowledgements, SACK recovery was not being entered correctly.
Fix this problem by separating out the SACK and the newreno cases. Also, check
if we are in FASTRECOVERY for the sack case and if so, turn off dupacks.

Fix an issue where the congestion window was not being incremented by ssthresh.

Thanks to Mohan Srinivasan for finding this problem.


131420 01-Jul-2004 ru

Bumped document date.
Fixed markup.
Fixed examples to match the new API.


131208 27-Jun-2004 phk

Rwatson, write 100 times for tomorrow:

First unlock, then assign NULL to pointer.


131178 27-Jun-2004 pjd

Those are unneeded too.


131177 27-Jun-2004 pjd

Add two missing includes and remove two uneeded.
This is quite serious fix, because even with MAC framework compiled in,
MAC entry points in those two files were simply ignored.


131151 26-Jun-2004 rwatson

Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.


131147 26-Jun-2004 rwatson

Remove spl's from TCP protocol entry points. While not all locking
is merged here yet, this will ease the merge process by bringing the
locked and unlocked versions into sync.


131079 25-Jun-2004 ps

White space & spelling fixes

Submitted by: Xin LI <delphij@frontfree.net>


131078 25-Jun-2004 bms

Whitespace.


131018 24-Jun-2004 rwatson

Broaden scope of the socket buffer lock when processing an ACK so that
the read and write of sb_cc are atomic. Call sbdrop_locked() instead
of sbdrop() since we already hold the socket buffer lock.


131017 24-Jun-2004 rwatson

Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations


131012 24-Jun-2004 rwatson

In ip_ctloutput(), acquire the inpcb lock around some of the basic
inpcb flag and status updates.


131011 24-Jun-2004 rwatson

When asserting non-Giant locks in the network stack, also assert
Giant if debug.mpsafenet=0, as any points that require synchronization
in the SMPng world also required it in the Giant-world:

- inpcb locks (including IPv6)
- inpcbinfo locks (including IPv6)
- dummynet subsystem lock
- ipfw2 subsystem lock


131006 24-Jun-2004 rwatson

Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.


130993 23-Jun-2004 ps

Move the sack sysctl's under net.inet.tcp.sack

net.inet.tcp.do_sack -> net.inet.tcp.sack.enable
net.inet.tcp.sackhole_limit -> net.inet.tcp.sack.sackhole_limit

Requested by: wollman


130989 23-Jun-2004 ps

Add support for TCP Selective Acknowledgements. The work for this
originated on RELENG_4 and was ported to -CURRENT.

The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.

You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.

Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)


130901 22-Jun-2004 rwatson

Acquire socket lock around frobbing of socket state in divert sockets.


130900 22-Jun-2004 rwatson

Prefer use of the inpcb as a MAC label source for outgoing packets sent
via divert sockets, when available.


130821 20-Jun-2004 rwatson

If debug.mpsafenet is set, initialize TCP callouts as CALLOUT_MPSAFE.


130811 20-Jun-2004 rwatson

Assert the inpcb lock before letting MAC check whether we can deliver
to the inpcb in tcp_input().


130810 20-Jun-2004 rwatson

IP multicast code no longer needs to acquire Giant before appending
an mbuf onto a socket buffer. This is left over from debug.mpsafenet
affecting the forwarding/bridging plane only.


130701 18-Jun-2004 rwatson

In tcp_ctloutput(), don't hold the inpcb lock over a call to
ip_ctloutput(), as it may need to perform blocking memory allocations.
This also improves consistency with locking relative to other points
that call into ip_ctloutput().

Bumped into by: Grover Lines <grover@ceribus.net>


130685 18-Jun-2004 bms

Check that m->m_pkthdr.rcvif is not NULL before checking if a packet
was received on a broadcast address on the input path. Under certain
circumstances this could result in a panic, notably for locally-generated
packets which do not have m_pkthdr.rcvif set.

This is a similar situation to that which is solved by
src/sys/netinet/ip_icmp.c rev 1.66.

PR: kern/52935


130683 18-Jun-2004 bms

Appease GCC.


130666 18-Jun-2004 bms

If SO_DEBUG is enabled for a TCP socket, and a received segment is
encapsulated within an IPv6 datagram, do not abuse the 'ipov' pointer
when registering trace records. 'ipov' is specific to IPv4, and
will therefore be uninitialized.

[This fandango is only necessary in the first place because of our
host-byte-order IP field pessimization.]

PR: kern/60856
Submitted by: Galois Zheng


130664 18-Jun-2004 bms

Don't set FIN on a retransmitted segment after a FIN has been sent,
unless the segment really contains the last of the data for the stream.

PR: kern/34619
Obtained from: OpenBSD (tcp_output.c rev 1.47)
Noticed by: Joseph Ishac
Reviewed by: George Neville-Neil


130662 18-Jun-2004 bms

Ensure that dst is bzeroed before calling rtalloc_ign(), to avoid possible
routing table corruption.

PR: kern/40563, freebsd4/432 (KAME)
Obtained from: NetBSD (in_gif.c rev 1.26.10.1)
Requested by: Jean-Luc Richier


130613 16-Jun-2004 mlaier

Commit pf version 3.5 and link additional files to the kernel build.

Version 3.5 brings:
- Atomic commits of ruleset changes (reduce the chance of ending up in an
inconsistent state).
- A 30% reduction in the size of state table entries.
- Source-tracking (limit number of clients and states per client).
- Sticky-address (the flexibility of round-robin with the benefits of
source-hash).
- Significant improvements to interface handling.
- and many more ...


130609 16-Jun-2004 mlaier

Prepare for pf 3.5 import:
- Remove pflog and pfsync modules. Things will change in such a fashion
that there will be one module with pf+pflog that can be loaded into
GENERIC without problems (which is what most people want). pfsync is no
longer possible as a module.
- Add multicast address for in-kernel multicast pfsync protocol. Protocol
glue will follow once the import is done.
- Add one more mbuf tag


130590 16-Jun-2004 maxim

o connect(2): if there is no a route to the destination
do not pick up the first local ip address for the source
ip address, return ENETUNREACH instead.

Submitted by: Gleb Smirnoff
Reviewed by: -current (silence)


130584 16-Jun-2004 bms

Fix build for IPSEC && !INET6

PR: kern/66125
Submitted by: Cyrille Lefevre


130583 16-Jun-2004 bms

Reverse a patch which has no effect on -CURRENT and should probably be
applied directly to -STABLE.

Noticed by: iedowse
Pointy hat to: bms


130581 16-Jun-2004 bms

In ip_forward(), when calculating the MTU in effect for an IPSEC transport
mode tunnel, take the per-route MTU into account, *if* and *only if* it
is non-zero (as found in struct rt_metrics/rt_metrics_lite).

PR: kern/42727
Obtained from: NetBSD (ip_input.c rev 1.151)


130580 16-Jun-2004 bms

In ip_forward(), set m->m_pkthdr.len correctly such that the mbuf chain
is sane, and ipsec4_getpolicybyaddr() will therefore complete.

PR: kern/42727
Obtained from: KAME (kame/freebsd4/sys/netinet/ip_input.c rev 1.42)


130559 16-Jun-2004 bms

Disconnect a temporarily-connected UDP socket in out-of-mbufs case. This
fixes the problem of UDP sockets getting wedged in a connected state (and
bound to their destination) under heavy load.
Temporary bind/connect should probably be deleted in future
as an optimization, as described in "A Faster UDP" [Partridge/Pink 1993].

Notes:
- INP_LOCK() is already held in udp_output(). The connection is in effect
happening at a layer lower than the socket layer, therefore in theory
socket locking should not be needed.
- Inlining the in_pcbdisconnect() operation buys us nothing (in the case
of the current state of the code), as laddr is not part of the
inpcb hash or the udbinfo hash. Therefore there should be no need
to rehash after restoring laddr in the error case (this was a
concern of the original author of the patch).

PR: kern/41765
Requested by: gnn
Submitted by: Jinmei Tatuya (with cleanups)
Tested by: spray(8)


130555 16-Jun-2004 rwatson

Convert GIANT_REQUIRED to NET_ASSERT_GIANT for socket access.


130513 15-Jun-2004 rwatson

Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field. This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.


130480 14-Jun-2004 rwatson

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


130416 13-Jun-2004 mlaier

Link ALTQ to the build and break with ABI for struct ifnet. Please recompile
your (network) modules as well as any userland that might make sense of
sizeof(struct ifnet).
This does not change the queueing yet. These changes will follow in a
seperate commit. Same with the driver changes, which need case by case
evaluation.

__FreeBSD_version bump will follow.

Tested-by: (i386)LINT


130407 13-Jun-2004 dfr

Add a new driver to support IP over firewire. This driver is intended to
conform to the rfc2734 and rfc3146 standard for IP over firewire and
should eventually supercede the fwe driver. Right now the broadcast
channel number is hardwired and we don't support MCAP for multicast
channel allocation - more infrastructure is required in the firewire
code itself to fix these problems.


130398 13-Jun-2004 rwatson

Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.


130387 12-Jun-2004 rwatson

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


130363 11-Jun-2004 csjp

Modify ip fw so that whenever UID or GID constraints exist in a
ruleset, the pcb is looked up once per ipfw_chk() activation.

This is done by extracting the required information out of the PCB
and caching it to the ipfw_chk() stack. This should greatly reduce
PCB looking contention and speed up the processing of UID/GID based
firewall rules (especially with large UID/GID rulesets).

Some very basic benchmarks were taken which compares the number
of in_pcblookup_hash(9) activations to the number of firewall
rules containing UID/GID based contraints before and after this patch.

The results can be viewed here:
o http://people.freebsd.org/~csjp/ip_fw_pcb.png

Reviewed by: andre, luigi, rwatson
Approved by: bmilekic (mentor)


130337 11-Jun-2004 rwatson

Remove unneeded Giant acquisition in divert_packet(), which is
left over from debug.mpsafenet affecting only the forwarding
plane. Giant is now acquired in the ithread/netisr or in the
system call code.


130333 11-Jun-2004 rwatson

Lock down parallel router_info list for tracking multicast IGMP
versions of various routers seen:

- Introduce igmp_mtx.
- Protect global variable 'router_info_head' and list fields
in struct router_info with this mutex, as well as
igmp_timers_are_running.
- find_rti() asserts that the caller acquires igmp_mtx.
- Annotate a failure to check the return value of
MALLOC(..., M_NOWAIT).


130311 10-Jun-2004 ru

init_tables() must be run after sys/net/route.c:route_init().


130281 09-Jun-2004 ru

Introduce a new feature to IPFW2: lookup tables. These are useful
for handling large sparse address sets. Initial implementation by
Vsevolod Lobko <seva@ip.net.ua>, refined by me.

MFC after: 1 week


130183 07-Jun-2004 ume

do not send icmp response if the original packet is encrypted.

Obtained from: KAME
MFC after: 1 week


130024 03-Jun-2004 bmilekic

Move the locking of the pcb into raw_output(). Organize code so
that m_prepend() is not called with possibility to wait while the
pcb lock is held. What still needs revisiting is whether the
ripcbinfo lock is really required here.

Discussed with: rwatson


129880 30-May-2004 phk

add missing #include <sys/module.h>


129876 30-May-2004 phk

Add some missing <sys/module.h> includes which are masked by the
one on death-row in <sys/kernel.h>


129720 25-May-2004 csjp

Add a super-user check to ipfw_ctl() to make sure that the calling
process is a non-prison root. The security.jail.allow_raw_sockets
sysctl variable is disabled by default, however if the user enables
raw sockets in prisons, prison-root should not be able to interact
with firewall rule sets.

Approved by: rwatson, bmilekic (mentor)


129465 20-May-2004 yar

When checking for possible port theft, skip over a TCP inpcb
unless it's in the closed or listening state (remote address
== INADDR_ANY).

If a TCP inpcb is in any other state, it's impossible to steal
its local port or use it for port theft. And if there are
both closed/listening and connected TCP inpcbs on the same
localIP:port couple, the call to in_pcblookup_local() will
find the former due to the design of that function.

No objections raised in: -net, -arch
MFC after: 1 month


129126 11-May-2004 maxim

o Calculate a number of bytes to copy (cnt) correctly:

+----+-+-+-+-+----+----+- - - - - - - - - - - - -+----+
| | |C| | | | | | |
| IP |N|O|L|P| | IP | | IP |
| #1 |O|D|E|T| | #2 | | #n |
| |P|E|N|R| | | | |
+----+-+-+-+-+----+----+- - - - - - - - - - - - -+----+
^ ^<---- cnt - (IPOPT_MINOFF - 1) ---->|
| |
src | +-- cp[IPOPT_OFF + 1] + sizeof(struct in_addr)
|
dst +-- cp[IPOPT_OFF + 1]

PR: kern/66386
Submitted by: Andrei Iltchenko
MFC after: 3 weeks


129019 07-May-2004 maxim

o IFNAMSIZ does include the trailing \0.

Approved by: andre

o Document net.inet.icmp.reply_src.


129017 06-May-2004 andre

Provide the sysctl net.inet.ip.process_options to control the processing
of IP options.

net.inet.ip.process_options=0 Ignore IP options and pass packets unmodified.
net.inet.ip.process_options=1 Process all IP options (default).
net.inet.ip.process_options=2 Reject all packets with IP options with ICMP
filter prohibited message.

This sysctl affects packets destined for the local host as well as those
only transiting through the host (routing).

IP options do not have any legitimate purpose anymore and are only used
to circumvent firewalls or to exploit certain behaviours or bugs in TCP/IP
stacks.

Reviewed by: sam (mentor)


128905 04-May-2004 rwatson

Switch to using the inpcb MAC label instead of socket MAC label when
labeling new mbufs created from sockets/inpcbs in IPv4. This helps avoid
the need for socket layer locking in the lower level network paths
where inpcb locks are already frequently held where needed. In
particular:

- Use the inpcb for label instead of socket in raw_append().
- Use the inpcb for label instead of socket in tcp_output().
- Use the inpcb for label instead of socket in tcp_respond().
- Use the inpcb for label instead of socket in tcp_twrespond().
- Use the inpcb for label instead of socket in syncache_respond().

While here, modify tcp_respond() to avoid assigning NULL to a stack
variable and centralize assertions about the inpcb when inp is
assigned.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


128904 04-May-2004 rwatson

Assert inpcb lock in udp_append().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


128903 04-May-2004 rwatson

Assert the inpcb lock on 'last' in udp_append(), since it's always
called with it, and also requires it.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


128880 03-May-2004 maxim

o Fix misindentation in the previous commit.


128877 03-May-2004 andre

Back out a change that slipped into the previous commit for which other
supporting parts have not yet been committed.

Remove pre-mature IP options ignoring option.


128872 03-May-2004 andre

Optimize IP fastforwarding some more:

o New function ip_findroute() to reduce code duplication for the
route lookup cases. (luigi)

o Store ip_len in host byte order on the stack instead of using
it via indirection from the mbuf. This allows to defer the host
byte conversion to a later point and makes a quicker fallback to
normal ip_input() processing. (luigi)

o Check if route is dampned with RTF_REJECT flag and drop packet
already here when ARP is unable to resolve destination address.
An ICMP unreachable is sent to inform the sender.

o Check if interface output queue is full and drop packet already
here. No ICMP notification is sent because signalling source quench
is depreciated.

o Check if media_state is down (used for ethernet type interfaces)
and drop the packet already here. An ICMP unreachable is sent to
inform the sender.

o Do not account sent packets to the interface address counters. They
are only for packets with that 'ia' as source address.

o Update and clarify some comments.

Submitted by: luigi (most of it)


128829 02-May-2004 darrenr

Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.


128828 02-May-2004 darrenr

oops, I forgot this file in a prior commit (change was still sitting here,
uncommitted):

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h
alongside all of the other macros that work ok mbuf's and tag's.


128816 02-May-2004 darrenr

Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg
(the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside
all of the other macros that work ok mbuf's and tag's.


128664 26-Apr-2004 bmilekic

Give jail(8) the feature to allow raw sockets from within a
jail, which is less restrictive but allows for more flexible
jail usage (for those who are willing to make the sacrifice).
The default is off, but allowing raw sockets within jails can
now be accomplished by tuning security.jail.allow_raw_sockets
to 1.

Turning this on will allow you to use things like ping(8)
or traceroute(8) from within a jail.

The patch being committed is not identical to the patch
in the PR. The committed version is more friendly to
APIs which pjd is working on, so it should integrate
into his work quite nicely. This change has also been
presented and addressed on the freebsd-hackers mailing
list.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
PR: kern/65800


128653 26-Apr-2004 silby

Tighten up reset handling in order to make reset attacks as difficult as
possible while maintaining compatibility with the widest range of TCP stacks.

The algorithm is as follows:

---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.

For connections in all other states, a reset anywhere in the window
will cause the connection to be reset. All other segments will be
silently dropped.
---

The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.

Idea by: Darren Reed
Reviewed by: truckman, jlemon, others(?)


128645 25-Apr-2004 luigi

Another small set of changes to reduce diffs with the new arp code.


128642 25-Apr-2004 luigi

remove a stale comment on the behaviour of arpresolve


128641 25-Apr-2004 luigi

Start the arp timer at init time.
It runs so rarely that it makes no sense to wait until the first request.


128636 25-Apr-2004 luigi

This commit does two things:

1. rt_check() cleanup:
rt_check() is only necessary for some address families to gain access
to the corresponding arp entry, so call it only in/near the *resolve()
routines where it is actually used -- at the moment this is
arpresolve(), nd6_storelladdr() (the call is embedded here),
and atmresolve() (the call is just before atmresolve to reduce
the number of changes).
This change will make it a lot easier to decouple the arp table
from the routing table.

There is an extra call to rt_check() in if_iso88025subr.c to
determine the routing info length. I have left it alone for
the time being.

The interface of arpresolve() and nd6_storelladdr() now changes slightly:
+ the 'rtentry' parameter (really a hint from the upper level layer)
is now passed unchanged from *_output(), so it becomes the route
to the final destination and not to the gateway.
+ the routines will return 0 if resolution is possible, non-zero
otherwise.
+ arpresolve() returns EWOULDBLOCK in case the mbuf is being held
waiting for an arp reply -- in this case the error code is masked
in the caller so the upper layer protocol will not see a failure.

2. arpcom untangling
Where possible, use 'struct ifnet' instead of 'struct arpcom' variables,
and use the IFP2AC macro to access arpcom fields.
This mostly affects the netatalk code.

=== Detailed changes: ===
net/if_arcsubr.c
rt_check() cleanup, remove a useless variable

net/if_atmsubr.c
rt_check() cleanup

net/if_ethersubr.c
rt_check() cleanup, arpcom untangling

net/if_fddisubr.c
rt_check() cleanup, arpcom untangling

net/if_iso88025subr.c
rt_check() cleanup

netatalk/aarp.c
arpcom untangling, remove a block of duplicated code

netatalk/at_extern.h
arpcom untangling

netinet/if_ether.c
rt_check() cleanup (change arpresolve)

netinet6/nd6.c
rt_check() cleanup (change nd6_storelladdr)


128593 23-Apr-2004 silby

Wrap two long lines in the previous commit.


128592 23-Apr-2004 andre

Correct an edge case in tcp_mss() where the cached path MTU
from tcp_hostcache would have overridden a (now) lower MTU of
an interface or route that changed since first PMTU discovery.
The bug would have caused TCP to redo the PMTU discovery when
not strictly necessary.

Make a comment about already pre-initialized default values
more clear.

Reviewed by: sam


128575 23-Apr-2004 andre

Add the option versrcreach to verify that a valid route to the
source address of a packet exists in the routing table. The
default route is ignored because it would match everything and
render the check pointless.

This option is very useful for routers with a complete view of
the Internet (BGP) in the routing table to reject packets with
spoofed or unrouteable source addresses.

Example:

ipfw add 1000 deny ip from any to any not versrcreach

also known in Cisco-speak as:

ip verify unicast source reachable-via any

Reviewed by: luigi


128574 23-Apr-2004 andre

Fix a potential race when purging expired hostcache entries.

Spotted by: luigi


128548 22-Apr-2004 silby

Take out an unneeded variable I forgot to remove in the last commit,
and make two small whitespace fixes so that diffs vs rev 1.142 are minimal.


128547 22-Apr-2004 silby

Simplify random port allocation, and add net.inet.ip.portrange.randomized,
which can be used to turn off randomized port allocation if so desired.

Requested by: alfred


128493 20-Apr-2004 bms

Fix a typo in a comment.


128453 20-Apr-2004 silby

Switch from using sequential to random ephemeral port allocation,
implementation taken directly from OpenBSD.

I've resisted committing this for quite some time because of concern over
TIME_WAIT recycling breakage (sequential allocation ensures that there is a
long time before ports are recycled), but recent testing has shown me that
my fears were unwarranted.


128452 20-Apr-2004 silby

Enhance our RFC1948 implementation to perform better in some pathlogical
TIME_WAIT recycling cases I was able to generate with http testing tools.

In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number. As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.

I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.

Except in such contrived benchmarking situations, this problem should never
come up in normal usage... until networks get faster.

No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.


128398 18-Apr-2004 luigi

Replace Bcopy with 'the real thing' as in the rest of the file.


128210 14-Apr-2004 luigi

In an effort to simplify the routing code, try to deprecate rtalloc()
in favour of rtalloc_ign(), which is what would end up being called
anyways.

There are 25 more instances of rtalloc() in net*/ and
about 10 instances of rtalloc_ign()


128019 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


128003 07-Apr-2004 ru

Fixed a bug in previous revision: compute the payload checksum before
we convert ip_len into a network byte order; in_delayed_cksum() still
expects it in host byte order.

The symtom was the ``in_cksum_skip: out of data by %d'' complaints
from the kernel.

To add to the previous commit log. These fixes make tcpdump(1) happy
by not complaining about UDP/TCP checksum being bad for looped back
IP multicast when multicast router is deactivated.

Reported by: Vsevolod Lobko


127936 06-Apr-2004 bde

Fixed misspelling of IPPORT_MAX as USHRT_MAX. Don't include <sys/limits.h>
to implement this mistake.

Fixed some nearby style bugs (initialization in declaration, misformatting
of this initialization, missing blank line after the declaration, and
comparision of the non-boolean result of the initialization with 0 using
"!". In KNF, "!" is not even used to compare booleans with 0).


127871 05-Apr-2004 rwatson

Two missed in previous commit -- compare pointer with NULL rather than
using it as a boolean.


127870 05-Apr-2004 rwatson

Prefer NULL to 0 when checking pointer values as integers or booleans.


127862 04-Apr-2004 pjd

Fix a panic possibility caused by returning without releasing locks.
It was fixed by moving problemetic checks, as well as checks that
doesn't need locking before locks are acquired.

Submitted by: Ryan Sommers <ryans@gamersimpact.com>
In co-operation with: cperciva, maxim, mlaier, sam
Tested by: submitter (previous patch), me (current patch)
Reviewed by: cperciva, mlaier (previous patch), sam (current patch)
Approved by: sam
Dedicated to: enough!


127828 04-Apr-2004 luigi

+ arpresolve(): remove an unused argument
+ struct ifnet: remove unused fields, move ipv6-related field close
to each other, add a pointer to l3<->l2 translation tables (arp,nd6,
etc.) for future use.

+ struct route: remove an unused field, move close to each
other some fields that might likely go away in the future


127757 02-Apr-2004 deischen

Unbreak natd.

Reported and submitted by: Sean McNeil (sean at mcneil.com)


127690 31-Mar-2004 des

Raise WARNS level to 2.


127689 31-Mar-2004 des

Deal with aliasing warnings.

Reviewed by: ru
Approved by: silence on the lists


127535 28-Mar-2004 rwatson

Invert the logic of NET_LOCK_GIANT(), and remove the one reference to it.
Previously, Giant would be grabbed at entry to the IP local delivery code
when debug.mpsafenet was set to true, as that implied Giant wouldn't be
grabbed in the driver path. Now, we will use this primitive to
conditionally grab Giant in the event the entire network stack isn't
running MPSAFE (debug.mpsafenet == 0).


127526 28-Mar-2004 pjd

Remove unused argument.


127505 27-Mar-2004 pjd

Reduce 'td' argument to 'cred' (struct ucred) argument in those functions:
- in_pcbbind(),
- in_pcbbind_setup(),
- in_pcbconnect(),
- in_pcbconnect_setup(),
- in6_pcbbind(),
- in6_pcbconnect(),
- in6_pcbsetport().
"It should simplify/clarify things a great deal." --rwatson

Requested by: rwatson
Reviewed by: rwatson, ume


127504 27-Mar-2004 pjd

Remove unused argument.

Reviewed by: ume


127463 26-Mar-2004 ume

Validate IPv6 socket options more carefully to avoid a panic.

PR: kern/61513
Reviewed by: cperciva, nectar


127408 25-Mar-2004 pjd

Remove unused function.
It was used in FreeBSD 4.x, but now we're using cr_canseesocket().


127396 25-Mar-2004 ru

Untangle IP multicast routing interaction with delayed payload checksums.

Compute the payload checksum for a locally originated IP multicast where
God intended, in ip_mloopback(), rather than doing it in ip_output() and
only when multicast router is active. This is more correct as we do not
fool ip_input() that the packet has the correct payload checksum when in
fact it does not (when multicast router is inactive). This is also more
efficient if we don't join the multicast group we send to, thus allowing
the hardware to checksum the payload.


127307 22-Mar-2004 rwatson

Lock down global variables in if_gre:

- Add gre_mtx to protect global softc list.
- Hold gre_mtx over various list operations (insert, delete).
- Centralize if_gre interface teardown in gre_destroy(), and call this
from modevent unload and gre_clone_destroy().
- Export gre_mtx to ip_gre.c, which walks the gre list to look up gre
interfaces during encapsulation. Add a wonking comment on how we need
some sort of drain/reference count mechanism to keep gre references
alive while in use and simultaneous destroy.

This commit does not lockdown softc data, which follows in a future
commit.


127277 21-Mar-2004 mdodd

- Fix indentation lost by 'diff -b'.
- Un-wrap short line.


127261 21-Mar-2004 mdodd

Remove interface type specific code from arprequest(), and in_arpinput().

The AF_ARP case in the (*if_output)() routine will handle the interface type
specific bits.

Obtained from: NetBSD


127094 16-Mar-2004 des

Run through indent(1) so I can read the code without getting a headache.
The result isn't quite knf, but it's knfer than the original, and far
more consistent.


126936 14-Mar-2004 mdodd

De-register.


126792 10-Mar-2004 rwatson

Lock down IP-layer encapsulation library:

- Add encapmtx to protect ip_encap.c global variables (encapsulation
list).
- Unifdef #ifdef 0 pieces of encap_init() which was (and now really
is) basically a no-op.
- Lock encapmtx when walking encaptab, modifying it, comparing
entries, etc.
- Remove spl's.

Note that currently there's no facilite to make sure outstanding
use of encapsulation methods on a table entry have drained bfore
we allow a table entry to be removed. As such, it's currently the
caller's responsibility to make sure that draining takes place.

Reviewed by: mlaier


126791 10-Mar-2004 rwatson

Scrub unused variable zeroin_addr.


126741 08-Mar-2004 hsu

To comply with the spec, do not copy the TOS from the outer IP
header to the inner IP header of the PIM Register if this is a PIM
Null-Register message.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


126740 08-Mar-2004 hsu

Include <sys/types.h> for autoconf/automake detection.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


126513 03-Mar-2004 mlaier

Add some missing DUMMYNET_UNLOCK() in config_pipe().

Noticed by: Simon Coggins
Approved by: bms(mentor)


126486 02-Mar-2004 mlaier

Two minor follow-ups on the MT_TAG removal:
ifp is now passed explicitly to ether_demux; no need to look it up again.
Make mtag a global var in ip_input.

Noticed by: rwatson
Approved by: bms(mentor)


126467 01-Mar-2004 rwatson

Rename NET_PICKUP_GIANT() to NET_LOCK_GIANT(), and NET_DROP_GIANT()
to NET_UNLOCK_GIANT(). While they are used in similar ways, the
semantics are quite different -- NET_LOCK_GIANT() and NET_UNLOCK_GIANT()
directly wrap mutex lock and unlock operations, whereas drop/pickup
special case the handling of Giant recursion. Add a comment saying
as much.

Add NET_ASSERT_GIANT(), which conditionally asserts Giant based
on the value of debug_mpsafenet.


126456 01-Mar-2004 ume

fix -O0 compilation without INET6.

Pointed out by: ru


126368 28-Feb-2004 rwatson

Remove unneeded {} originally used to hold local variables for dummynet
in a code block, as the variable is now gone.

Submitted by: sam


126351 28-Feb-2004 rwatson

Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These
were needed by the MAC Framework until inpcbs gained labels.

Submitted by: sam


126264 26-Feb-2004 mlaier

Bring eventhandler callbacks for pf.
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.

Approved by: bms(mentor)


126263 26-Feb-2004 mlaier

Tweak existing header and other build infrastructure to be able to build
pf/pflog/pfsync as modules. Do not list them in NOTES or modules/Makefile
(i.e. do not connect it to any (automatic) builds - yet).

Approved by: bms(mentor)


126253 26-Feb-2004 truckman

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


126239 25-Feb-2004 mlaier

Re-remove MT_TAGs. The problems with dummynet have been fixed now.

Tested by: -current, bms(mentor), me
Approved by: bms(mentor), sam


126226 25-Feb-2004 bde

Fixed namespace pollution in rev.1.74. Implementation of the syncache
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly. Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.

Approved by: jlemon (years ago, for a more invasive fix)


126225 25-Feb-2004 bde

Don't use the negatively-opaque type uma_zone_t or be chummy with
<vm/uma.h>'s idempotency indentifier or its misspelling.


126220 25-Feb-2004 hsu

Relax a KASSERT condition to allow for a valid corner case where
the FIN on the last segment consumes an extra sequence number.

Spurious panic reported by Mike Silbersack <silby@silby.com>.


126193 24-Feb-2004 andre

Convert the tcp segment reassembly queue to UMA and limit the maximum
amount of segments it will hold.

The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:

net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).

net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).

net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.

net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.

Tested by: bms, silby
Reviewed by: bms, silby


126002 19-Feb-2004 pjd

Fixed ucred structure leak.

Approved by: scottl (mentor)
PR: 54163
MFC after: 3 days


125952 18-Feb-2004 mlaier

Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is
not working properly with the patch in place.

Approved by: bms(mentor)


125941 17-Feb-2004 ume

IPSEC and FAST_IPSEC have the same internal API now;
so merge these (IPSEC has an extra ipsecstat)

Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>


125890 16-Feb-2004 bms

Shorten the name of the socket option used to enable TCP-MD5 packet
treatment.

Submitted by: Vincent Jardin


125875 16-Feb-2004 ume

don't update outgoing ifp, if ipsec tunnel mode encapsulation
was not made.

Obtained from: KAME


125870 16-Feb-2004 bms

Spell types consistently throughout this file. Do not use the __packed attribute, as we are often #include'd from userland without <sys/cdefs.h> in front of us, and it is not strictly necessary.

Noticed by: Sascha Blank


125819 14-Feb-2004 bms

Final brucification pass. Spell types consistently (u_int). Remove bogus
casts. Remove unnecessary parenthesis.

Submitted by: bde


125791 13-Feb-2004 mlaier

Do not expose ip_dn_find_rule inline function to userland and unbreak world.
----------------------------------------------------------------------


125785 13-Feb-2004 mlaier

Do not check receive interface when pfil(9) hook changed address.

Approved by: bms(mentor)


125784 13-Feb-2004 mlaier

This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).

This is (mostly) work from: sam

Silence from: -arch
Approved by: bms(mentor), sam, rwatson


125783 13-Feb-2004 bms

Brucification.

Submitted by: bde


125776 13-Feb-2004 ume

supported IPV6_RECVPATHMTU socket option.

Obtained from: KAME


125742 12-Feb-2004 bms

Update the prototype for tcpsignature_apply() to reflect the spelling of
the types used by m_apply()'s callback function, f, as documented in mbuf(9).

Noticed by: njl


125741 12-Feb-2004 bms

style(9) pass; whitespace and comments.

Submitted by: njl


125740 12-Feb-2004 bms

Remove an unnecessary initialization that crept in from the code which
verifies TCP-MD5 digests.

Noticed by: njl


125698 11-Feb-2004 bms

Fix a typo; left out preprocessor conditional for sigoff variable, which
is only used by TCP_SIGNATURE code.

Noticed by: Roop Nanuwa


125680 11-Feb-2004 bms

Initial import of RFC 2385 (TCP-MD5) digest support.

This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.

For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.

Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.

There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.

Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.

This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.

Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.

Sponsored by: sentex.net


125396 03-Feb-2004 ume

pass pcb rather than so. it is expected that per socket policy
works again.


125360 02-Feb-2004 andre

Add sysctl net.inet.icmp.reply_src to specify the interface name
used for the ICMP reply source in reponse to packets which are not
directly addressed to us. By default continue with with normal
source selection.

Reviewed by: bms


125349 02-Feb-2004 andre

More verbose description of the source ip address selection for ICMP replies.

Reviewed by: bms


125264 31-Jan-2004 phk

Introduce the SO_BINTIME option which takes a high-resolution timestamp
at packet arrival.

For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.

This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.


125226 30-Jan-2004 sobomax

Remove NetBSD'isms (add FreeBSD'isms?), which makes gre(4) working again.


125118 27-Jan-2004 ru

Correct the descriptions of the net.inet.{udp,raw}.recvspace sysctls.


125024 26-Jan-2004 sobomax

Add support for WCCPv2. It should be enablem manually using link2
ifconfig(8) flag since header for version 2 is the same but IP payload
is prepended with additional 4-bytes field.

Inspired by: Roman Synyuk <roman@univ.kiev.ua>
MFC after: 2 weeks


125020 26-Jan-2004 sobomax

(whilespace-only)

Kill trailing spaces.


124851 23-Jan-2004 andre

Remove leftover FREE() from changes in rev 1.50.

Noticed by: Jun Kuriyama <kuriyama@imgsrc.co.jp>


124849 22-Jan-2004 andre

Split the overloaded variable 'win' into two for their specific purposes:
recwin and sendwin. This removes a big source of confusion and makes
following the code much easier.

Reviewed by: sam (mentor)
Obtained from: DragonFlyBSD rev 1.6 (hsu)


124848 22-Jan-2004 andre

Move the reduction by one of the syncache limit after the zone has been
allocated.

Reviewed by: sam (mentor)
Obtained from: DragonFlyBSD rev 1.6 (hsu)


124847 22-Jan-2004 andre

Remove an unused variable and put the sockaddr_in6 onto the stack instead
of malloc'ing it.

Reviewed by: sam (mentor)
Obtained from: DragonFlyBSD rev 1.6 (hsu)


124761 20-Jan-2004 hsu

Merge from DragonFlyBSD rev 1.10:

date: 2003/09/02 10:04:47; author: hsu; state: Exp; lines: +5 -6
Account for when Limited Transmit is not congestion window limited.

Obtained from: DragonFlyBSD


124621 17-Jan-2004 phk

Mostly mechanical rework of libalias:

Makes it possible to have multiple packet aliasing instances in a
single process by moving all static and global variables into an
instance structure called "struct libalias".

Redefine a new API based on s/PacketAlias/LibAlias/g

Add new "instance" argument to all functions in the new API.

Implement old API in terms of the new API.


124464 13-Jan-2004 ume

do not deref freed pointer

Submitted by: "Bjoern A. Zeeb" <bzeeb+freebsd@zabbadoz.net>
Reviewed by: itojun


124437 12-Jan-2004 andre

Disable the minmssoverload connection drop by default until the detection
logic is refined.


124336 10-Jan-2004 truckman

Check that sa_len is the appropriate value in tcp_usr_bind(),
tcp6_usr_bind(), tcp_usr_connect(), and tcp6_usr_connect() before checking
to see whether the address is multicast so that the proper errno value
will be returned if sa_len is incorrect. The checks are identical to the
ones in in_pcbbind_setup(), in6_pcbbind(), and in6_pcbladdr(), which are
called after the multicast address check passes.

MFC after: 30 days


124290 09-Jan-2004 andre

Reduce TCP_MINMSS default to 216. The AX.25 protocol (packet radio)
is frequently used with an MTU of 256 because of slow speeds and a
high packet loss rate.


124258 08-Jan-2004 andre

Limiters and sanity checks for TCP MSS (maximum segement size)
resource exhaustion attacks.

For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.

The resource exhaustion works in two ways:

o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.

For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.

This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).

We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.

o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.

For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).

This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.

We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.

MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day


124248 08-Jan-2004 andre

If path mtu discovery is enabled set the DF bit in all cases we
send packets on a tcp connection.

PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)


124247 08-Jan-2004 andre

Do not set the ip_id to zero when DF is set on packet and
restore the general pre-randomid behaviour.

Setting the ip_id to zero causes several problems with
packet reassembly when a device along the path removes
the DF bit for some reason.

Other BSD and Linux have found and fixed the same issues.

PR: kern/60889
Tested by: Richard Wendland <richard@wendland.org.uk>
Approved by: re (scottl)


124199 06-Jan-2004 andre

Enable the following TCP options by default to give it more exposure:

rfc3042 Limited retransmit
rfc3390 Increasing TCP's initial congestion Window
inflight TCP inflight bandwidth limiting

All my production server have it enabled and there have been no
issues. I am confident about having them on by default and it gives
us better overall TCP performance.

Reviewed by: sam (mentor)


124198 06-Jan-2004 andre

According to RFC1812 we have to ignore ICMP redirects when we
are acting as router (ipforwarding enabled).

This doesn't fix the problem that host routes from ICMP redirects
are never removed from the kernel routing table but removes the
problem for machines doing packet forwarding.

Reviewed by: sam (mentor)


123998 30-Dec-2003 ru

Document the net.inet.ip.subnets_are_local sysctl.


123992 30-Dec-2003 sobomax

Sync with NetBSD:

if_gre.c rev.1.41-1.49

o Spell output with two ts.
o Remove assigned-to but not used variable.
o fix grammatical error in a diagnostic message.
o u_short -> u_int16_t.
o gi_len is ip_len, so it has to be network byteorder.

if_gre.h rev.1.11-1.13

o prototype must not have variable name.
o u_short -> u_int16_t.
o Spell address with two d's.

ip_gre.c rev.1.22-1.29

o KNF - return is not a function.
o The "osrc" variable in gre_mobile_input() is only ever set but not
referenced; remove it.
o correct (false) assumptions on mbuf chain. not sure if it really helps, but
anyways, it is necessary to perform m_pullup.
o correct arg to m_pullup (need to count IP header size as well).
o remove redundant adjustment of m->m_pkthdr.len.
o clear m_flags just for safety.
o tabify.
o u_short -> u_int16_t.

MFC after: 2 weeks


123922 28-Dec-2003 sam

o eliminate widespread on-stack mbuf use for bpf by introducing
a new bpf_mtap2 routine that does the right thing for an mbuf
and a variable-length chunk of data that should be prepended.
o while we're sweeping the drivers, use u_int32_t uniformly when
when prepending the address family (several places were assuming
sizeof(int) was 4)
o return M_ASSERTVALID to BPF_MTAP* now that all stack-allocated
mbufs have been eliminated; this may better be moved to the bpf
routines

Reviewed by: arch@ and several others


123893 27-Dec-2003 maxim

o Fix a comment: softticks lives in sys/kern/kern_timeout.c.

PR: kern/60613
Submitted by: Gleb Smirnoff
MFC after: 3 days


123809 24-Dec-2003 ume

NULL is not 0.

Submitted by: "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>


123768 23-Dec-2003 ru

I didn't notice it right away, but check the right length too.


123765 23-Dec-2003 ru

Fix a problem introduced in revision 1.84: m_pullup() does not
necessarily return the same mbuf chain so we need to recompute
mtod() consumers after pulling up.


123740 23-Dec-2003 peter

Catch a few places where NULL (pointer) was used where 0 (integer) was
expected.


123690 20-Dec-2003 sam

o move mutex init/destroy logic to the module load/unload hooks;
otherwise they are initialized twice when the code is statically
configured in the kernel because the module load method gets
invoked before the user application calls ip_mrouter_init
o add a mutex to synchronize the module init/done operations; this
sort of was done using the value of ip_mroute but X_ip_mrouter_done
sets it to NULL very early on which can lead to a race against
ip_mrouter_init--using the additional mutex means this is safe now
o don't call ip_mrouter_reset from ip_mrouter_init; this now happens
once at module load and X_ip_mrouter_done does the appropriate
cleanup work to insure the data structures are in a consistent
state so that a subsequent init operation inherits good state

Reviewed by: juli


123608 17-Dec-2003 jhb

Fix some becuase -> because typos.

Reported by: Marco Wertejuk <wertejuk@mwcis.com>


123607 17-Dec-2003 rwatson

Switch TCP over to using the inpcb label when responding in timed
wait, rather than the socket label. This avoids reaching up to
the socket layer during connection close, which requires locking
changes. To do this, introduce MAC Framework entry point
mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond()
instead of calling mac_create_mbuf_from_socket() or
mac_create_mbuf_netlayer(). Introduce MAC Policy entry point
mpo_create_mbuf_from_inpcb(), and implementations for various
policies, which generally just copy label data from the inpcb to
the mbuf. Assert the inpcb lock in the entry point since we
require consistency for the inpcb label reference.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


123572 16-Dec-2003 maxim

o IN_MULTICAST wants an address in host byte order.

PR: kern/60304
Submitted by: demon
MFC after: 1 week


123169 06-Dec-2003 emax

Do not panic when flushing dummynet firewall rules

Reviewed by: andre
Approved by: re (scottl)


123113 02-Dec-2003 andre

Swap destination and source arguments of two bcopy() calls.

Before committing the initial tcp_hostcache I changed them from memcpy()
to conform with FreeBSD style without realizing the difference in argument
definition.

This fixes hostcache operation for IPv6 (in general and explicitly IPv6
path mtu discovery) and T/TCP (RFC1644).

Submitted by: Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Approved by: re (rwatson)


123096 02-Dec-2003 sam

Include opt_ipsec.h so IPSEC/FAST_IPSEC is defined and the appropriate
code is compiled in to support the O_IPSEC operator. Previously no
support was included and ipsec rules were always matching. Note that
we do not return an error when an ipsec rule is added and the kernel
does not have IPsec support compiled in; this is done intentionally
but we may want to revisit this (document this in the man page).

PR: 58899
Submitted by: Bjoern A. Zeeb
Approved by: re (rwatson)


123028 28-Nov-2003 andre

Fix an optimization where I made an ifdef'd out section to broad.

When the hostcache bucket limit is reached the last bucket wasn't
removed from the bucket row but inserted a few lines later at the
bucket row head again. This leads to infinite loop when the same
bucket row is accessed the next time for a lookup/insert or purge
action.

Tested by: imp, Matt Smith
Approved by: re (rwatson)


123000 27-Nov-2003 andre

Fix verify_rev_path() function. The author of this function tried to
cut corners which completely broke down when the routing table locking
was introduced.

Reviewed by: sam (mentor)
Approved by: re (rwatson)


122996 26-Nov-2003 andre

Make sure all uses of stack allocated struct route's are properly
zeroed. Doing a bzero on the entire struct route is not more
expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst.

Reviewed by: sam (mentor)
Approved by: re (scottl)


122991 26-Nov-2003 sam

Split the "inp" mutex class into separate classes for each of divert,
raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness
complaints.

Reviewed by: rwatson
Approved by: re (rwatson)


122987 25-Nov-2003 andre

Restructure a too broad ifdef which was disabling the setting of the
tcp flightsize sysctl value for local networks in the !INET6 case.

Approved by: re (scottl)


122971 24-Nov-2003 sam

Correct a problem where ipfw-generated packets were being returned
for ipfw processing w/o an indication the packets were generated
by ipfw--and so should not be processed (this manifested itself
as a LOR.) The flag bit in the mbuf that was used to mark the
packets was not listed in M_COPYFLAGS so if a packet had a header
prepended (as done by IPsec) the flag was lost. Correct this by
defining a new M_PROTO6 flag and use it to mark packets that need
this processing.

Reviewed by: bms
Approved by: re (rwatson)
MFC after: 2 weeks


122966 23-Nov-2003 sam

Use MPSAFE callouts only when debug.mpsafenet is 1. Both timer routines
potentially transmit packets that may enter KAME IPsec w/o Giant if the
callouts are marked MPSAFE.

Reviewed by: ume
Approved by: re (rwatson)


122960 23-Nov-2003 tmm

bzero() the the sockaddr used for the destination address for
rtalloc_ign() in in_pcbconnect_setup() before it is filled out.
Otherwise, stack junk would be left in sin_zero, which could
cause host routes to be ignored because they failed the comparison
in rn_match().
This should fix the wrong source address selection for connect() to
127.0.0.1, among other things.

Reviewed by: sam
Approved by: re (rwatson)


122922 20-Nov-2003 andre

Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


122921 20-Nov-2003 andre

Remove RTF_PRCLONING from routing table and adjust users of it
accordingly. The define is left intact for ABI compatibility
with userland.

This is a pre-step for the introduction of tcp_hostcache. The
network stack remains fully useable with this change.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)


122915 20-Nov-2003 maxim

Fix an arguments order in check_uidgid() call.

PR: kern/59314
Submitted by: Andrey V. Shytov
Approved by: re (rwatson, jhb)


122875 18-Nov-2003 rwatson

Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


122867 17-Nov-2003 cognet

In rip_abort(), unlock the inpcb if we didn't detach it, or we may
recurse on the lock before destroying the mutex.

Submitted by: sam


122828 17-Nov-2003 green

Fix a few cases where MT_TAG-type "fake mbufs" are created on the stack, but
do not have mh_nextpkt initialized. Somtimes what's there is "1", and the
ip_input() code pukes trying to m_free() it, rendering divert sockets and
such broken.
This really underscores the need to get rid of MT_TAG.

Reviewed by: rwatson


122797 16-Nov-2003 andre

Make two casts correct for all types of 64bit platforms.

Explained by: bde


122759 15-Nov-2003 andre

Correct a cast to make it compile on 64bit platforms (noticed by tinderbox)
and remove two unneccessary variable initializations.
Make the introduction comment more clear with regard which parts of
the packet are touched.

Requested by: luigi


122723 15-Nov-2003 andre

Make ipstealth global as we need it in ip_fastforward too.


122708 14-Nov-2003 andre

Remove the global one-level rtcache variable and associated
complex locking and rework ip_rtaddr() to do its own rtlookup.
Adopt all its callers to this and make ip_output() callable
with NULL rt pointer.

Reviewed by: sam (mentor)


122702 14-Nov-2003 andre

Introduce ip_fastforward and remove ip_flow.

Short description of ip_fastforward:

o adds full direct process-to-completion IPv4 forwarding code
o handles ip fragmentation incl. hw support (ip_flow did not)
o sends icmp needfrag to source if DF is set (ip_flow did not)
o supports ipfw and ipfilter (ip_flow did not)
o supports divert, ipfw fwd and ipfilter nat (ip_flow did not)
o returns anything it can't handle back to normal ip_input

Enable with sysctl -w net.inet.ip.fastforwarding=1

Reviewed by: sam (mentor)


122599 13-Nov-2003 sam

add missing inpcb lock before call to tcp_twclose (which reclaims the inpcb)

Supported by: FreeBSD Foundation


122598 13-Nov-2003 sam

o reorder some locking asserts to reflect the order of the locks
o correct a read-lock assert in in_pcblookup_local that should be
a write-lock assert (since time wait close cleanups may alter state)

Supported by: FreeBSD Foundation


122593 13-Nov-2003 andre

Move global variables for icmp_input() to its stack. With SMP or
preemption two CPUs can be in the same function at the same time
and clobber each others variables. Remove register declaration
from local variables.

Reviewed by: sam (mentor)


122588 12-Nov-2003 andre

Do not fragment a packet with hardware assistance if it has the DF
bit set.

Reviewed by: sam (mentor)


122579 12-Nov-2003 bms

Add a new sysctl knob, net.inet.udp.strict_mcast_mship, to the udp_input path.

This switch toggles between strict multicast delivery, and traditional
multicast delivery.

The traditional (default) behaviour is to deliver multicast datagrams to all
sockets which are members of that group, regardless of the network interface
where the datagrams were received.

The strict behaviour is to deliver multicast datagrams received on a
particular interface only to sockets whose membership is bound to that
interface.

Note that as a matter of course, multicast consumers specifying INADDR_ANY
for their interface get joined on the interface where the default route
happens to be bound. This switch has no effect if the interface which the
consumer specifies for IP_ADD_MEMBERSHIP is not UP and RUNNING.

The original patch has been cleaned up somewhat from that submitted. It has
been tested on a multihomed machine with multiple QuickTime RTP streams
running over the local switch, which doesn't do IGMP snooping.

PR: kern/58359
Submitted by: William A. Carrel
Reviewed by: rwatson
MFC after: 1 week


122576 12-Nov-2003 andre

dropwithreset is not needed in this case as tcp_drop() is already notifying
the other side. Before we were sending two RST packets.


122524 12-Nov-2003 rwatson

Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


122501 11-Nov-2003 sam

correct typos

Pointed out by: Mike Silbersack


122496 11-Nov-2003 sam

o add missing inpcb locking in tcp_respond
o replace spl's with lock assertions

Supported by: FreeBSD Foundation


122449 10-Nov-2003 sam

use Giant-less callouts when debug_mpsafenet is non-zero

Supported by: FreeBSD Foundation


122446 10-Nov-2003 iedowse

In in_pcbconnect_setup(), don't use the cached inp->inp_route unless
it is marked as RTF_UP. This appears to fix a crash that was sometimes
triggered when dhclient(8) tried to send a packet after an interface
had been detatched.

Reviewed by: sam


122437 10-Nov-2003 hsu

Mark TCP syncache timer as not Giant-free ready yet.


122334 08-Nov-2003 sam

replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF
macros that expand to include assertions when the system is built
with INVARIANTS

Supported by: FreeBSD Foundation


122331 08-Nov-2003 sam

divert socket fixups:

o pickup Giant in divert_packet to protect sbappendaddr since it
can be entered through MPSAFE callouts or through ip_input when
mpsafenet is 1
o add missing locking on output
o add locking to abort and shutdown
o add a ctlinput handler to invalidate held routing table references
on an ICMP redirect (may not be needed)

Supported by: FreeBSD Foundation


122330 08-Nov-2003 sam

assert optional inpcb is passed in locked

Supported by: FreeBSD Foundation


122329 08-Nov-2003 sam

add locking assertions

Supported by: FreeBSD Foundation


122328 08-Nov-2003 sam

assert inpcb is locked in udp_output

Supported by: FreeBSD Foundation


122327 08-Nov-2003 sam

o correct locking problem: the inpcb must be held across tcp_respond
o add assertions in tcp_respond to validate inpcb locking assumptions
o use local variable instead of chasing pointers in tcp_respond

Supported by: FreeBSD Foundation


122326 08-Nov-2003 sam

use local values instead of chasing pointers

Supported by: FreeBSD Foundation


122325 08-Nov-2003 sam

replace mtx_assert by INP_LOCK_ASSERT

Supported by: FreeBSD Foundation


122324 08-Nov-2003 sam

add some missing locking

Supported by: FreeBSD Foundation


122323 08-Nov-2003 sam

the sbappendaddr call in socket_send must be protected by Giant
because it can happen from an MPSAFE callout

Supported by: FreeBSD Foundation


122322 08-Nov-2003 sam

add locking assertions that turn into noops if INET6 is configured;
this is necessary because the ipv6 code shares the in_pcb code with
ipv4 but (presently) lacks proper locking

Supported by: FreeBSD Foundation


122320 08-Nov-2003 sam

o add a flags parameter to netisr_register that is used to specify
whether or not the isr needs to hold Giant when running; Giant-less
operation is also controlled by the setting of debug_mpsafenet
o mark all netisr's except NETISR_IP as needing Giant
o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant
o pickup Giant (when debug_mpsafenet is 1) inside ip_input before
calling up with a packet
o change netisr handling so swi_net runs w/o Giant; instead we grab
Giant before invoking handlers based on whether the handler needs Giant
o change netisr handling so that netisr's that are marked MPSAFE may
have multiple instances active at a time
o add netisr statistics for packets dropped because the isr is inactive

Supported by: FreeBSD Foundation


122271 08-Nov-2003 sam

unbreak compilation of FAST_IPSEC

Supported by: FreeBSD Foundation


122267 07-Nov-2003 sam

MFp4: reminder that random id code is not reentrant

Supported by: FreeBSD Foundation


122265 07-Nov-2003 sam

Move uid/gid checking logic out of line and lock inpcb usage. This
has a LOR between IPFW inpcb locks but I'm committing it now as the
lesser of two evils (the other being unlocked use of in_pcblookup).

Supported by: FreeBSD Foundation


122242 07-Nov-2003 ume

use ipsec_getnhist() instead of obsoleted ipsec_gethist().

Submitted by: "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>
Reviewed by: Ari Suutari <ari@suutari.iki.fi> (ipfw@)


122179 07-Nov-2003 sam

Fix locking of the ip forwarding cache. We were holding a reference
to a routing table entry w/o bumping the reference count or locking
against the entry being free'd. This caused major havoc (for some
reason it appeared most frequently for folks running natd). Fix
is to bump the reference count whenever we copy the route cache
contents into a private copy so the entry cannot be reclaimed out
from under us. This is a short term fix as the forthcoming routing
table changes will eliminate this cache entirely.

Supported by: FreeBSD Foundation


122062 04-Nov-2003 ume

- cleanup SP refcnt issue.
- share policy-on-socket for listening socket.
- don't copy policy-on-socket at all. secpolicy no longer contain
spidx, which saves a lot of memory.
- deep-copy pcb policy if it is an ipsec policy. assign ID field to
all SPD entries. make it possible for racoon to grab SPD entry on
pcb.
- fixed the order of searching SA table for packets.
- fixed to get a security association header. a mode is always needed
to compare them.
- fixed that the incorrect time was set to
sadb_comb_{hard|soft}_usetime.
- disallow port spec for tunnel mode policy (as we don't reassemble).
- an user can define a policy-id.
- clear enc/auth key before freeing.
- fixed that the kernel crashed when key_spdacquire() was called
because key_spdacquire() had been implemented imcopletely.
- preparation for 64bit sequence number.
- maintain ordered list of SA, based on SA id.
- cleanup secasvar management; refcnt is key.c responsibility;
alloc/free is keydb.c responsibility.
- cleanup, avoid double-loop.
- use hash for spi-based lookup.
- mark persistent SP "persistent".
XXX in theory refcnt should do the right thing, however, we have
"spdflush" which would touch all SPs. another solution would be to
de-register persistent SPs from sptree.
- u_short -> u_int16_t
- reduce kernel stack usage by auto variable secasindex.
- clarify function name confusion. ipsec_*_policy ->
ipsec_*_pcbpolicy.
- avoid variable name confusion.
(struct inpcbpolicy *)pcb_sp, spp (struct secpolicy **), sp (struct
secpolicy *)
- count number of ipsec encapsulations on ipsec4_output, so that we
can tell ip_output() how to handle the packet further.
- When the value of the ul_proto is ICMP or ICMPV6, the port field in
"src" of the spidx specifies ICMP type, and the port field in "dst"
of the spidx specifies ICMP code.
- avoid from applying IPsec transport mode to the packets when the
kernel forwards the packets.

Tested by: nork
Obtained from: KAME


121972 03-Nov-2003 rwatson

Note that when ip_output() is called from ip_forward(), it will already
have its options inserted, so the opt argument to ip_output() must be
NULL.


121971 03-Nov-2003 rwatson

Remove comment about desire for eventual explicit labeling of ICMP
header copy made on input path: this is now handled differently.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


121929 03-Nov-2003 sam

Remove bogus RTFREE that was added in rev 1.47. The rmx code operates
directly on the radix tree and does not hold any routing table refernces.
This fixes the reference counting problems that manifested itself as a
panic during unmount of filesystems that were mounted by NFS over an
interface that had been removed.

Supported by: FreeBSD Foundation


121922 03-Nov-2003 sam

Correct rev 1.56 which (incorrectly) reversed the test used to
decide if in_pcbpurgeif0 should be invoked.

Supported by: FreeBSD Foundation


121884 02-Nov-2003 silby

Add an additional check to the tcp_twrecycleable function; I had
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.

The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.


121850 01-Nov-2003 silby

- Add a new function tcp_twrecycleable, which tells us if the ISN which
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.

- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.

This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.


121816 31-Oct-2003 brooks

Replace the if_name and if_unit members of struct ifnet with new members
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.

This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.

Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)


121770 30-Oct-2003 sam

Overhaul routing table entry cleanup by introducing a new rtexpunge
routine that takes a locked routing table reference and removes all
references to the entry in the various data structures. This
eliminates instances of recursive locking and also closes races
where the lock on the entry had to be dropped prior to calling
rtrequest(RTM_DELETE). This also cleans up confusion where the
caller held a reference to an entry that might have been reclaimed
(and in some cases used that reference).

Supported by: FreeBSD Foundation


121700 29-Oct-2003 sam

Potential fix for races shutting down callouts when unloading
the module. Previously we grabbed the mutex used by the callouts,
then stopped the callout with callout_stop, but if the callout
was already active and blocked by the mutex then it would continue
later and reference the mutex after it was destroyed. Instead
stop the callout first then lock.

Supported by: FreeBSD Foundation


121699 29-Oct-2003 sam

o add locking to protect routing table refcnt manipulations
o add some more debugging help for figuring out why folks are
getting complaints about releasing routing table entries with
a zero refcnt
o fix comment that talked about spl's
o remove duplicate define of DUMMYNET_DEBUG

Supported by: FreeBSD Foundation


121684 29-Oct-2003 ume

add ECN support in layer-3.
- implement the tunnel egress rule in ip_ecn_egress() in ip_ecn.c.
make ip{,6}_ecn_egress() return integer to tell the caller that
this packet should be dropped.
- handle ECN at fragment reassembly in ip_input.c and frag6.c.

Obtained from: KAME


121674 29-Oct-2003 ume

ip6_savecontrol() argument is redundant


121645 29-Oct-2003 sam

Introduce the notion of "persistent mbuf tags"; these are tags that stay
with an mbuf until it is reclaimed. This is in contrast to tags that
vanish when an mbuf chain passes through an interface. Persistent tags
are used, for example, by MAC labels.

Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp. This fixes problems
with "tag leakage".

Pointed out by: Jonathan Stone
Reviewed by: Robert Watson


121628 28-Oct-2003 sam

speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)


121499 25-Oct-2003 ume

revert following unwanted changes:
- __packed to __attribute__((__packed__)
- uintN_t back to u_intN_t

Reported by: bde


121498 25-Oct-2003 ume

correct namespace pollution.

Submitted by: bde


121478 24-Oct-2003 ume

remove the ip6r0_addr and ip6r0_slmap members from ip6_rthdr0{}
according to rfc2292bis.

Obtained from: KAME


121477 24-Oct-2003 ume

correct tab and order.


121472 24-Oct-2003 ume

Switch Advanced Sockets API for IPv6 from RFC2292 to RFC3542
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.

Obtained from: KAME


121453 24-Oct-2003 silby

Reduce the number of tcp time_wait structs to maxsockets / 5; this ensures
that at most 20% of sockets can be in time_wait at one time, ensuring
that time_wait sockets do not starve real connections from inpcb
structures.

No implementation change is needed, jlemon already implemented a nice
LRU-ish algorithm for tcp_tw structure recycling.

This should reduce the need for sysadmins to lower the default msl on
busy servers.


121446 24-Oct-2003 sam

o restructure initialization code so data structures are setup
when loaded as a module
o cleanup data structures on module unload when no application has
been started (i.e. kldload, kldunload w/o mrtd)
o remove extraneous unlocks immediately prior to destroying them

Supported by: FreeBSD Foundation


121307 21-Oct-2003 silby

Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.


121285 20-Oct-2003 ume

enclose IPv6 part with ifdef INET6.

Obtained from: KAME


121283 20-Oct-2003 ume

correct linkmtu handling.

Obtained from: KAME


121161 17-Oct-2003 ume

- add dom_if{attach,detach} framework.
- transition to use ifp->if_afdata.

Obtained from: KAME


121141 16-Oct-2003 sam

pfil hooks can modify packet contents so check if the destination
address has been changed when PFIL_HOOKS is enabled and, if it has,
arrange for the proper action by ip*_forward.

Supported by: FreeBSD Foundation
Submitted by: Pyun YongHyeon


121140 16-Oct-2003 sam

Drop dummynet lock when calling back into the network stack to deliver
packets. This eliminates a LOR with Giant that caused outbound pipes
to fail.

Supported by: FreeBSD Foundation


121123 16-Oct-2003 mckusick

Malloc buckets of size 128 have been having their 64-byte offset
trashed after being freed. This has caused several panics including
kern/42277 related to soft updates. Jim Kuhn tracked the problem
down to ipfw limit rule processing. In the expiry of dynamic rules,
it is possible for an O_LIMIT_PARENT rule to be removed when it still
has live children. When the children eventually do expire, a pointer
to the (long gone) parent is dereferenced and a count decremented.
Since this memory can, and is, allocated for other purposes (in the
case of kern/42277 an inodedep structure), chaos ensues. The offset
in question in inodedep is the offset of the 16 bit count field in
the ipfw2 ipfw_dyn_rule.

Submitted by: Jim Kuhn <jkuhn@sandvine.com>
Reviewed by: "Evgueni V. Gavrilov" <aquatique@rusunix.org>
Reviewed by: Ben Pfountz <netprince@vt.edu>
MFC after: 1 week


121119 15-Oct-2003 sam

purge extraneous ';'s

Supported by: FreeBSD Foundation
Noticed by: bde


121093 14-Oct-2003 sam

Lock ip forwarding route cache. While we're at it, remove the global
variable ipforward_rt by introducing an ip_forward_cacheinval() call
to use to invalidate the cache.

Supported by: FreeBSD Foundation


121091 14-Oct-2003 sam

remove dangling ';'s` that were harmless

Supported by: FreeBSD Foundation


120891 07-Oct-2003 ume

- fix typo in comment.
- style.

Obtained from: KAME


120887 07-Oct-2003 ume

nuke unused ICMPV6CTL_NAMES and KEYCTL_NAMES macros.


120885 07-Oct-2003 ume

return(code) -> return (code)

Obtained from: KAME


120727 04-Oct-2003 sam

Locking for updates to routing table entries. Each rtentry gets a mutex
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.

Other/related changes:

o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts

Notes:

1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.

Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)


120721 03-Oct-2003 sam

hookup ctlinput for fast ipsec versions of esp+ah protocols

Supported by: FreeBSD Foundation


120714 03-Oct-2003 sam

place some kernel-specific data structures under #ifdef _KERNEL

Sponsored by: FreeBSD Foundation


120699 03-Oct-2003 bms

Shorten 'bad gateway' AF_LINK message.

Submitted by: green


120698 03-Oct-2003 bms

Make arp_rtrequest()'s 'bad gateway' messages slightly more informative,
to aid me in tracking down LLINFO inconsistencies in the routing table.

Discussed with: fenner


120685 03-Oct-2003 bms

Only delete the route if arplookup() tried to create it. Do not delete
RTF_STATIC routes. Do not check for RTF_HOST so as to avoid being DoSed
when an RTF_GENMASK route exists in the table.

Add a more verbose comment about exactly what this code does.

Submitted by: ru


120626 01-Oct-2003 ru

By popular demand, added the "static ARP" per-interface option.


120435 25-Sep-2003 ume

add /*CONSTCOND*/ to reduce diffs against latest KAME.

Obtained from: KAME


120418 24-Sep-2003 bms

Fix a logic error in the check to see if arplookup() should free the route.

Noticed by: Mike Hogsett
Reviewed by: ru


120386 23-Sep-2003 sam

o update PFIL_HOOKS support to current API used by netbsd
o revamp IPv4+IPv6+bridge usage to match API changes
o remove pfil_head instances from protosw entries (no longer used)
o add locking
o bump FreeBSD version for 3rd party modules

Heavy lifting by: "Max Laier" <max@love2party.net>
Supported by: FreeBSD Foundation
Obtained from: NetBSD (bits of pfil.h and pfil.c)


120383 23-Sep-2003 bms

Fix a bug in arplookup(), whereby a hostile party on a locally
attached network could exhaust kernel memory, and cause a system
panic, by sending a flood of spoofed ARP requests.

Approved by: jake (mentor)
Reported by: Apple Product Security <product-security@apple.com>


120373 23-Sep-2003 marcus

Grrr...add the Skinny alias code forgotten in the last commit.


120372 23-Sep-2003 marcus

Add Cisco Skinny Station protocol support to libalias, natd, and ppp.
Skinny is the protocol used by Cisco IP phones to talk to Cisco Call
Managers. With this code, one can use a Cisco IP phone behind a FreeBSD
NAT gateway.

Currently, having the Call Manager behind the NAT gateway is not supported.
More information on enabling Skinny support in libalias, natd, and ppp
can be found in those applications' manpages.

PR: 55843
Reviewed by: ru
Approved by: ru
MFC after: 30 days


120182 17-Sep-2003 sam

Bandaid locking change: mark static rule mutex recursive so re-entry when
sending an ICMP packet doesn't cause a panic. A better solution is needed;
possibly defering the transmit to a dedicated thread.

Observed by: "Aaron Wohl" <freebsd@soith.com>


120181 17-Sep-2003 sam

shuffle code so we don't "continue" and miss a needed unlock operation

Observed by: Wiktor Niesiobedzki <w@evip.pl>


120141 17-Sep-2003 sam

Add locking.

o change timeout to MPSAFE callout
o restructure rule deletion to deal with locking requirements
o replace static buffer used for ipfw control operations with malloc'd storage

Sponsored by: FreeBSD Foundation


120140 17-Sep-2003 sam

Minor fixups + add locking.

o change time to MPSAFE callout
o make debug printfs conditional on DUMMYNET_DEBUG and runtime controllable
by net.inet.ip.dummynet.debug
o make boot-time printf dependent on bootverbose

Sponsored by: FreeBSD Foundation


119995 11-Sep-2003 ru

Fix a bunch of off-by-one errors in the range checking code.


119932 09-Sep-2003 ru

Fixed -Wpointer-arith warning.

Submitted by: Stefan Farfeleder
PR: bin/56653


119893 08-Sep-2003 ru

mdoc(7): Use the new feature of the .In macro.


119792 06-Sep-2003 sam

Add locking.

Special thanks to Pavlin Radoslavov <pavlin@icir.org> for testing and
fixing numerous problems.

Sponsored by: FreeBSD Foundation
Reviewed by: Pavlin Radoslavov <pavlin@icir.org>


119753 05-Sep-2003 sam

lock ip fragment queues

Submitted by: Robert Watson <rwatson@freebsd.org>
Obtained from: BSD/OS


119752 05-Sep-2003 sam

o add locking
o move the global divsrc socket address to a local variable
instead of locking it

Sponsored by: FreeBSD Foundation


119705 03-Sep-2003 bms

PR: kern/56343
Reviewed by: tjr
Approved by: jake (mentor)


119644 01-Sep-2003 silby

Implement MBUF_STRESS_TEST mark II.

Changes from the original implementation:

- Fragmentation is handled by the function m_fragment, which can
be called from whereever fragmentation is needed. Note that this
function is wrapped in #ifdef MBUF_STRESS_TEST to discourage non-testing
use.

- m_fragment works slightly differently from the old fragmentation
code in that it allocates a seperate mbuf cluster for each fragment.
This defeats dma_map_load_mbuf/buffer's feature of coalescing adjacent
fragments. While that is a nice feature in practice, it nerfed the
usefulness of mbuf_stress_test.

- Add two modes of random fragmentation. Chains with fragments all of
the same random length and chains with fragments that are each uniquely
random in length may now be requested.


119640 01-Sep-2003 sam

add locking

NB: There is a known LOR on the forwarding path; this needs to be resolved
together with a similar issue in the bridge. For the moment it is
believed to be benign.

Sponsored by: FreeBSD Fondation


119635 01-Sep-2003 sam

remove warning about use of old divert sockets; this was marked
for removal before 5.2

Reviewed by: silence on -net and -arch


119634 01-Sep-2003 sam

add locking

Sponsored by: FreeBSD Foundation


119541 28-Aug-2003 rwatson

Remove redundant initialization of rti; SLIST_FOREACH does that for
us.


119489 26-Aug-2003 rwatson

M_PREPEND() with an argument of M_TRYWAIT can fail, meaning the
returned mbuf can be NULL. Check for NULL in rip_output() when
prepending an IP header. This prevents mbuf exhaustion from
causing a local kernel panic when sending raw IP packets.

PR: kern/55886
Reported by: Pawel Malachowski <pawmal-posting@freebsd.lublin.pl>
MFC after: 3 days


119401 24-Aug-2003 hsu

Remove redundant bzero.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


119245 21-Aug-2003 rwatson

Introduce two new MAC Framework and MAC policy entry points:

mac_reflect_mbuf_icmp()
mac_reflect_mbuf_tcp()

These entry points permit MAC policies to do "update in place"
changes to the labels on ICMP and TCP mbuf headers when an ICMP or
TCP response is generated to a packet outside of the context of
an existing socket. For example, in respond to a ping or a RST
packet to a SYN on a closed port.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


119181 20-Aug-2003 rwatson

Before digging into IGMP locking, do a whitespace and prototype cleanup:
prefer tabs to 8 spaces, focus on consistent indentation, prefer modern
C function prototypes. Not all the way to style(9), but substantially
closer.


119180 20-Aug-2003 rwatson

Move from a custom-crafted singly-linked list to the SLIST_* macros
from queue(3).

Improve vertical compactness by using a IGMP_PRINTF() macro rather
than #ifdefing IGMP_DEBUG a large number of debugging printfs.

Reviewed by: mdodd (SLIST changes)


119178 20-Aug-2003 bms

Add the IP_ONESBCAST option, to enable undirected IP broadcasts to be sent on
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.

Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)


119137 19-Aug-2003 sam

Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter. This
does not change the behaviour; it just makes the intent more clear.


119134 19-Aug-2003 hsu

* Bug fix in bw_meter_process(): the periodically processed bins
of bw_meter entries were processed up to one second ahead.
After an unappropriate rescheduling of some of the bw_meter
entries, the upcalls weren't delivered.

* pim_register_prepare() uses the appropriate sw_csum flag to
call ip_fragment() so the IP checksum is computed properly.

* Modify pim_register_prepare() to take care of IP packets that
don't need fragmentation.

* Add-back in_delayed_cksum() to encap_send(), because it seems it
should be there.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


119132 19-Aug-2003 sam

add missing unlock when in_pcballoc returns an error


119071 18-Aug-2003 obrien

style.Makefile(5)


119017 17-Aug-2003 gordon

Stage 3 of dynamic root support. Make all the libraries needed to run
binaries in /bin and /sbin installed in /lib. Only the versioned files
reside in /lib, the .so symlink continues to live /usr/lib so the
toolchain doesn't need to be modified.


118864 13-Aug-2003 harti

The syncache has made use of TCPDEBUG problematic, because the SYN
segments are lost for the application. This broke, for example,
ports/benchmarks/dbs which needs the SYN segment to filter the
contents of the trace buffer for the connection it is interested in.

This patch makes the SYN segments available again. Unfortunately they
are now associated with the listening socket instead of the new one, so
a change to applications is required, but without this patch it wouldn't
work altogether.

PR: kern/45966


118862 13-Aug-2003 harti

The tcp_trace call needs the length of the header. Unfortunately the
code has rotten a bit so that the header length is not correct at
the point when tcp_trace is called. Temporarily compute the correct
value before the call and restore the old value after. This makes
ports/benchmarks/dbs to almost work.

This is a NOP unless you compile with TCPDEBUG.


118861 13-Aug-2003 harti

A number of patches in the last years have created new return paths
in tcp_input that leave the function before hitting the tcp_trace
function call for the TCPDEBUG option. This has made TCPDEBUG mostly
useless (and tools like ports/benchmarks/dbs not working). Add
tcp_trace calls to the return paths that could be identified in this
maze.

This is a NOP unless you compile with TCPDEBUG.


118823 12-Aug-2003 harti

Change the code that enables/disables the ATM channel to use the
new ATMIOCOPENVCC/CLOSEVCC. This allows us to not only use UBR channels
for IP over ATM, but also CBR, VBR and ABR. Change the format of the
link layer address to specify the channel characteristics. The old
format is still supported and opens UBR channels.


118623 07-Aug-2003 hsu

New PIM header files.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


118622 07-Aug-2003 hsu

1. Basic PIM kernel support
Disabled by default. To enable it, the new "options PIM" must be
added to the kernel configuration file (in addition to MROUTING):

options MROUTING # Multicast routing
options PIM # Protocol Independent Multicast

2. Add support for advanced multicast API setup/configuration and
extensibility.

3. Add support for kernel-level PIM Register encapsulation.
Disabled by default. Can be enabled by the advanced multicast API.

4. Implement a mechanism for "multicast bandwidth monitoring and upcalls".

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


118607 07-Aug-2003 jhb

Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort. In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by: bde (kern_ktrace.c)


118552 06-Aug-2003 harti

Ups. I forgot this one in the SIOCATMENA/SIOCATMDIS removal commit.

This change allows one to specify almost the complete traffic parameters
for IPoverATM channels through the routing table. Up to now we used
4 byte DL addresses (flag, vpi, vciH, vciL). This format is still allowed.
If the address is longer, however, the 5th byte is interpreted as the
traffic class (UBR, CBR, VBR or ABR) and the remaining bytes are the
parameters for this traffic class:

UBR: 0 byte or 3 byte PCR
CBR: 3 byte PCR
VBR: 3 byte PCR, 3 byte SCR, 3 byte MBS
ABR: 3 byte PCR, 3 byte MCR, 3 byte ICR, 3 byte TBE, 1 byte NRM,
1 byte TRM, 2 bytes ADTF, 1 byte RIF, 1 byte RDF and 1 byte CDF

A script to generate the corresponding 'route add' arguments will follow soon.


118501 05-Aug-2003 hsu

* makes mfc[MFCTBLSIZ] and vif[MAXVIFS] tables accessible via
sysctl:
- sysctlbyname("net.inet.ip.mfctable", ...)
- sysctlbyname("net.inet.ip.viftable", ...)

This change is needed so netstat can use sysctlbyname() to read
the data from those tables.
Otherwise, in some cases "netstat -g" may fail to report the
multicast forwarding information (e.g., if we run a multicast
router on PicoBSD).

* Bug fix: when sending IGMPMSG_WRONGVIF upcall to the multicast
routing daemon, set properly "im->im_vif" to the receiving
incoming interface of the packet that triggered that upcall
rather than to the expected incoming interface of that packet.

* Bug fix: add missing increment of counter "mrtstat.mrts_upcalls"

* Few formatting nits (e.g., replace extra spaces with TABs)

Submitted by: Pavlin Radoslavov <pavlin@icir.org>


118499 05-Aug-2003 harti

When adding a channel for INET failed at the device level (ioctl) the
code used to call rtrequest(RTM_DELETE, ...). This is a problem, because
the function that just has called us (route_output)
is not really happy with the route it just is creating beeing ripped out
from under it. Unfortunately we also cannot return an error from
ifa_rtrequest. Therefore mark the route just as RTF_REJECT.


118497 05-Aug-2003 harti

Make this file to conform more to style(9) before really touching it.


118259 31-Jul-2003 maxim

o Fix a typo in previous commit.


118008 25-Jul-2003 maxim

o Do not overwrite saved interrupt priority level by alloc_hash(),
use a separate variable.
o Restore interrupt priority level before return (no-op in HEAD).

Spotted by: Don Bowman <don@sandvine.com>
MFC after: 5 days


117897 22-Jul-2003 sam

add IPSEC_FILTERGIF suport for FAST_IPSEC

PR: kern/51922
Submitted by: Eric Masson <e-masson@kisoft-services.com>
MFC after: 1 week


117765 19-Jul-2003 silby

Minor fix to the MBUF_STRESS_TEST code so that it keeps
pkthdr.len consistant at all times. (Some debugging
code I'm working on is tripped otherwise.)

MFC after: 3 days


117737 18-Jul-2003 rwatson

Add a comment above rip_ctloutput() documenting that the privilege
check for raw IP system management operations is often (although
not always) implicit due to the namespacing of raw IP sockets. I.e.,
you have to have privilege to get a raw IP socket, so much of the
management code sitting on raw IP sockets assumes that any requests
on the socket should be granted privilege.

Obtained from: TrustedBSD Project
Product of: France


117686 17-Jul-2003 hsu

Drop Giant around syncache timer processing.


117654 15-Jul-2003 luigi

Allow set 31 to be used for rules other than 65535.
Set 31 is still special because rules belonging to it are not deleted
by the "ipfw flush" command, but must be deleted explicitly with
"ipfw delete set 31" or by individual rule numbers.

This implement a flexible form of "persistent rules" which you might
want to have available even after an "ipfw flush".
Note that this change does not violate POLA, because you could not
use set 31 in a ruleset before this change.

sbin/ipfw changes to allow manipulation of set 31 will follow shortly.

Suggested by: Paul Richards


117650 15-Jul-2003 hsu

Unify the "send high" and "recover" variables as specified in the
lastest rev of the spec. Use an explicit flag for Fast Recovery. [1]

Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]

Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>,
Sally Floyd <floyd@acm.org> [1]


117468 12-Jul-2003 luigi

Implement comments embedded into ipfw2 instructions.

Since we already had 'O_NOP' instructions which always match, all
I needed to do is allow the NOP command to have arbitrary length
(i.e. move its label in a different part of the switch() which
validates instructions).

The kernel must know nothing about comments, everything else is
done in userland (which will be described in the upcoming ipfw2.c
commit).


117327 08-Jul-2003 luigi

Merge the handlers of O_IP_SRC_MASK and O_IP_DST_MASK opcodes, and
support matching a list of addr/mask pairs so one can write
more efficient rulesets which were not possible before e.g.

add 100 skipto 1000 not src-ip 10.0.0.0/8,127.0.0.1/8,192.168.0.0/16

The change is fully backward compatible.
ipfw2 and manpage commit to follow.

MFC after: 3 days


117241 04-Jul-2003 luigi

Implement the 'ipsec' option to match packets coming out of an ipsec tunnel.
Should work with both regular and fast ipsec (mutually exclusive).
See manpage for more details.

Submitted by: Ari Suutari (ari.suutari@syncrontech.com)
Revised by: sam
MFC after: 1 week


117240 04-Jul-2003 luigi

Correct some comments, add opcode O_IPSEC to match packets
coming out of an ipsec tunnel.


116982 28-Jun-2003 luigi

Remove a stale comment, fix indentation.


116981 28-Jun-2003 luigi

whitespace fix


116778 24-Jun-2003 luigi

remove unused file (ipfw2 is the default in RELENG_5 and above; the old
ipfw1 has been unused and unmaintained for a long time).


116764 23-Jun-2003 luigi

Fix typo in a (commented out) debugging string.

Spotted by: diff


116763 23-Jun-2003 luigi

Remove whitespace at end of line.


116690 22-Jun-2003 luigi

Add support for multiple values and ranges for the "iplen", "ipttl",
"ipid" options. This feature has been requested by several users.
On passing, fix some minor bugs in the parser. This change is fully
backward compatible so if you have an old /sbin/ipfw and a new
kernel you are not in trouble (but you need to update /sbin/ipfw
if you want to use the new features).

Document the changes in the manpage.

Now you can write things like

ipfw add skipto 1000 iplen 0-500

which some people were asking to give preferential treatment to
short packets.

The 'MFC after' is just set as a reminder, because I still need
to merge the Alpha/Sparc64 fixes for ipfw2 (which unfortunately
change the size of certain kernel structures; not that it matters
a lot since ipfw2 is entirely optional and not the default...)

PR: bin/48015

MFC after: 1 week


116462 17-Jun-2003 silby

Map icmp time exceeded responses to EHOSTUNREACH rather than 0 (no error);
this makes connect act more sensibly in these cases.

PR: 50839
Submitted by: Barney Wolff <barney@pit.databus.com>
Patch delayed by laziness of: silby
MFC after: 1 week


116315 13-Jun-2003 ru

In the PKT_ALIAS_PROXY_ONLY mode, make sure to preserve the
original source IP address, as promised in the manual page.

Spotted by: Vaclav Petricek


116314 13-Jun-2003 ru

Removed a couple of .Xo/.Xc that are leftovers of the "ninth-argument
limit" mdoc(7) atavism.


116313 13-Jun-2003 ru

Clarify that original address and port when doing transparent proxying
are _destination_ address and port.


116312 13-Jun-2003 ru

Added myself to the AUTHORS section.


116020 08-Jun-2003 charnier

The .Fn function


115909 06-Jun-2003 rwatson

When setting fragment queue pointers to NULL, or comparing them with
NULL, use NULL rather than 0 to improve readability.


115824 04-Jun-2003 hsu

Compensate for decreasing the minimum retransmit timeout.

Reviewed by: jlemon


115793 04-Jun-2003 ticso

Change handling to support strong alignment architectures such as alpha and
sparc64.

PR: alpha/50658
Submitted by: rizzo
Tested on: alpha


115750 02-Jun-2003 kbyanc

Account for packets processed at layer-2 (i.e. net.link.ether.ipfw=1).

MFC after: 2 weeks


115650 01-Jun-2003 ru

A new API function PacketAliasRedirectDynamic() can be used
to mark a fully specified static link as dynamic; i.e. make
it a one-time link.


115648 01-Jun-2003 ru

Make the PacketAliasSetAddress() function call optional. If it
is not called, and no static rules match an outgoing packet, the
latter retains its source IP address. This is in support of the
"static NAT only" mode.


115612 01-Jun-2003 phk

Remove unused variables.

Found by: FlexeLint


115503 31-May-2003 phk

Add /* FALLTHROUGH */

Found by: FlexeLint


115471 31-May-2003 wollman

Don't generate an ip_id for packets with the DF bit set; ip_id is
only meaningful for fragments. Also don't bother to byte-swap the
ip_id when we do generate it; it is only used at the receiver as a
nonce. I tried several different permutations of this code with no
measurable difference to each other or to the unmodified version, so
I've settled on the one for which gcc seems to generate the best code.
(If anyone cares to microoptimize this differently for an architecture
where it actually matters, feel free.)

Suggested by: Steve Bellovin's paper in IMW'02


114794 07-May-2003 rwatson

Correct a bug introduced with reduced TCP state handling; make
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.

Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.

Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


114788 06-May-2003 rwatson

Trim a call to mac_create_mbuf_from_mbuf() since m_tag meta-data
copying for mbuf headers now works properly in m_dup_pkthdr(), so
we don't need to do an explicit copy.

Approved by: re (jhb)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


114259 29-Apr-2003 mdodd

Add definitions for IN6ADDR_LINKLOCAL_ALLMDNS_INIT and INADDR_ALLMDNS_GROUP.


114258 29-Apr-2003 mdodd

IP_RECVTTL socket option.

Reviewed by: Stuart Cheshire <cheshire@apple.com>


114216 29-Apr-2003 kan

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


113799 21-Apr-2003 obrien

Explicitly declare 'int' parameters.


113755 20-Apr-2003 obrien

style.Makefile(5)


113384 12-Apr-2003 silby

Rename MBUF_FRAG_TEST to MBUF_STRESS_TEST as it will be extended
to include more than just frag tests.


113345 10-Apr-2003 rwatson

Remove a potential panic condition introduced by reduced TCP wait
state. Those changed attempted to work around the changed invariant
that inp->in_socket was sometimes now NULL, but the logic wasn't
quite right, meaning that inp->in_socket would be dereferenced by
cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC
were in use. Attempt to clarify and correct the logic.

Note: the work-around originally introduced with the reduced TCP
wait state handling to use cr_cansee() instead of cr_canseesocket()
in this case isn't really right, although it "Does the right thing"
for most of the cases in the base system. We'll need to address
this at some point in the future.

Pointed out by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


113255 08-Apr-2003 des

Introduce an M_ASSERTPKTHDR() macro which performs the very common task
of asserting that an mbuf has a packet header. Use it instead of hand-
rolled versions wherever applicable.

Submitted by: Hiten Pandya <hiten@unixdaemons.com>


113074 04-Apr-2003 des

Replace memcpy() and ovbcopy() with bcopy(); ditch some caddr_t usage.


112985 02-Apr-2003 mdodd

Back out support for RFC3514.

RFC3514 poses an unacceptale risk to compliant systems.


112983 02-Apr-2003 mdodd

- Use the correct constant define.
- Add a missing break.


112973 02-Apr-2003 mdodd

Sync constant define with NetBSD.

Requested by: Tom Spindler <dogcow@babymeat.com>


112957 01-Apr-2003 hsu

Observe conservation of packets when entering Fast Recovery while
doing Limited Transmit. Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.

Approved in principle by: Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>


112929 01-Apr-2003 mdodd

Implement support for RFC 3514 (The Security Flag in the IPv4 Header).
(See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt)

This fulfills the host requirements for userland support by
way of the setsockopt() IP_EVIL_INTENT message.

There are three sysctl tunables provided to govern system behavior.

net.inet.ip.rfc3514:

Enables support for rfc3514. As this is an
Informational RFC and support is not yet widespread
this option is disabled by default.

net.inet.ip.hear_no_evil

If set the host will discard all received evil packets.

net.inet.ip.speak_no_evil

If set the host will discard all transmitted evil packets.

The IP statistics counter 'ips_evil' (available via 'netstat') provides
information on the number of 'evil' packets recieved.

For reference, the '-E' option to 'ping' has been provided to demonstrate
and test the implementation.


112711 27-Mar-2003 maxim

Fix indentation.


112710 27-Mar-2003 maxim

o Protect set_fs_param() by splimp(9).

Quote from kern/37573:

There is an obvious race in netinet/ip_dummynet.c:config_pipe().
Interrupts are not blocked when changing the params of an
existing pipe. The specific crash observed:

... -> config_pipe -> set_fs_parms -> config_red

malloc a new w_q_lookup table but take an interrupt before
intializing it, interrupt handler does:

... -> dummynet_io -> red_drops

red_drops dereferences the uninitialized (zeroed) w_q_lookup
table.

o Flush accumulated credits for idle pipes.
o Flush accumulated credits when change pipe characteristics.
o Change dn_flow_queue.numbytes type to unsigned long.

Overlapping dn_flow_queue->numbytes in ready_event() leads to
numbytes becomes negative and SET_TICKS() macro returns a very
big value. heap_insert() overlaps dn_key again and inserts a
queue to a ready heap with a sched_time points to the past.
That leads to an "infinity" loop.

PR: kern/33234, kern/37573, misc/42459, kern/43133,
kern/44045, kern/48099
Submitted by: Mike Hibler <mike@cs.utah.edu> (kern/37573)
MFC after: 6 weeks


112675 26-Mar-2003 rwatson

Modify the mac_init_ipq() MAC Framework entry point to accept an
additional flags argument to indicate blocking disposition, and
pass in M_NOWAIT from the IP reassembly code to indicate that
blocking is not OK when labeling a new IP fragment reassembly
queue. This should eliminate some of the WITNESS warnings that
have started popping up since fine-grained IP stack locking
started going in; if memory allocation fails, the creation of
the fragment queue will be aborted.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


112650 25-Mar-2003 mux

Try to make the MBUF_FRAG_TEST code work better.

- Don't try to fragment the packet if it's smaller than mbuf_frag_size.
- Preserve the size of the mbuf chain which is modified by m_split().
- Check that m_split() didn't return NULL.
- Make it so we don't end up with two M_PKTHDR mbuf in the chain.
- Use m->m_pkthdr.len instead of m->m_len so that we fragment the whole
chain and not just the first mbuf.
- Fix a nearby style bug and rework the logic of the loops so that it's
more clear.

This is still not quite right, because we're clearly abusing m_split() to
do something it was not designed for, but at least it works now. We
should probably move this code into a m_fragment() function when it's
correct.


112591 25-Mar-2003 silby

Add the MBUF_FRAG_TEST option. When compiled in, this option
allows you to tell ip_output to fragment all outgoing packets
into mbuf fragments of size net.inet.ip.mbuf_frag_size bytes.
This is an excellent way to test if network drivers can properly
handle long mbuf chains being passed to them.

net.inet.ip.mbuf_frag_size defaults to 0 (no fragmentation)
so that you can at least boot before your network driver dies. :)


112482 22-Mar-2003 mux

Use __packed instead of __attribute__((__packed__)).


112465 21-Mar-2003 mdodd

Add a sysctl node allowing the specification of an address mask to use
when replying to ICMP Address Mask Request packets.


112464 21-Mar-2003 mdodd

Add comments regarding the ICMP timestamp fields.


112250 15-Mar-2003 cjc

Add a 'verrevpath' option that verifies the interface that a packet
comes in on is the same interface that we would route out of to get to
the packet's source address. Essentially automates an anti-spoofing
check using the information in the routing table.

Experimental. The usage and rule format for the feature may still be
subject to change.


112191 13-Mar-2003 hsu

Greatly simplify the unlocking logic by holding the TCP protocol lock until
after FIN_WAIT_2 processing.

Helped with debugging: Doug Barton


112171 13-Mar-2003 hsu

Add support for RFC 3390, which allows for a variable-sized
initial congestion window.


112162 12-Mar-2003 hsu

Implement the Limited Transmit algorithm (RFC 3042).


112148 12-Mar-2003 sam

correct two more flag misuses; m_tag* use malloc flags


112010 08-Mar-2003 jlemon

Remove check for t_state == TCPS_TIME_WAIT and introduce the tw structure.

Sponsored by: DARPA, NAI Labs


112009 08-Mar-2003 jlemon

Remove a panic(); if the zone allocator can't provide more timewait
structures, reuse the oldest one. Also move the expiry timer from
a per-structure callout to the tcp slow timer.

Sponsored by: DARPA, NAI Labs


111926 05-Mar-2003 peter

Finish driving a stake through the heart of netns and the associated
ifdefs scattered around the place - its dead Jim!

The SMB stuff had stolen AF_NS, make it official.


111888 04-Mar-2003 jlemon

Update netisr handling; Each SWI now registers its queue, and all queue
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.

Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs


111748 02-Mar-2003 des

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


111560 26-Feb-2003 jlemon

In timewait state, if the incoming segment is a pure in-sequence ack
that matches snd_max, then do not respond with an ack, just drop the
segment. This fixes a problem where a simultaneous close results in
an ack loop between two time-wait states.

Test case supplied by: Tim Robbins <tjr@FreeBSD.ORG>
Sponsored by: DARPA, NAI Labs


111549 26-Feb-2003 jlemon

The TCP protocol lock may still be held if the reassembly queue dropped FIN.
Detect this case and drop the lock accordingly.

Sponsored by: DARPA, NAI Labs


111541 26-Feb-2003 silby

Fix a condition so that ip reassembly queues are emptied immediately
when maxfragpackets is dropped to 0.

Noticed by: bmah


111483 25-Feb-2003 rwatson

When generating a TCP response to a connection, not only test if the
tcpcb is NULL, but also its connected inpcb, since we now allow
elements of a TCP connection to hang around after other state, such
as the socket, has been recycled.

Tested by: dcs
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


111479 25-Feb-2003 maxim

style(9): join lines.


111478 25-Feb-2003 maxim

Ip reassembly queue structure has ipq_nfrags now. Count a number of
dropped ip fragments precisely.

Reviewed by: silby


111459 25-Feb-2003 hsu

Hold the TCP protocol lock while modifying the connection hash table.


111405 24-Feb-2003 silby

Fix a comment which didn't match the new cookie behavior.

Submitted by: Scott Renfro <scott@renfro.org>
MFC after: 1 day


111389 24-Feb-2003 hsu

tcp_twstart() need to be called with the TCP protocol lock held to avoid
a race condition with the TCP timer routines.


111386 24-Feb-2003 hsu

Pass the right function to callout_reset() for a compressed
TIME-WAIT control block.


111338 23-Feb-2003 silby

Improve the security and performance of syncookies:

Security improvements:
- Increase the size of each syncookie secret from 32 to 128 bits
in order to make brute force attacks on the secrets much more
difficult.
- Always return the lowest order dword from the MD5 hash; this
allows us to expose 2 more bits of the cookie and makes ACK
floods which seek to guess the cookie value more difficult.

Performance improvements:
- Increase the lifetime of each syncookie from 4 seconds to 16
seconds. This increases the usefulness of syncookies during
an attack.
- From Yahoo!: Reduce the number of calls to MD5Update; this
results in a ~17% increase in cookie generation time here.

Reviewed by: hsu, jayanth, jlemon, nectar
MFC After: 15 seconds


111319 23-Feb-2003 jlemon

Yesterday just wasn't my day. Remove testing delta that crept into the diff.

Pointy hat provided by: sam


111275 23-Feb-2003 sam

Add a new config option IPSEC_FILTERGIF to control whether or not
packets coming out of a GIF tunnel are re-processed by ipfw, et. al.
By default they are not reprocessed. With the option they are.

This reverts 1.214. Prior to that change packets were not re-processed.
After they were which caused problems because packets do not have
distinguishing characteristics (like a special network if) that allows
them to be filtered specially.

This is really a stopgap measure designed for immediate MFC so that
4.8 has consistent handling to what was in 4.7.

PR: 48159
Reviewed by: Guido van Rooij <guido@gvr.org>
MFC after: 1 day


111266 22-Feb-2003 jlemon

Check to see if the TF_DELACK flag is set before returning from
tcp_input(). This unbreaks delack handling, while still preserving
correct T/TCP behavior

Tested by: maxim
Sponsored by: DARPA, NAI Labs


111244 22-Feb-2003 silby

Add the ability to limit the number of IP fragments allowed per packet,
and enable it by default, with a limit of 16.

At the same time, tweak maxfragpackets downward so that in the worst
possible case, IP reassembly can use only 1/2 of all mbuf clusters.

MFC after: 3 days
Reviewed by: hsu
Liked by: bmah


111231 21-Feb-2003 phk

- m = m_gethdr(M_NOWAIT, MT_HEADER);
+ m = m_gethdr(M_DONTWAIT, MT_HEADER);

'nuff said.


111205 21-Feb-2003 cjc

The ancient and outdated concept of "privileged ports" in UNIX-type
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,

net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0

Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.

Two edge cases to note,

# sysctl net.inet.ip.portrange.reservedhigh=0

Opens all ports to everyone, and,

# sysctl net.inet.ip.portrange.reservedhigh=65535

Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).

For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.


111186 20-Feb-2003 jlemon

Remove unused variables in the IPSEC case.

Submitted by: Lars Eggert <larse@ISI.EDU>


111153 19-Feb-2003 jlemon

Unbreak non-IPV6 compilation.

Caught by: phk
Sponsored by: DARPA, NAI Labs


111145 19-Feb-2003 jlemon

Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs


111144 19-Feb-2003 jlemon

Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.

Sponsored by: DARPA, NAI Labs


111140 19-Feb-2003 jlemon

Correct comments.


111139 19-Feb-2003 jlemon

Clean up delayed acks and T/TCP interactions:
- delay acks for T/TCP regardless of delack setting
- fix bug where a single pass through tcp_input might not delay acks
- use callout_active() instead of callout_pending()

Sponsored by: DARPA, NAI Labs


111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


111037 17-Feb-2003 maxim

o Fix ipfw uid rules: socheckuid() returns 0 when uid matches a socket
cr_uid.

Note: we do not have socheckuid() in RELENG_4, ip_fw2.c uses its
own macro for a similar purpose that is why ipfw2 in RELENG_4 processes
uid rules correctly. I will MFC the diff for code consistency.

Reported by: Oleg Baranov <ol@csa.ru>
Reviewed by: luigi
MFC after: 1 month


110896 15-Feb-2003 hsu

Take advantage of pre-existing lock-free synchronization and type stable memory
to avoid acquiring SMP locks during expensive copyout process.


110830 13-Feb-2003 hsu

The protocol lock is always held in the dropafterack case, so we don't
need to check for it at runtime.


110775 12-Feb-2003 hsu

in_pcbnotifyall() requires an exclusive protocol lock for notify functions
which modify the connection list, namely, tcp_notify().


110737 12-Feb-2003 hsu

Properly document that syncache timer processing requires an
exclusive TCP protocol lock.


110683 11-Feb-2003 tanimura

s/IPSSEC/IPSEC/


110656 10-Feb-2003 hsu

Get cosmetic changes out of the way before I add routing table SMP locks.


110544 08-Feb-2003 orion

Avoid multiply for preemptive arp calculation since it hits every
ethernet packet sent.

Prompted by: Jeffrey Hsu <hsu@FreeBSD.org>


110308 04-Feb-2003 orion

MFS 1.64.2.22: Re-enable non pre-emptive ARP requests.

Submitted by: "Diomidis Spinellis" <dds@aueb.gr>
PR: kern/46116


110251 02-Feb-2003 cjc

Add the TCP flags to the log message whenever log_in_vain is 1, not
just when set to 2.

PR: kern/43348
MFC after: 5 days


110178 01-Feb-2003 silby

Move a comment and optimize the frag timeout code a slight bit.

Submitted by: maxim
MFC with: The previous two revisions


110074 30-Jan-2003 sam

FAST_IPSEC bandaid: act like KAME and ignore ENOENT error codes from
ipsec4_process_packet; they happen when a packet is dropped because
an SA acquire is initiated

Submitted by: Doug Ambrisko <ambrisko@verniernetworks.com>


110073 30-Jan-2003 sam

remove the restriction on build a kernel with FAST_IPSEC and INET6;
you still don't want to use the two together, but it's ok to have
them in the same kernel (the problem that initiated this bandaid
has long since been fixed)


110023 29-Jan-2003 silby

Fix a bug with syncookies; previously, the syncache's MSS size was not
initialized until after a syncookie was generated. As a result,
all connections resulting from a returned cookie would end up using
a MSS of ~512 bytes. Now larger packets will be used where possible.

MFC after: 5 days


110008 28-Jan-2003 phk

Check bounds for index before dereferencing memory past end of array.

Found by: FlexeLint


109996 28-Jan-2003 hsu

Avoid lock order reversal by expanding the scope of the
AF_INET radix tree lock to cover the ARP data structures.


109965 28-Jan-2003 silby

A few fixes to rev 1.221

- Honor the previous behavior of maxfragpackets = 0 or -1
- Take a better stab at fragment statistics
- Move / correct a comment

Suggested by: maxim@
MFC after: 7 days


109843 26-Jan-2003 silby

Merge the best parts of maxfragpackets and maxnipq together. (Both
functions implemented approximately the same limits on fragment memory
usage, but in different fashions.)

End user visible changes:
- Fragment reassembly queues are freed in a FIFO manner when maxfragpackets
has been reached, rather than all reassembly stopping.

MFC after: 5 days


109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


109569 20-Jan-2003 maxim

De-anonymity a couple of messages I missed in a previous sweep.
Move one of them under DEB macro.

Noticed by: Wiktor Niesiobedzki <w@evip.pl>


109566 20-Jan-2003 maxim

If the first action is O_LOG adjust a pointer to the real one, unbreaks
skipto + log rules.

Reported by: Wiktor Niesiobedzki <w@evip.pl>
MFC after: 1 week


109492 18-Jan-2003 hsu

Optimize away call to bzero() in the common case by directly checking
if a connection has any cached TAO information.


109451 18-Jan-2003 hsu

Fix long-standing bug predating FreeBSD where calling connect() twice
on a raw ip socket will crash the system with a null-dereference.


109409 17-Jan-2003 hsu

SMP locking for ARP.


109246 14-Jan-2003 dillon

Introduce the ability to flag a sysctl for operation at secure level 2 or 3
in addition to secure level 1. The mask supports up to a secure level of 8
but only add defines through CTLFLAG_SECURE3 for now.

As per the missif in the log entry for 1.11 of ip_fw2.c which added the
secure flag to the IPFW sysctl's in the first place, change the secure
level requirement from 1 to 3 now that we have support for it.

Reviewed by: imp
With Design Suggestions by: imp


109175 13-Jan-2003 hsu

Fix NewReno.

Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>


109035 10-Jan-2003 tmm

Clear the target hardware address field when generating an ARP request.

Reviewed by: nectar
MFC after: 1 week


108703 05-Jan-2003 hsu

Validate inp before de-referencing it.

Submitted by: pb


108533 01-Jan-2003 schweikh

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


108466 30-Dec-2002 sam

Correct mbuf packet header propagation. Previously, packet headers
were sometimes propagated using M_COPY_PKTHDR which actually did
something between a "move" and a "copy" operation. This is replaced
by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it
from the source mbuf) and m_dup_pkthdr which copies the packet
header contents including any m_tag chain. This corrects numerous
problems whereby mbuf tags could be lost during packet manipulations.

These changes also introduce arguments to m_tag_copy and m_tag_copy_chain
to specify if the tag copy work should potentially block. This
introduces an incompatibility with openbsd which we may want to revisit.

Note that move/dup of packet headers does not handle target mbufs
that have a cluster bound to them. We may want to support this;
for now we watch for it with an assert.

Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG|M_LASTFRAG.

Supported by: Vernier Networks
Reviewed by: Robert Watson <rwatson@FreeBSD.org>


108464 30-Dec-2002 dillon

Remove the PAWS ack-on-ack debugging printf().

Note that the original RFC 1323 (PAWS) says in 4.2.1 that the out of
order / reverse-time-indexed packet should be acknowledged as specified
in RFC-793 page 69 then dropped. The original PAWS code in FreeBSD (1994)
simply acknowledged the segment unconditionally, which is incorrect, and
was fixed in 1.183 (2002). At the moment we do not do checks for SYN or FIN
in addition to (tlen != 0), which may or may not be correct, but the
worst that ought to happen should be a retry by the sender.


108461 30-Dec-2002 sam

correct style bogons


108327 27-Dec-2002 iedowse

Bridged packets are supplied to the firewall with their IP header
in network byte order, but icmp_error() expects the IP header to
be in host order and the code here did not perform the necessary
swapping for the bridged case. This bug causes an "icmp_error: bad
length" panic when certain length IP packets (e.g. ip_len == 0x100)
are rejected by the firewall with an ICMP response.

MFC after: 3 days


108265 24-Dec-2002 hsu

Validate inp to prevent an use after free.


108258 24-Dec-2002 maxim

o De-anonymity dummynet(4) and ipfw(4) messages, prepend them
by 'dummynet: ' and 'ipfw: ' prefixes.

PR: kern/41609


108250 24-Dec-2002 hsu

SMP locking for radix nodes.


108180 22-Dec-2002 pb

Remove forgotten INP_UNLOCK(inp) in my previous commit.
Reported by: hsu


108160 21-Dec-2002 pb

In syncache_timer(), don't attempt to lock the inpcb structure
associated with the syncache entry: in case tcp_close() has been
called on the corresponding listening socket, the lock has been
destroyed as a side effect of in_pcbdetach(), causing a panic when
we attempt to lock on it.

Reviewed by: hsu


108144 21-Dec-2002 sam

replace the special-purpose rate-limiting code with the general facility
just added; this tries to maintain the same behaviour vis a vis printing
the rate-limiting messages but need tweaking


108125 20-Dec-2002 hsu

Eliminate a goto.
Fix some line breaks.


108123 20-Dec-2002 hsu

Unravel a nested conditional.
Remove an unneeded local variable.


108112 20-Dec-2002 hsu

Expand scope of TCP protocol lock to cover syncache data structures.


108107 19-Dec-2002 bmilekic

o Untangle the confusion with the malloc flags {M_WAITOK, M_NOWAIT} and
the mbuf allocator flags {M_TRYWAIT, M_DONTWAIT}.
o Fix a bpf_compat issue where malloc() was defined to just call
bpf_alloc() and pass the 'canwait' flag(s) along. It's been changed
to call bpf_alloc() but pass the corresponding M_TRYWAIT or M_DONTWAIT
flag (and only one of those two).

Submitted by: Hiten Pandya <hiten@unixdaemons.com> (hiten->commit_count++)


108033 18-Dec-2002 hsu

Lock up ifaddr reference counts.


107983 17-Dec-2002 phk

Remove unused and incorrectly maintained variable "in_interfaces"


107961 17-Dec-2002 dillon

Fix syntax in last commit.


107900 15-Dec-2002 maxim

o Trim EOL whitespaces.

MFC after: 1 week


107899 15-Dec-2002 maxim

o s/if_name[16]/if_name[IFNAMSIZ]/

Reviewed by: luigi
MFC after: 1 week


107898 15-Dec-2002 maxim

o M_DONTWAIT is mbuf(9) flag: malloc(M_DONTWAIT) -> malloc(M_NOWAIT).
The bug does not affect anything because M_NOWAIT == M_DONTWAIT.

Reviewed by: luigi
MFC after: 1 week


107897 15-Dec-2002 maxim

o Fix byte order logging issue: sa.sin_port is already in host byte order.

PR: kern/45964
Submitted by: Sascha Blank <sblank@tiscali.de>
Reviewed by: luigi
MFC after: 1 week


107881 14-Dec-2002 dillon

Change tcp.inflight_min from 1024 to a production default of 6144. Create
a sysctl for the stabilization value for the bandwidth delay product (inflight)
algorithm and document it.

MFC after: 3 days


107854 14-Dec-2002 dillon

Bruce forwarded this tidbit from an analysis Van Jacobson did on an
apparent ack-on-ack problem with FreeBSD. Prof. Jacobson noticed a
case in our TCP stack which would acknowledge a received ack-only packet,
which is not legal in TCP.

Submitted by: Van Jacobson <van@packetdesign.com>,
bmah@packetdesign.com (Bruce A. Mah)
MFC after: 7 days


107670 07-Dec-2002 sobomax

MFS: recognize gre packets used in the WCCP protocol.

Approved by: re


107114 20-Nov-2002 luigi

Move fw_one_pass from ip_fw2.c to ip_input.c so that neither
bridge.c nor if_ethersubr.c depend on IPFIREWALL.
Restore the use of fw_one_pass in if_ethersubr.c

ipfw.8 will be updated with a separate commit.

Approved by: re


107113 20-Nov-2002 luigi

Back out some style changes. They are not urgent,
I will put them back in after 5.0 is out.

Requested by: sam
Approved by: re


107112 20-Nov-2002 luigi

Back out the ip_fragment() code -- it is not urgent to have it in now,
I will put it back in in a better form after 5.0 is out.

Requested by: sam, rwatson, luigi (on second thought)
Approved by: re


107081 19-Nov-2002 silby

Add a sysctl to control the generation of source quench packets,
and set it to 0 by default.

Partially obtained from: NetBSD
Suggested by: David Gilbert
MFC after: 5 days


107022 17-Nov-2002 luigi

Fix function headers and remove 'register' variable declarations.


107020 17-Nov-2002 luigi

Move the ip_fragment code from ip_output() to a separate function,
so that it can be reused elsewhere (there is a number of places
where it can be useful). This also trims some 200 lines from
the body of ip_output(), which helps readability a bit.

(This change was discussed a few weeks ago on the mailing lists,
Julian agreed, silence from others. It is not a functional change,
so i expect it to be ok to commit it now but i am happy to back it
out if there are objections).

While at it, fix some function headers and replace m_copy() with
m_copypacket() where applicable.

MFC after: 1 week


107018 17-Nov-2002 luigi

Minor documentation changes and indentation fix.

Replace m_copy() with m_copypacket() where applicable.

While at it, fix some function headers and remove 'register' from
variable declarations.


107017 17-Nov-2002 luigi

Cleanup some of the comments, and reformat long lines.

Replace m_copy() with m_copypacket() where applicable.

Replace "if (a.s_addr ...)" with "if (a.s_addr != INADDR_ANY ...)"
to make it clear what the code means.

While at it, fix some function headers and remove 'register' from
variable declarations.

MFC after: 3 days


106968 15-Nov-2002 luigi

Massive cleanup of the ip_mroute code.

No functional changes, but:

+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).

This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).

Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.

Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct

Most of the code is from Pavlin Radoslavov and the XORP project

Reviewed by: sam
MFC after: 1 week


106935 14-Nov-2002 sam

track changes to not strip the Ethernet header from input packets

Reviewed by: many
Approved by: re


106934 14-Nov-2002 sam

track bpf changes

Reviewed by: many
Approved by: re


106846 13-Nov-2002 maxim

Due to a memory alignment sizeof(struct ipfw_flow_id) is bigger than
ipfw_flow_id structure actual size and bcmp(3) may fail to compare
them properly. Compare members of these structures instead.

PR: kern/44078
Submitted by: Oleg Bulyzhin <oleg@rinet.ru>
Reviewed by: luigi
MFC after: 2 weeks


106824 12-Nov-2002 hsu

Turn off duplicate lock checking for inp locks because udp_input()
intentionally locks two inp records simultaneously.


106736 10-Nov-2002 sam

a better solution to building FAST_IPSEC w/o INET6

Submitted by: Jeffrey Hsu <hsu@FreeBSD.org>


106696 09-Nov-2002 alfred

Fix instances of macros with improperly parenthasized arguments.

Verified by: md5


106681 08-Nov-2002 sam

temporarily disallow FAST_IPSEC and INET6 to avoid potential panics;
will correct this before 5.0 release


106680 08-Nov-2002 sam

FAST_IPSEC fixups:

o fix #ifdef typo
o must use "bounce functions" when dispatched from the protosw table

don't know how this stuff was missed in my testing; must've committed
the wrong bits

Pointy hat: sam
Submitted by: "Doug Ambrisko" <ambrisko@verniernetworks.com>


106679 08-Nov-2002 sam

fixup FAST_IPSEC build w/o INET6


106678 08-Nov-2002 sam

correct fast ipsec logic: compare destination ip address against the
contents of the SA, not the SP

Submitted by: "Doug Ambrisko" <ambrisko@verniernetworks.com>


106625 08-Nov-2002 jhb

Cast a ptrdiff_t to an int to printf.


106271 31-Oct-2002 jeff

- Consistently update snd_wl1, snd_wl2, and rcv_up in the header
prediction code. Previously, 2GB worth of header predicted data
could leave these variables too far out of sequence which would cause
problems after receiving a packet that did not match the header
prediction.

Submitted by: Bill Baumann <bbaumann@isilon.com>
Sponsored by: Isilon Systems, Inc.
Reviewed by: hsu, pete@isilon.com, neal@isilon.com, aaronp@isilon.com


106198 30-Oct-2002 hsu

Don't need to check if SO_OOBINLINE is defined.
Don't need to protect isipv6 conditional with INET6.
Fix leading indentation in 2 lines.


106152 29-Oct-2002 fenner

Renumber IPPROTO_DIVERT out of the range of valid IP protocol numbers.
This allows socket() to return an error when the kernel is not built
with IPDIVERT, and doesn't prevent future applications from using the
"borrowed" IP protocol number. The sysctl net.inet.raw.olddiverterror
controls whether opening a socket with the "borrowed" IP protocol
fails with an accompanying kernel printf; this code should last only a
couple of releases.

Approved by: re


106118 29-Oct-2002 maxim

Lower a priority of "session drop" messages.

Requested by: Eugene Grosbein <eugen@kuzbass.ru>
MFC after: 3 days


105899 24-Oct-2002 mux

Oops, forgot to commit this file. This is part of the fix
for ipfw2 panics on sparc64.


105887 24-Oct-2002 mux

Fix ipfw2 panics on 64-bit platforms.

Quoting luigi:

In order to make the userland code fully 64-bit clean it may
be necessary to commit other changes that may or may not cause
a minor change in the ABI.

Reviewed by: luigi


105886 24-Oct-2002 luigi

src and dst address were erroneously swapped in SRC_SET and DST_SET
commands. Use the correct one. Also affects ipfw2 in -stable.


105856 24-Oct-2002 mux

Fix kernel build on sparc64 in the IPDIVERT case.


105840 24-Oct-2002 iedowse

Unbreak the automatic remapping of an INADDR_ANY destination address
to the primary local IP address when doing a TCP connect(). The
tcp_connect() code was relying on in_pcbconnect (actually in_pcbladdr)
modifying the passed-in sockaddr, and I failed to notice this in
the recent change that added in_pcbconnect_setup(). As a result,
tcp_connect() was ending up using the unmodified sockaddr address
instead of the munged version.

There are two cases to handle: if in_pcbconnect_setup() succeeds,
then the PCB has already been updated with the correct destination
address as we pass it pointers to inp_faddr and inp_fport directly.
If in_pcbconnect_setup() fails due to an existing but dead connection,
then copy the destination address from the old connection.


105775 23-Oct-2002 maxim

Kill EOL spaces.

Approved by: luigi
MFC after: 1 week


105774 23-Oct-2002 maxim

Use syslog for messages about dropped sessions, do not flood a console.

Suggested by: Eugene Grosbein <eugen@kuzbass.ru>
Approved by: luigi
MFC after: 1 week


105748 22-Oct-2002 suz

fixed a kernel crash by "ifconfig stf0 inet 1.2.3.4"
MFC after: 1 week


105651 21-Oct-2002 iedowse

Implement a new IP_SENDSRCADDR ancillary message type that permits
a server process bound to a wildcard UDP socket to select the IP
address from which outgoing packets are sent on a per-datagram
basis. When combined with IP_RECVDSTADDR, such a server process can
guarantee to reply to an incoming request using the same source IP
address as the destination IP address of the request, without having
to open one socket per server IP address.

Discussed on: -net
Approved by: re


105649 21-Oct-2002 iedowse

Remove the "temporary connection" hack in udp_output(). In order
to send datagrams from an unconnected socket, we used to first block
input, then connect the socket to the sendmsg/sendto destination,
send the datagram, and finally disconnect the socket and unblock
input.

We now use in_pcbconnect_setup() to check if a connect() would have
succeeded, but we never record the connection in the PCB (local
anonymous port allocation is still recorded, though). The result
from in_pcbconnect_setup() authorises the sending of the datagram
and selects the local address and port to use, so we just construct
the header and call ip_output().

Discussed on: -net
Approved by: re


105629 21-Oct-2002 iedowse

Replace in_pcbladdr() with a more generic inner subroutine for
in_pcbconnect() called in_pcbconnect_setup(). This version performs
all of the functions of in_pcbconnect() except for the final
committing of changes to the PCB. In the case of an EADDRINUSE error
it can also provide to the caller the PCB of the duplicate connection,
avoiding an extra in_pcblookup_hash() lookup in tcp_connect().

This change will allow the "temporary connect" hack in udp_output()
to be removed and is part of the preparation for adding the
IP_SENDSRCADDR control message.

Discussed on: -net
Approved by: re


105586 20-Oct-2002 phk

Fix two instances of variant struct definitions in sys/netinet:

Remove the never completed _IP_VHL version, it has not caught on
anywhere and it would make us incompatible with other BSD netstacks
to retain this version.

Add a CTASSERT protecting sizeof(struct ip) == 20.

Don't let the size of struct ipq depend on the IPDIVERT option.

This is a functional no-op commit.

Approved by: re


105570 20-Oct-2002 rwatson

When a packet is multicast encapsulated, give labeled policies the
opportunity to preserve the label.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105565 20-Oct-2002 iedowse

Split out most of the logic from in_pcbbind() into a new function
called in_pcbbind_setup() that does everything except commit the
changes to the PCB. There should be no functional change here, but
in_pcbbind_setup() will be used by the soon-to-appear IP_SENDSRCADDR
control message implementation to check or allocate the source
address and port.

Discussed on: -net
Approved by: re


105440 19-Oct-2002 mux

Several malloc() calls were passing the M_DONTWAIT flag
which is an mbuf allocation flag. Use the correct
M_NOWAIT malloc() flag. Fortunately, both were defined
to 1, so this commit is a no-op.


105340 17-Oct-2002 ume

last arg of in6?_gif_output() is not used any more.

Obtained from: KAME
MFC after: 3 weeks


105301 16-Oct-2002 alfred

de-__P().


105295 16-Oct-2002 ume

use encapcheck.

Obtained from: KAME
MFC after: 3 weeks


105293 16-Oct-2002 ume

- after gif_set_tunnel(), psrc/pdst may be null. set IFF_RUNNING accordingly.
- set IFF_UP on SIOCSIFADDR. be consistent with others.
- set if_addrlen explicitly (just in case)
- multi destination mode is long gone.
- missing break statement
- add gif_set_tunnel(), so that we can set tunnel address from within the
kernel at ease.
- encap_attach/detach dynamically on ioctls
- move encap_attach() to dedicated function in in*_gif.c

Obtained from: KAME
MFC after: 3 weeks


105291 16-Oct-2002 dillon

Fix oops in my last commit, I was calculating a new length but then not
using it. (The code is already correct in -stable).

Found by: silby


105218 16-Oct-2002 guido

Get rid of checking for ip sec history. It is true that packets are not
supposed to be checked by the firewall rules twice. However, because the
various ipsec handlers never call ip_input(), this never happens anyway.

This fixes the situation where a gif tunnel is encrypted with IPsec. In
such a case, after IPsec processing, the unencrypted contents from the
GIF tunnel are fed back to the ipintrq and subsequently handeld by
ip_input(). Yet, since there still is IPSec history attached, the
packets coming out from the gif device are never fed into the filtering
code.
This fix was sent to Itojun, and he pointed towartds
http://www.netbsd.org/Documentation/network/ipsec/#ipf-interaction.
This patch actually implements what is stated there (specifically:
Packet came from tunnel devices (gif(4) and ipip(4)) will still
go through ipf(4). You may need to identify these packets by
using interface name directive in ipf.conf(5).

Reviewed by: rwatson
MFC after: 3 weeks


105201 16-Oct-2002 sam

correct PCB locking in broadcast/multicast case that was exposed by change
to use udp_append

Reviewed by: hsu


105199 16-Oct-2002 sam

Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks


105194 16-Oct-2002 sam

Replace aux mbufs with packet tags:

o instead of a list of mbufs use a list of m_tag structures a la openbsd
o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit
ABI/module number cookie
o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and
use this in defining openbsd-compatible m_tag_find and m_tag_get routines
o rewrite KAME use of aux mbufs in terms of packet tags
o eliminate the most heavily used aux mbufs by adding an additional struct
inpcb parameter to ip_output and ip6_output to allow the IPsec code to
locate the security policy to apply to outbound packets
o bump __FreeBSD_version so code can be conditionalized
o fixup ipfilter's call to ip_output based on __FreeBSD_version

Reviewed by: julian, luigi (silent), -arch, -net, darren
Approved by: julian, silence from everyone else
Obtained from: openbsd (mostly)
MFC after: 1 month


104975 12-Oct-2002 seanc

Increase the max dummynet hash size from 1024 to 65536. Default is still
1024.

Silence on: -net, -ipfw 4weeks+
Reviewed by: dd
Approved by: knu (mentor)
MFC after: 3 weeks


104825 10-Oct-2002 dillon

turn off debugging by default if bandwidth delay product limiting is
turned on (it is already off in -stable).


104815 10-Oct-2002 dillon

Update various comments mainly related to retransmit/FIN that I
documented while working on a previous bug.

Fix a PERSIST bug. Properly account for a FIN sent during a PERSIST.

MFC after: 7 days


104774 10-Oct-2002 maxim

Fix IPOPT_TS processing: do not overwrite IP address by timestamp.

PR: misc/42121
Submitted by: Praveen Khurjekar <praveen@codito.com>
Reviewed by: silence on -net
MFC after: 1 month


104366 02-Oct-2002 sobomax

Since bpf is no longer an optional component, remove associated ifdef's.

Submitted by: don't quite remember - the name of the sender disappeared
with the rest of my inbox. :(


104343 02-Oct-2002 mike

Include <sys/cdefs.h> so the visibility conditionals are available.
(This should have been included with the previous revision.)


104342 02-Oct-2002 mike

Use visibility conditionals. Only TCP_NODELAY ends up being defined
in the standards case.


104226 30-Sep-2002 dillon

Guido found another bug. There is a situation with
timestamped TCP packets where FreeBSD will send DATA+FIN and
A W2K box will ack just the DATA portion. If this occurs
after FreeBSD has done a (NewReno) fast-retransmit and is
recovering it (dupacks > threshold) it triggers a case in
tcp_newreno_partial_ack() (tcp_newreno() in stable) where
tcp_output() is called with the expectation that the retransmit
timer will be reloaded. But tcp_output() falls through and
returns without doing anything, causing the persist timer to be
loaded instead. This causes the connection to hang until W2K gives up.
This occurs because in the case where only the FIN must be acked, the
'len' calculation in tcp_output() will be 0, a lot of checks will be
skipped, and the FIN check will also be skipped because it is designed
to handle FIN retransmits, not forced transmits from tcp_newreno().

The solution is to simply set TF_ACKNOW before calling tcp_output()
to absolute guarentee that it will run the send code and reset the
retransmit timer. TF_ACKNOW is already used for this purpose in other
cases.

For some unknown reason this patch also seems to greatly reduce
the number of duplicate acks received when Guido runs his tests over
a lossy network. It is quite possible that there are other
tcp_newreno{_partial_ack()} cases which were not generating the expected
output which this patch also fixes.

X-MFC after: Will be MFC'd after the freeze is over


104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


104073 28-Sep-2002 peter

Zap now-unused SHLIB_MINOR


103852 23-Sep-2002 maxim

Slightly rearrange a code in rev. 1.164:

o Move len initialization closer to place of its first usage.
o Compare len with 0 to improve readability.
o Explicitly zero out phlen in ip_insertoptions() in failure case.

Suggested by: jhb
Reviewed by: jhb
MFC after: 2 weeks


103842 23-Sep-2002 alfred

s/__attribute__((__packed__))/__packed/g


103776 22-Sep-2002 silby

Fix issue where shutdown(socket, SHUT_RD) was effectively
ignored for TCP sockets.

NetBSD PR: 18185
Submitted by: Sean Boudreau <seanb@qnx.com>
MFC after: 3 days


103553 18-Sep-2002 phk

Use m_fixhdr() rather than roll our own.


103505 17-Sep-2002 dillon

Guido reported an interesting bug where an FTP connection between a
Windows 2000 box and a FreeBSD box could stall. The problem turned out
to be a timestamp reply bug in the W2K TCP stack. FreeBSD sends a
timestamp with the SYN, W2K returns a timestamp of 0 in the SYN+ACK
causing FreeBSD to calculate an insane SRTT and RTT, resulting in
a maximal retransmit timeout (60 seconds). If there is any packet
loss on the connection for the first six or so packets the retransmit
case may be hit (the window will still be too small for fast-retransmit),
causing a 60+ second pause. The W2K box gives up and closes the
connection.

This commit works around the W2K bug.

15:04:59.374588 FREEBSD.20 > W2K.1036: S 1420807004:1420807004(0) win 65535 <mss 1460,nop,wscale 2,nop,nop,timestamp 188297344 0> (DF) [tos 0x8]
15:04:59.377558 W2K.1036 > FREEBSD.20: S 4134611565:4134611565(0) ack 1420807005 win 17520 <mss 1460,nop,wscale 0,nop,nop,timestamp 0 0> (DF)

Bug reported by: Guido van Rooij <guido@gvr.org>


103481 17-Sep-2002 sobomax

Remove __RCSID().

Submitted by: bde


103479 17-Sep-2002 maxim

Explicitly clear M_FRAG flag on a mbuf with the last fragment to unbreak
ip fragments reassembling for loopback interface.

Discussed with: bde, jlemon
Reviewed by: silence on -net
MFC after: 2 weeks


103478 17-Sep-2002 maxim

In rare cases when there is no room for ip options ip_insertoptions()
can fail and corrupt a header length. Initialize len and check what
ip_insertoptions() returns.

Reviewed by: archie, silence on -net
MFC after: 5 days


103444 17-Sep-2002 jennifer

Tempary fix for inet6. The final fix is to change in6_pcbnotify to take pcbinfo instead
of pcbhead. It is on the way.


103176 10-Sep-2002 sobomax

Remove superfluous break.


103124 09-Sep-2002 sobomax

Since from now on encap_input() also catches IPPROTO_MOBILE and IPPROTO_GRE
packets in addition to IPPROTO_IPV4 and IPPROTO_IPV6, explicitly specify
IPPROTO_IPV4 or IPPROTO_IPV6 instead of -1 when calling encap_attach().

MFC after: 28 days
(along with other if_gre changes)


103032 06-Sep-2002 sobomax

Reduce namespace pollution by staticizing everything, which doesn't need to
be visible from outside of the module.


103026 06-Sep-2002 sobomax

Add a new gre(4) driver, which could be used to create GRE (RFC1701)
and MOBILE (RFC2004) IP tunnels.

Obrained from: NetBSD


102981 05-Sep-2002 bde

Fixed namespace pollution in uma changes:
- use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't
a prerequisite.
- don't include <sys/uma.h>.
Namespace pollution makes "opaque" types like uma_zone_t perfectly
non-opaque. Such types should never be used (see style(9)).

Fixed subsequently grwon dependencies of this header on its own pollution:
- include <sys/_mutex.h> and its prerequisite <sys/_lock.h> instead of
depending on namespace pollution 2 layers deep in <sys/uma.h>.


102967 05-Sep-2002 bde

Include <sys/mutex.h> and its prerequisite <sys/lock.h> instead of depending
on namespace pollution 4 layers deep in <netinet/in_pcb.h>.

Removed unused includes. Sorted includes.


102925 04-Sep-2002 sobomax

Add in_hosteq() and in_nullhost() macros to make life of developers
porting NetBSD code a little bit easier.

Obtained from: NetBSD


102575 29-Aug-2002 darrenr

some ipfilter files that accidently got imported here


102515 28-Aug-2002 darrenr

This commit was generated by cvs2svn to compensate for changes in r102514,
which included commits to RCS files with non-trunk default branches.


102412 25-Aug-2002 charnier

Replace various spelling with FALLTHROUGH which is lint()able


102397 25-Aug-2002 cjc

Lock the sysctl(8) knobs that turn ip{,6}fw(8) firewalling and
firewall logging on and off when at elevated securelevel(8). It would
be nice to be able to only lock these at securelevel >= 3, like rules
are, but there is no such functionality at present. I don't see reason
to be adding features to securelevel(8) with MAC being merged into 5.0.

PR: kern/39396
Reviewed by: luigi
MFC after: 1 week


102368 24-Aug-2002 dillon

Correct bug in t_bw_rtttime rollover, #undef USERTT


102291 22-Aug-2002 archie

Replace (ab)uses of "NULL" where "0" is really meant.


102227 21-Aug-2002 mike

o Merge <machine/ansi.h> and <machine/types.h> into a new header
called <machine/_types.h>.
o <machine/ansi.h> will continue to live so it can define MD clock
macros, which are only MD because of gratuitous differences between
architectures.
o Change all headers to make use of this. This mainly involves
changing:
#ifdef _BSD_FOO_T_
typedef _BSD_FOO_T_ foo_t;
#undef _BSD_FOO_T_
#endif
to:
#ifndef _FOO_T_DECLARED
typedef __foo_t foo_t;
#define _FOO_T_DECLARED
#endif

Concept by: bde
Reviewed by: jake, obrien


102218 21-Aug-2002 truckman

Create new functions in_sockaddr(), in6_sockaddr(), and
in6_v4mapsin6_sockaddr() which allocate the appropriate sockaddr_in*
structure and initialize it with the address and port information passed
as arguments. Use calls to these new functions to replace code that is
replicated multiple times in in_setsockaddr(), in_setpeeraddr(),
in6_setsockaddr(), in6_setpeeraddr(), in6_mapped_sockaddr(), and
in6_mapped_peeraddr(). Inline COMMON_END in tcp_usr_accept() so that
we can call in_sockaddr() with temporary copies of the address and port
after the PCB is unlocked.

Fix the lock violation in tcp6_usr_accept() (caused by calling MALLOC()
inside in6_mapped_peeraddr() while the PCB is locked) by changing
the implementation of tcp6_usr_accept() to match tcp_usr_accept().

Reviewed by: suz


102131 19-Aug-2002 jmallett

Enclose IPv6 addresses in brackets when they are displayed printable with a
TCP/UDP port seperated by a colon. This is for the log_in_vain facility.

Pointed out by: Edward J. M. Brocklesby
Reviewed by: ume
MFC after: 2 weeks


102086 19-Aug-2002 luigi

Raise limit for port lists to 30 entries/ranges.

Remove a duplicate "logging" message, and identify the firewall
as ipfw2 in the boot message.


102017 17-Aug-2002 dillon

Implement TCP bandwidth delay product window limiting, similar to (but
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.

net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit

MFC after: 1 week


102002 17-Aug-2002 hsu

Cosmetic-only changes for readability.

Reviewed by: (early form passed by) bde
Approved by: itojun (from core@kame.net)


101978 16-Aug-2002 luigi

sys/netinet/ip_fw2.c:

Implement the M_SKIP_FIREWALL bit in m_flags to avoid loops
for firewall-generated packets (the constant has to go in sys/mbuf.h).

Better comments on keepalive generation, and enforce dyn_rst_lifetime
and dyn_fin_lifetime to be less than dyn_keepalive_period.

Enforce limits (up to 64k) on the number of dynamic buckets, and
retry allocation with smaller sizes.

Raise default number of dynamic rules to 4096.

Improved handling of set of rules -- now you can atomically
enable/disable multiple sets, move rules from one set to another,
and swap sets.

sbin/ipfw/ipfw2.c:

userland support for "noerror" pipe attribute.

userland support for sets of rules.

minor improvements on rule parsing and printing.

sbin/ipfw/ipfw.8:

more documentation on ipfw2 extensions, differences from ipfw1
(so we can use the same manpage for both), stateful rules,
and some additional examples.
Feedback and more examples needed here.


101975 16-Aug-2002 alfred

make the strings for tcptimers, tanames and prurequests const to silence
warnings.


101948 15-Aug-2002 rwatson

Code formatting sync to trustedbsd_mac: don't perform an assignment
in an if clause.

PR:
Submitted by:
Reviewed by:
Approved by:
Obtained from:
MFC after:


101934 15-Aug-2002 rwatson

Rename mac_check_socket_receive() to mac_check_socket_deliver() so that
we can use the names _receive() and _send() for the receive() and send()
checks. Rename related constants, policy implementations, etc.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101928 15-Aug-2002 hsu

Reset dupack count in header prediction.
Follow-on to rev 1.39.

Reviewed by: jayanth, Thomas R Henderson <thomas.r.henderson@boeing.com>, silby, dillon


101927 15-Aug-2002 luigi

Kernel support for a dummynet option:
When a pipe or queue has the "noerror" attribute, do not report
drops to the caller (ip_output() and friends).
(2 lines to implement it, 2 lines to document it.)

This will let you simulate losses on the sender side as if they
happened in the middle of the network, i.e. with no explicit feedback
to the sender.

manpage and ipfw2.c changes to follow shortly, together with other
ipfw2 changes.

Requested by: silby
MFC after: 3 days


101921 15-Aug-2002 rwatson

It's now sufficient to rely on a nested include of _label.h to make sure
all structures in ip_var.h are defined, so remove include of mac.h.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101920 15-Aug-2002 rwatson

Perform a nested include of _label.h if #ifdef _KERNEL. This will
satisfy consumers of ip_var.h that need a complete definition of
struct ipq and don't include mac.h.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101919 15-Aug-2002 rwatson

Add mac.h -- raw_ip.c was depending on nested inclusion of mac.h which
is no longer present.

Pointed out by: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101843 13-Aug-2002 phk

remove spurious printf


101713 12-Aug-2002 jennifer

Assert that the inpcb lock is held when calling tcp_output().

Approved by: hsu


101628 10-Aug-2002 luigi

One bugfix and one new feature.

The bugfix (ipfw2.c) makes the handling of port numbers with
a dash in the name, e.g. ftp-data, consistent with old ipfw:
use \\ before the - to consider it as part of the name and not
a range separator.

The new feature (all this description will go in the manpage):

each rule now belongs to one of 32 different sets, which can
be optionally specified in the following form:

ipfw add 100 set 23 allow ip from any to any

If "set N" is not specified, the rule belongs to set 0.

Individual sets can be disabled, enabled, and deleted with the commands:

ipfw disable set N
ipfw enable set N
ipfw delete set N

Enabling/disabling of a set is atomic. Rules belonging to a disabled
set are skipped during packet matching, and they are not listed
unless you use the '-S' flag in the show/list commands.
Note that dynamic rules, once created, are always active until
they expire or their parent rule is deleted.
Set 31 is reserved for the default rule and cannot be disabled.

All sets are enabled by default. The enable/disable status of the sets
can be shown with the command

ipfw show sets

Hopefully, this feature will make life easier to those who want to
have atomic ruleset addition/deletion/tests. Examples:

To add a set of rules atomically:

ipfw disable set 18
ipfw add ... set 18 ... # repeat as needed
ipfw enable set 18

To delete a set of rules atomically

ipfw disable set 18
ipfw delete set 18
ipfw enable set 18

To test a ruleset and disable it and regain control if something
goes wrong:

ipfw disable set 18
ipfw add ... set 18 ... # repeat as needed
ipfw enable set 18 ; echo "done "; sleep 30 && ipfw disable set 18

here if everything goes well, you press control-C before
the "sleep" terminates, and your ruleset will be left
active. Otherwise, e.g. if you cannot access your box,
the ruleset will be disabled after the sleep terminates.

I think there is only one more thing that one might want, namely
a command to assign all rules in set X to set Y, so one can
test a ruleset using the above mechanisms, and once it is
considered acceptable, make it part of an existing ruleset.


101405 05-Aug-2002 silby

Handle PMTU discovery in syn-ack packets slightly differently;
rely on syncache flags instead of directly accessing the route
entry.

MFC after: 3 days


101335 04-Aug-2002 luigi

bugfix: move check for udp_blackhole before the one for icmp_bandlim.

MFC after: 3 days


101268 03-Aug-2002 luigi

Fix handling of packets which matched an "ipfw fwd" rule on the input side.


101239 02-Aug-2002 rwatson

When preserving the IP header in extra mbuf in the IP forwarding
case, also preserve the MAC label. Note that this mbuf allocation
is fairly non-optimal, but not my fault.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101233 02-Aug-2002 rwatson

Work to fix LINT build.

Reported by: phk


101185 01-Aug-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Add MAC support for the UDP protocol. Invoke appropriate MAC entry
points to label packets that are generated by local UDP sockets,
and to authorize delivery of mbufs to local sockets both in the
multicast/broadcast case and the unicast case.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101137 01-Aug-2002 rwatson

Document the undocumented assumption that at least one of the PCB
pointer and incoming mbuf pointer will be non-NULL in tcp_respond().
This is relied on by the MAC code for correctness, as well as
existing code.

Obtained from: TrustedBSD PRoject
Sponsored by: DARPA, NAI Labs


101136 01-Aug-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Add support for labeling most out-going ICMP messages using an
appropriate MAC entry point. Currently, we do not explicitly
label packet reflect (timestamp, echo request) ICMP events,
implicitly using the originating packet label since the mbuf is
reused. This will be made explicit at some point.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101106 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the TCP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check socket and
mbuf labels before permitting delivery to a socket. Assign labels
to newly accepted connections when the syncache/cookie code has done
its business. Also set peer labels as convenient. Currently,
MAC policies cannot influence the PCB matching algorithm, so cannot
implement polyinstantiation. Note that there is at least one case
where a PCB is not available due to the TCP packet not being associated
with any socket, so we don't label in that case, but need to handle
it in a special manner.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101103 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the raw IP socket code for packet generation and delivery:
label outgoing mbufs with the label of the socket, and check the
socket and mbuf labels before permitting delivery to a socket,
permitting MAC policies to selectively allow delivery of raw IP mbufs
to various raw IP sockets that may be open. Restructure the policy
checking code to compose IPsec and MAC results in a more readable
manner.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101096 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

When fragmenting an IP datagram, invoke an appropriate MAC entry
point so that MAC labels may be copied (...) to the individual
IP fragment mbufs by MAC policies.

When IP options are inserted into an IP datagram when leaving a
host, preserve the label if we need to reallocate the mbuf for
alignment or size reasons.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101095 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument the code managing IP fragment reassembly queues (struct ipq)
to invoke appropriate MAC entry points to maintain a MAC label on
each queue. Permit MAC policies to associate information with a queue
based on the mbuf that caused it to be created, update that information
based on further mbufs accepted by the queue, influence the decision
making process by which mbufs are accepted to the queue, and set the
label of the mbuf holding the reassembled datagram following reassembly
completetion.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101091 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

When generating an IGMP message, invoke a MAC entry point to permit
the MAC framework to label its mbuf appropriately for the target
interface.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101090 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

When generating an ARP query, invoke a MAC entry point to permit the
MAC framework to label its mbuf appropriately for the interface.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101088 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the MAC framework to label mbuf created using divert sockets.
These labels may later be used for access control on delivery to
another socket, or to an interface.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI LAbs


100993 30-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Label IP fragment reassembly queues, permitting security features to
be maintained on those objects. ipq_label will be used to manage
the reassembly of fragments into IP datagrams using security
properties. This permits policies to deny the reassembly of fragments,
as well as influence the resulting label of a datagram following
reassembly.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


100871 29-Jul-2002 maxim

Use a common way to release locks before exit.

Reviewed by: hsu


100831 28-Jul-2002 truckman

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


100685 25-Jul-2002 ume

make setsockopt(IPV6_V6ONLY, 0) actuall work for tcp6.

MFC after: 1 week


100683 25-Jul-2002 ume

cleanup usage of ip6_mapped_addr_on and ip6_v6only. now,
ip6_mapped_addr_on is unified into ip6_v6only.

MFC after: 1 week


100589 24-Jul-2002 luigi

Only log things net.inet.ip.fw.verbose is set


100537 23-Jul-2002 ru

Don't forget to recalculate the IP checksum of the original
IP datagram embedded into ICMP error message.

Spotted by: tcpdump 3.7.1 (-vvv)
MFC after: 3 days


100534 22-Jul-2002 ru

Don't shrink socket buffers in tcp_mss(), application might have already
configured them with setsockopt(SO_*BUF), for RFC1323's scaled windows.

PR: kern/11966
MFC after: 1 week


100508 22-Jul-2002 ume

do not refer to IN6P_BINDV6ONLY anymore.

Obtained from: KAME
MFC after: 1 week


100420 20-Jul-2002 jdp

Fix overflows in intermediate calculations in sysctl_msec_to_ticks().
At hz values of 1000 and above the overflows caused net.inet.tcp.keepidle
to be reported as negative.

MFC after: 3 days


100419 20-Jul-2002 rwatson

Don't export 'struct ipq' from kernel, instead #ifdef _KERNEL. As kernel
data structures pick up security and synchronization primitives, it
becomes increasingly desirable not to arbitrarily export them via
include files to userland, as the userland applications pick up new
#include dependencies.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


100373 19-Jul-2002 dillon

Add the tcps_sndrexmitbad statistic, keep track of late acks that caused
unnecessary retransmissions.


100335 18-Jul-2002 dillon

Introduce two new sysctl's:

net.inet.tcp.rexmit_min (default 3 ticks equiv)

This sysctl is the retransmit timer RTO minimum,
specified in milliseconds. This value is
designed for algorithmic stability only.

net.inet.tcp.rexmit_slop (default 200ms)

This sysctl is the retransmit timer RTO slop
which is added to every retransmit timeout and
is designed to handle protocol stack overheads
and delayed ack issues.

Note that the *original* code applied a 1-second
RTO minimum but never applied real slop to the RTO
calculation, so any RTO calculation over one second
would have no slop and thus not account for
protocol stack overheads (TCP timestamps are not
a measure of protocol turnaround!). Essentially,
the original code made the RTO calculation almost
completely irrelevant.

Please note that the 200ms slop is debateable.
This commit is not meant to be a line in the sand,
and if the community winds up deciding that increasing
it is the correct solution then it's easy to do.
Note that larger values will destroy performance
on lossy networks while smaller values may result in
a greater number of unnecessary retransmits.


100288 18-Jul-2002 luigi

Move IPFW2 definition before including ip_fw.h

Make indentation of new parts consistent with the style used for this file.


100270 17-Jul-2002 dillon

I don't know how the minimum retransmit timeout managed to get set to
one second but it badly breaks throughput on networks with minor packet
loss.

Complaints by: at least two people tracked down to this.
MFC after: 3 days


100228 17-Jul-2002 luigi

Fix a panic when doing "ipfw add pipe 1 log ..."

Also synchronize ip_dummynet.c with the version in RELENG_4 to
ease MFC's.


100004 14-Jul-2002 luigi

Implement keepalives for dynamic rules, so they will not expire
just because you leave your session idle.

Also, put in a fix for 64-bit architectures (to be revised).

In detail:

ip_fw.h

* Reorder fields in struct ip_fw to avoid alignment problems on
64-bit machines. This only masks the problem, I am still not
sure whether I am doing something wrong in the code or there
is a problem elsewhere (e.g. different aligmnent of structures
between userland and kernel because of pragmas etc.)

* added fields in dyn_rule to store ack numbers, so we can
generate keepalives when the dynamic rule is about to expire

ip_fw2.c

* use a local function, send_pkt(), to generate TCP RST for Reset rules;

* save about 250 bytes by cleaning up the various snprintf()
in ipfw_log() ...

* ... and use twice as many bytes to implement keepalives
(this seems to be working, but i have not tested it extensively).

Keepalives are generated once every 5 seconds for the last 20 seconds
of the lifetime of a dynamic rule for an established TCP flow. The
packets are sent to both sides, so if at least one of the endpoints
is responding, the timeout is refreshed and the rule will not expire.

You can disable this feature with

sysctl net.inet.ip.fw.dyn_keepalive=0

(the default is 1, to have them enabled).

MFC after: 1 day

(just kidding... I will supply an updated version of ipfw2 for
RELENG_4 tomorrow).


99891 12-Jul-2002 luigi

Avoid dereferencing a null pointer in ro_rt.

This was always broken in HEAD (the offending statement was introduced
in rev. 1.123 for HEAD, while RELENG_4 included this fix (in rev.
1.99.2.12 for RELENG_4) and I inadvertently deleted it in 1.99.2.30.

So I am also restoring these two lines in RELENG_4 now.
We might need another few things from 1.99.2.30.


99869 12-Jul-2002 truckman

Back out the previous change, since it looks like locking udbinfo provides
sufficient protection.


99863 12-Jul-2002 truckman

Lock inp while we're accessing it.


99838 11-Jul-2002 truckman

Defer calling SYSCTL_OUT() until after the locks have been released.


99837 11-Jul-2002 truckman

Reduce the nesting level of a code block that doesn't need to be in
an else clause.


99642 09-Jul-2002 luigi

Change one variable to make it easier to switch between ipfw and ipfw2


99623 08-Jul-2002 luigi

Fix a bug caused by dereferencing an invalid pointer when
no punch_fw was used.
Fix another couple of bugs which prevented rules from being
installed properly.

On passing, use IPFW2 instead of NEW_IPFW to compile the new code,
and slightly simplify the instruction generation code.


99622 08-Jul-2002 luigi

No functional changes, but:

Following Darren's suggestion, make Dijkstra happy and rewrite the
ipfw_chk() main loop removing a lot of goto's and using instead a
variable to store match status.

Add a lot of comments to explain what instructions are supposed to
do and how -- this should ease auditing of the code and make people
more confident with it.

In terms of code size: the entire file takes about 12700 bytes of text,
about 3K of which are for the main function, ipfw_chk(), and 2K (ouch!)
for ipfw_log().


99621 08-Jul-2002 luigi

Remove one unused command name.


99620 08-Jul-2002 luigi

Forgot to update one field name in one of the latest commits.


99475 05-Jul-2002 luigi

Implement the last 2-3 missing instructions for ipfw,
now it should support all the instructions of the old ipfw.

Fix some bugs in the user interface, /sbin/ipfw.

Please check this code against your rulesets, so i can fix the
remaining bugs (if any, i think they will be mostly in /sbin/ipfw).

Once we have done a bit of testing, this code is ready to be MFC'ed,
together with a bunch of other changes (glue to ipfw, and also the
removal of some global variables) which have been in -current for
a couple of weeks now.

MFC after: 7 days


99207 01-Jul-2002 brian

Remove trailing whitespace


99156 30-Jun-2002 jesper

Extend the effect of the sysctl net.inet.tcp.icmp_may_rst
so that, if we recieve a ICMP "time to live exceeded in transit",
(type 11, code 0) for a TCP connection on SYN-SENT state, close
the connection.

MFC after: 2 weeks


98982 28-Jun-2002 jlemon

One possible code path for syncache_respond() is:

syncache_respond(A), ip_output(), ip_input(), tcp_input(), syncache_badack(B)

Which winds up deleting a different entry from the syncache. Handle
this by not utilizing the next entry in the timer chain until after
syncache_respond() completes. The case of A == B should not be possible.

Problem found by: Don Bowman <don@sandvine.com>


98965 28-Jun-2002 dfr

Fix warning.

Reviewed by: luigi


98943 27-Jun-2002 luigi

The new ipfw code.

This code makes use of variable-size kernel representation of rules
(exactly the same concept of BPF instructions, as used in the BSDI's
firewall), which makes firewall operation a lot faster, and the
code more readable and easier to extend and debug.

The interface with the rest of the system is unchanged, as witnessed
by this commit. The only extra kernel files that I am touching
are if_fw.h and ip_dummynet.c, which is quite tied to ipfw. In
userland I only had to touch those programs which manipulate the
internal representation of firewall rules).

The code is almost entirely new (and I believe I have written the
vast majority of those sections which were taken from the former
ip_fw.c), so rather than modifying the old ip_fw.c I decided to
create a new file, sys/netinet/ip_fw2.c . Same for the user
interface, which is in sbin/ipfw/ipfw2.c (it still compiles to
/sbin/ipfw). The old files are still there, and will be removed
in due time.

I have not renamed the header file because it would have required
touching a one-line change to a number of kernel files.

In terms of user interface, the new "ipfw" is supposed to accepts
the old syntax for ipfw rules (and produce the same output with
"ipfw show". Only a couple of the old options (out of some 30 of
them) has not been implemented, but they will be soon.

On the other hand, the new code has some very powerful extensions.
First, you can put "or" connectives between match fields (and soon
also between options), and write things like

ipfw add allow ip from { 1.2.3.4/27 or 5.6.7.8/30 } 10-23,25,1024-3000 to any

This should make rulesets slightly more compact (and lines longer!),
by condensing 2 or more of the old rules into single ones.

Also, as an example of how easy the rules can be extended, I have
implemented an 'address set' match pattern, where you can specify
an IP address in a format like this:

10.20.30.0/26{18,44,33,22,9}

which will match the set of hosts listed in braces belonging to the
subnet 10.20.30.0/26 . The match is done using a bitmap, so it is
essentially a constant time operation requiring a handful of CPU
instructions (and a very small amount of memmory -- for a full /24
subnet, the instruction only consumes 40 bytes).

Again, in this commit I have focused on functionality and tried
to minimize changes to the other parts of the system. Some performance
improvement can be achieved with minor changes to the interface of
ip_fw_chk_t. This will be done later when this code is settled.

The code is meant to compile unmodified on RELENG_4 (once the
PACKET_TAG_* changes have been merged), for this reason
you will see #ifdef __FreeBSD_version in a couple of places.
This should minimize errors when (hopefully soon) it will be time
to do the MFC.


98904 27-Jun-2002 mux

Warning fixes for 64 bits platforms. With this last fix,
I can build a GENERIC sparc64 kernel with -Werror.

Reviewed by: luigi


98894 26-Jun-2002 luigi

Just a comment on some additional consistency checks that could
be added here.


98849 26-Jun-2002 ken

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


98781 24-Jun-2002 hsu

Avoid unlocking the inp twice if badport_bandlim() returns -1.

Reported by: jlemon


98769 24-Jun-2002 hsu

Style bug: fix 4 space indentations that should have been tabs.

Submitted by: jlemon


98704 23-Jun-2002 luigi

Slightly restructure the #ifdef INET6 sections to make the code
more readable.

Remove the six "register" attributes from variables tcp_output(), the
compiler surely knows well how to allocate them.


98703 23-Jun-2002 luigi

Move two global variables to automatic variables within the
only function where they are used (they are used with TCPDEBUG only).


98701 23-Jun-2002 luigi

Move some global variables in more appropriate places.

Add XXX comments to mark places which need to be taken care of
if we want to remove this part of the kernel from Giant.

Add a comment on a potential performance problem with ip_forward()


98666 23-Jun-2002 luigi

fix bad indentation and whitespace resulting from cut&paste


98665 23-Jun-2002 luigi

fix indentation of a comment


98664 23-Jun-2002 luigi

fix a typo in a comment


98663 23-Jun-2002 luigi

Remove ip_fw_fwd_addr (forgotten in previous commit)
remove some extra whitespace.


98613 22-Jun-2002 luigi

Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.

The variables removed by this change are:

ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet

Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().

On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.

Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.

option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.

NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.

* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary

* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.

* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).

MFC after: 10 days


98598 21-Jun-2002 hsu

Fix logic which resulted in missing a call to INP_UNLOCK().

Submitted by: jlemon, mux


98596 21-Jun-2002 hsu

TCP notify functions can change the pcb list.


98459 20-Jun-2002 peter

Solve the 'unregistered netisr 18' information notice with a sledgehammer.
Register the ISR early, but do not actually kick off the timer until we
see some activity. This still saves us from running the arp timers on
a system with no network cards.


98385 18-Jun-2002 tanimura

Remove so*_locked(), which were backed out by mistake.


98211 14-Jun-2002 hsu

Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)


98204 14-Jun-2002 silby

Re-commit w/fix:

Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.

PR: 39141
MFC after: 2 weeks

This time, make sure that ipv4 specific code (aka all of the above)
is only run in the ipv4 case.


98203 14-Jun-2002 silby

Back out ip_tos/ip_ttl/DF "fix", it just panic'd my box. :)

Pointy-hat to: silby


98202 14-Jun-2002 silby

Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.

PR: 39141
MFC after: 2 weeks


98191 13-Jun-2002 hsu

Because we're holding an exclusive write lock on the head, references to
the new inp cannot leak out even though it has been placed on the head list.


98147 12-Jun-2002 hsu

The UDP head was unlocked too early in one unicast case.

Submitted by: bug reported by arr


98135 12-Jun-2002 hsu

Fix logic which resulted in missing a call to INP_UNLOCK().


98134 12-Jun-2002 hsu

Fix typo where INP_INFO_RLOCK should be INP_INFO_RUNLOCK.
Submitted by: tegge, jlemon

Prefer LIST_FOREACH macro.
Submitted by: jlemon


98115 11-Jun-2002 hsu

Remember to initialize the control block head mutex.


98114 11-Jun-2002 hsu

Fix typo.

Submitted by: Kyunghwan Kim <redjade@atropos.snu.ac.kr>


98108 10-Jun-2002 hsu

Every array elt is initialized in the following loop, so remove
unnecessary M_ZERO.


98102 10-Jun-2002 hsu

Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>


97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


97627 30-May-2002 wollman

Avoid unintentional trigraph.


97074 21-May-2002 arr

- Change the newly turned INVARIANTS #ifdef blocks (they were changed from
DIAGNOSTIC yesterday) into KASSERT()'s as these help to increase code
readability.


97020 20-May-2002 arr

- Turn a few DIAGNOSTIC into INVARIANTS since they are really sanity
checks.


97019 20-May-2002 arr

- Turn a DIAGNOSTIC into an INVARIANTS since it's a sanity check. Use
proper ``if'' statement style.


97018 20-May-2002 arr

- Turn a #ifdef DIAGNOSTIC to #ifdef INVARIANTS as the code from this line
through the #endif is really a sanity check.

Reviewed by: jake


96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


96624 15-May-2002 kbyanc

Reset token-ring source routing control field on receipt of ethernet frame
without source routing information. This restores the behaviour in this
scenario to that of prior to my last commit.


96602 14-May-2002 rwatson

Modify the arguments to syncache_socket() to include the mbuf (m) that
results in the syncache entry being turned into a socket. While it's
not used in the main tree, this is required in the MAC tree so that
labels can be propagated from the mbuf to the socket. This is also
useful if you're doing things like transparent IP connection hijacking
and you want to use the syncache/cookie mechanism, but we won't go
there.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


96511 13-May-2002 luigi

Add ipfw hooks to ether_demux() and ether_output_frame().
Ipfw processing of frames at layer 2 can be enabled by the sysctl variable

net.link.ether.ipfw=1

Consider this feature experimental, because right now, the firewall
is invoked in the places indicated below, and controlled by the
sysctl variables listed on the right. As a consequence, a packet
can be filtered from 1 to 4 times depending on the path it follows,
which might make a ruleset a bit hard to follow.

I will add an ipfw option to tell if we want a given rule to apply
to ether_demux() and ether_output_frame(), but we have run out of
flags in the struct ip_fw so i need to think a bit on how to implement
this.

to upper layers
| |
+----------->-----------+
^ V
[ip_input] [ip_output] net.inet.ip.fw.enable=1
| |
^ V
[ether_demux] [ether_output_frame] net.link.ether.ipfw=1
| |
+->- [bdg_forward]-->---+ net.link.ether.bridge_ipfw=1
^ V
| |
to devices


96509 13-May-2002 luigi

Remove custom definitions (IP_FW_TCPF_SYN etc.) of TCP header flags
which are the same as the original ones (TH_SYN etc.)


96474 12-May-2002 luigi

Add code to match MAC header fields (at the moment supported on
bridged packets only, soon to come also for packets on ordinary
ether_input() and ether_output() paths. The syntax is

ipfw add <action> MAC dst src type

where dst and src can be "any" or a MAC address optionallyfollowed
by a mask, e.g.

10:20:30:40:50
10:20:30:40:50/32
10:20:30:40:50&ff:ff:ff:f0:ff:0f

and type can be a single ethernet type, a range, or a type followed by
a mask (values are always in hexadecimal) e.g.

0800
0800-0806
0800/8
0800&03ff

Note, I am still uncertain on what is the best format for inputting
these values, having the values in hexadecimal is convenient in most
cases but can be confusing sometimes. Suggestions welcome.

Implement suggestion from PR 37778 to allow "not me" on destination
and source IP. The code in the PR was slightly wrong and interfered
with the normal handling of IP addresses. This version hopefully is
correct.

Minor cleanup of the code, in some places moving the indentation to 4
spaces because the code was becoming too deep. Eventually, in a
separate commit, I will move the whole file to 4 space indent.


96432 12-May-2002 dd

s/demon/daemon/


96431 11-May-2002 mike

Remove some duplicate types that should have been removed as part of
the rearranging in the previous revision.

Pointy hat to: cvs update (merging), mike (for not noticing)


96245 09-May-2002 luigi

Cleanup the interface to ip_fw_chk, two of the input arguments
were totally useless and have been removed.

ip_input.c, ip_output.c:
Properly initialize the "ip" pointer in case the firewall does an
m_pullup() on the packet.

Remove some debugging code forgotten long ago.

ip_fw.[ch], bridge.c:
Prepare the grounds for matching MAC header fields in bridged packets,
so we can have 'etherfw' functionality without a lot of kernel and
userland bloat.


96184 07-May-2002 kbyanc

Move ISO88025 source routing information into sockaddr_dl's sdl_data
field. This returns the sdl_data field to a variable-length field. More
importantly, this prevents a easily-reproduceable data-corruption bug when
the interface name plus the hardware address exceed the sdl_data field's
original 12 byte limit. However, token-ring interfaces may still overflow
the new sdl_data field's 46 byte limit if the interface name exceeds 6
characters (since 6 characters for interface name plus 6 for hardware
address plus 34 for source routing = the size of sdl_data). Further
refinements could overcome this limitation but would break binary
compatibility; this commit only addresses fixing the bug for
commonly-occuring cases without breaking binary compatibility with the
intention that the functionality can be MFC'ed to -stable.

See message ID's (both send to -arch):
20020421013332.F87395-100000@gateway.posi.net
20020430181359.G11009-300000@gateway.posi.net
for a more thorough description of the bug addressed and how to
reproduce it.

Approved by: silence on -arch and -net
Sponsored by: NTT Multimedia Communications Labs
MFC after: 1 week


96116 06-May-2002 ume

Revised MLD-related definitions
- Used mld_xxx and MLD_xxx instead of mld6_xxx and MLD6_xxx according
to the official defintions in rfc2292bis
(macro definitions for backward compatibility were provided)
- Changed the first member of mld_hdr{} from mld_hdr to mld_icmp6_hdr
to avoid name space conflict in C++

This change makes ports/net/pchar compilable again under -CURRENT.

Obtained from: KAME


96077 05-May-2002 luigi

Indentation and comments cleanup, no functional change.

MFC after: 3 days


95883 01-May-2002 alfred

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


95867 01-May-2002 alfred

Fix some edge cases where bad string handling could occur.

Submitted by: ps


95865 01-May-2002 alfred

cleanup:
fix line wraps, add some comments, fix macro definitions, fix for(;;) loops.


95858 01-May-2002 cjc

Enlighten those who read the FINE POINTS of the documentation a bit
more on how ipfw(8) deals with tiny fragments. While we're at it, add
a quick log message to even let people know we dropped a packet. (Note
that the second FINE POINT is somewhat redundant given the first, but
since the code is there, leave the docs for it.)

MFC after: 1 day


95759 30-Apr-2002 tanimura

Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.

While I am here, sort include files alphabetically, where possible.


95552 27-Apr-2002 tanimura

Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.


95336 24-Apr-2002 mike

Rearrange <netinet/in.h> so that it is easier to conditionalize
sections for various standards. Conditionalize sections for various
standards. Use standards conforming spelling for types in the
sockaddr_in structure.


95099 20-Apr-2002 mike

Add sa_family_t type to <sys/_types.h> and typedefs to <netinet/in.h>
and <sys/socket.h>. Previously, sa_family_t was only typedef'd in
<sys/socket.h>.


95023 19-Apr-2002 suz

just merged cosmetic changes from KAME to ease sync between KAME and FreeBSD.
(based on freebsd4-snap-20020128)

Reviewed by: ume
MFC after: 1 week


94394 11-Apr-2002 suz

initialize local variable explicitly
Reviewed by: ume
Obtained from: Fujitsu guys
MFC after: 1 week


94390 10-Apr-2002 silby

Remove some ISN generation code which has been unused since the
syncache went in.

MFC after: 3 days


94379 10-Apr-2002 silby

Totally nuke IPPORT_USERRESERVED, it is no longer used anywhere, update
remaining comments to reflect new ephemeral port range.

Reminded by: Maxim Konovalov <maxim@macomnet.ru>
MFC after: 3 days


94357 10-Apr-2002 mike

Unconditionalize the definition of INET_ADDRSTRLEN and
INET6_ADDRSTRLEN. Doing this helps expose bogus redefinitions in 3rd
party software.


94327 10-Apr-2002 brian

Remove the code that masks an EEXIST returned from rtinit() when
calling ioctl(SIOC[AS]IFADDR).

This allows the following:

ifconfig xx0 inet 1.2.3.1 netmask 0xffffff00
ifconfig xx0 inet 1.2.3.17 netmask 0xfffffff0 alias
ifconfig xx0 inet 1.2.3.25 netmask 0xfffffff8 alias
ifconfig xx0 inet 1.2.3.26 netmask 0xffffffff alias

but would (given the above) reject this:

ifconfig xx0 inet 1.2.3.27 netmask 0xfffffff8 alias

due to the conflicting netmasks. I would assert that it's wrong
to mask the EEXIST returned from rtinit() as in the above scenario, the
deletion of the 1.2.3.25 address will leave the 1.2.3.27 address
as unroutable as it was in the first place.

Offered for review on: -arch, -net
Discussed with: stephen macmanus <stephenm@bayarea.net>
MFC after: 3 weeks


94326 10-Apr-2002 brian

Don't add host routes for interface addresses of 0.0.0.0/8 -> 0.255.255.255.

This change allows bootp to work with more than one interface, at the
expense of some rather ``wrong'' looking code. I plan to MFC this in
place of luigi's recent #ifdef BOOTP stuff that was committed to this
file in -stable, as that's slightly more wrong that this is.

Offered for review on: -arch, -net
MFC after: 2 weeks


94304 09-Apr-2002 jhb

Change the first argument of prison_xinpcb() to be a thread pointer instead
of a proc pointer so that prison_xinpcb() can use td_ucred.


94291 09-Apr-2002 silby

Update comments to reflect the recent ephemeral port range
change.

Noticed by: ru
MFC After: 1 day


93904 05-Apr-2002 mdodd

Retire this copy; it now lives in sys/net/fddi.h.


93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


93514 01-Apr-2002 mike

o Implement <sys/_types.h>, a new header for storing types that are
MI, not required to be a fixed size, and used in multiple headers.
This will grow in time, as more things move here from <sys/types.h>
and <machine/ansi.h>.
o Add missing type definitions (uint16_t and uint32_t) to
<arpa/inet.h> and <netinet/in.h>.
o Reduce pollution in <sys/types.h> by using `#if _FOO_T_DECLARED'
widgets to avoid including <sys/stdint.h>.
o Add some missing type definitions to <unistd.h> and note the ones
that still need to be added.
o Make use of <sys/_types.h> primitives in <grp.h> and <sys/types.h>.

Reviewed by: bde


93085 24-Mar-2002 bde

Fixed some style bugs in the removal of __P(()). Continuation lines
were not outdented to preserve non-KNF lining up of code with parentheses.
Switch to KNF formatting.


92976 22-Mar-2002 rwatson

Merge from TrustedBSD MAC branch:

Move the network code from using cr_cansee() to check whether a
socket is visible to a requesting credential to using a new
function, cr_canseesocket(), which accepts a subject credential
and object socket. Implement cr_canseesocket() so that it does a
prison check, a uid check, and add a comment where shortly a MAC
hook will go. This will allow MAC policies to seperately
instrument the visibility of sockets from the visibility of
processes.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


92960 22-Mar-2002 ru

Prevent icmp_reflect() from calling ip_output() with a NULL route
pointer which will then result in the allocated route's reference
count never being decremented. Just flood ping the localhost and
watch refcnt of the 127.0.0.1 route with netstat(1).

Submitted by: jayanth

Back out ip_output.c,v 1.143 and ip_mroute.c,v 1.69 that allowed
ip_output() to be called with a NULL route pointer. The previous
paragraph shows why this was a bad idea in the first place.

MFC after: 0 days


92926 22-Mar-2002 silby

Change the ephemeral port range from 1024-5000 to 49152-65535.
This increases the number of concurrent outgoing connections from ~4000
to ~16000. Other OSes (Solaris, OS X, NetBSD) and many other NAT
products have already made this change without ill effects, so we
should not run into any problems.

MFC after: 1 week


92802 20-Mar-2002 orion

Send periodic ARP requests when ARP entries for hosts we are sending
to are about to expire. This prevents high packet rate flows from
experiencing packet drops at the sender following ARP cache entry
timeout.

PR: kern/25517
Reviewed by: luigi
MFC after: 7 days


92760 20-Mar-2002 jeff

Switch vm_zone.h with uma.h. Change over to uma interfaces.


92723 19-Mar-2002 alfred

Remove __P.


92654 19-Mar-2002 jeff

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


92275 14-Mar-2002 rwatson

NAI DBA update


91984 10-Mar-2002 mike

o Add INET_ADDRSTRLEN and INET6_ADDRSTRLEN defines to <arpa/inet.h>
for POSIX.1-2001 conformance.
o Add magic to <netinet/in.h> and <netinet6/in6.h> to prevent
redefining INET_ADDRSTRLEN and INET6_ADDRSTRLEN.
o Add a note about missing typedefs in <arpa/inet.h>.


91959 09-Mar-2002 mike

o Don't require long long support in bswap64() functions.
o In i386's <machine/endian.h>, macros have some advantages over
inlines, so change some inlines to macros.
o In i386's <machine/endian.h>, ungarbage collect word_swap_int()
(previously __uint16_swap_uint32), it has some uses on i386's with
PDP endianness.

Submitted by: bde

o Move a comment up in <machine/endian.h> that was accidentially moved
down a few revisions ago.
o Reenable userland's use of optimized inline-asm versions of
byteorder(3) functions.
o Fix ordering of prototypes vs. redefinition of byteorder(3)
functions, so that the non-GCC (libc asm) case has proper
prototypes.
o Add proper prototypes for byteorder(3) functions in <sys/param.h>.
o Prevent redundant duplicate prototypes by making use of the
_BYTEORDER_PROTOTYPED define.
o Move the bswap16(), bswap32(), bswap64() C functions into MD space
for platforms in which asm versions don't exist. This significantly
reduces the complexity of some things at the cost of duplicate code.

Reviewed by: bde


91492 28-Feb-2002 ume

- Set inc_isipv6 in tcp6_usr_connect().
- When making a pcb from a sync cache, do not forget to copy inc_isipv6.

Obtained from: KAME
MFC After: 1 week


91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


91374 27-Feb-2002 cjc

Change the wording of the inline comments from the previous commit.

Objection from: ru


91357 27-Feb-2002 alfred

More IPV6 const fixes.


91354 27-Feb-2002 dd

Introduce a version field to `struct xucred' in place of one of the
spares (the size of the field was changed from u_short to u_int to
reflect what it really ends up being). Accordingly, change users of
xucred to set and check this field as appropriate. In the kernel,
this is being done inside the new cru2x() routine which takes a
`struct ucred' and fills out a `struct xucred' according to the
former. This also has the pleasant sideaffect of removing some
duplicate code.

Reviewed by: rwatson


91324 26-Feb-2002 brooks

Staticize an extern that no one else used.


91271 26-Feb-2002 jedgar

Enforce inbound IPsec SPD

Reviewed by: fenner


91236 25-Feb-2002 alfred

Document what inpcb->inp_vflag is for.

Submitted by: Marco Molteni <molter@tin.it>


91234 25-Feb-2002 cjc

The TCP code did not do sufficient checks on whether incoming packets
were destined for a broadcast IP address. All TCP packets with a
broadcast destination must be ignored. The system only ignored packets
that were _link-layer_ broadcasts or multicast. We need to check the
IP address too since it is quite possible for a broadcast IP address
to come in with a unicast link-layer address.

Note that the check existed prior to CSRG revision 7.35, but was
removed. This commit effectively backs out that nine-year-old change.

PR: misc/35022


90988 20-Feb-2002 luigi

BUGFIX: make use of the pointer to the target of skipto rules,
so that after the first time we can follow the pointer instead
of having to scan the list.
This was the intended behaviour from day one.

PR: 34639
MFC-after: 3 days


90982 20-Feb-2002 jlemon

When expanding a syncache entry into a socket, inherit the socket options
from the current listen socket instead of the cached (and possibly stale)
TCB pointer.


90868 18-Feb-2002 mike

o Move NTOHL() and associated macros into <sys/param.h>. These are
deprecated in favor of the POSIX-defined lowercase variants.
o Change all occurrences of NTOHL() and associated marcros in the
source tree to use the lowercase function variants.
o Add missing license bits to sparc64's <machine/endian.h>.
Approved by: jake
o Clean up <machine/endian.h> files.
o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>.
o Remove prototypes for non-existent bswapXX() functions.
o Include <machine/endian.h> in <arpa/inet.h> to define the
POSIX-required ntohl() family of functions.
o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>,
and <sys/param.h>.
o Prepend underscores to the ntohl() family to help deal with
complexities associated with having MD (asm and inline) versions, and
having to prevent exposure of these functions in other headers that
happen to make use of endian-specific defines.
o Create weak aliases to the canonical function name to help deal with
third-party software forgetting to include an appropriate header.
o Remove some now unneeded pollution from <sys/types.h>.
o Add missing <arpa/inet.h> includes in userland.

Tested on: alpha, i386
Reviewed by: bde, jake, tmm


90698 15-Feb-2002 ru

Moved the 127/8 check below so that IPF redirects have a chance of working.

MFC after: 1 day


90556 12-Feb-2002 jlemon

When a duplicate SYN arrives which matches an entry in the syncache,
update our lazy reference to the inpcb structure, as it may have changed.

Found by: dima


90493 10-Feb-2002 dd

Silence unused variable warning in the !KLD_MODULE case.

Submitted by: archie


90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


90198 04-Feb-2002 ume

In tcp_respond(), correctly reset returned IPv6 header. This is essential
when the original packet contains an IPv6 extension header.

Obtained from: KAME
MFC after: 1 week


90137 03-Feb-2002 markm

WARNS=n and lint(1) silencer. Declare an array of (const) strings
as const char.


89809 26-Jan-2002 cjc

The ipfw(8) 'tee' action simply hasn't worked on incoming packets for
some time. _All_ packets, regardless of destination, were accepted by
the machine as if addressed to it.

Jump back to 'pass' processing for a teed packet instead of falling
through as if it was ours.

PR: kern/31130
Reviewed by: -net, luigi
MFC after: 2 weeks


89667 22-Jan-2002 jlemon

The ENDPTS_EQ macro was comparing the one of the fports to itself. Fix.

Submitted by: emy@boostworks.com


89624 21-Jan-2002 ume

- Check the address family of the destination cached in a PCB.
- Clear the cached destination before getting another cached route.
Otherwise, garbage in the padding space (which might be filled in if it was
used for IPv4) could annoy rtalloc.

Obtained from: KAME


89614 21-Jan-2002 ru

RFC1122 requires that addresses of the form { 127, <any> } MUST NOT
appear outside a host.

PR: 30792, 33996
Obtained from: ip_input.c
MFC after: 1 week


89253 11-Jan-2002 ru

Fix a panic condition in icmp_reflect() introduced in rev. 1.61.
(We should be able to handle locally originated IP packets, and
these do not have m_pkthdr.rcvif set.)

PR: kern/32806, kern/33766
Reviewed by: luigi
Fix tested by: Maxim Konovalov <maxim@macomnet.ru>,
Erwin Lansing <erwin@lansing.dk>


89069 08-Jan-2002 msmith

Initialise the intrq_present fields at runtime, not link time. This allows
us to load protocols at runtime, and avoids the use of common variables.

Also fix the ip6_intrq assignment so that it works at all.


88991 07-Jan-2002 cjc

Fix a missing "ipfw:" in a syslog message.

MFC after: 1 day


88931 05-Jan-2002 fenner

Pre-calculate the checksum for multicast packets sourced on a
multicast router. This is overkill; it should be possible to
delay to hardware interfaces and only pre-calculate when forwarding
to a tunnel.


88884 04-Jan-2002 rwatson

o Spelling fix in comment: tcp_ouput -> tcp_output


88665 29-Dec-2001 yar

Don't reveal a router in the IPSTEALTH mode through IP options.
The following steps are involved:
a) the IP options related to routing (LSRR and SSRR) are processed
as though the router were a host,
b) the other IP options are processed as usual only if the packet
is destined for the router; otherwise they are ignored.

PR: kern/23123
Discussed in: freebsd-hackers


88593 28-Dec-2001 julian

Fix ipfw fwd so that it acts as the docs say
when forwarding an incoming packet to another machine.

Obtained from: Vicor Production tree
MFC after: 3 weeks


88359 21-Dec-2001 yar

Implement matching IP precedence in ipfw(4).

Submitted by: Igor Timkin <ivt@gamma.ru>


88331 21-Dec-2001 jlemon

Remove a change that snuck in from my private tree.


88330 21-Dec-2001 jlemon

If syncookies are disabled (net.inet.tcp.syncookies) then use the faster
arc4random() routine to generate ISNs instead of creating them with MD5().

Suggested by: silby


88195 19-Dec-2001 jlemon

When storing an int value in a void *, use intptr_t as the cast type
(instead of int) to keep the 64 bit platforms happy.


88190 19-Dec-2001 yar

Don't try to free a NULL route when doing IPFIREWALL_FORWARD.
An old route will be NULL at that point if a packet were initially
routed to an interface (using the IP_ROUTETOIF flag.)

Submitted by: Igor Timkin <ivt@gamma.ru>


88180 19-Dec-2001 jlemon

Extend the SYN DoS defense by adding syncookies to the syncache.
All TCP ISNs that are sent out are valid cookies, which allows entries
in the syncache to be dropped and still have the ACK accepted later.
As all entries pass through the syncache, there is no sudden switchover
from cache -> cookies when the cache is full; instead, syncache entries
simply have a reduced lifetime. More details may be found in the
"Resisting DoS attacks with a SYN cache" paper in the Usenix BSDCon 2002
conference proceedings.

Sponsored by: DARPA, NAI Labs


88132 18-Dec-2001 ru

Fixed the bug in transparent TCP proxying with the "encode_ip_hdr"
option -- TcpAliasOut() did not catch the IP header length change.

Submitted by: Stepachev Andrey <aka50@mail.ru>


87919 14-Dec-2001 rwatson

o Add IPOPT_ESO for the 'Extended Security' IP option (RFC1108)

Obtained from: TrustedBSD Project


87917 14-Dec-2001 rwatson

o Add definition for IPOPT_CIPSO, the commercial security IP option
number.

Submitted by: Ilmar S. Habibulin <ilmar@watson.org>
Obtained from: TrustedBSD Project


87916 14-Dec-2001 jlemon

whitespace and style fixes recovered from -stable.


87915 14-Dec-2001 jlemon

minor style and whitespace fixes.


87914 14-Dec-2001 jlemon

whitespace fixes.


87913 14-Dec-2001 jlemon

minor whitespace fixes.


87903 14-Dec-2001 silby

Reduce the local network slowstart flightsize from infinity to 4 packets.

Now that we've increased the size of our send / receive buffers, bursting
an entire window onto the network may cause congestion. As a result,
we will slow start beginning with a flightsize of 4 packets.

Problem reported by: Thomas Zenker <thz@Lennartz-electronic.de>

MFC after: 3 days


87780 13-Dec-2001 jlemon

Undo one of my last minute changes; move sc_iss up earlier so it
is initialized in case we take the T/TCP path.


87779 13-Dec-2001 jlemon

Fix up tabs from cut&n&paste.


87778 13-Dec-2001 jlemon

Fix up tabs in comments.


87777 13-Dec-2001 jlemon

Minor style fixes.


87776 13-Dec-2001 jlemon

Minor style fix.


87599 10-Dec-2001 obrien

Update to C99, s/__FUNCTION__/__func__/,
also don't use ANSI string concatenation.


87499 07-Dec-2001 rwatson

o Our currenty userland boot code (due to rc.conf and rc.network) always
enables TCP keepalives using the net.inet.tcp.always_keepalive by default.
Synchronize the kernel default with the userland default.


87410 05-Dec-2001 ru

Fixed remotely exploitable DoS in arpresolve().

Easily exploitable by flood pinging the target
host over an interface with the IFF_NOARP flag
set (all you need to know is the target host's
MAC address).

MFC after: 0 days


87275 03-Dec-2001 rwatson

o Introduce pr_mtx into struct prison, providing protection for the
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.

Reviewed by: jhb


87193 02-Dec-2001 dillon

Fix a bug with transmitter restart after receiving a 0 window. The
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.

Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).

Some cleanup. Identify additonal issues in comments.

MFC after: 1 day


87167 01-Dec-2001 ru

Allow for ip_output() to be called with a NULL route pointer.
This fixes a panic I introduced yesterday in ip_icmp.c,v 1.64.


87158 01-Dec-2001 mike

o Stop abusing MD headers with non-MD types.
o Hide nonstandard functions and types in <netinet/in.h> when
_POSIX_SOURCE is defined.
o Add some missing types (required by POSIX.1-200x) to <netinet/in.h>.
o Restore vendor ID from Rev 1.1 in <netinet/in.h> and make use of new
__FBSDID() macro.
o Fix some miscellaneous issues in <arpa/inet.h>.
o Correct final argument for the inet_ntop() function (POSIX.1-200x).
o Get rid of the namespace pollution from <sys/types.h> in
<arpa/inet.h>.

Reviewed by: fenner
Partially submitted by: bde


87145 30-Nov-2001 dillon

The transmit burst limit for newreno completely breaks TCP's performance
if the receive side is using delayed acks. Temporarily remove it.

MFC after: 0 days


87124 30-Nov-2001 brian

During SIOCAIFADDR, if in_ifinit() fails and we've already added an
interface address, blow the address away again before returning the
error.

In in_ifinit(), if we get an error from rtinit() and we've also got
a destination address, return the error rather than masking EEXISTS.
Failing to create a host route when configuring an interface should
be treated as an error.


87120 30-Nov-2001 ru

- Make ip_rtaddr() global, and use it to look up the correct source
address in icmp_reflect().
- Two new "struct icmpstat" members: icps_badaddr and icps_noroute.

PR: kern/31575
Obtained from: BSD/OS
MFC after: 1 week


87003 27-Nov-2001 dd

ipfw_modevent(): Don't use an unnatural block to define a variable
(fcp) that's already defined in the outer block and isn't used
anywhere else. This silences -Wunused.

Reviewed by: md5(1)


87002 27-Nov-2001 dd

Remove debugging printfs that weren't conditional on any debugging
options in handling MOD_{UN,}LOAD (they weren't very useful, anyway).


86999 27-Nov-2001 dd

In icmp_reflect(): If the packet was not addressed to us and was
received on an interface without an IP address, try to find a
non-loopback AF_INET address to use. If that fails, drop it.
Previously, we used the address at the top of the in_ifaddrhead list,
which didn't make much sense, and would cause a panic if there were no
AF_INET addresses configured on the system.

PR: 29337, 30524
Reviewed by: ru, jlemon
Obtained from: NetBSD


86991 27-Nov-2001 rwatson

Add include of net/route.h, as structures moved around due to the
syncache rely on 'struct route' being defined. This fixes the
LINT build some.


86958 27-Nov-2001 tanimura

Clear a new syncache entry first, followed by filling in values. This
fixes route breakage due to uncleared gabage on my box.


86953 27-Nov-2001 ru

When servicing an internal FTP server, punch ipfirewall(4) holes
for passive mode data connections (PASV/EPSV -> 227/229). Well,
the actual punching happens a bit later, when the aliasing link
becomes fully specified.

Prodded by: Danny Carroll <dannycarroll@hotmail.com>
MFC after: 1 week


86910 26-Nov-2001 ru

Restore the ability to use IP_FW_ADD with setsockopt(2) that got
broken in revision 1.86. This broke natd(8)'s -punch_fw option.

Reported by: Daniel Rock <D.Rock@t-online.de>,
setantae <setantae@submonkey.net>


86814 23-Nov-2001 bde

Fixed a buffer overrun. In my kernel configuration, tcp_syncache happens
to be followed by nfsnodehashtbl, so bzeroing callouts beyond the end of
tcp_syncache soon caused a null pointer panic when nfsnodehashtbl was
accessed.


86764 22-Nov-2001 jlemon

Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs


86744 21-Nov-2001 jlemon

Move initialization of snd_recover into tcp_sendseqinit().


86487 17-Nov-2001 dillon

Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.


86183 08-Nov-2001 rwatson

o Replace reference to 'struct proc' with 'struct thread' in 'struct
sysctl_req', which describes in-progress sysctl requests. This permits
sysctl handlers to have access to the current thread, permitting work
on implementing td->td_ucred, migration of suser() to using struct
thread to derive the appropriate ucred, and allowing struct thread to be
passed down to other code, such as network code where td is not currently
available (and curproc is used).

o Note: netncp and netsmb are not updated to reflect this change, as they
are not currently KSE-adapted.

Reviewed by: julian
Obtained from: TrustedBSD Project


86117 06-Nov-2001 arr

- Fixes non-zero'd out sin_zero field problem so that the padding
is used as it is supposed to be.

Inspired by: PR #31704
Approved by: jdp
Reviewed by: jhb, -net@


86106 05-Nov-2001 phk

3.5 years ago Wollman wrote:
"[...] and removes the hostcache code from standard kernels---the
code that depends on it is not going to happen any time soon,
I'm afraid."
Time to clean up.


86047 04-Nov-2001 luigi

MFS: sync the ipfw/dummynet/bridge code with the one recently merged
into stable (mostly , but not only, formatting and comments changes).


86031 04-Nov-2001 luigi

s/FREE/free/


85964 03-Nov-2001 brian

cmott@scientech.com -> cm@linktel.net

Requested by: Charles Mott <cmott@scientech.com>


85741 30-Oct-2001 wpaul

Fix a (long standing?) bug in ip_output(): if ip_insertoptions() is
called and ip_output() encounters an error and bails (i.e. host
unreachable), we will leak an mbuf. This is because the code calls
m_freem(m0) after jumping to the bad: label at the end of the function,
when it should be calling m_freem(m). (m0 is the original mbuf list
_without_ the options mbuf prepended.)

Obtained from: NetBSD


85740 30-Oct-2001 des

Make sure the netmask always has an address family. This fixes Linux
ifconfig, which expects the address returned by the SIOCGIFNETMASK ioctl
to have a valid sa_family. Similar changes may be necessary for IPv6.

While we're here, get rid of an unnecessary temp variable.

MFC after: 2 weeks


85732 30-Oct-2001 jlemon

When dropping a packet because there is no room in the queue (which itself
is somewhat bogus), update the statistics to indicate something was dropped.

PR: 13740


85689 29-Oct-2001 joe

A few more style changes picked up whilst working on an MFC to -stable.


85687 29-Oct-2001 joe

Fix some whitespace, and a comment that I missed in the last commit.


85665 29-Oct-2001 joe

Clean up the style of this header file.


85658 29-Oct-2001 dillon

fix int argument used in printf w/ %ld (cast to long)


85467 25-Oct-2001 jlemon

Don't use the ip_timestamp structure to access timestamp options, as the
compiler may cause an unaligned access to be generated in some cases.

PR: 30982


85466 25-Oct-2001 jlemon

If we are bridging, fall back to using any inet address in the system,
irrespective of receive interface, as a last resort.

Submitted by: ru


85465 25-Oct-2001 jlemon

Relocate the KASSERT for a null recvif to a location where it will
actually do some good.

Pointed out by: ru


85315 22-Oct-2001 ume

restore the data of the ip header when extended udp header and data checksum
is calculated. this caused some trouble in the code which the ip header
is not modified. for example, inbound policy lookup failed.

Obtained from: KAME
MFC after: 1 week


85223 20-Oct-2001 jlemon

Only examine inet addresses of the interface. This was broken in r1.83,
with the result that the system would reply to an ARP request of 0.0.0.0


85074 17-Oct-2001 ru

Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2.

Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo *''
as the argument. Pass rt_addrinfo all the way down to rtrequest1
and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now
``rt_addrinfo *'' instead of ``sockaddr *'' (almost noone is
using it anyways).

Benefit: the following command now works. Previously we needed
two route(8) invocations, "add" then "change".
# route add -inet6 default ::1 -ifp gif0

Remove unsafe typecast in rtrequest(), from ``rtentry *'' to
``sockaddr *''. It was introduced by 4.3BSD-Reno and never
corrected.

Obtained from: BSD/OS, NetBSD
MFC after: 1 month
PR: kern/28360


84931 14-Oct-2001 fjoe

bring in ARP support for variable length link level addresses

Reviewed by: jdp
Approved by: jdp
Obtained from: NetBSD
MFC after: 6 weeks


84736 09-Oct-2001 rwatson

- Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.

Reviewed by: ps, billf
Obtained from: TrustedBSD Project


84564 05-Oct-2001 jayanth

Add a flag TF_LASTIDLE, that forces a previously idle connection
to send all its data, especially when the data is less than one MSS.
This fixes an issue where the stack was delaying the sending
of data, eventhough there was enough window to send all the data and
the sending of data was emptying the socket buffer.

Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp)

Submitted by: Jayanth Vijayaraghavan


84527 05-Oct-2001 ps

Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks


84516 05-Oct-2001 ps

Make it so dummynet and bridge can be loaded as modules.

Submitted by: billf


84317 01-Oct-2001 jlemon

in_ifinit apparently can be used to rewrite an ip address; recalculate
the correct hash bucket for the entry.

Submitted by: iedowse (with some munging by me)


84315 01-Oct-2001 luigi

Fix a problem with unnumbered rules introduced in latest commit.
Reported by: des


84306 01-Oct-2001 ru

mdoc(7) police: Use the new .In macro for #include statements.


84195 30-Sep-2001 dillon

Add __FBSDID's to libalias


84137 29-Sep-2001 jlemon

Nuke unused (and incorrect) #define of INADDR_HMASK.

Spotted by: ru


84109 29-Sep-2001 jlemon

Make the INADDR_TO_IFP macro use the IP address hash lookup instead of
walking the entire list of IP addresses.

Pointed out by: bfumerola


84102 29-Sep-2001 jlemon

Add a hash table that contains the list of internet addresses, and use
this in place of the in_ifaddr list when appropriate. This improves
performance on hosts which have a large number of IP aliases.


84101 29-Sep-2001 jlemon

Centralize satosin(), sintosa() and ifatoia() macros in <netinet/in.h>
Remove local definitions.


84058 27-Sep-2001 luigi

Two main changes here:
+ implement "limit" rules, which permit to limit the number of sessions
between certain host pairs (according to masks). These are a special
type of stateful rules, which might be of interest in some cases.
See the ipfw manpage for details.

+ merge the list pointers and ipfw rule descriptors in the kernel, so
the code is smaller, faster and more readable. This patch basically
consists in replacing "foo->rule->bar" with "rule->bar" all over
the place.
I have been willing to do this for ages!

MFC after: 1 week


84023 27-Sep-2001 luigi

Remove unused (and duplicate) struct ip_opts which is never used,
not referenced in Stevens, and does not compile with g++.
There is an equivalent structure, struct ipoption in ip_var.h
which is actually used in various parts of the kernel, and also referenced
in Stevens.

Bill Fenner also says:
... if you want the trivia, struct ip_opts was introduced
in in.h SCCS revision 7.9, on 6/28/1990, by Mike Karels.
struct ipoption was introduced in ip_var.h SCCS revision 6.5,
on 9/16/1985, by... Mike Karels.

MFC-after: 3 days


83994 26-Sep-2001 brooks

Include sys/proc.h for the definition of securelevel_ge().

Submitted by: LINT


83970 26-Sep-2001 rwatson

o Modify IPFW and DUMMYNET administrative setsockopt() calls to use
securelevel_gt() to check the securelevel, rather than direct access
to the securelevel variable.

Obtained from: TrustedBSD Project


83934 25-Sep-2001 brooks

Make faith loadable, unloadable, and clonable.


83873 24-Sep-2001 luigi

Fix a null pointer dereference introduced in the last commit, plus
remove a useless assignment and move a comment.

Submitted by: Thomas Moestl


83771 21-Sep-2001 ru

Fixed the bug that prevented communication with FTP servers behind
NAT in extended passive mode if the server's public IP address was
different from the main NAT address. This caused a wrong aliasing
link to be created that did not route the incoming packets back to
the original IP address of the server.

natd -v -n pub0 -redirect_address localFTP publicFTP

Note that even if localFTP == publicFTP, one still needs to supply
the -redirect_address directive. It is needed as a helper because
extended passive mode's 229 reply does not contain the IP address.

MFC after: 1 week


83742 20-Sep-2001 rwatson

o Rename u_cansee() to cr_cansee(), making the name more comprehensible
in the face of a rename of ucred to cred, and possibly generally.

Obtained from: TrustedBSD Project


83725 20-Sep-2001 luigi

A bunch of minor changes to the code (see below) for readability, code size
and speed. No new functionality added (yet) apart from a bugfix.
MFC will occur in due time and probably in stages.

BUGFIX: fix a problem in old code which prevented reallocation of
the hash table for dynamic rules (there is a PR on this).

OTHER CHANGES: minor changes to the internal struct for static and dynamic rules.
Requires rebuild of ipfw binary.

Add comments to show how data structures are linked together.
(It probably makes no sense to keep the chain pointers separate
from actual rule descriptors. They will be hopefully merged soon.

keep a (sysctl-readable) counter for the number of static rules,
to speed up IP_FW_GET operations

initial support for a "grace time" for expired connections, so we
can set timeouts for closing connections to much shorter times.

merge zero_entry() and resetlog_entry(), they use basically the
same code.

clean up and reduce replication of code for removing rules,
both for readability and code size.

introduce a separate lifetime for dynamic UDP rules.

fix a problem in old code which prevented reallocation of
the hash table for dynamic rules (PR ...)

restructure dynamic rule descriptors

introduce some local variables to avoid multiple dereferencing of
pointer chains (reduces code size and hopefully increases speed).


83708 20-Sep-2001 sumikawa

Fixed comment: ipip_input -> mroute_encapcheck.

Reported by: bde


83615 18-Sep-2001 sumikawa

Removed ipip_input(). No codes calls it anymore due to ip_encap.c's
encapsulation support.


83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


83188 07-Sep-2001 julian

Remove some un-needed code that was accidentally included in
the 2nd previous KAME patch.

Submitted by: SUMIKAWA Munechika <sumikawa@ebina.hitachi.co.jp>


83187 07-Sep-2001 julian

Patches from KAME to remove usage of Varargs in existing
IPV4 code. For now they will still have some in the developing stuff (IPv6)

Submitted by: Keiichi SHIMA / <keiichi@iij.ad.jp>
Obtained from: KAME


83130 06-Sep-2001 jlemon

Wrap array accesses in macros, which also happen to be lvalues:

ifnet_addrs[i - 1] -> ifaddr_byindex(i)
ifindex2ifnet[i] -> ifnet_byindex(i)

This is intended to ease the conversion to SMPng.


82966 04-Sep-2001 alfred

Fix sysctl comment field, s/the the/then the

Pointed out by: ru


82893 03-Sep-2001 alfred

Allow disabling of "arp moved" messages.

Submitted by: Stephen Hurd <deuce@lordlegacy.org>


82892 03-Sep-2001 julian

I really hope this is the right answer.
call ip_input directly but take the offset off the
packet first if it's an IPV4 packet encapsulated.


82891 03-Sep-2001 julian

Call ip_input() instead of ipip_input()
when decoding encapsulated ipv4 packets.
(allows line to compile again)


82890 03-Sep-2001 julian

One caller of rip_input failed to be converted in the last commit.


82884 03-Sep-2001 julian

Patches from Keiichi SHIMA <keiichi@iij.ad.jp>
to make ip use the standard protosw structure again.

Obtained from: Well, KAME I guess.


82529 29-Aug-2001 jayanth

when newreno is turned on, if dupacks = 1 or dupacks = 2 and
new data is acknowledged, reset the dupacks to 0.
The problem was spotted when a connection had its send buffer full
because the congestion window was only 1 MSS and was not being incremented
because dupacks was not reset to 0.

Obtained from: Yahoo!


82445 27-Aug-2001 jesper

When net.inet.tcp.icmp_may_rst is enabled, report ECONNREFUSED not ENETRESET
to the application as a RST would, this way we're compatible with the most
applications.

MFC candidate.

Submitted by: Scott Renfro <scott@renfro.org>
Reviewed by: Mike Silbersack <silby@silby.com>


82345 26-Aug-2001 billf

the IP_FW_GET code in ip_fw_ctl() sizes a buffer to hold information
about rules and dynamic rules. it later fills this buffer with these
rules.

it also takes the opporunity to compare the expiration of the dynamic
rules with the current time and either marks them for deletion or simply
charges the countdown.

unfortunatly it does this all (the sizing, the buffer copying, and the
expiration GC) with no spl protection whatsoever. it was possible for
the dynamic rule(s) to be ripped out from under the request before it
had completed, resulting in corrupt memory dereferencing.

Reviewed by: ps
MFC before: 4.4-RELEASE, hopefully.


82238 23-Aug-2001 dd

Correct a typo in a comment: FIN_WAIT2 -> FIN_WAIT_2

PR: 29970
Submitted by: Joseph Mallett <jmallett@xMach.org>


82122 22-Aug-2001 silby

Much delayed but now present: RFC 1948 style sequence numbers

In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.

Reviewed by: jesper


82069 21-Aug-2001 ru

Added TFTP support.

Submitted by: Joe Clarke <marcus@marcuscom.com>
MFC after: 2 weeks


82050 21-Aug-2001 ru

Close the "IRC DCC" security breach reported recently on Bugtraq.

Submitted by: Makoto MATSUSHITA <matusita@jp.FreeBSD.org>


82001 20-Aug-2001 brian

Make the copyright consistent.

Previously approved by: Charles Mott <cmott@scientech.com>


81962 20-Aug-2001 brian

Handle snprintf() returning -1

MFC after: 2 weeks


81501 10-Aug-2001 julian

Make the protoswitch definitiosn checkable in the same way that
cdevsw entries have been for a long time.
Discover that we now have two version sof the same structure.
I will shoot one of them shortly when I figure out why someone thinks
they need it. (And I can prove they don't)
(netinet/ipprotosw.h should GO AWAY)


81251 07-Aug-2001 ru

mdoc(7) police:

Avoid using parenthesis enclosure macros (.Pq and .Po/.Pc) with plain text.
Not only this slows down the mdoc(7) processing significantly, but it also
has an undesired (in this case) effect of disabling hyphenation within the
entire enclosed block.


81127 04-Aug-2001 ume

When running aplication joined multicast address,
removing network card, and kill aplication.
imo_membership[].inm_ifp refer interface pointer
after removing interface.
When kill aplication, release socket,and imo_membership.
imo_membership use already not exist interface pointer.
Then, kernel panic.

PR: 29345
Submitted by: Inoue Yuichi <inoue@nd.net.fujitsu.co.jp>
Obtained from: KAME
MFC after: 3 days


81111 03-Aug-2001 dcs

MFS: Avoid dropping fragments in the absence of an interface address.

Noticed by: fenner
Submitted by: iedowse
Not committed to current by: iedowse ;-)


80429 27-Jul-2001 peter

Fix a warning.


80428 27-Jul-2001 peter

Patch up some style(9) stuff in tcp_new_isn()


80427 27-Jul-2001 peter

s/OpemBSD/OpenBSD/


80406 26-Jul-2001 ume

move ipsec security policy allocation into in_pcballoc, before
making pcbs available to the outside world. otherwise, we will see
inpcb without ipsec security policy attached (-> panic() in ipsec.c).

Obtained from: KAME
MFC after: 3 days


80354 25-Jul-2001 fenner

Somewhat modernize ip_mroute.c:
- Use sysctl to export stats
- Use ip_encap.c's encapsulation support
- Update lkm to kld (is 6 years a record for a broken module?)
- Remove some unused cruft


80211 23-Jul-2001 ru

Avoid a NULL pointer derefence introduced in rev. 1.129.

Problem noticed by: bde, gcc(1)
Panic caught by: mjacob
Patch tested by: mjacob


79934 19-Jul-2001 ru

Backout non-functional changes from revision 1.128.

Not objected to by: dcs


79830 17-Jul-2001 dcs

Skip the route checking in the case of multicast packets with known
interfaces.

Reviewed by: people at that channel
Approved by: silence on -net


79821 17-Jul-2001 ru

Backout damage to the INADDR_TO_IFP() macro in revision 1.7.

This macro was supposed to only match local IP addresses of
interfaces, and all consumers of this macro assume this as
well. (See IP_MULTICAST_IF and IP_ADD_MEMBERSHIP socket
options in the ip(4) manpage.)

This fixes a major security breach in IPFW-based firewalls
where the `me' keyword would match the other end of a P2P
link.

PR: kern/28567


79685 13-Jul-2001 obrien

Bump net.inet.tcp.sendspace to 32k and net.inet.tcp.recvspace to 65k.
This should help us in nieve benchmark "tests".

It seems a wide number of people think 32k buffers would not cause major
issues, and is in fact in use by many other OS's at this time. The
receive buffers can be bumped higher as buffers are hardly used and several
research papers indicate that receive buffers rarely use much space at all.

Submitted by: Leo Bicknell <bicknell@ufp.org>
<20010713101107.B9559@ussenterprise.ufp.org>
Agreed to in principle by: dillon (at the 32k level)


79531 10-Jul-2001 ru

mdoc(7) police: removed HISTORY info from the .Os call.


79413 08-Jul-2001 silby

Temporary feature: Runtime tuneable tcp initial sequence number
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.

While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.

To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.

Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.

Reviewed by: jlemon
Tested by: numerous subscribers of -net


79106 02-Jul-2001 brooks

gif(4) and stf(4) modernization:

- Remove gif dependencies from stf.
- Make gif and stf into modules
- Make gif cloneable.

PR: kern/27983
Reviewed by: ru, ume
Obtained from: NetBSD
MFC after: 1 week


79092 02-Jul-2001 cjc

While in there fixing a fragment logging bug, fix it so we log
fragments "right." Log fragment information tcpdump(8)-style,

Jul 1 19:38:45 bubbles /boot/kernel/kernel: ipfw: 1000 Accept ICMP:8.0 192.168.64.60 192.168.64.20 in via ep0 (frag 53113:1480@0+)

That is, instead of the old,

... Fragment = <offset/8>

Do,

... (frag <IP ID>:<data len>@<offset>[+])

PR: kern/23446
Approved by: ru
MFC after: 1 week


78964 29-Jun-2001 ru

Backout CSRG revision 7.22 to this file (if in_losing notices an
RTF_DYNAMIC route, it got freed twice). I am not sure what was
the actual problem in 1992, but the current behavior is memory
leak if PCB holds a reference to a dynamically created/modified
routing table entry. (rt_refcnt>0 and we don't call rtfree().)

My test bed was:

1. Set net.inet.tcp.msl to a low value (for test purposes), e.g.,
5 seconds, to speed up the transition of TCP connection to a
"closed" state.
2. Add a network route which causes ICMP redirect from the gateway.
3. ping(8) host H that matches this route; this creates RTF_DYNAMIC
RTF_HOST route to H. (I was forced to use ICMP to cause gateway
to generate ICMP host redirect, because gateway in question is a
4.2-STABLE system vulnerable to a problem that was fixed later in
ip_icmp.c,v 1.39.2.6, and TCP packets with DF bit set were
triggering this bug.)
4. telnet(1) to H
5. Block access to H with ipfw(8)
6. Send something in telnet(1) session; this causes EPERM, followed
by an in_losing() call in a few seconds.
7. Delete ipfw(8) rule blocking access to H, and wait for TCP
connection moving to a CLOSED state; PCB is freed.
8. Delete host route to H.
9. Watch with netstat(1) that `rttrash' increased.
10. Repeat steps 3-9, and watch `rttrash' increases.

PR: kern/25421
MFC after: 2 weeks


78886 27-Jun-2001 ru

Fixed the brain-o in rev. 1.10: the logic check was reversed.

Reported by: Bernd Fuerwitt <bf@fuerwitt.de>


78805 26-Jun-2001 ru

Bring in fix from NetBSD's revision 1.16:

Pass the correct destination address for the route-to-gateway case.

PR: kern/10607
MFC after: 2 weeks


78697 24-Jun-2001 dwmalone

Allow getcred sysctl to work in jailed root processes. Processes can
only do getcred calls for sockets which were created in the same jail.
This should allow the ident to work in a reasonable way within jails.

PR: 28107
Approved by: des, rwatson


78671 23-Jun-2001 jlemon

Replace bzero() of struct ip with explicit zeroing of structure members,
which is faster.


78667 23-Jun-2001 ru

Add netstat(1) knob to reset net.inet.{ip|icmp|tcp|udp|igmp}.stats.
For example, ``netstat -s -p ip -z'' will show and reset IP stats.

PR: bin/17338


78642 23-Jun-2001 silby

Eliminate the allocation of a tcp template structure for each
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.

Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.

Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks


78539 21-Jun-2001 sumikawa

- Renumber KAME local ICMP types and NDP options numberes beacaues they
are duplicated by newly defined types/options in RFC3121
- We have no backward compatibility issue. There is no apps in our
distribution which use the above types/options.

Obtained from: KAME
MFC after: 2 weeks


78492 20-Jun-2001 ume

made sure to use the correct sa_len for rtalloc().
sizeof(ro_dst) is not necessarily the correct one.
this change would also fix the recent path MTU discovery problem for the
destination of an incoming TCP connection.

Submitted by: JINMEI Tatuya <jinmei@kame.net>
Obtained from: KAME
MFC after: 2 weeks


78295 15-Jun-2001 jlemon

Do not perform arp send/resolve on an interface marked NOARP.

PR: 25006
MFC after: 2 weeks


78243 15-Jun-2001 peter

Fix a stack of KAME netinet6/in6.h warnings:
592: warning: `struct mbuf' declared inside parameter list
595: warning: `struct ifnet' declared inside parameter list


78064 11-Jun-2001 ume

Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks


77969 10-Jun-2001 jesper

Make the default value of net.inet.ip.maxfragpackets and
net.inet6.ip6.maxfragpackets dependent on nmbclusters,
defaulting to nmbclusters / 4

Reviewed by: bde
MFC after: 1 week


77900 08-Jun-2001 peter

"Fix" the previous initial attempt at fixing TUNABLE_INT(). This time
around, use a common function for looking up and extracting the tunables
from the kernel environment. This saves duplicating the same function
over and over again. This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.


77859 07-Jun-2001 jlemon

Move IPFilter into contrib.


77853 07-Jun-2001 peter

Back out part of my previous commit. This was a last minute change
and I botched testing. This is a perfect example of how NOT to do
this sort of thing. :-(


77843 06-Jun-2001 peter

Make the TUNABLE_*() macros look and behave more consistantly like the
SYSCTL_*() macros. TUNABLE_INT_DECL() was an odd name because it didn't
actually declare the int, which is what the name suggests it would do.


77830 06-Jun-2001 jesper

Silby's take one on increasing FreeBSD's resistance to SYN floods:

One way we can reduce the amount of traffic we send in response to a SYN
flood is to eliminate the RST we send when removing a connection from
the listen queue. Since we are being flooded, we can assume that the
majority of connections in the queue are bogus. Our RST is unwanted
by these hosts, just as our SYN-ACK was. Genuine connection attempts
will result in hosts responding to our SYN-ACK with an ACK packet. We
will automatically return a RST response to their ACK when it gets to us
if the connection has been dropped, so the early RST doesn't serve the
genuine class of connections much. In summary, we can reduce the number
of packets we send by a factor of two without any loss in functionality
by ensuring that RST packets are not sent when dropping a connection
from the listen queue.

Submitted by: Mike Silbersack <silby@silby.com>
Reviewed by: jesper
MFC after: 2 weeks


77701 04-Jun-2001 brian

Add BSD-style copyright headers

Approved by: Charles Mott <cmott@scientech.com>


77696 04-Jun-2001 brian

Change to a standard BSD-style copyright

Approved by: Atsushi Murai <amurai@spec.co.jp>


77665 03-Jun-2001 jesper

Prevent denial of service using bogus fragmented IPv4 packets.

A attacker sending a lot of bogus fragmented packets to the target
(with different IPv4 identification field - ip_id), may be able
to put the target machine into mbuf starvation state.

By setting a upper limit on the number of reassembly queues we
prevent this situation.

This upper limit is controlled by the new sysctl
net.inet.ip.maxfragpackets which defaults to 200,
as the IPv6 case, this should be sufficient for most
systmes, but you might want to increase it if you have
lots of TCP sessions.
I'm working on making the default value dependent on
nmbclusters.

If you want old behaviour (no upper limit) set this sysctl
to a negative value.

If you don't want to accept any fragments (not recommended)
set the sysctl to 0 (zero).

Obtained from: NetBSD
MFC after: 1 week


77574 01-Jun-2001 kris

Add ``options RANDOM_IP_ID'' which randomizes the ID field of IP packets.
This closes a minor information leak which allows a remote observer to
determine the rate at which the machine is generating packets, since the
default behaviour is to increment a counter for each packet sent.

Reviewed by: -net
Obtained from: OpenBSD


77572 01-Jun-2001 obrien

Back out jesper's 2001/05/31 14:58:11 PDT commit. It does not compile.


77545 31-May-2001 jesper

Prevent denial of service using bogus fragmented IPv4 packets.

A attacker sending a lot of bogus fragmented packets to the target
(with different IPv4 identification field - ip_id), may be able
to put the target machine into mbuf starvation state.

By setting a upper limit on the number of reassembly queues we
prevent this situation.

This upper limit is controlled by the new sysctl
net.inet.ip.maxfragpackets which defaults to NMBCLUSTERS/4

If you want old behaviour (no upper limit) set this sysctl
to a negative value.

If you don't want to accept any fragments (not recommended)
set the sysctl to 0 (zero)

Obtained from: NetBSD (partially)
MFC after: 1 week


77539 31-May-2001 jesper

Disable rfc1323 and rfc1644 TCP extensions if we havn't got
any response to our third SYN to work-around some broken
terminal servers (most of which have hopefully been retired)
that have bad VJ header compression code which trashes TCP
segments containing unknown-to-them TCP options.

PR: kern/1689
Submitted by: jesper
Reviewed by: wollman
MFC after: 2 weeks


77485 30-May-2001 ru

Add an integer field to keep protocol-specific flags with links.

For FTP control connection, keep the CRLF end-of-line termination
status in there.

Fixed the bug when the first FTP command in a session was ignored.

PR: 24048
MFC after: 1 week


77427 29-May-2001 jesper

Inline TCP_REASS() in the single location where it's used,
just as OpenBSD and NetBSD has done.

No functional difference.

MFC after: 2 weeks


77421 29-May-2001 jesper

properly delay acks in half-closed TCP connections

PR: 24962
Submitted by: Tony Finch <dot@dotat.at>
MFC after: 2 weeks


76469 11-May-2001 ru

In in_ifadown(), differentiate between whether the interface goes
down or interface address is deleted. Only delete static routes
in the latter case.

Reported by: Alexander Leidinger <Alexander@leidinger.net>


76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


75733 20-Apr-2001 jesper

Say goodbye to TCP_COMPAT_42

Reviewed by: wollman
Requested by: wollman


75619 17-Apr-2001 kris

Randomize the TCP initial sequence numbers more thoroughly.

Obtained from: OpenBSD
Reviewed by: jesper, peter, -developers


75262 06-Apr-2001 darrenr

fix security hole created by fragment cache


75255 06-Apr-2001 billf

pipe/queue are the only consumers of flow_id, so only set it in those cases


74937 28-Mar-2001 jesper

MFC candidate.

Change code from PRC_UNREACH_ADMIN_PROHIB to PRC_UNREACH_PORT for
ICMP_UNREACH_PROTOCOL and ICMP_UNREACH_PORT

And let TCP treat PRC_UNREACH_PORT like PRC_UNREACH_ADMIN_PROHIB

This should fix the case where port unreachables for udp returned
ENETRESET instead of ECONNREFUSED

Problem found by: Bill Fenner <fenner@research.att.com>
Reviewed by: jlemon


74870 27-Mar-2001 ru

MAN[1-9] -> MAN.


74851 27-Mar-2001 yar

Add a missing m_pullup() before a mtod() in in_arpinput().

PR: kern/22177
Reviewed by: wollman


74839 27-Mar-2001 simokawa

Replace dyn_fin_lifetime with dyn_ack_lifetime for half-closed state.
Half-closed state could last long for some connections and fin_lifetime
(default 20sec) is too short for that.

OK'ed by: luigi


74810 26-Mar-2001 phk

Send the remains (such as I have located) of "block major numbers" to
the bit-bucket.


74778 25-Mar-2001 brian

Make header files conform to style(9).

Reviewed by (*): bde

(*) alias_local.h only got a cursory glance.


74768 25-Mar-2001 brian

Remove an extraneous declaration.


74700 23-Mar-2001 ume

IPv4 address is not unsigned int. This change introduces in_addr_t.

PR: 9982
Adviced by: des
Reviewed by: -alpha and -net (no objection)
Obtained from: OpenBSD


74651 22-Mar-2001 brian

Remove (non-protected) variable names from function prototypes.


74551 21-Mar-2001 paul

Only flush rules that have a rule number above that set by a new
sysctl, net.inet.ip.fw.permanent_rules.

This allows you to install rules that are persistent across flushes,
which is very useful if you want a default set of rules that
maintains your access to remote machines while you're reconfiguring
the other rules.

Reviewed by: Mark Murray <markm@FreeBSD.org>


74494 19-Mar-2001 des

Axe TCP_RESTRICT_RST. It was never a particularly good idea except for a few
very specific scenarios, and now that we have had net.inet.tcp.blackhole for
quite some time there is really no reason to use it any more.

(last of three commits)


74454 19-Mar-2001 ru

Invalidate cached forwarding route (ipforward_rt) whenever a new route
is added to the routing table, otherwise we may end up using the wrong
route when forwarding.

PR: kern/10778
Reviewed by: silence on -net


74415 18-Mar-2001 ru

Make sure the cached forwarding route (ipforward_rt) is still up before
using it. Not checking this may have caused the wrong IP address to be
used when processing certain IP options (see example below). This also
caused the wrong route to be passed to ip_output() when forwarding, but
fortunately ip_output() is smart enough to detect this.

This example demonstrates the wrong behavior of the Record Route option
observed with this bug. Host ``freebsd'' is acting as the gateway for
the ``sysv''.

1. On the gateway, we add the route to the destination. The new route
will use the primary address of the loopback interface, 127.0.0.1:

: freebsd# route add 10.0.0.66 -iface lo0 -reject
: add host 10.0.0.66: gateway lo0

2. From the client, we ping the destination. We see the correct replies.
Please note that this also causes the relevant route on the ``freebsd''
gateway to be cached in ipforward_rt variable:

: sysv# ping -snv 10.0.0.66
: PING 10.0.0.66: 56 data bytes
: ICMP Host Unreachable from gateway 192.168.0.115
: ICMP Host Unreachable from gateway 192.168.0.115
: ICMP Host Unreachable from gateway 192.168.0.115
:
: ----10.0.0.66 PING Statistics----
: 3 packets transmitted, 0 packets received, 100% packet loss

3. On the gateway, we delete the route to the destination, thus making
the destination reachable through the `default' route:

: freebsd# route delete 10.0.0.66
: delete host 10.0.0.66

4. From the client, we ping destination again, now with the RR option
turned on. The surprise here is the 127.0.0.1 in the first reply.
This is caused by the bug in ip_rtaddr() not checking the cached
route is still up befor use. The debug code also shows that the
wrong (down) route is further passed to ip_output(). The latter
detects that the route is down, and replaces the bogus route with
the valid one, so we see the correct replies (192.168.0.115) on
further probes:

: sysv# ping -snRv 10.0.0.66
: PING 10.0.0.66: 56 data bytes
: 64 bytes from 10.0.0.66: icmp_seq=0. time=10. ms
: IP options: <record route> 127.0.0.1, 10.0.0.65, 10.0.0.66,
: 192.168.0.65, 192.168.0.115, 192.168.0.120,
: 0.0.0.0(Current), 0.0.0.0, 0.0.0.0
: 64 bytes from 10.0.0.66: icmp_seq=1. time=0. ms
: IP options: <record route> 192.168.0.115, 10.0.0.65, 10.0.0.66,
: 192.168.0.65, 192.168.0.115, 192.168.0.120,
: 0.0.0.0(Current), 0.0.0.0, 0.0.0.0
: 64 bytes from 10.0.0.66: icmp_seq=2. time=0. ms
: IP options: <record route> 192.168.0.115, 10.0.0.65, 10.0.0.66,
: 192.168.0.65, 192.168.0.115, 192.168.0.120,
: 0.0.0.0(Current), 0.0.0.0, 0.0.0.0
:
: ----10.0.0.66 PING Statistics----
: 3 packets transmitted, 3 packets received, 0% packet loss
: round-trip (ms) min/avg/max = 0/3/10


74362 16-Mar-2001 phk

<sys/queue.h> makeover.


74361 16-Mar-2001 phk

Fix a style(9) nit.


74299 15-Mar-2001 ru

net/route.c:

A route generated from an RTF_CLONING route had the RTF_WASCLONED flag
set but did not have a reference to the parent route, as documented in
the rtentry(9) manpage. This prevented such routes from being deleted
when their parent route is deleted.

Now, for example, if you delete an IP address from a network interface,
all ARP entries that were cloned from this interface route are flushed.

This also has an impact on netstat(1) output. Previously, dynamically
created ARP cache entries (RTF_STATIC flag is unset) were displayed as
part of the routing table display (-r). Now, they are only printed if
the -a option is given.

netinet/in.c, netinet/in_rmx.c:

When address is removed from an interface, also delete all routes that
point to this interface and address. Previously, for example, if you
changed the address on an interface, outgoing IP datagrams might still
use the old address. The only solution was to delete and re-add some
routes. (The problem is easily observed with the route(8) command.)

Note, that if the socket was already bound to the local address before
this address is removed, new datagrams generated from this socket will
still be sent from the old address.

PR: kern/20785, kern/21914
Reviewed by: wollman (the idea)


74213 13-Mar-2001 ru

RFC768 (UDP) requires that "if the computed checksum is zero, it
is transmitted as all ones". This got broken after introduction
of delayed checksums as follows. Some guys (including Jonathan)
think that it is allowed to transmit all ones in place of a zero
checksum for TCP the same way as for UDP. (The discussion still
takes place on -net.) Thus, the 0 -> 0xffff checksum fixup was
first moved from udp_output() (see udp_usrreq.c, 1.64 -> 1.65)
to in_cksum_skip() (see sys/i386/i386/in_cksum.c, 1.17 -> 1.18,
INVERT expression). Besides that I disagree that it is valid for
TCP, there was no real problem until in_cksum.c,v 1.20, where the
in_cksum() was made just a special version of in_cksum_skip().
The side effect was that now every incoming IP datagram failed to
pass the checksum test (in_cksum() returned 0xffff when it should
actually return zero). It was fixed next day in revision 1.21,
by removing the INVERT expression. The latter also broke the
0 -> 0xffff fixup for UDP checksums.

Before this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 0000

After this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 ffff


74209 13-Mar-2001 ru

Count and show incoming UDP datagrams with no checksum.


74183 12-Mar-2001 phk

Correctly cleanup in case of failure to bind a pcb.

PR: 25751
Submitted by: <unicorn@Forest.Od.UA>


74134 12-Mar-2001 jlemon

Unbreak LINT.

Pointed out by: phk


74111 11-Mar-2001 iedowse

In ip_output(), initialise `ia' in the case where the packet has
come from a dummynet pipe. Without this, the code which increments
the per-ifaddr stats can dereference an uninitialised pointer. This
should make dummynet usable again.

Reported by: "Dmitry A. Yanko" <fm@astral.ntu-kpi.kiev.ua>
Reviewed by: luigi, joe


74024 09-Mar-2001 ru

Make it possible to use IP_TTL and IP_TOS setsockopt(2) options
on certain types of SOCK_RAW sockets. Also, use the ip.ttl MIB
variable instead of MAXTTL constant as the default time-to-live
value for outgoing IP packets all over the place, as we already
do this for TCP and UDP.

Reviewed by: wollman


74018 09-Mar-2001 jlemon

Push the test for a disconnected socket when accept()ing down to the
protocol layer. Not all protocols behave identically. This fixes the
brokenness observed with unix-domain sockets (and postfix)


74017 09-Mar-2001 jlemon

The TCP sequence number used for sending a RST with the ipfw reset rule
is already in host byte order, so do not swap it again.

Reviewed by: bfumerola


73996 08-Mar-2001 iedowse

It was possible for ip_forward() to supply to icmp_error()
an IP header with ip_len in network byte order. For certain
values of ip_len, this could cause icmp_error() to write
beyond the end of an mbuf, causing mbuf free-list corruption.
This problem was observed during generation of ICMP redirects.

We now make quite sure that the copy of the IP header kept
for icmp_error() is stored in a non-shared mbuf header so
that it will not be modified by ip_output().

Also:
- Calculate the correct number of bytes that need to be
retained for icmp_error(), instead of assuming that 64
is enough (it's not).
- In icmp_error(), use m_copydata instead of bcopy() to
copy from the supplied mbuf chain, in case the first 8
bytes of IP payload are not stored directly after the IP
header.
- Sanity-check ip_len in icmp_error(), and panic if it is
less than sizeof(struct ip). Incoming packets with bad
ip_len values are discarded in ip_input(), so this should
only be triggered by bugs in the code, not by bad packets.

This patch results from code and suggestions from Ruslan, Bosko,
Jonathan Lemon and Matt Dillon, with important testing by Mike
Tancsa, who could reproduce this problem at will.

Reported by: Mike Tancsa <mike@sentex.net>
Reviewed by: ru, bmilekic, jlemon, dillon


73791 05-Mar-2001 truckman

Modify the comments to more closely resemble the English language.


73626 05-Mar-2001 truckman

Move the loopback net check closer to the beginning of ip_input() so that
it doesn't block packets whose destination address has been translated to
the loopback net by ipnat.

Add warning comments about the ip_checkinterface feature.


73540 04-Mar-2001 bmilekic

During a flood, we don't call rtfree(), but we remove the entry ourselves.
However, if the RTF_DELCLONE and RTF_WASCLONED condition passes, but the ref
count is > 1, we won't decrement the count at all. This could lead to
route entries never being deleted.

Here, we call rtfree() not only if the initial two conditions fail, but
also if the ref count is > 1 (and we therefore don't immediately delete
the route, but let rtfree() handle it).

This is an urgent MFC candidate. Thanks go to Mike Silbersack for the
fix, once again. :-)

Submitted by: Mike Silbersack <silby@silby.com>


73402 04-Mar-2001 truckman

Disable interface checking for packets subject to "ipfw fwd".

Chris Johnson <cjohnson@palomine.net> tested this fix in -stable.


73399 04-Mar-2001 truckman

Disable interface checking when IP forwarding is engaged so that packets
addressed to the interface on the other side of the box follow their
historical path.

Explicitly block packets sent to the loopback network sent from the outside,
which is consistent with the behavior of the forwarding path between
interfaces as implemented in in_canforward().

Always check the arrival interface when matching the packet destination
against the interface broadcast addresses. This bug allowed TCP
connections to be made to the broadcast address of an interface on the
far side of the system because the M_BCAST flag was not set because the
packet was unicast to the interface on the near side. This was broken
when the directed broadcast code was removed from revision 1.32. If
the directed broadcast code was stil present, the destination would not
have been recognized as local until the packet was forwarded to the output
interface and ether_output() looped a copy back to ip_input() with
M_BCAST set and the receive interface set to the output interface.

Optimize the order of the tests.

Reviewed by: jlemon


73357 02-Mar-2001 jlemon

Add a new sysctl net.inet.ip.check_interface, which will verify that
an incoming packet arrivees on an interface that has an address matching
the packet's address. This is turned on by default.


73217 28-Feb-2001 phk

Fix jails.


73172 27-Feb-2001 jlemon

When iterating over our list of interface addresses in order to determine
if an arriving packet belongs to us, also check that the packet arrived
through the correct interface. Skip this check if the packet was locally
generated.


73142 27-Feb-2001 billf

The TCP header-specific section suffered a little bit of bitrot recently:

When we recieve a fragmented TCP packet (other than the first) we can't
extract header information (we don't have state to reference). In a rather
unelegant fashion we just move on and assume a non-match.

Recent additions to the TCP header-specific section of the code neglected
to add the logic to the fragment code so in those cases the match was
assumed to be positive and those parts of the rule (which should have
resulted in a non-match/continue) were instead skipped (which means
the processing of the rule continued even though it had already not
matched).

Fault can be spread out over Rich Steenbergen (tcpoptions) and myself
(tcp{seq,ack,win}).

rwatson sent me a patch that got me thinking about this whole situation
(but what I'm committing / this description is mine so don't blame him).


73110 26-Feb-2001 jlemon

Use more aggressive retransmit timeouts for the initial SYN packet.
As we currently drop the connection after 4 retransmits + 2 ICMP errors,
this allows initial connection attempts to be dropped much faster.


73109 26-Feb-2001 jlemon

Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly.

For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.

Clean up some layering violations (access to tcp internals from in_pcb)


73103 26-Feb-2001 asmodai

Remove struct full_tcpiphdr{}.

This piece of code has not been referenced since it was put there
in 1995. Also done a codebased search on popular networking libraries
and third-party applications. This is an orphan.

Reviewed by: jesper


73102 26-Feb-2001 asmodai

Remove conditionals for vax support.
People who care much about this are welcomed to try 2.11BSD. :)

Noticed by: luigi
Reviewed by: jesper


73036 25-Feb-2001 jesper

Remove tcp_drop_all_states, which is unneeded after jlemon removed it
from tcp_subr.c in rev 1.92


73031 25-Feb-2001 jlemon

Do not delay a new ack if there already is a delayed ack pending on the
connection, but send it immediately. Prior to this change, it was possible
to delay a delayed-ack for multiple times, resulting in degraded TCP
behavior in certain corner cases.


72960 23-Feb-2001 jlemon

When converting soft error into a hard error, drop the connection. The
error will be passed up to the user, who will close the connection, so
it does not appear to make a sense to leave the connection open.

This also fixes a bug with kqueue, where the filter does not set EOF
on the connection, because the connection is still open.

Also remove calls to so{rw}wakeup, as we aren't doing anything with
them at the moment anyway.

Reviewed by: alfred, jesper


72959 23-Feb-2001 jlemon

Allow ICMP unreachables which map into PRC_UNREACH_ADMIN_PROHIB to
reset TCP connections which are in the SYN_SENT state, if the sequence
number in the echoed ICMP reply is correct. This behavior can be
controlled by the sysctl net.inet.tcp.icmp_may_rst.

Currently, only subtypes 2,3,10,11,12 are treated as such
(port, protocol and administrative unreachables).

Assocaiate an error code with these resets which is reported to the
user application: ENETRESET.

Disallow resetting TCP sessions which are not in a SYN_SENT state.

Reviewed by: jesper, -net


72922 22-Feb-2001 jesper

Redo the security update done in rev 1.54 of src/sys/netinet/tcp_subr.c
and 1.84 of src/sys/netinet/udp_usrreq.c

The changes broken down:

- remove 0 as a wildcard for addresses and port numbers in
src/sys/netinet/in_pcb.c:in_pcbnotify()
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
all sessions with the specific remote address.
- change
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
to use in_pcbnotifyall() to notify multiple sessions, instead of
using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
udp_ctlinput() and tcp_ctlinput(), so they will call
in_pcbnotifyall() to notify all sessions with the specific
remote address.

Approved by: jlemon
Inspired by: NetBSD


72803 21-Feb-2001 jesper

Backout change in 1.153, as it violate rfc1122 section 3.2.1.3.

Requested by: jlemon,ru


72786 21-Feb-2001 rwatson

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


72778 20-Feb-2001 jesper

Only call in_pcbnotify if the src port number != 0, as we
treat 0 as a wildcard in src/sys/in_pbc.c:in_pcbnotify()

It's sufficient to check for src|local port, as we'll have no
sessions with src|local port == 0

Without this a attacker sending ICMP messages, where the attached
IP header (+ 8 bytes) has the address and port numbers == 0, would
have the ICMP message applied to all sessions.

PR: kern/25195
Submitted by: originally by jesper, reimplimented by jlemon's advice
Reviewed by: jlemon
Approved by: jlemon


72775 20-Feb-2001 jesper

Send a ICMP unreachable instead of dropping the packet silent, if we
receive a packet not for us, and forwarding disabled.

PR: kern/24512
Reviewed by: jlemon
Approved by: jlemon


72774 20-Feb-2001 jesper

Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify

Forgotten by phk, when committing fix in kern/23986

PR: kern/23986
Reviewed by: phk
Approved by: phk


72650 18-Feb-2001 green

Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde


72638 18-Feb-2001 phk

Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify

Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h

Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input

In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB
or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG

Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst
to reflect the fact that we also react on ICMP unreachables that
are not administrative prohibited. Also update the comments to
reflect this.

In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat
PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different.

PR: 23986
Submitted by: Jesper Skriver <jesper@skriver.dk>


72631 18-Feb-2001 luigi

remove unused data structure definition, and corresponding macro into*()


72526 15-Feb-2001 jlemon

Clean up warning.


72486 14-Feb-2001 asmodai

Add definitions for IPPROTO numbers 55-57.


72440 13-Feb-2001 phk

Introduce a new feature in IPFW: Check of the source or destination
address is configured on a interface. This is useful for routers with
dynamic interfaces. It is now possible to say:

0100 allow tcp from any to any established
0200 skipto 1000 tcp from any to any
0300 allow ip from any to any
1000 allow tcp from 1.2.3.4 to me 22
1010 deny tcp from any to me 22
1020 allow tcp from any to any

and not have to worry about the behaviour if dynamic interfaces configure
new IP numbers later on.

The check is semi expensive (traverses the interface address list)
so it should be protected as in the above example if high performance
is a requirement.


72357 11-Feb-2001 bmilekic

Clean up RST ratelimiting. Previously, ratelimiting occured before tests
were performed to determine if the received packet should be reset. This
created erroneous ratelimiting and false alarms in some cases. The code
has now been reorganized so that the checks for validity come before
the call to badport_bandlim. Additionally, a few changes in the symbolic
names of the bandlim types have been made, as well as a clarification of
exactly which type each RST case falls under.

Submitted by: Mike Silbersack <silby@silby.com>


72270 10-Feb-2001 luigi

Sync with the bridge/dummynet/ipfw code already tested in stable.

In ip_fw.[ch] change a couple of variable and field names to
avoid having types, variables and fields with the same name.


72091 06-Feb-2001 asmodai

Fix typo: seperate -> separate.

Seperate does not exist in the english language.


72084 06-Feb-2001 phk

Convert if_multiaddrs from LIST to TAILQ so that it can be traversed
backwards in the three drivers which want to do that.

Reviewed by: mikeh


72056 05-Feb-2001 julian

Fix bad patch from a few days ago. It broke some bridging.


72012 04-Feb-2001 phk

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


72010 04-Feb-2001 darrenr

fix duplicate rcsid


72006 04-Feb-2001 darrenr

fix conflicts


71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


71998 04-Feb-2001 phk

Use <sys/queue.h> macro API.


71963 03-Feb-2001 julian

Make the code act the same in the case of BRIDGE being defined, but not
turned on, and the case of it not being defined at all.
i.e. Disabling bridging re-enables some of the checks it disables.

Submitted by: "Rogier R. Mulhuijzen" <drwilco@drwilco.net>


71937 02-Feb-2001 jlemon

When turning off TCP_NOPUSH, call tcp_output to immediately flush
out any data pending in the buffer.

Submitted by: Tony Finch <dot@dotat.at>


71909 02-Feb-2001 luigi

MFS: bridge/ipfw/dummynet fixes (bridge.c will be committed separately)


71796 29-Jan-2001 brian

Add a few ``const''s to silence some -Wwrite-strings warnings


71763 29-Jan-2001 brian

Ignore leading witespace in the string given to PacketAliasProxyRule().


71700 27-Jan-2001 luigi

Make sure we do not follow an invalid pointer in ipfw_report
when we get an incomplete packet or m_pullup fails.


71686 26-Jan-2001 luigi

Minor cleanups after yesterday's patch.
The code (bridging and dummynet) actually worked fine!


71667 26-Jan-2001 luigi

Bring dummynet in line with the code that now works in -STABLE.
It compiles, but I cannot test functionality yet.


71613 25-Jan-2001 luigi

Pass up errors returned by dummynet. The same should be done with
divert.


71594 24-Jan-2001 wollman

Correct a comment.


71415 23-Jan-2001 wes

When attempting to bind to an ephemeral port, if no such port is
available, the error return should be EADDRNOTAVAIL rather than
EAGAIN.

PR: 14181
Submitted by: Dima Dorfman <dima@unixfreak.org>
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


71395 22-Jan-2001 luigi

Change critical section protection for dummynet from splnet() to
splimp() -- we need it because dummynet can be invoked by the
bridging code at splimp().

This should cure the pipe "stalls" that several people have been
reporting on -stable while using bridging+dummynet (the problem
would not affect routers using dummynet).


71350 21-Jan-2001 des

First step towards an MP-safe zone allocator:
- have zalloc() and zfree() always lock the vm_zone.
- remove zalloci() and zfreei(), which are now redundant.

Reviewed by: bmilekic, jasone


71137 17-Jan-2001 luigi

Document data structures and operation on dummynet so next time
I or someone else browse through this code I do not have a hard
time understanding what is going on.


71133 16-Jan-2001 luigi

Some dummynet patches that I forgot to commit last summer.
One of them fixes a potential panic when bridging is used and
you run out of mbufs (though i have no idea if the bug has
ever hit anyone).


70951 12-Jan-2001 bmilekic

Prototype inet_ntoa_r and thereby silence a warning from GCC. The function
is prototyped immediately under inet_ntoa, which is also from libkern.


70854 09-Jan-2001 rwatson

o Minor style(9)ism to make consistent with -STABLE


70826 09-Jan-2001 rwatson

o IPFW incorrectly handled filtering in the presence of previously
reserved and now allocated TCP flags in incoming packets. This patch
stops overloading those bits in the IP firewall rules, and moves
colliding flags to a seperate field, ipflg. The IPFW userland
management tool, ipfw(8), is updated to reflect this change. New TCP
flags related to ECN are now included in tcp.h for reference, although
we don't currently implement TCP+ECN.

o To use this fix without completely rebuilding, it is sufficient to copy
ip_fw.h and tcp.h into your appropriate include directory, then rebuild
the ipfw kernel module, and ipfw tool, and install both. Note that a
mismatch between module and userland tool will result in incorrect
installation of firewall rules that may have unexpected effects. This
is an MFC candidate, following shakedown. This bug does not appear
to affect ipfilter.

Reviewed by: security-officer, billf
Reported by: Aragon Gouveia <aragon@phat.za.net>


70699 06-Jan-2001 alfred

provide a sysctl 'net.link.ether.inet.log_arp_wrong_iface' to allow one
to supress logging when ARP replies arrive on the wrong interface:
"/kernel: arp: 1.2.3.4 is on dc0 but got reply from 00:00:c5:79:d0:0c on dc1"

the default is to log just to give notice about possibly incorrectly
configured networks.


70643 03-Jan-2001 alfred

Fix incorrect logic wouldn't disconnect incomming connections that had been
disconnected because they were not full.

Submitted by: David Filo


70391 27-Dec-2000 assar

include tcp header files to get the prototype for tcp_seq_vs_sess


70330 24-Dec-2000 phk

Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and
to be configurable with respect to acting only in SYN or in all TCP states.

PR: 23665
Submitted by: Jesper Skriver <jesper@skriver.dk>


70254 21-Dec-2000 bmilekic

* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT.
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.

* Fix a typo in a comment in mbuf.h

* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.


70105 16-Dec-2000 billf

Use getmicrotime() instead of microtime() when timestamping ICMP packets,
the former is quicker and accurate enough for use here.

Submitted by: Jason Slagle <raistlin@toledolink.com> (on IRC)
Reviewed by: phk


70103 16-Dec-2000 phk

We currently does not react to ICMP administratively prohibited
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.

rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.

quote begin.

A Destination Unreachable message that is received MUST be
reported to the transport layer. The transport layer SHOULD
use the information appropriately; for example, see Sections
4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol
that has its own mechanism for notifying the sender that a
port is unreachable (e.g., TCP, which sends RST segments)
MUST nevertheless accept an ICMP Port Unreachable for the
same purpose.

quote end.

I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.

When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.

The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.

I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.

PR: 23086
Submitted by: Jesper Skriver <jesper@skriver.dk>


70070 15-Dec-2000 bmilekic

Change the following:

1. ICMP ECHO and TSTAMP replies are now rate limited.
2. RSTs generated due to packets sent to open and unopen ports
are now limited by seperate counters.
3. Each rate limiting queue now has its own description, as
follows:

Limiting icmp unreach response from 439 to 200 packets per second
Limiting closed port RST response from 283 to 200 packets per second
Limiting open port RST response from 18724 to 200 packets per second
Limiting icmp ping response from 211 to 200 packets per second
Limiting icmp tstamp response from 394 to 200 packets per second

Submitted by: Mike Silbersack <silby@silby.com>


69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


69774 08-Dec-2000 phk

Staticize some malloc M_ instances.


69152 25-Nov-2000 jlemon

Lock down the network interface queues. The queue mutex must be obtained
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.

IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.

Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.


69147 25-Nov-2000 jlemon

Revert the last commit to the callout interface, and add a flag to
callout_init() indicating whether the callout is safe or not. Update
the callers of callout_init() to reflect the new interface.

Okayed by: Jake


69099 23-Nov-2000 bmilekic

Fixup (hopefully) bridging + ipfw + dummynet together...

* Some dummynet code incorrectly handled a malloc()-allocated pseudo-mbuf
header structure, called "pkt," and could consequently pollute the mbuf
free list if it was ever passed to m_freem(). The fix involved passing not
pkt, but essentially pkt->m_next (which is a real mbuf) to the mbuf
utility routines.

* Also, for dummynet, in bdg_forward(), made the code copy the ethernet header
back into the mbuf (prepended) because the dummynet code that follows expects
it to be there but it is, unfortunately for dummynet, passed to bdg_forward
as a seperate argument.

PRs: kern/19551 ; misc/21534 ; kern/23010
Submitted by: Thomas Moestl <tmoestl@gmx.net>
Reviewed by: bmilekic
Approved by: luigi


69025 22-Nov-2000 ru

mdoc(7) police: use the new feature of the An macro.


68619 11-Nov-2000 bmilekic

While I'm here, get rid of (now useless) MCLISREFERENCED and use MEXT_IS_REF
instead.
Also, fix a small set of "avail." If we're setting `avail,' we shouldn't
be re-checking whether m_flags is M_EXT, because we know that it is, as if
it wasn't, we would have already returned several lines above.

Reviewed by: jlemon


68431 07-Nov-2000 ru

Fixed the security breach I introduced in rev 1.145.
Disallow getsockopt(IP_FW_ADD) if securelevel >= 3.

PR: 22600


68318 04-Nov-2000 jlemon

tp->snd_recover is part of the New Reno recovery algorithm, and should
only be checked if the system is currently performing New Reno style
fast recovery. However, this value was being checked regardless of the
NR state, with the end result being that the congestion window was never
opened.

Change the logic to check t_dupack instead; the only code path that
allows it to be nonzero at this point is NewReno, so if it is nonzero,
we are in fast recovery mode and should not touch the congestion window.

Tested by: phk


68231 02-Nov-2000 ru

Fixed the bug I have introduced in icmp_error() in revision 1.44.
The amount of data we copy from the original IP datagram into the
ICMP message was computed incorrectly for IP packets with payload
less than 8 bytes.


68179 01-Nov-2000 ru

Wrong checksum may have been computed for certain UDP packets.

Reviewed by: jlemon


68169 01-Nov-2000 ru

Wrong checksum used for certain reassembled IP packets before diverting.


68150 01-Nov-2000 joe

It's no longer true that "nobody uses ia beyond here"; it's now
used to keep address based if_data statistics in.

Submitted by: ru


68056 31-Oct-2000 ru

Do not waste a time saving a copy of IP header if we are certainly
not going to send an ICMP error message (net.inet.udp.blackhole=1).


67980 30-Oct-2000 ru

Added boolean argument to link searching functions, indicating
whether they should create a link if lookup has failed or not.


67966 30-Oct-2000 ru

A significant rewrite of PPTP aliasing code.

PPTP links are no longer dropped by simple (and inappropriate in this
case) "inactivity timeout" procedure, only when requested through the
control connection.

It is now possible to have multiple PPTP servers running behind NAT.
Just redirect the incoming TCP traffic to port 1723, everything else
is done transparently.

Problems were reported and the fix was tested by:
Michael Adler <Michael.Adler@compaq.com>,
David Andersen <dga@lcs.mit.edu>


67893 29-Oct-2000 phk

Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.


67882 29-Oct-2000 phk

Remove unneeded #include <sys/proc.h> lines.


67853 29-Oct-2000 darrenr

Fix conflicts creted by import.


67833 29-Oct-2000 joe

Count per-address statistics for IP fragments.

Requested by: ru
Obtained from: BSD/OS


67711 27-Oct-2000 obrien

Include sys/param.h for `__FreeBSD_version' rather than the non-existent
osreldate.h.

Submitted by: dougb


67708 27-Oct-2000 phk

Convert all users of fldoff() to offsetof(). fldoff() is bad
because it only takes a struct tag which makes it impossible to
use unions, typedefs etc.

Define __offsetof() in <machine/ansi.h>

Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>

Remove myriad of local offsetof() definitions.

Remove includes of <stddef.h> in kernel code.

NB: Kernelcode should *never* include from /usr/include !

Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.

Deprecate <struct.h> with a warning. The warning turns into an error on
01-12-2000 and the file gets removed entirely on 01-01-2001.

Paritials reviews by: various.
Significant brucifications by: bde


67692 27-Oct-2000 ru

Fetch the protocol header (TCP, UDP, ICMP) only from the first fragment
of IP datagram. This fixes the problem when firewall denied fragmented
packets whose last fragment was less than minimum protocol header size.

Found by: Harti Brandt <brandt@fokus.gmd.de>
PR: kern/22309


67620 26-Oct-2000 ru

RFC 791 says that IP_RF bit should always be zero, but nothing
in the code enforces this. So, do not check for and attempt a
false reassembly if only IP_RF is set.

Also, removed the dead code, since we no longer use dtom() on
return from ip_reass().


67614 26-Oct-2000 darrenr

fix conflicts from rcsids


67609 26-Oct-2000 ru

Wrong header length used for certain reassembled IP packets.
This was first fixed in rev 1.82 but then broken in rev 1.125.

PR: 6177


67596 26-Oct-2000 luigi

Close PR22152 and PR19511 -- correct the naming of a variable


67564 25-Oct-2000 ru

We now keep the ip_id field in network byte order all the
time, so there is no need to make the distinction between
ip_output() and ip_input() cases.

Reviewed by: silence on freebsd-net


67456 23-Oct-2000 itojun

be careful on mbuf overrun on ctlinput.
short icmp6 packet may be able to panic the kernel.
sync with kame.


67375 20-Oct-2000 ru

Save a few CPU cycles in IP fragmentation code.


67334 19-Oct-2000 joe

Augment the 'ifaddr' structure with a 'struct if_data' to keep
statistics on a per network address basis.

Teach the IPv4 and IPv6 input/output routines to log packets/bytes
against the network address connected to the flow.

Teach netstat to display the per-address stats for IP protocols
when 'netstat -i' is evoked, instead of displaying the per-interface
stats.


67316 19-Oct-2000 ru

A failure to allocate memory for auxiliary TCP data is now fatal.
This fixes a null pointer dereference problem that is unlikely to
happen in normal circumstances.


67287 18-Oct-2000 ru

If we do not byte-swap the ip_id in the first place, don't do it in
the second. NetBSD (from where I've taken this originally) needs
to fix this too.


67026 12-Oct-2000 ru

Backout my wrong attempt to fix the compilation warning in ip_input.c
and instead reapply the revision 1.49 of mbuf.h, i.e.

Fixed regression of the type of the `header' member of struct pkthdr from
`void *' to caddr_t in rev.1.51. This mainly caused an annoying warning
for compiling ip_input.c.

Requested by: bde


67009 12-Oct-2000 ru

Fix the compilation warning.


67003 12-Oct-2000 ru

Allow for IP_FW_ADD to be used in getsockopt(2) incarnation as
well, in which case return the rule number back into userland.

PR: bin/18351
Reviewed by: archie, luigi


66798 07-Oct-2000 alfred

Remove headers not needed.

Pointed out by: phk


66744 06-Oct-2000 ru

As we now may check the TCP header window field, make sure we pullup
enough into the mbuf data area. Solve this problem once and for all
by pulling up the entire (standard) header for TCP and UDP, and four
bytes of header for ICMP (enough for type, code and cksum fields).


66582 03-Oct-2000 ru

Added the missing ntohs() conversion when matching IP packet with
the IP_FW_IF_IPID rule. (We have recently decided to keep the
ip_id field in network byte order inside the kernel, see revision
1.140 of src/sys/netinet/ip_input.c).

I did not like to have the conversion happen in userland, and I
think that the similar conversions for fw_tcp(seq|ack|win) should
be moved out of userland (src/sbin/ipfw/ipfw.c) into the kernel.


66552 02-Oct-2000 jlemon

If TCPDEBUG is defined, we could dereference a tp which was freed.


66545 02-Oct-2000 ru

A bit of indentation reformatting.


66523 02-Oct-2000 billf

Add new fields for more granularity:
IP: version, tos, ttl, len, id
TCP: seq#, ack#, window size

Reviewed by: silence on freebsd-{net,ipfw}


66521 02-Oct-2000 billf

Add new fields for more granularity:
IP: version, tos, ttl, len, id
TCP: seq#, ack#, window size

Reviewed by: silence on freebsd-{net,ipfw}


66445 29-Sep-2000 ru

Document that net.inet.ip.fw.one_pass only affects dummynet(4).

Noticed by: Peter Jeremy<peter.jeremy@alcatel.com.au>


66433 29-Sep-2000 kris

Use stronger random number generation for TCP_ISSINCR and tcp_iss.

Reviewed by: peter, jlemon


66376 25-Sep-2000 bmilekic

Finally make do_tcpdrain sysctl live under correct parent, _net_inet_tcp,
as opposed to _debug. Like before, default value remains 1.


66157 21-Sep-2000 ru

Fixed the calculations with UDP header length field.
The field is in network byte order and contains the
size of the header.

Reviewed by: brian


65986 17-Sep-2000 kjc

change the evaluation order of the rsvp socket in rsvp_input()
in favor of the new-style per-vif socket.

this does not affect the behavior of the ISI rsvpd but allows
another rsvp implementation (e.g., KOM rsvp) to take advantage
of the new style for particular sockets while using the old style
for others.

in the future, rsvp supporn should be replaced by more generic
router-alert support.

PR: kern/20984
Submitted by: Martin Karsten <Martin.Karsten@KOM.tu-darmstadt.de>
Reviewed by: kjc


65985 17-Sep-2000 phk

Properly jail UDP sockets. This is quite a bit more tricky than TCP.

This fixes a !root userland panic, and some cases where the wrong
interface was chosen for a jailed UDP socket.

PR: 20167, 19839, 20946


65984 17-Sep-2000 phk

Reverse last commit, a better fix has been found.


65976 17-Sep-2000 phk

Make sure UDP sockets are explicitly bind(2)'ed [sic] before we connect(2)
them.

PR: 20946
Isolated by: Aaron Gifford <agifford@infowest.com>


65906 16-Sep-2000 jlemon

It is possible for a TCP callout to be removed from the timing wheel,
but have a network interrupt arrive and deactivate the timeout before
the callout routine runs. Check for this case in the callout routine;
it should only run if the callout is active and not on the wheel.


65892 15-Sep-2000 ru

Add -Wmissing-prototypes.


65859 14-Sep-2000 jlemon

m_cat() can free its second argument, so collect the checksum information
from the fragment before calling m_cat().


65837 14-Sep-2000 ru

Follow BSD/OS and NetBSD, keep the ip_id field in network order all the time.

Requested by: wollman


65765 12-Sep-2000 billf

Fix screwup in previous commit.


65751 11-Sep-2000 archie

Don't do snd_nxt rollback optimization (rev. 1.46) for SYN packets.
It causes a panic when/if snd_una is incremented elsewhere (this
is a conservative change, because originally no rollback occurred
for any packets at all).

Submitted by: Vivek Sadananda Pai <vivek@imimic.com>


65643 09-Sep-2000 alfred

Forget to include sysctl.h

Submitted by: des


65534 06-Sep-2000 alfred

Accept filter maintainance

Update copyrights.

Introduce a new sysctl node:
net.inet.accf

Although acceptfilters need refcounting to be properly (safely) unloaded
as a temporary hack allow them to be unloaded if the sysctl
net.inet.accf.unloadable is set, this is really for developers who want
to work on thier own filters.

A near complete re-write of the accf_http filter:
1) Parse check if the request is HTTP/1.0 or HTTP/1.1 if not dump
to the application.
Because of the performance implications of this there is a sysctl
'net.inet.accf.http.parsehttpversion' that when set to non-zero
parses the HTTP version.
The default is to parse the version.
2) Check if a socket has filled and dump to the listener
3) optimize the way that mbuf boundries are handled using some voodoo
4) even though you'd expect accept filters to only be used on TCP
connections that don't use m_nextpkt I've fixed the accept filter
for socket connections that use this.

This rewrite of accf_http should allow someone to use them and maintain
full HTTP compliance as long as net.inet.accf.http.parsehttpversion is
set.


65504 06-Sep-2000 billf

1. IP_FW_F_{UID,GID} are _not_ commands, they are extras. The sanity checking
for them does not belong in the IP_FW_F_COMMAND switch, that mask doesn't even
apply to them(!).

2. You cannot add a uid/gid rule to something that isn't TCP, UDP, or IP.

XXX - this should be handled in ipfw(8) as well (for more diagnostic output),
but this at least protects bogus rules from being added.

Pointy hat: green


65332 01-Sep-2000 ru

Match IPPROTO_ICMP with IP protocol field of the original IP
datagram embedded into ICMP error message, not with protocol
field of ICMP message itself (which is always IPPROTO_ICMP).

Pointed by: Erik Salander <erik@whistle.com>


65327 01-Sep-2000 ru

Fixed broken ICMP error generation, unified conversion of IP header
fields between host and network byte order. The details:

o icmp_error() now does not add IP header length. This fixes the problem
when icmp_error() is called from ip_forward(). In this case the ip_len
of the original IP datagram returned with ICMP error was wrong.

o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host
byte order, so DTRT and convert these fields back to network byte order
before sending a message. This fixes the problem described in PR 16240
and PR 20877 (ip_id field was returned in host byte order).

o ip_ttl decrement operation in ip_forward() was moved down to make sure
that it does not corrupt the copy of original IP datagram passed later
to icmp_error().

o A copy of original IP datagram in ip_forward() was made a read-write,
independent copy. This fixes the problem I first reported to Garrett
Wollman and Bill Fenner and later put in audit trail of PR 16240:
ip_output() (not always) converts fields of original datagram to network
byte order, but because copy (mcopy) and its original (m) most likely
share the same mbuf cluster, ip_output()'s manipulations on original
also corrupted the copy.

o ip_output() now expects all three fields, ip_len, ip_off and (what is
significant) ip_id in host byte order. It was a headache for years that
ip_id was handled differently. The only compatibility issue here is the
raw IP socket interface with IP_HDRINCL socket option set and a non-zero
ip_id field, but ip.4 manual page was unclear on whether in this case
ip_id field should be in host or network byte order.


65317 01-Sep-2000 ru

Changed the way we handle outgoing ICMP error messages -- do
not alias `ip_src' unless it comes from the host an original
datagram that triggered this error message was destined for.

PR: 20712
Reviewed by: brian, Charles Mott <cmott@scientech.com>


65281 31-Aug-2000 ru

Grab ADJUST_CHECKSUM() macro from alias_local.h.


65280 31-Aug-2000 ru

Create aliasing links for incoming ICMP echo/timestamp requests.
This makes outgoing ICMP echo/timestamp replies to be de-aliased
with the right source IP, not exactly the primary aliasing IP.


65260 30-Aug-2000 ru

Fixed the bug that div_bind() always returned zero
even if there was an error (broken in rev 1.9).


65248 30-Aug-2000 ru

Backout the hack in rev 1.71, I am working on a better patch
that should cover almost all inconsistencies in ICMP error
generation.


65221 29-Aug-2000 ache

strtok -> strsep (no strtok allowed in libraries)
add unsigned char cast to ctype macro


65197 29-Aug-2000 darrenr

Apply appropriate patch.

PR: 20877
Submitted by: Frank Volf (volf@oasis.IAEhv.nl)


64902 22-Aug-2000 archie

Remove obsolete comment.


64853 19-Aug-2000 bde

Fixed a missing splx() in if_addmulti(). Was broken in rev.1.28.


64658 15-Aug-2000 itojun

repair endianness issue in IN_MULTICAST().
again, *BSD difference...

From: Nick Sayer <nsayer@quack.kfu.com>


64644 14-Aug-2000 ru

Fixed PunchFW code segmentation violation bug.

Reported by: Christian Schade <chris@cube.sax.de>


64643 14-Aug-2000 ru

Use queue(3) LIST_* macros for doubly-linked lists.


64580 13-Aug-2000 darrenr

resolve conflicts


64452 09-Aug-2000 ru

- Do not modify Peer's Call ID in outgoing Incoming-Call-Connected
PPTP control messages.

- Cosmetics: replace `GRE link' with `PPTP link'.

Reviewed by: Erik Salander <erik@whistle.com>


64334 07-Aug-2000 ru

Adjust TCP checksum rather than compute it afresh.

Submitted by: Erik Salander <erik@whistle.com>


64213 03-Aug-2000 archie

Improve performance in the case where ip_output() returns an error.
When this happens, we know for sure that the packet data was not
received by the peer. Therefore, back out any advancing of the
transmit sequence number so that we send the same data the next
time we transmit a packet, avoiding a guaranteed missed packet and
its resulting TCP transmit slowdown.

In most systems ip_output() probably never returns an error, and
so this problem is never seen. However, it is more likely to occur
with device drivers having short output queues (causing ENOBUFS to
be returned when they are full), not to mention low memory situations.

Moreover, because of this problem writers of slow devices were
required to make an unfortunate choice between (a) having a relatively
short output queue (with low latency but low TCP bandwidth because
of this problem) or (b) a long output queue (with high latency and
high TCP bandwidth). In my particular application (ISDN) it took
an output queue equal to ~5 seconds of transmission to avoid ENOBUFS.
A more reasonable output queue of 0.5 seconds resulted in only about
50% TCP throughput. With this patch full throughput was restored in
the latter case.

Reviewed by: freebsd-net


64192 03-Aug-2000 ru

Make netstat(1) to be aware of divert(4) sockets.


64105 01-Aug-2000 roberto

Change __FreeBSD_Version into the proper __FreeBSD_version.

Submitted by: Alain.Thivillon@hsc.fr (Alain Thivillon) (for ip_fil.c)


64078 01-Aug-2000 ache

Add missing '0' to FreeBSD_version test: 50011 -> 500011


64075 31-Jul-2000 ache

Nonexistent <sys/pfil.h> -> <net/pfil.h>
Kernel 'make depend' fails otherwise


64061 31-Jul-2000 sheldonh

Whitespace only:

Fix an overlong line and trailing whitespace that crept in, in the
previous commit.


64060 31-Jul-2000 darrenr

activate pfil_hooks and covert ipfilter to use it


63899 26-Jul-2000 archie

Add address translation support for RTSP/RTP used by RealPlayer and
Quicktime streaming media applications.

Add a BUGS section to the man page.

Submitted by: Erik Salander <erik@whistle.com>


63745 21-Jul-2000 jayanth

When a connection is being dropped due to a listen queue overflow,
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.

Reviewed by: Peter Wemm,asmodai


63523 19-Jul-2000 darrenr

fix conflicts


63431 18-Jul-2000 sheldonh

Fix a comment which was broken in rev 1.36.

PR: 19947
Submitted by: Tetsuya Isaki <isaki@net.ipc.hiroshima-u.ac.jp>


63330 17-Jul-2000 luigi

close PR 19544 - ipfw pipe delete causes panic when no pipes defined

PR: 19544


63080 13-Jul-2000 dwmalone

Extra sanity check when arp proxyall is enabled. Don't send an arp
reply if the requesting machine isn't on the interface we believe
it should be. Prevents arp wars when you plug cables in the wrong
way around.

PR: 9848
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
Not objected to by: wollman


63048 12-Jul-2000 jayanth

re-enable the tcp newreno code.


63024 12-Jul-2000 itojun

remove m_pulldown statistics, which is highly experimental and does not
belong to *bsd-merged tree


62846 09-Jul-2000 itojun

be more cautious about tcp option length field. drop bogus ones earlier.
not sure if there is a real threat or not, but it seems that there's
possibility for overrun/underrun (like non-NOP option with optlen > cnt).


62587 04-Jul-2000 itojun

sync with kame tree as of july00. tons of bug fixes/improvements.

API changes:
- additional IPv6 ioctls
- IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8).
(also syntax change)


62573 04-Jul-2000 phk

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


62454 03-Jul-2000 phk

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


62159 27-Jun-2000 ru

Fixed PunchFWHole():
- ipfw always rejected rule with `neither in nor out' diagnostics.
- number of src/dst ports was not set properly.


61865 20-Jun-2000 ru

- Removed PacketAliasPptp() API function.
- SHLIB_MAJOR++.


61861 20-Jun-2000 ru

Added true support for PPTP aliasing. Some nice features include:

- Multiple PPTP clients behind NAT to the same or different servers.

- Single PPTP server behind NAT -- you just need to redirect TCP
port 1723 to a local machine. Multiple servers behind NAT is
possible but would require a simple API change.

- No API changes!

For more information on how this works see comments at the start of
the alias_pptp.c.

PacketAliasPptp() is no longer necessary and will be removed soon.

Submitted by: Erik Salander <erik@whistle.com>
Reviewed by: ru
Rewritten by: ru
Reviewed by: Erik Salander <erik@whistle.com>


61837 20-Jun-2000 alfred

return of the accept filter part II

accept filters are now loadable as well as able to be compiled into
the kernel.

two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)

Reviewed by: jmg


61735 16-Jun-2000 ru

- Improved passive mode FTP support by aliasing 229 replies.
- Stricter checking of PORT/EPRT/227/229 messages format.
- Moved all security checks into one place.


61677 14-Jun-2000 ru

- Added support for passive mode FTP by aliasing 227 replies.
It does mean that it is now possible to run passive-mode FTP
server behind NAT.

- SECURITY: FTP aliasing engine now ensures that:
o the segment preceding a PORT/227 segment terminates with a \r\n;
o the IP address in the PORT/227 matches the source IP address of
the packet;
o the port number in the PORT command or 277 reply is greater than
or equal to 1024.

Submitted by: Erik Salander <erik@whistle.com>
Reviewed by: ru


61657 14-Jun-2000 luigi

Fix behaviour of "ipfw pipe show" -- previous code gave
ambiguous data to the userland program (kernel operation was
safe, anyways).


61420 08-Jun-2000 dan

Add tcpoptions to ipfw. This works much in the same way as ipoptions do.
It also squashes 99% of packet kiddie synflood orgies. For example, to
rate syn packets without MSS,

ipfw pipe 10 config 56Kbit/s queue 10Packets
ipfw add pipe 10 tcp from any to any in setup tcpoptions !mss

Submitted by: Richard A. Steenbergen <ras@e-gerbil.net>


61413 08-Jun-2000 luigi

Implement WF2Q+ in dummynet.


61183 02-Jun-2000 jlemon

Add boundary checks against IP options.

Obtained from: OpenBSD


61179 02-Jun-2000 jlemon

When attempting to transmit a packet, if the system fails to allocate
a mbuf, it may return without setting any timers. If no more data is
scheduled to be transmitted (this was a FIN) the system will sit in
LAST_ACK state forever.

Thus, when mbuf allocation fails, set the retransmit timer if neither
the retransmit or persist timer is already pending.

Problem discovered by: Mike Silbersack (silby@silby.com)
Pushed for a fix by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: jayanth


60944 26-May-2000 darrenr

define CSUM_DELAY_DATA to match merge


60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


60925 25-May-2000 darrenr

fix up #ifdef jungle for FreeBSD


60923 25-May-2000 darrenr

remove duplicate prototypes


60910 25-May-2000 jlemon

Mark the checksum as complete when looping back multicast packets.

Submitted by: Jeff Gibbons <jgibbons@n2.net>


60889 24-May-2000 archie

Just need to pass the address family to if_simloop(), not the whole sockaddr.


60883 24-May-2000 darrenr

fix duplicate rcsid's


60872 24-May-2000 bde

Fixed some style bugs (mainly convoluted logic for blackhole processing).


60865 24-May-2000 peter

It would have been nice if this actually compiled. Close the header
comment */.


60857 24-May-2000 darrenr

fix up conflicts


60855 24-May-2000 darrenr

fix conflicts


60854 24-May-2000 darrenr

fix conflicts


60853 24-May-2000 darrenr

fix conflicts


60852 24-May-2000 darrenr

fix conflicts


60851 24-May-2000 darrenr

fix conflicts


60850 24-May-2000 darrenr

fix conflicts


60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


60798 22-May-2000 dan

sysctl'ize ICMP_BANDLIM and ICMP_BANDLIM_SUPPRESS_OUTPUT.

Suggested by: des/nbm


60797 22-May-2000 dan

Add option ICMP_BANDLIM_SUPPRESS_OUTPUT to the mix. With this option,
badport_bandlim() will not muck up your console with printf() messages.


60765 21-May-2000 jlemon

Compute the checksum before handing the packet off to IPFilter.

Tested by: Cy Schubert <Cy.Schubert@uumail.gov.bc.ca>


60690 19-May-2000 peter

Return ECONNRESET instead of EINVAL if the connection has been shot
down as a result of a reset. Returning EINVAL in that case makes no
sense at all and just confuses people as to what happened. It could be
argued that we should save the original address somewhere so that
getsockname() etc can tell us what it used to be so we know where the
problem connection attempts are coming from.


60687 18-May-2000 jayanth

snd_cwnd was updated twice in the tcp_newreno function.


60662 17-May-2000 jayanth

Sigh, fix a rookie patch merge error.

Also-missed-by: peter


60661 17-May-2000 jlemon

Cast sizeof() calls to be of type (int) when they appear in a signed
integer expression. Otherwise the sizeof() call will force the expression
to be evaluated as unsigned, which is not the intended behavior.

Obtained from: NetBSD (in a different form)


60619 16-May-2000 jayanth

snd_una was being updated incorrectly, this resulted in the newreno
code retransmitting data from the wrong offset.

As a footnote, the newreno code was partially derived from NetBSD
and Tom Henderson <tomh@cs.berkeley.edu>


60612 15-May-2000 ru

Do not call icmp_error() if ipfirewall(4) denied packet.

PR: kern/10747, kern/18382


60536 14-May-2000 archie

Move code to handle BPF and bridging for incoming Ethernet packets out
of the individual drivers and into the common routine ether_input().
Also, remove the (incomplete) hack for matching ethernet headers
in the ip_fw code.

The good news: net result of 1016 lines removed, and this should make
bridging now work with *all* Ethernet drivers.

The bad news: it's nearly impossible to test every driver, especially
for bridging, and I was unable to get much testing help on the mailing
lists.

Reviewed by: freebsd-net


60408 11-May-2000 jayanth

Temporarily turn off the newreno flag until we can track down the known
data corruption problem.


60363 11-May-2000 brian

Revert the default behaviour for incoming connections so
that they (once again) go to the target machine rather than
the alias address.

PR: 18354
Submitted by: ru


60304 10-May-2000 itojun

correct more out-of-bounds memory access, if cnt == 1 and optlen > 1.
similar to recent fix to sys/netinet/ipf.c (by darren).


60295 09-May-2000 darrenr

Fix bug in dealing with "hlen == 1 and opt > 1"


60265 09-May-2000 ps

Add missing include machine/in_cksum.h.

Submitted by: n_hibma


60214 08-May-2000 ken

Include machine/in_cksum.h to unbreak options MROUTING.


60105 06-May-2000 jlemon

Add #include <machine/in_cksum.h>, in order to pick up the checksum
inline functions and prototypes.


60067 06-May-2000 jlemon

Implement TCP NewReno, as documented in RFC 2582. This allows
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".

Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>


59909 02-May-2000 paul

Force the address of the socket to be INADDR_ANY immediately before
calling in_pcbbind so that in_pcbbind sees a valid address if no
address was specified (since divert sockets ignore them).

PR: 17552
Reviewed by: Brian


59898 02-May-2000 luigi

Remove an unnecessary error message


59874 01-May-2000 peter

Add $FreeBSD$


59726 28-Apr-2000 ru

Replace PacketAliasRedirectPptp() (which had nothing specific
to PPTP) with more generic PacketAliasRedirectProto().

Major number is not bumped because it is believed that noone
has started using PacketAliasRedirectPptp() yet.


59704 27-Apr-2000 ru

Spell PacketAliasRedirectAddr() correctly.


59702 27-Apr-2000 ru

Load Sharing using IP Network Address Translation (RFC 2391, LSNAT).

LSNAT links are first created by either PacketAliasRedirectPort() or
PacketAliasRedirectAddress() and then set up by one or more calls to
PacketAliasAddServer().


59392 19-Apr-2000 shin

Let initialize th_sum before in6_cksum(), again.
Without this fix, all IPv6 TCP RST packet has wrong cksum value,
so IPv6 connect() trial to 5.0 machine won't fail until tcp connect timeout,
when they should fail soon.

Thanks to haro@tk.kubota.co.jp (Munehiro Matsuda) for his much debugging
help and detailed info.


59391 19-Apr-2000 phk

Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>


59356 18-Apr-2000 ru

Add support for multiple PPTP sessions:

- new API function: PacketAliasRedirectPptp()
- new mode bit: PKT_ALIAS_DENY_PPTP

Please see manual page for details.


59334 17-Apr-2000 sumikawa

ND6_HINT() should not be called unless the connection status is
ESTABLISHED.

Obtained from: KAME Project


59237 14-Apr-2000 ru

Apply TCP_EXPIRE_CONNECTED (86400 seconds) timeout only to established
connections, after SYN packets were seen from both ends. Before this,
it would get applied right after the first SYN packet was seen (either
from client or server). With broken TCP connection attempts, when the
remote end does not respond with SYNACK nor with RST, this resulted in
having a useless (ie, no actual TCP connection associated with it) TCP
link with 86400 seconds TTL, wasting system memory. With high rate of
such broken connection attempts (for example, remote end simply blocks
these connection attempts with ipfw(8) without sending RST back), this
could result in a denial-of-service.

PR: bin/17963


59202 13-Apr-2000 ru

A complete reformatting of manual page.


59181 12-Apr-2000 ru

Make partially specified permanent links without `dst_addr'
but with `dst_port' work for outgoing packets.

This case was not handled properly when I first fixed this
in revision 1.17.

This change is also required for the upcoming improved PPTP
support patches -- that is how I found the problem.

Before this change:

# natd -v -a aliasIP \
-redirect_port tcp localIP:localPORT publicIP:publicPORT 0:remotePORT

Out [TCP] [TCP] localIP:localPORT -> remoteIP:remotePORT aliased to
[TCP] aliasIP:localPORT -> remoteIP:remotePORT

After this change:

# natd -v -a aliasIP \
-redirect_port tcp localIP:localPORT publicIP:publicPORT 0:remotePORT

Out [TCP] [TCP] localIP:localPORT -> remoteIP:remotePORT aliased to
[TCP] publicIP:publicPORT -> remoteIP:remotePORT


59143 11-Apr-2000 wes

PR: kern/17872
Submitted by: csg@waterspout.com (C. Stephen Gunn)


59075 06-Apr-2000 ru

- Add support for FTP EPRT (RFC 2428) command.
- Minor optimizations.
- Minor spelling fixes.

PR: 14305
Submitted by: ume
Rewritten by: ru


59047 05-Apr-2000 ru

- Remove unused includes.
- Minor spelling fixes.
- Make IcmpAliasOut2() really work.

Before this change:

# natd -v -n PUB_IFACE -p 12345 -redirect_address 192.168.1.1 P.P.P.P
natd[87923]: Aliasing to A.A.A.A, mtu 1500 bytes
In [UDP] [UDP] X.X.X.X:49562 -> P.P.P.P:50000 aliased to
[UDP] X.X.X.X:49562 -> 192.168.1.1:50000
Out [ICMP] [ICMP] 192.168.1.1 -> X.X.X.X 3(3) aliased to
[ICMP] A.A.A.A -> X.X.X.X 3(3)

# tcpdump -n -t -i PUB_IFACE host X.X.X.X and "(udp or icmp)"
tcpdump: listening on PUB_IFACE
X.X.X.X.49562 > P.P.P.P.50000: udp 3
A.A.A.A > X.X.X.X: icmp: A.A.A.A udp port 50000 unreachable

After this change:

# natd -v -n PUB_IFACE -p 12345 -redirect_address 192.168.1.1 P.P.P.P
natd[89360]: Aliasing to A.A.A.A, mtu 1500 bytes
In [UDP] [UDP] X.X.X.X:49563 -> P.P.P.P:50000 aliased to
[UDP] X.X.X.X:49563 -> 192.168.1.1:50000
Out [ICMP] [ICMP] 192.168.1.1 -> X.X.X.X 3(3) aliased to
[ICMP] P.P.P.P -> X.X.X.X 3(3)

# tcpdump -n -t -i PUB_IFACE host X.X.X.X and "(udp or icmp)"
tcpdump: listening on PUB_IFACE
X.X.X.X.49563 > P.P.P.P.50000: udp 3
P.P.P.P > X.X.X.X: icmp: P.P.P.P udp port 50000 unreachable


59046 05-Apr-2000 ru

- Moved NULL definition into private include file.
- Minor spelling fixes.


59031 05-Apr-2000 ru

Minor spelling fixes.


58943 02-Apr-2000 brian

Correct Charles Mott's email address

Requested by: Charles Mott <cmott@scientech.com>


58936 02-Apr-2000 shin

Move htons() ip_len to after the in_delayed_cksum() call.
This should stop cksum error messages on IPsec communication
which was reported on freebsd-current.

Reviewed by: jlemon


58911 02-Apr-2000 ps

Try and make the kernel build again without INET6.


58907 01-Apr-2000 shin

Support per socket based IPv4 mapped IPv6 addr enable/disable control.

Submitted by: ume


58895 01-Apr-2000 jlemon

Calculate any delayed checksums before handing an mbuf off to a
divert socket. This fixes a problem with ppp/natd.

Reviewed by: bsd (Brian Dean, gotta love that login name)


58877 31-Mar-2000 brian

Allow PacketAliasSetTarget() to be passed the following:
INADDR_NONE: Incoming packets go to the alias address (the default)
INADDR_ANY: Incoming packets are not NAT'd (direct access to the
internal network from outside)
anything else: Incoming packets go to the specified address

Change a few inaddr::s_addr == 0 to inaddr::s_addr == INADDR_ANY
while I'm there.


58866 31-Mar-2000 brian

When an incoming packet is received that is not specifically
redirected and when no target address has been specified, NAT
the destination address to the alias address rather than
allowing people direct access to your internal network from
outside.


58806 30-Mar-2000 jlemon

If `ipfw fwd' loops an mbuf back to ip_input from ip_output and the
mbuf is marked for delayed checksums, then additionally mark the
packet as having it's checksums computed. This allows us to bypass
computing/checking the checksum entirely, which isn't really needeed
as the packet has never hit the wire.

Reviewed by: green


58770 29-Mar-2000 joerg

Peter Johnson found another log() call without a trailing newline.
All three of them have been introduced in rev 1.64, so i guess i've
got all of them now. :)

Submitted by: Peter Johnson <locke@mcs.net>


58758 28-Mar-2000 joerg

Added two missing newlines in calls to log(9).

Reported in Usenet by: locke@mcs.net (Peter Johnson)

While i was at it, prepended a 0x to the %D output, to make it clear that
the printed value is in hex (i assume %D has been chosen over %#x to
obey network byte order).


58698 27-Mar-2000 jlemon

Add support for offloading IP/TCP/UDP checksums to NIC hardware which
supports them.


58499 23-Mar-2000 dillon

Fix parens in m_pullup() line in arp handling code. The code was
improperly doing the equivalent of (m = (function() == NULL)) instead
of ((m = function()) == NULL).

This fixes a NULL pointer dereference panic with runt arp packets.


58452 22-Mar-2000 green

in6_pcb.c:
Remove a bogus (redundant, just weird, etc.) key_freeso(so).
There are no consumers of it now, nor does it seem there
ever will be.

in6?_pcb.c:
Add an if (inp->in6?p_sp != NULL) before the call to
ipsec[46]_delete_pcbpolicy(inp). In low-memory conditions
this can cause a crash because in6?_sp can be NULL...


58313 19-Mar-2000 lile

o Replace most magic numbers related to token ring with #defines
from iso88025.h.

o Add minimal llc support to iso88025_input.

o Clean up most of the source routing code.

* Submitted by: Nikolai Saoukh <nms@otdel-1.org>


58279 19-Mar-2000 brian

Make _FindLinkIn() static and only define GetDestPort when
NO_FW_PUNCH isn't defined.


58057 14-Mar-2000 ru

Fix reporting of src and dst IP addresses for ICMP and generic IP packets.

PR: 17319
Submitted by: Mike Heffner <spock@techfour.net>


57920 11-Mar-2000 shin

Disable IPv4 over IPv4 tunnel on the 6to4 interface for better security.

Approved by: jkh


57903 11-Mar-2000 shin

IPv6 6to4 support.

Now most big problem of IPv6 is getting IPv6 address
assignment.
6to4 solve the problem. 6to4 addr is defined like below,

2002: 4byte v4 addr : 2byte SLA ID : 8byte interface ID

The most important point of the address format is that an IPv4 addr
is embeded in it. So any user who has IPv4 addr can get IPv6 address
block with 2byte subnet space. Also, the IPv4 addr is used for
semi-automatic IPv6 over IPv4 tunneling.

With 6to4, getting IPv6 addr become dramatically easy.
The attached patch enable 6to4 extension, and confirmed to work,
between "Richard Seaman, Jr." <dick@tar.com> and me.

Approved by: jkh

Reviewed by: itojun


57900 11-Mar-2000 rwatson

The function arpintr() incorrectly checks m->m_len to detect incomplete
ARP packets. This can incorrectly reject complete frames since the frame
could be stored in more than one mbuf.

The following patches fix the length comparisson, and add several
diagnostic log messages to the interrupt handler for out-of-the-norm ARP
packets. This should make ARP problems easier to detect, diagnose and
fix.

Submitted by: C. Stephen Gunn <csg@waterspout.com>
Approved by: jkh
Reviewed by: rwatson


57855 09-Mar-2000 shin

Initialize mbuf pointer at getting ipsec policy.
Without this, kernel will panic at getsockopt() of IPSEC_POLICY.
Also make compilable libipsec/test-policy.c which tries getsockopt() of
IPSEC_POLICY.

Approved by: jkh

Submitted by: sakane@kame.net


57686 02-Mar-2000 sheldonh

Remove single-space hard sentence breaks. These degrade the quality
of the typeset output, tend to make diffs harder to read and provide
bad examples for new-comers to mdoc.


57631 29-Feb-2000 luigi

Fix panic when doing keep-state and "forward".
Removed a redundant check.
Also move check for expired rules before using them.
Sorry for the whitespace changes.

Approved-by: jordan


57576 28-Feb-2000 ps

Limit the maximum permissible TCP window size to 65535 octets if
window scaling is disabled.

PR: kern/16914
Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
Reviewed by: wollman
Approved by: jkh


57544 28-Feb-2000 alfred

-it do, among other things, clear out any
+it does, amongst other things, clear out any

The old sentance didn't seem to make sense.


57401 23-Feb-2000 guido

Remove option IPFILTER_KLD. In case you wanted to kldload ipfilter,
the module would only work in kernels built with this option.

Approved by: jkh


57178 13-Feb-2000 peter

Clean up some loose ends in the network code, including the X.25 and ISO
#ifdefs. Clean out unused netisr's and leftover netisr linker set gunk.
Tested on x86 and alpha, including world.

Approved by: jkh


57140 11-Feb-2000 luigi

Forgot one line: don't try to match flags when looking for a flow.

Approved-by: jordan


57126 10-Feb-2000 guido

Re add rev 1.11 diffs to ip_fil.h Also discover that I did not undefine
CVS_FUBAR (which no longer exists) and thus forgot to add $FreeBSD's.
Add them.

Approved by: jkh (is part of ipfilter upgrade)


57120 10-Feb-2000 shin

Forbid include of soem inet6 header files from wrong place

KAME put INET6 related stuff into sys/netinet6 dir, but IPv6
standard API(RFC2553) require following files to be under sys/netinet.
netinet/ip6.h
netinet/icmp6.h
Now those header files just include each following files.
netinet6/ip6.h
netinet6/icmp6.h

Also KAME has netinet6/in6.h for easy INET6 common defs
sharing between different BSDs, but RFC2553 requires only
netinet/in.h should be included from userland.
So netinet/in.h also includes netinet6/in6.h inside.

To keep apps portability, apps should not directly include
above files from netinet6 dir.
Ideally, all contents of,
netinet6/ip6.h
netinet6/icmp6.h
netinet6/in6.h
should be moved into
netinet/ip6.h
netinet/icmp6.h
netinet/in.h
but to avoid big changes in this stage, add some hack, that
-Put some special macro define into those files under neitnet
-Let files under netinet6 cause error if it is included
from some apps, and, if the specifal macro define is not
defined.
(which should have been defined if files under netinet is
included)
-And let them print an error message which tells the
correct name of the include file to be included.

Also fix apps which includes invalid header files.

Approved by: jkh

Obtained from: KAME project


57117 10-Feb-2000 luigi

Move definition of fw_enable from ip_fw.c to ip_input.c
so we can compile kernels without IPFIREWALL .

Reported-by: Robert Watson
Approved-by: jordan


57116 10-Feb-2000 luigi

Whoops... forgot braces in a conditional

Revealed-by: diff with -STABLE version (the advantage of having
multiple lines of development...)
Approved-by: jordan


57114 10-Feb-2000 luigi

Support the net.inet.ip.fw.enable variable, part of
the recent ipfw modifications.

Approved-by: jordan


57113 10-Feb-2000 luigi

Support for stateful (dynamic) ipfw rules. They are very
similar to ipfilter's keep-state.

Look at the updated ipfw(8) manpage for details.

Approved-by: jordan


57096 09-Feb-2000 guido

Bring over ipfilter v3_3_8 kernel sources, including merging the
local modifications.
Also fix initializing fr_running in KLD case.
Rename ipl_inited to fr_runninhg in mlfk_ipl

Approved by: jkh


57068 09-Feb-2000 shin

Avoid kernel panic when tcp rfc1323 and rfc1644 options are enabled
at the same time.

When rfc1323 and rfc1644 option are enabled by sysctl,
and tcp over IPv6 is tried, kernel panic happens by the
following check in tcp_output(), because now hdrlen is bigger
in such case than before.

/*#ifdef DIAGNOSTIC*/
if (max_linkhdr + hdrlen > MHLEN)
panic("tcphdr too big");
/*#endif*/

So change the above check to compare with MCLBYTES in #ifdef INET6 case.
Also, allocate a mbuf cluster for the header mbuf, in that case.

Bug reported at KAME environment.
Approved by: jkh

Reviewed by: sumikawa
Obtained from: KAME project


56991 04-Feb-2000 luigi

Fix a (mostly harmless) scheduling-in-the-past problem with
dummynet (already fixed in -stable, was waiting for Jordan's
approval due to the code freeze).

Reported-By: Mike Tancsa
Approved-By: Jordan


56968 02-Feb-2000 archie

The flags PKT_ALIAS_PUNCH_FW and PKT_ALIAS_PROXY_ONLY were both
being defined as 0x40. Change the former to be 0x100.

Submitted by: Erik Salander <erik@whistle.com>
Approved by: jkh


56967 02-Feb-2000 brian

Mention what PKT_ALIAS_PROXY_ONLY does.

Prompted by: archie


56801 29-Jan-2000 shin

Sorry in this just befor code freeze commit.
This is fix to usr.sbin/trpt and tcp_debug.[ch]
I think of putting this after 4.0 but,,,

-There was bug that when INET6 is defined,
IPv4 socket is not traced by trpt.

-I received request from a person who distribute a program
which use tcp_debug interface and print performance statistics,
that
-leave comptibility with old program as much as possible
-use same interface with other OSes

So, I talked with itojun, and synced API with netbsd IPv6 extension.

makeworld check, kernel build check(includes GENERIC) is done.

But if there happen to any problem, please let me know and
I soon backout this change.


56724 28-Jan-2000 imp

Mitigate the stream.c attacks

o Drop all broadcast and multicast source addresses in tcp_input.
o Enable ICMP_BANDLIM in GENERIC.
o Change default to 200/s from 100/s. This will still stop the attack, but
is conservative enough to do this close to code freeze.

This is not the optimal patch for the problem, but is likely the least
intrusive patch that can be made for this.

Obtained from: Don Lewis and Matt Dillon.
Reviewed by: freebsd-security


56565 25-Jan-2000 shin

Avoid m_len and m_pkthdr.len inconsistency when changing m_len
for an mbuf whose M_PKTHDR is set.

PR: related to kern/15175
Reviewed by: archie


56564 25-Jan-2000 shin

Fix the bug that IPv4 ttl is not initialized when AF_INET6 socket is used
for IPv4 communication.(IPv4 mapped IPv6 addr.)
Also removed IPv6 hoplimit initialization because it is alway done at
tcp_output.

Confirmed by: Bernd Walter <ticso@cicely5.cicely.de>


56555 24-Jan-2000 brian

Move the *intrq variables into net/intrq.c and unconditionally
include this in all kernels. Declare some const *intrq_present
variables that can be checked by a module prior to using *intrq
to queue data.

Make the if_tun module capable of processing atm, ip, ip6, ipx,
natm and netatalk packets when TUNSIFHEAD is ioctl()d on.

Review not required by: freebsd-hackers


56041 15-Jan-2000 shin

Fixed the problem that IPsec connection hangs when bigger data is sent.
-opt_ipsec.h was missing on some tcp files (sorry for basic mistake)
-made buildable as above fix
-also added some missing IPv4 mapped IPv6 addr consideration into
ipsec4_getpolicybysock


56039 15-Jan-2000 shin

Added missing 'else' for 'if (isipv6)' at IPv6 length setting in tcp_respond().
By this bug, IPv6 reset was not sent.
(I checked around same kind of bug, but no other found.)


56019 15-Jan-2000 shin

Removed wrong(unnecessary) & operators for pointer, in ipsec_hdrsiz_tcp().
This must be one of the reason why connections over IPsec hangs for
bigger packets.(which was reported on freebsd-current@freebsd.org)

But there still seems to be another bug and the problem is not yet fixed.


56016 15-Jan-2000 shin

add forward declarations, and small cosmetic changes.

Submitted by: bde


55990 14-Jan-2000 guido

Apply patches in rev 1.2 and 1.9 that I forgot

Pointe out by: bde


55955 14-Jan-2000 rgrimes

Replace beforeinstall target with new variables used by .mk system.

Reviewed by: marcel, and make world


55929 13-Jan-2000 guido

Bring over ipfilter kernel sources, including merging the local modifications.


55917 13-Jan-2000 shin

Change struct sockaddr_storage member name, because following change
is very likely to become consensus as recent ietf/ipng mailing list
discussion. Also recent KAME repository and other KAME patched BSDs
also applied it.

s/__ss_family/ss_family/
s/__ss_len/ss_len/

Makeworld is confirmed, and no application should be affected by this change
yet.


55913 13-Jan-2000 shin

Clear rt after RTFREE. This might have sometime caused kernel panic at rtfree()
on INET6 enabled environment.


55875 13-Jan-2000 shin

add a comment for some possible? IPv4 option processing.


55874 13-Jan-2000 shin

removed incorrect ip6 length setting for IPv6 tcp reset packet.


55777 10-Jan-2000 ru

MGETHDR() does not initialize m_pkthdr.rcvif, do it here.

This fixes page fault panic observed when diverting packets
with IP options (e.g. ping -R remoteIP over natd).

PR: kern/8596, kern/11199


55679 09-Jan-2000 shin

tcp updates to support IPv6.
also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


55632 09-Jan-2000 shin

enable IPsec over DUMMYNET again

Submitted by: luigi
Reviewed by: luigi


55601 08-Jan-2000 shin

prevent kernel panic which happens when either of IPSEC and IPDIVERT
is enabled.

Confirmed by: Eugene M. Kim <ab@astralblue.com>


55599 08-Jan-2000 luigi

Add ipfw hooks for the new dummynet features.

Support masks on TCP/UDP ports.

Minor cleanup of ip_fw_chk() to avoid repeated calls to PULLUP_TO
at each rule.


55598 08-Jan-2000 luigi

Cleanup dummynet call interface so it should now work on the Alpha
as well. Also (probably) fix a bug introduced during the IPv6 import.


55597 08-Jan-2000 luigi

Implement per-flow queueing. Using a single pipe config rule,
now you can dynamically create rate-limited queues for different
flows using masks on dst/src IP, port and protocols.
Read the ipfw(8) manpage for details and examples.

Restructure the internals of the traffic shaper to use heaps,
so that it manages efficiently large number of queues.

Fix a bug which was present in the previous versions which could
cause, under certain unfrequent conditions, to send out very large
bursts of traffic.

All in all, this new code is much cleaner than the previous one and
should also perform better.

Work supported by Akamba Corp.


55460 05-Jan-2000 eivind

KERNEL -> _KERNEL


55205 29-Dec-1999 peter

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


55198 28-Dec-1999 msmith

Make tcp_drain() actually do something. When invoked (usually as a
desperation measure in low-memory situations), walk the tcpbs and
flush the reassembly queues.

This behaviour is currently controlled by the debug.do_tcpdrain sysctl
(defaults to on).

Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: wollman


55009 22-Dec-1999 shin

IPSEC support in the kernel.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


54952 21-Dec-1999 eivind

Change incorrect NULLs to 0s


54892 20-Dec-1999 peter

The ipfilter module name wasn't exactly conventional..


54799 19-Dec-1999 green

M_PREPEND-related cleanups (unregisterifying struct mbuf *s).


54601 14-Dec-1999 jlemon

Use SEQ_* macros for comparing sequence space numbers.

Reviewed by: truckman


54526 13-Dec-1999 shin

Always set INP_IPV4 flag for IPv4 pcb entries, because netstat needs it
to print out protocol specific pcb info.

A patch submitted by guido@gvr.org, and asmodai@wxs.nl also reported
the problem.
Thanks and sorry for your troubles.

Submitted by: guido@gvr.org
Reviewed by: shin


54421 11-Dec-1999 jlemon

According to RFC 793, a reset should be honored if the sequence number
is within the receive window. Follow this behavior, instead of only
allowing resets at last_ack_sent.

Pointed out by: jayanth@yahoo-inc.com


54415 10-Dec-1999 archie

Fix a '&&' that should have been a '&'.

Submitted by: Erik Salander <erik@whistle.com>


54376 09-Dec-1999 archie

Fix several typos.

Submitted by: Erik Salander <erik@whistle.com>


54304 08-Dec-1999 shin

Make this buildable with MROUTING defined.

Specified by: eivind, phk


54263 07-Dec-1999 shin

udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


54228 06-Dec-1999 guido

Last minute patch that I forgot to apply: check return code of iplattach()


54221 06-Dec-1999 guido

Revive mlfk_ipl here. This version is slightly changed from
the old one: an unnecessary define (KLD_MODULE) has been deleted and
the initialisation of the module is done after domaininit was called
to be sure inet is running.

Some slight changed were made to ip_auth.c and ip_state.c in order
to assure including of sys/systm.h in case we make a kld

Make sure ip_fil does nmot include osreldate in kernel mode

Remove mlfk_ipl.c from here: no sources allowed in these directories!


54175 06-Dec-1999 archie

Miscellaneous fixes/cleanups relating to ipfw and divert(4):

- Implement 'ipfw tee' (finally)
- Divert packets by calling new function divert_packet() directly instead
of going through protosw[].
- Replace kludgey global variable 'ip_divert_port' with a function parameter
to divert_packet()
- Replace kludgey global variable 'frag_divert_port' with a function parameter
to ip_reass()
- style(9) fixes

Reviewed by: julian, green


54018 02-Dec-1999 jlemon

Change the delayed ack time from 200ms to 100ms.

This results in closer behavior to earlier versions, where the fixed
200ms timer actually resulted in a delay anywhere from 1..200ms, with
the average delay being 100ms.

Pointed out by: dg


53716 26-Nov-1999 luigi

RTFREE the correct route entry in dummynet_io(). The previous
code failed in handling things like "forward" actions.

Reported-and-tested-by: Jean-Hugues ROYER jhroyer@joher.com


53645 23-Nov-1999 guido

Get rid of useless osreldate include for KLD/LKM modules (sys/param.h
already carries what is needed).
This is needed for the KLD support.


53642 23-Nov-1999 guido

Add kernel parts of revived ipfilter (3.3.3.)


53541 22-Nov-1999 shin

KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project


53353 18-Nov-1999 peter

Fix a warning and a potential panic if TCPDEBUG is active. (tp is
a wild pointer and used by TCPDEBUG2())


53295 17-Nov-1999 phk

The logic for blackhole processing does not free mbufs if the
blackhole flag is set.

PR: 14958
Submitted by: Larry Baird <lab@gta.com>
Reviewed by: phk


53187 15-Nov-1999 jmb

add two more codes to ICMP error 12 (Parameter Problem).
these two are detailed in RFC1700.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


53038 09-Nov-1999 phantom

Restore sub-chapters order.

PR: docs/14766
Submitted by: Kazutoshi Kubota <kazu@iworks.co.jp>


52952 07-Nov-1999 jlemon

Undo rev 1.10, which took out TH_FIN from the CLOSING state. This
breaks simultaneous closes.


52904 05-Nov-1999 shin

KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project


52377 18-Oct-1999 sheldonh

Append missing newline to log() message for permanent ARP modification
attempt warning, which was added in rev 1.48 .

PR: 14371
Submitted by: sec@pi.musin.de (Stefan `Sec` Zehl)


52089 10-Oct-1999 peter

Nuke the old antique copy of ipfilter from the tree. This is old enough
to be dangerous. It will better serve us as a port building a KLD,
ala SKIP.

The hooks are staying although it would be better to port and use
the NetBSD pfil interface rather than have custom hooks.


52070 09-Oct-1999 green

Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.


51727 27-Sep-1999 ru

Properly handle the case when either the aliasing or source address of
the link are equal to the default aliasing address. Do not zero them!

This will fix the problem with non-working links added with the source
and/or aliasing address equal to the default aliasing address, but the
default aliasing address is set later, after the link has been set up,
like both natd(8) and ppp(8) do (for objective reasons).

Reviewed by: Brian Somers <brian@FreeBSD.org>,
Eivind Eklund <eivind@FreeBSD.org>,
Charles Mott <cmott@srv.net>


51658 25-Sep-1999 phk

Remove five now unused fields from struct cdevsw. They should never
have been there in the first place. A GENERIC kernel shrinks almost 1k.

Add a slightly different safetybelt under nostop for tty drivers.

Add some missing FreeBSD tags


51550 22-Sep-1999 ru

ReLink() partial links in FindLinkOut() in the same manner as we do it
in FindLinkIn(). This will make TcpMonitorIn()/TcpMonitorOut() happy.

Reviewed by: eivind


51506 21-Sep-1999 ru

Restore previous version of FindLinkIn().

Instead, natd(8) should be fixed to call PacketAliasSetAddress()
as part of initialization, as required by libalias(3).


51494 21-Sep-1999 ru

- Make partially specified permanent links (without `dst_addr' and/or
`dst_port') work for outgoing packets.

- Make permanent links whose `alias_addr' matches the primary aliasing
address `aliasAddress' work for incoming packets.

- Typo fixes.

Reviewed by: brian, eivind


51491 21-Sep-1999 brian

sys/errno.h -> errno.h


51381 19-Sep-1999 green

Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.


51320 16-Sep-1999 lile

Re-arrange the arp code so that fddi arps work properly.


51282 14-Sep-1999 des

Reorder.


51279 14-Sep-1999 des

Fix some more disordering, as well as the description string for the
net.inet.tcp.drop_synfin sysctl, which for some mysterious reason said
"Drop TCP packets with FIN+ACK set" (instead of "...with SYN+FIN set")


51209 12-Sep-1999 des

Add the net.inet.tcp.restrict_rst and net.inet.tcp.drop_synfin sysctl
variables, conditional on the TCP_RESTRICT_RST and TCP_DROP_SYNFIN kernel
options, respectively. See the comments in LINT for details.


51125 10-Sep-1999 ru

- Optimization to the previous (rev 1.15) commit.

Requested by: eivind
Discussed with: eivind
Reviewed by: brian, eivind


51107 09-Sep-1999 ru

Handle TCP reset sequence properly.

In the words of originator:
:If an incoming connection is initiated through natd and deny_incoming is
:not set, then a new alias_link structure is created to handle the link.
:If there is nothing listening for the incoming connection, then the kernel
:responds with a RST for the connection. However, this is not processed
:correctly in libalias/alias.c:TcpMonitor{In,Out} and
:libalias/alias_db.c:SetState{In,Out} as it thinks a connection
:has been established and therefore applies a timeout of 86400 seconds
:to the link.
:
:If many of these half-connections are initiated (during, for example, a
:port scan of the host), then many thousands of unnecessary links are
:created and the resident size of natd balloons to 20MB or more.

PR: 13639
Reviewed by: brian


51091 08-Sep-1999 ru

Fix typo.


50705 31-Aug-1999 jlemon

Simplify, and return an error if the user attempts to set a TCP
time value which results in < 1 tick.

Suggested by: bde


50704 31-Aug-1999 jlemon

Remove conversion macros that were used during development.


50682 31-Aug-1999 jlemon

Add a SYSCTL_PROC so that TCP timer values are now expressed to
the user in ms, while they are stored internally as ticks. Note
that there probably are rounding bogons here, especially on the
alpha.


50673 30-Aug-1999 jlemon

Restructure TCP timeout handling:

- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.

Submitted by: jlemon, wollmann


50597 29-Aug-1999 billf

Add $FreeBSD$ and spell Eklund properly.

Approved by: brian (well, he approved adding $Id$)


50596 29-Aug-1999 obrien

Remove extra indenting of `break' statements introducted in rev 1.89,
plus wrap some long lines from that revision.

While here, wrap some other long lines.


50561 29-Aug-1999 des

Include the correct header for the IPSTEALTH option.


50556 29-Aug-1999 bde

Oops, I missed a cast in rev.1.119.


50512 28-Aug-1999 lile

It is much easier to arp if you don't truncate your arp-reply's.
[affects token-ring only]


50496 28-Aug-1999 green

Also make the "other" packets counter resettable.


50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


50476 28-Aug-1999 peter

$Id$ -> $FreeBSD$


50474 27-Aug-1999 green

Correction: uid -> gid (comment)


50426 26-Aug-1999 jlemon

Add readonly OID ``net.inet.tcp.tcbhashsize'' so it is possible to
discover the size of the TCB hashtable on a running system.


50273 24-Aug-1999 bde

Cast pointers to [u]intptr_t instead of casting them to [u_]long. Don't
depend on gcc's feature of casting lvalues, especially for direct
assignment where it doesn't even simplify the syntax. Cosmetic.


50194 22-Aug-1999 brian

Aallow ppp to work with Nortel Networks Extranet Switch
product and Windows NT tunneling.

Submitted by: Chain Lee <chain@nortelnetworks.com>


50175 22-Aug-1999 hoek

Typo: 102 => 192 (PR: docs/13310 - Maxim Sobolev <sobomax@altavista.net>)


50129 21-Aug-1999 green

To christen the brand new security category for syslog, we get IPFW
using syslog(3) (log(9)) for its various purposes! This long-awaited
change also includes such nice things as:
* macros expanding into _two_ comma-delimited arguments!
* snprintf!
* more snprintf!
* linting and criticism by more people than you can shake a stick at!
* a slightly more uniform message style than before!
and last but not least
* no less than 5 rewrites!

Reviewed by: committers


50043 19-Aug-1999 csgr

Fix breakage if blackhole=1 and tiflags & TH_SYN, plus
style(9) fixes

Submitted by: Jonathon Lemon


50015 18-Aug-1999 csgr

Slight tweak to tcp.blackhole to add optional behaviour to
drop any segment arriving at a closed port.
tcp.blackhole=1 - only drop SYN without RST
tcp.blackhole=2 - drop everything without RST
tcp.blackhole=0 - always send RST - default behaviour

This confuses nmap -sF or -sX or -sN quite badly.


49988 17-Aug-1999 billf

Fix a printf() formatter to match its variable.

Reviewed by: bde, luigi


49968 17-Aug-1999 csgr

Add net.inet.tcp.blackhole and net.inet.udp.blackhole
sysctl knobs.

With these knobs on, refused connection attempts are dropped
without sending a RST, or Port unreachable in the UDP case.
In the TCP case, sending of RST is inhibited iff the incoming
segment was a SYN.

Docs and rc.conf settings to follow.


49828 15-Aug-1999 mpp

Various man page cleanup:

- Sort xrefs
- FreeBSD.ORG -> FreeBSD.org
- Be consistent with section names as outlines in mdoc(7)
- Other misc mdoc cleanup.

PR: doc/13144
Submitted by: Alexy M. Zelkin <phantom@cris.net>


49630 11-Aug-1999 luigi

Implement probabilistic rule match in ipfw. Each rule can be associated
with a match probability to achieve non-deterministic behaviour of
the firewall. This can be extremely useful for testing purposes
such as simulating random packet drop without having to use dummynet
(which already does the same thing), and simulating multipath effects
and the associated out-of-order delivery (this time in conjunction
with dummynet).

The overhead on normal rules is just one comparison with 0.

Since it would have been trivial to implement this by just adding
a field to the ip_fw structure, I decided to do it in a
backward-compatible way (i.e. struct ip_fw is unchanged, and as a
consequence you don't need to recompile ipfw if you don't want to
use this feature), since this was also useful for -STABLE.

When, at some point, someone decides to change struct ip_fw, please
add a length field and a version number at the beginning, so userland
apps can keep working even if they are out of sync with the kernel.


49628 11-Aug-1999 luigi

Add spl() protection to remove that the timer is invoked multiple
times resulting in higher bandwidth and lower delays.
Reported-by: Jamshid Madhavi


49603 10-Aug-1999 des

Add net.inet.icmp.log_redirect and net.inet.icmp.drop_redirect, for
respectively logging and dropping ICMP REDIRECT packets.

Note that there is no rate limiting on the log messages, so log_redirect
should be used with caution (preferrably only for debugging purposes).


49350 01-Aug-1999 green

Make ipfw's logging more dynamic. Now, log will use the default limit
_or_ you may specify "log logamount number" to set logging specifically
the rule.
In addition, "ipfw resetlog" has been added, which will reset the
logging counters on any/all rule(s). ipfw resetlog does not affect
the packet/byte counters (as ipfw reset does), and is the only "set"
command that can be run at securelevel >= 3.
This should address complaints about not being able to set logging
amounts, not being able to restart logging at a high securelevel,
and not being able to just reset logging without resetting all of the
counters in a rule.


49194 28-Jul-1999 green

8 -> NBBy


49193 28-Jul-1999 green

Correct a really gross comment format.


48886 18-Jul-1999 jmb

fix comment re: RST received in TIME_WAIT to match the code.


48788 12-Jul-1999 green

Correct a mistake in so_cred changes. In practice, I don't think that it
would make a difference. However, my previous diff _did_ change the
behavior in some way (not necessarily break it), so I'm fixing it.

Found by: bde
Submitted by: bde


48758 11-Jul-1999 green

Two new sysctls: net.inet.tcp.getcred and net.inet.udp.getcred. These take
a sockaddr_in[2] (local, then remote) and return a struct ucred. Example
code for these is at:
http://www.FreeBSD.org/~green/inetd_ident.patch
http://www.FreeBSD.org/~green/freebsd4.c (for pidentd)

Reviewed by: bde


48578 05-Jul-1999 msmith

Use the new tunable macros for the net.inet.tcp.tcbhashsize tunable.


48224 25-Jun-1999 pb

In in_pcbconnect(), check the return value from in_pcbbind() and
exit on errors.

If we don't, in_pcbrehash() is called without a preceeding
in_pcbinshash(), causing a crash.

There are apparently several conditions that could cause the crash;
PR misc/12256 is only one of these.

PR: misc/12256


48102 22-Jun-1999 brian

Don't get caught in an infinite recursion when PKT_ALIAS_REVERSE
is set.
Document PKT_ALIAS_REVERSE.

Pointed out by: Jonathan Hanna <jh@cr1003333-a.crdva1.bc.home.com>
PR: 12304


48023 19-Jun-1999 green

This is the much-awaited cleaned up version of IPFW [ug]id support.
All relevant changes have been made (including ipfw.8).


48015 19-Jun-1999 green

Add RCS strings to kernel ipfilter files.


48013 19-Jun-1999 green

This should fix ipfilter for everyone it was broken for. CDEV_MAJOR is _not_
-1.

Noticed by: users on freebsd-current


47992 17-Jun-1999 green

Reviewed by: the cast of thousands

This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.


47960 16-Jun-1999 tegge

Close a race window where a tcp socket is closed while tcp_pcblist is
copying out tcp socket info, causing a NULL pointer to be dereferenced.


47877 11-Jun-1999 ru

Don't accept divert/tee/pipe rules without corresponding option.

PR: 10324
Reviewed by: luigi


47720 04-Jun-1999 peter

Plug a mbuf leak in tcp_usr_send(). pru_send() routines are expected
to either enqueue or free their mbuf chains, but tcp_usr_send() was
dropping them on the floor if the tcpcb/inpcb has been torn down in the
middle of a send/write attempt. This has been responsible for a wide
variety of mbuf leak patterns, ranging from slow gradual leakage to rather
rapid exhaustion. This has been a problem since before 2.2 was branched
and appears to have been fixed in rev 1.16 and lost in 1.23/1.28.

Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking
(extensively) into this on a live production 2.2.x system and that it
was the actual cause of the leak and looks like it fixes it. The machine
in question was loosing (from memory) about 150 mbufs per hour under
load and a change similar to this stopped it. (Don't blame Jayanth
for this patch though)

An alternative approach to this would be to recheck SS_CANTSENDMORE etc
inside the splnet() right before calling pru_send() after all the potential
sleeps, interrupts and delays have happened. However, this would mean
exposing knowledge of the tcp stack's reset handling and removal of the
pcb to the generic code. There are other things that call pru_send()
directly though.

Problem originally noted by: John Plevyak <jplevyak@inktomi.com>


47640 31-May-1999 phk

Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.


47625 30-May-1999 phk

This commit should be a extensive NO-OP:

Reformat and initialize correctly all "struct cdevsw".

Initialize the d_maj and d_bmaj fields.

The d_reset field was not removed, although it is never used.

I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.

Vinum and i4b not modified, patches emailed to respective authors.


47547 27-May-1999 dg

Added net.inet.tcp.path_mtu_discovery variable which when set to 0
(default 1) disables PMTUD globally. Although PMTUD can be disabled in
the standard case by locking the MTU on a static route (including the
default route), this method doesn't work in the face of dynamic routing
protocols like gated.


47546 27-May-1999 dg

Made net.inet.ip.intr_queue_maxlen writeable.


47455 24-May-1999 luigi

close pr 10889:
+ add a missing call to dn_rule_delete() when flushing firewall
rules, thus preventing possible panics due to dangling pointers
(this was already done for single rule deletes).
+ improve "usage" output in ipfw(8)
+ add a few checks to ipfw pipe parameters and make it a bit more
tolerant of common mistakes (such as specifying kbit instead of Kbit)

PR: kern/10889
Submitted by: Ruslan Ermilov


47427 23-May-1999 brian

brucify
Mentioned by: sprice@hiwaay.net


47344 20-May-1999 eivind

Make incoming packets work as keepalives, too. This should fix problems
for some games.

Notified of problem by: tim@turbinegames.com


47023 11-May-1999 peter

"fix" warning. This still needs to be kld-ified some day (or removed).


46696 08-May-1999 peter

Pre-declare struct proc to avoid 'inside param list' warnings.


46594 06-May-1999 peter

Fix two warnings; and note a problem where a pointer is stored in an
int variable - this can't work on an Alpha.


46568 06-May-1999 peter

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


46420 04-May-1999 luigi

Free the dummynet descriptor in ip_dummynet, not in the called
routines. The descriptor contains parameters which could be used
within those routines (eg. ip_output() ).

On passing, add IPPROTO_PGM entry to netinet/in.h


46395 04-May-1999 brian

Add missing ``.''.


46393 04-May-1999 luigi

forgot passing the right pointer to dst to dummynet_io().
(-stable and releng2 were already safe).
Debugged-By: phk


46385 04-May-1999 luigi

assorted dummynet cleanup:
+ plug an mbuf leak when dummynet used with bridging
+ make prototype of dummynet_io consistent with usage
+ code cleanup so that now bandwidth regulation is precise to the
bit/s and not to (8*HZ) bit/s as before.


46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


46155 28-Apr-1999 phk

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


46153 28-Apr-1999 dt

s/static foo_devsw_installed = 0;/static int foo_devsw_installed;/.
(Edited automatically)


46112 27-Apr-1999 phk

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


46095 26-Apr-1999 luigi

Make one pass through the firewall the default.
Multiple pass (which only affects dummynet) is too confusing.


46016 24-Apr-1999 ache

so_linger is in seconds, not in 1/HZ

PR: 11252
Submitted by: Martin Kammerhofer <dada@sbox.tu-graz.ac.at>


45998 24-Apr-1999 dt

Use pointer arithmetic as appropriate.


45997 24-Apr-1999 luigi

postpone the sending of IGMP LEAVE msg to after deleting the
mc address from the address list. The latter operation on some
hardware resets the card, potentially canceling the pending LEAVE
pkt.


45926 21-Apr-1999 luoqi

Work around an egcs optimizer bug (i386). This should fix the active ftp
hang problem. A bug report has been sent to cygnus.


45871 20-Apr-1999 peter

s/IPFIREWALL_MODULE/KLD_MODULE/


45869 20-Apr-1999 peter

Tidy up some stray / unused stuff in the IPFW package and friends.
- unifdef -DCOMPAT_IPFW (this was on by default already)
- remove traces of in-kernel ip_nat package, it was never committed.
- Make IPFW and DUMMYNET initialize themselves rather than depend on
compiled-in hooks in ip_init(). This means they initialize the same
way both in-kernel and as kld modules. (IPFW initializes now :-)


45822 19-Apr-1999 peter

Zap LKM option and support. Farewell old friend.


45743 17-Apr-1999 peter

Convert the dummynet lkm code to be kld aware (this isn't actually used
anywhere that I can see).


45740 17-Apr-1999 peter

Oops, forgot this part of lkm code that's been replaced with kld.


45705 15-Apr-1999 eivind

Better handling for ARP/source routing on Token Ring

Submitted by: Larry Lile <lile@stdio.com>


45573 11-Apr-1999 eivind

Staticize.


45439 07-Apr-1999 julian

Two cosmetic changes, one a typo and the other, a clarification.


45165 30-Mar-1999 nsayer

Merge from RELENG_2_2, per luigi. Fixes the ntoh?() issue for the
firewall code when called from the bridge code.

PR: 10818
Submitted by: nsayer
Obtained from: luigi


45048 26-Mar-1999 luigi

Use the correct length from the mbuf header instead of the one from
the IP header (this would not work for bridged packets).
This has been fixed long ago in the 2.2 branch.

Problem noticed by: a few people
Fix suggested by: Remy Nonnenmacher


45025 25-Mar-1999 brian

PacketAliasProxyRule takes a const char *
Reminded by: bde


45008 24-Mar-1999 brian

Add a ``const'' and remove some inconsistent prototype args.


44993 24-Mar-1999 luigi

add missing #include "opt_bdg.h"


44979 23-Mar-1999 billf

Remove duplicate line.

Reviewed by: eivind


44797 16-Mar-1999 luigi

Fix a dummynet bug caused by passing a bad next hop address (the
symptom was the msg "arp failure -- host is not on local network" that
some user have seen on multihomed machines.
Bug tracked down by Emmanuel Duros


44677 12-Mar-1999 julian

Fix the 'fwd' option to ipfw when asked to divert to another machine.
also rely less on other modules clearing static values, and clear them
in a few cases we missed before.
Submitted by: Matthew Reimer <mreimer@vpop.net>


44627 10-Mar-1999 julian

Submitted by: Larry Lile
Move the Olicom token ring driver to the officially sanctionned location of
/sys/contrib. Also fix some brokenness in the generic token ring support.

Be warned that if_dl.h has been changed and SOME programs might
like recompilation.


44616 09-Mar-1999 brian

Remove all diagnostics to stdout/stderr with #ifdef DEBUG
Statify functions in alias_nbt.c


44556 07-Mar-1999 brian

Document PacketAliasPptp() and allow it to be disabled
by passing INADDR_NONE.


44548 07-Mar-1999 brian

Remove unused function stubs.


44546 07-Mar-1999 brian

Mention that PacketAliasProxyRule() doesn't accept host names,
just IP numbers.


44528 06-Mar-1999 archie

When an incoming packet is reflected back as an ICMP reply, make sure we
zero "m->m_pkthdr.rcvif", otherwise ipfw may wrongly match the outgoing packet.
PR: kern/9723
Submitted by: David Malone <dwmalone@maths.tcd.ie>


44526 06-Mar-1999 brian

Document PacketAliasProxyRule() and fix a typo.


44511 06-Mar-1999 wollman

Move kernel-only declaration inside #ifdef KERNEL section.


44456 04-Mar-1999 wpaul

arprequest() allocates an mbuf with m_gethdr() but does not initialize
m->m_pkthdr.rcvif to NULL. Bad arprequest(). No biscuit.


44307 27-Feb-1999 brian

Version 3.0: January 1, 1999
- Transparent proxying support added.
- PPTP redirecting support added based on patches
contributed by Dru Nelson <dnelson@redwoodsoft.com>.

Submitted by: Charles Mott <cmott@srv.net>


44219 22-Feb-1999 des

Add support for stealth forwarding (forwarding packets without touching
their ttl). This can be used - in combination with the proper ipfw
incantations - to make a firewall or router invisible to traceroute
and other exploration tools.

This behaviour is controlled by a sysctl variable (net.inet.ip.stealth)
and hidden behind a kernel option (IPSTEALTH).

Reviewed by: eivind, bde


44165 20-Feb-1999 julian

World, I'd like you to meet the first FreeBSD token Ring driver.
This is for various Olicom cards. An IBM driver is following.
This patch also adds support to tcpdump to decode packets on tokenring.
Congratulations to the proud father.. (below)

Submitted by: Larry Lile <lile@stdio.com>


44154 19-Feb-1999 luigi

avoid panic with pkts larger than MTU and DF set coming out of a pipe.


44078 16-Feb-1999 dfr

* Change sysctl from using linker_set to construct its tree using SLISTs.
This makes it possible to change the sysctl tree at runtime.

* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.

Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)


43802 09-Feb-1999 wollman

After wading in the cesspool of ip_input for an hour, I have managed to
convince myself that nothing will break if we permit IP input while
interface addresses are unconfigured. (At worst, they will hit some
ULP's PCB scan and fail if nobody is listening.) So, remove the restriction
that addresses must be configured before packets can be input. Assume
that any unicast packet we receive while unconfigured is potentially ours.


43764 08-Feb-1999 julian

remove leftover garbage line.


43763 08-Feb-1999 julian

Fix for PR 9309.
Divert was not feeding clean data to ifa_ifwithaddr() so it was
giving bad results.
Submitted by: kseel <kseel@utcorp.com>, Ruslan Ermilov <ru@ucb.crimea.ua>


43691 06-Feb-1999 fenner

Use snd_nxt, not rcv_nxt, when calculating the ISS during TIME_WAIT.
This was missed in the 4.4-Lite2 merge.

Noticed by: Mohan Parthasarathy <Mohan.Parthasarathy@eng.Sun.COM> and
jayanth@loc201.tandem.com (vijayaraghavan_jayanth)
on the tcp-impl mailing list.


43576 04-Feb-1999 msmith

Nuke all the stupid ffs() stuff and use powerof2() instead.
Submitted by: Bruce Evans <bde@zeta.org.au>


43575 04-Feb-1999 msmith

Fix power-of-2 check for the TCB hash size.

Submitted by: Brian Feldman <green@unixhelp.org>


43562 03-Feb-1999 msmith

Make TCBHASHSIZE a boot-time tunable as well, taking its value from the
variable net.inet.tcp.tcbhashsize.

Requested by: David Filo <filo@yahoo-inc.com>


43305 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


43112 23-Jan-1999 archie

Move kernel-only declarations to within #ifdef KERNEL
Prompted by: gcc warnings when compiling /sbin/ipfw


43066 22-Jan-1999 wollman

Don't forward unicast packets received via link-layer multicast.

Suggested by: fenner
Original complaint: Shiva Shenoy <Shiva.Shenoy@yagosys.com>


42902 20-Jan-1999 fenner

Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.

Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.


42866 19-Jan-1999 fenner

Fix bug in last commit (la was used uninitialized if no route was passed in).


42777 18-Jan-1999 fenner

Use dynamic memory allocation instead of mbuf's for multicast routing
state.

Note: this requires a recompilation of netstat (but netstat has been
broken since rev 1.52 of ip_mroute.c anyway)

Obtained from: Significantly based on Steve McCanne's
<mccanne@cs.berkeley.edu> work for BSD/OS


42776 18-Jan-1999 fenner

Rename igmp's MALLOC; it doesn't have anything to do with multicast routing.


42775 18-Jan-1999 fenner

If arpresolve() gets passed a route with a null llinfo, call
arplookup() to try again. This gets rid of at least one user's
"arpresolve: can't allocate llinfo" errors, and arplookup() gives
better error messages to help track down the problem if there really
is a problem with the routing table.


42592 12-Jan-1999 eivind

... _and_ the (void*) casts for %p. Next, I'll forget my own name :-(


42591 12-Jan-1999 eivind

Avoid unnecessary GCCism - I hadn't noticed the __unused macro.


42578 12-Jan-1999 eivind

* Print pointers using the correct type (%p) instead of %x.
* Use the correct type for timeout function.
* Add missing #include.


42574 12-Jan-1999 eivind

Add #ifdef's to avoid unused label warning in some cases.


42572 12-Jan-1999 eivind

Remove unused statics.


42516 11-Jan-1999 luigi

Add a missing bzero which could be the source of instability
problems reported recently (the rtentry pointer in the dummynet
queue was not initialized in all cases, resulting in spurious
rt_refcnt decreases in the lucky cases, and memory trashing in
other cases.


42486 10-Jan-1999 luigi

Remove check from where arp replies are coming from -- when doing bridging,
interfaces are used in clusters so the check does not apply.


42454 10-Jan-1999 brian

If we can't open alias.log, don't try to write to the
resulting NULL FILE *.
PR: 9403


42194 31-Dec-1998 luigi

Partial fix for when ipfw is used with bridging. Bridged packets
have all fields in network order, whereas ipfw expects some to be
in host order. This resulted in some incorrect matching, e.g. some
packets being identified as fragments, or bandwidth not being
correctly enforced.
NOTE: this only affects bridge+ipfw, normal ipfw usage was already
correct).

Reported-By: Dave Alden and others.


42193 31-Dec-1998 luigi

Remove some unused variables.


42019 22-Dec-1998 luigi

'ip_fw_head' and 'M_IPFW' are also used in ip_dummynet so cannot be
static...
Reported by: Dave Alden


41993 21-Dec-1998 luigi

Recover from previous dummynet screwup


41990 21-Dec-1998 luigi

Restore 1.82->1.83 change deleted by mistake< per Bruce suggestion


41878 16-Dec-1998 fenner

Add missing "break"s to allow multicast routing to work.

Submitted by: Amancio Hasty <hasty@rah.star-gate.com>


41793 14-Dec-1998 luigi

Last bits (i think) of dummynet for -current.


41759 14-Dec-1998 dillon

Reviewed by: freebsd-current

Add bounds checking to netbios NS packet resolving code. This should
prevent natd from crashing on badly formed netbios packets (as might be
heard when the machine is sitting on a cable modem or certain DSL
networks), and also closes potential security holes that might have
exploited the lack of bounds checking in the previous version of the
code.


41702 12-Dec-1998 dillon

PR: kern/8990

If timer calculation results in degenerate value (0), force it to 1
to avoid divide-by-zero panic later on in calls to IGMP_RANDOM_DELAY().
I considered simply adding 1 to the timer calculation, but was unsure
if the calculation was part of the IGMP standard or not so did not want
to mess with it for all cases.


41591 07-Dec-1998 archie

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


41575 07-Dec-1998 eivind

Clean up some pointer usage.


41514 04-Dec-1998 archie

Examine all occurrences of sprintf(), strcat(), and str[n]cpy()
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.

These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>


41497 04-Dec-1998 dillon

Cleanup icmp_var.h, make icmp bandlim sysctl permanent but if ICMP_BANDLIM
option not defined the sysctl int value is set to -1 and read-only.

#ifdef KERNEL's added appropriately to wall off visibility of kernel
routines from user code.


41496 04-Dec-1998 dillon

Obtained from: "Andrey A. Chernov" <ache@nagual.pp.ru>

Quick add #ifdef KERNEL for ICMP_BANDLIM option so userland program
can #include icmp_var.h


41487 03-Dec-1998 dillon

Reviewed by: freebsd-current

Add ICMP_BANDLIM option and 'net.inet.icmp.icmplim' sysctl. If option
is specified in kernel config, icmplim defaults to 100 pps. Setting it
to 0 will disable the feature. This feature limits ICMP error responses
for packets sent to bad tcp or udp ports, which does a lot to help the
machine handle network D.O.S. attacks.

The kernel will report packet rates that exceed the limit at a rate of
one kernel printf per second. There is one issue in regards to the
'tail end' of an attack... the kernel will not output the last report
until some unrelated and valid icmp error packet is return at some
point after the attack is over. This is a minor reporting issue only.


41363 26-Nov-1998 eivind

Staticize some more.


41252 19-Nov-1998 jdp

Fix a couple of typos.


41208 17-Nov-1998 dfr

Remove stale references to ih_next and ih_prev.

Pointed out by: Roman V. Palagin <romanp@wuppy.rcs.ru>


41201 16-Nov-1998 dfr

Make the previous fix more portable.

Requested by: bde


41187 15-Nov-1998 guido

The below patch helps to reduce the leakage of internal socket information
when a TCP "stealth" scan is directed at a *BSD box by ensuring the window
is 0 for all RST packets generated through tcp_respond()
Reviewed by: Don Lewis <Don.Lewis@tsc.tdk.com>
Obtained from: Bugtraq (from: Darren Reed <avalon@COOMBS.ANU.EDU.AU>)


41177 15-Nov-1998 dfr

Fix printf format errors on alpha.


41173 15-Nov-1998 bde

Finished updating module event handlers to be compatible with
modeventhand_t.


41096 11-Nov-1998 dg

Be sure to pullup entire IP header when dealing with fragment packets.


41059 10-Nov-1998 peter

add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()


40670 27-Oct-1998 dfr

Some optimisations to the fragment reassembly code.

Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>


40669 27-Oct-1998 dfr

Fix a bug in the new fragment reassembly code which was tickled by recieving
a fragment which wholly overlapped one or more existing fragments.

Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>


40435 16-Oct-1998 peter

*gulp*. Jordan specifically OK'ed this..

This is the bulk of the support for doing kld modules. Two linker_sets
were replaced by SYSINIT()'s. VFS's and exec handlers are self registered.
kld is now a superset of lkm. I have converted most of them, they will
follow as a seperate commit as samples.
This all still works as a static a.out kernel using LKM's.


39681 26-Sep-1998 dfr

Dike out some obsolete defines which referenced ih_next and ih_prev from
struct ipovly (they don't exist anymore because they don't work when
pointers are 64bit).


39426 17-Sep-1998 fenner

Fix the bind security fix introduced in rev 1.38 to work with multicast:
- Don't bother checking for conflicting sockets if we're binding to a
multicast address.
- Don't return an error if we're binding to INADDR_ANY, the conflicting
socket is bound to INADDR_ANY, and the conflicting socket has SO_REUSEPORT
set.

PR: kern/7713


39389 17-Sep-1998 fenner

Prevent modification of permanent ARP entries (PR kern/7649)
Ignore ARP replies from the wrong interface (discussion on mailing list)
Add interface name to a couple of error messages


39267 15-Sep-1998 jkoshy

Turn off replies to ICMP echo requests for broadcast and multicast
addresses by default.

Add a knob "icmp_bmcastecho" to "rc.network" to allow this
behaviour to be controlled from "rc.conf".

Document the controlling sysctl variable "net.inet.icmp.bmcastecho"
in sysctl(3).

Reviewed by: dg, jkh
Reminded on -hackers by: Steinar Haug <sthaug@nethelp.no>


39119 12-Sep-1998 luigi

Bring in new files for dummynet support


39078 11-Sep-1998 wollman

Fix RST validation.

PR: 7892
Submitted by: Don.Lewis@tsc.tdk.com


39043 10-Sep-1998 dfr

Ensure that m_nextpkt is set to NULL after reassembling fragments.


38875 06-Sep-1998 phk

RFC 1644 has the status "Experimental Protocol", which means:

4.1.4. Experimental Protocol

A system should not implement an experimental protocol unless it
is participating in the experiment and has coordinated its use of
the protocol with the developer of the protocol.

Pointed out by: Steinar Haug <sthaug@nethelp.no>


38760 02-Sep-1998 phk

Widen and change the layout of the IPFW structures flag element.

This will allow us to add dummynet to 3.0

Recompile /sbin/ipfw AND your kernel.


38754 02-Sep-1998 wollman

Properly fragment multicast packets.

PR: 7802
Submitted by: Steve McCanne <mccanne@cs.berkeley.edu>


38681 31-Aug-1998 brian

Remove OpenBSD build support - let the Makefile vary per
OS rather than making it a mess and potentially screwing
up cross builds.
Suggested by: bde

Add Id keyword.


38663 30-Aug-1998 brian

Add OpenBSD build support


38513 24-Aug-1998 dfr

Re-implement tcp and ip fragment reassembly to not store pointers in the
ip header which can't work on alpha since pointers are too big.

Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


38482 23-Aug-1998 wollman

Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.


38373 17-Aug-1998 bde

Fixed printf format errors.


38342 15-Aug-1998 bde

Made some disgusting ifdefs even more disgusting to enable the support
for `u_long cmd' ioctl args if __FreeBSD_version >= 300003. Some ioctls
were broken on machines with 32-bit ints and 64-bit longs.


38249 11-Aug-1998 bde

Fixed printf format errors (ntohl() returns in_addr_t = u_int32_t != long
on some 64-bit systems). print_ip() should use inet_ntoa() instead of
bloated inline code with 4 ntohl()s.


38128 05-Aug-1998 bde

Converted the last instance of hzto() to tvtohz().


38057 03-Aug-1998 dfr

Use explicitly sized types when digging through packet headers.

Reviewed by: Julian Elischer <julian@whistle.com>


37996 01-Aug-1998 peter

Fix a compile error if IPFIREWALL_FORWARD active without IPDIVERT.


37939 29-Jul-1998 kjc

update ATM driver. (base version: midway.c 1.67 --> 1.68)

several new features are added:
- support vc/vp shaping
- support pvc shadow interface

code cleanup:
- remove WMAYBE related code. ENI WMAYBE DMA doen't work.
- remove updating if_lastchange for every packet.
- BPF related code is moved to midway.c as it should be.
(bpfwrite should work if atm_pseudohdr and LLC/SNAP are
prepended.)
- BPF link type is changed to DLT_ATM_RFC1483.
BPF now understands only LLC/SNAP!! (because bpf can't
handle variable link header length.)
It is recommended to use LLC/SNAP instead of NULL
encapsulation for various reasons. (BPF, IPv6,
interoperability, etc.)

the code has been used for months in ALTQ and KAME IPv6.

OKed by phk long time ago.


37745 18-Jul-1998 alex

Don't log ICMP type and subtype for non-zero offset packet fragments.


37625 13-Jul-1998 bde

Removed a bogus forward struct declaration.

Cleaned up ifdefs.


37624 13-Jul-1998 bde

Fixed some longs that should have been fixed-sized types.


37623 13-Jul-1998 bde

Fixed overflow and sign extension bugs in
`len = min(so->so_snd.sb_cc, win) - off;'. min() has type u_int
and `off' has type int, so when min() is 0 and `off' is 1, the RHS
overflows to 0U - 1 = UINT_MAX. `len' has type long, so when
sizeof(long) == sizeof(int), the LHS normally overflows to to the
correct value of -1, but when sizeof(long) > sizeof(int), the LHS
is UINT_MAX.

Fixed some u_long's that should have been fixed-sized types.


37622 13-Jul-1998 bde

Declare tcp_seq and tcp_cc as fixed-size types. Half fixed type
mismatches exposed by this (the prototype for tcp_respond() didn't
match the function definition lexically, and still depends on a
gcc feature to match if ints have more than 32 bits).


37621 13-Jul-1998 bde

Declare id_mask as a fixed-size type.


37620 13-Jul-1998 bde

Declare n_short, n_long and n_time as fixed-sized types. Don't ifdef
n_long or n_short specially for alphas.


37498 08-Jul-1998 dg

When not acting as a router (ipforwarding=0), silently discard source
routed packets that aren't destined for us, as required by RFC-1122.
PR: 7191


37434 06-Jul-1998 julian

oops ended comment before the comment ended..


37433 06-Jul-1998 julian

Bring back some slight cleanups from 2.2


37413 06-Jul-1998 julian

Don't expect the new code to be used without the right option file being
included.


37412 06-Jul-1998 julian

Fix braino in switching to TAILQ macro.


37409 06-Jul-1998 julian

Support for IPFW based transparent forwarding.
Any packet that can be matched by a ipfw rule can be redirected
transparently to another port or machine. Redirection to another port
mostly makes sense with tcp, where a session can be set up
between a proxy and an unsuspecting client. Redirection to another machine
requires that the other machine also be expecting to receive the forwarded
packets, as their headers will not have been modified.

/sbin/ipfw must be recompiled!!!

Reviewed by: Peter Wemm <peter@freebsd.org>
Submitted by: Chrisy Luke <chrisy@flix.net>


37334 02-Jul-1998 julian

Remove out of date comment.


37332 02-Jul-1998 julian

Remove the option to keep IPFW diversion backwards compatible
WRT diversion reinjection. No-one has been bitten by the new behaviour
that I know of.


37288 30-Jun-1998 phk

Byte count statistics of multicast vifs are invalid.
The problem is caused by a wrong endianess in the sum.

PR: 7115
Submitted by: Joao Carlos Mendes Luis <jonny@jonny.eng.br>


37183 27-Jun-1998 jhay

Only make struct xtcpcb visable if _NETINET_IN_PCB_H_ and _SYS_SOCKETVAR_H_
are defined.
Reviewed by: bde


37131 24-Jun-1998 brian

Add CUSEEME support. This has *not* been tested, nor
could I find anyone to test it, so please report any
problems to me.


37094 21-Jun-1998 bde

Removed unused includes.


37077 20-Jun-1998 peter

Merge ipfilter 3.2.3 -> 3.2.7 changes onto mainline.


37072 20-Jun-1998 peter

This commit was generated by cvs2svn to compensate for changes in r37071,
which included commits to RCS files with non-trunk default branches.


36995 15-Jun-1998 julian

fix another typo


36992 14-Jun-1998 julian

Try narrow down the culprit sending undefined packet types through the loopback


36933 12-Jun-1998 julian

Remove 3 occurances of __FUNCTION__


36908 12-Jun-1998 julian

Go through the loopback code with a broom..
Remove lots'o'hacks.
looutput is now static.

Other callers who want to use loopback to allow shortcutting
should call the special entrypoint for this, if_simloop(), which is
specifically designed for this purpose. Using looutput for this purpose
was problematic, particularly with bpf and trying to keep track
of whether one should be using the charateristics of the loopback interface
or the interface (e.g. if_ethersubr.c) that was requesting the loopback.
There was a whole class of errors due to this mis-use each of which had
hacks to cover them up.

Consists largly of hack removal :-)


36906 12-Jun-1998 julian

include opt_ipdivert.h so we get correct options


36903 12-Jun-1998 julian

Allow diverted packets from the transmit side to remember if they
had a recv interface and allow that state to be available
after re-injection for further tests.


36834 10-Jun-1998 brian

Quieten gcc 2.8.1


36767 08-Jun-1998 bde

Fixed pedantic semantics errors (bitfields not of type int, signed int
or unsigned int (this doesn't change the struct layout, size or
alignment in any of the files changed in this commit, at least for
gcc on i386's. Using bitfields of type u_char may affect size and
alignment but not packing)).


36752 08-Jun-1998 bde

ip_fil.h has 9 separate declarations of iplioctl() in a disgusting
ifdef tangle. The previous commit to ip_fil.h didn't change the
one that actually applies to the current FreeBSD kernel, of course.
Fixed.

Fixed style bugs in previous commit to ip_fil.h.


36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


36725 07-Jun-1998 bde

Fixed pedantic semantics errors (bitfields not of type int, signed int
or unsigned int).


36711 06-Jun-1998 brian

Don't call PunchFWHole() ifdef NO_FW_PUNCH
Pointed out by: "Steve Sims" <SimsS@IBM.Net>


36710 06-Jun-1998 julian

Make sure the default value of a dummy variable is 0
so that it doesn't do anything.


36708 06-Jun-1998 julian

Fix wrong data type for a pointer.


36707 06-Jun-1998 julian

clean up the changes made to ipfw over the last weeks
(should make the ipfw lkm work again)


36692 06-Jun-1998 jkoshy

Spelling corrections.

PR: 6868
Submitted by: Josh Gilliam <josh@quick.net>


36681 05-Jun-1998 julian

Reviewed by: Kirk Mckusick (mckusick@mckusick.com)
Submitted by: luoqi Chen
fix a type in fsck.
(also add a comment that got picked up by mistake but is worth adding)


36678 05-Jun-1998 julian

Reverse the default sense of the IPFW/DIVERT reinjection code
so that the new behaviour is now default.
Solves the "infinite loop in diversion" problem when more than one diversion
is active.
Man page changes follow.

The new code is in -stable as the NON default option.


36529 31-May-1998 peter

Let the sowwakeup macro decide when to call sowakeup rather than have
tcp "know" about it. A pending upcall would be missed, eg: used by NFS.

Obtained from: NetBSD


36393 26-May-1998 dg

Fixed logic in the test to drop ICMP echo and timestamp packets when
net.inet.ip.icmp.bmcastecho = 0 by removing the extra check for the
address being a multicast address. The test now relies on the link
layer flags that indicate it was received via multicast. The previous
logic was broken and replied to ICMP echo/timestamp broadcasts even
when the sysctl option disallowed them.
Reviewed by: wollman


36369 25-May-1998 julian

Add optional code to change the way that divert and ipfw work together.
Prior to this change, Accidental recursion protection was done by
the diverted daemon feeding back the divert port number it got
the packet on, as the port number on a sendto(). IPFW knew not to
redivert a packet to this port (again). Processing of the ruleset
started at the beginning again, skipping that divert port.

The new semantic (which is how we should have done it the first time)
is that the port number in the sendto() is the rule number AFTER which
processing should restart, and on a recvfrom(), the port number is the
rule number which caused the diversion. This is much more flexible,
and also more intuitive. If the user uses the same sockaddr received
when resending, processing resumes at the rule number following that
that caused the diversion. The user can however select to resume rule
processing at any rule. (0 is restart at the beginning)

To enable the new code use

option IPFW_DIVERT_RESTART

This should become the default as soon as people have looked at it a bit


36364 25-May-1998 julian

Hide the interface name in the sin_zero section of the sockaddr_in
passed to the user process for incoming packets. When the sockaddr_in
is passed back to the divert socket later, use thi sas the primary
interface lookup and only revert to the IP address when the name fails.
This solves a long standing bug with divert sockets:
When two interfaces had the same address (P2P for example) the interface
"assigned" to the reinjected packet was sometimes incorect.
Probably we should define a "sockaddr_div" to officially hold this
extended information in teh same manner as sockaddr_dl.


36363 25-May-1998 julian

Take the user's "IGNORE_DIVERT" argument from where the user put it
and not from the PCB which HAPPENS to contain the same number most
of the time, but not always.


36335 24-May-1998 fenner

Take IP options into account when calculating the allowable length
of the TCP payload. See RFC1122 section 4.2.2.6 . This allows
Path MTU discovery to be used along with IP options.

PR: problem discovered by Kevin Lahey <kml@nas.nasa.gov>


36330 24-May-1998 dg

The ipt_ptr field is 1-based (see TCP/IP Illustrated, Vol. 1, pp. 91-95),
so it must be adjusted (minus 1) before using it to do the length check.
I'm not sure who to give the credit to, but the bug was reported by
Jennifer Dawn Myers <jdm@enteract.com>, who also supplied a patch. It
was also fixed in OpenBSD previously by andreas.gunnarsson@emw.ericsson.se,
and of course I did the homework to verify that the fix was correct per
the specification.
PR: 6738


36321 24-May-1998 amurai

Primary verison of NetBIOS over TCP/IP. Now you can connect Windows
DOMAIN as DOMAIN user through NAT function. See also RFC1002 for
futher detail of SMB structure.

Submitted by: Atsushi Murai <amurai@spec.co.jp>


36308 23-May-1998 phk

Get more details on the "arpresolve: can't allocate llinfo" bogon.

PR: 2570
Reviewed by: phk
Submitted by: fenner


36196 19-May-1998 jdp

Fix a typo-bug in ipflow_reap that could cause a NULL pointer
dereference. I have also sent this fix to Matt Thomas.


36194 19-May-1998 pb

Move (private) struct ipflow out of ip_var.h, to reduce dependencies
(for ipfw for example) on internal implementation details.
Add $Id$ where missing.


36193 19-May-1998 dg

Moved #define of IPFLOW_HASHBITS to ip_flow.c where I think it belongs.


36192 19-May-1998 dg

Added fast IP forwarding code by Matt Thomas <matt@3am-software.com> via
NetBSD, ported to FreeBSD by Pierre Beyssac <pb@fasterix.freenix.org> and
minorly tweaked by me.
This is a standard part of FreeBSD, but must be enabled with:
"sysctl -w net.inet.ip.fastforwarding=1" ...and of course forwarding must
also be enabled. This should probably be modified to use the zone
allocator for speed and space efficiency. The current algorithm also
appears to lose if the number of active paths exceeds IPFLOW_MAX (256),
in which case it wastes lots of time trying to figure out which cache
entry to drop.


36161 18-May-1998 guido

Grumble...It seems I'm suffering from some mental disease. Do it correct now.


36159 18-May-1998 guido

Add some parenthesis for clarity and fix a bug
Pointed out by: Garrett Wollmand


36079 15-May-1998 wollman

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


35919 10-May-1998 jb

Treat all internet addresses as u_int32_t.


35823 07-May-1998 msmith

In the words of the submitter:

---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.

Quality testing was done with testvn, and lat_fs from the lmbench suite.

Some NFS client testing courtesy of Patrik Kudo.

vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------

Submitted by: Michael Hancock <michaelh@cet.co.jp>


35698 04-May-1998 guido

Refuse accellerated opens on listening sockets that have not set
the TCP_NOPUSH socket option.
This disables TAO for those services that do not know about T/TCP.

Reviewed by: Garrett Wollman
Submitted by: Peter Wemm


35421 24-Apr-1998 dg

At the request of Garrett, changed sysctl:

net.inet.tcp.delack_enabled -> net.inet.tcp.delayed_ack


35419 24-Apr-1998 dg

Ensure that TCP_REXMTVAL doesn't return a value less than t_rttmin. This
is believed to have been broken with the Brakmo/Peterson srtt
calculation changes. The result of this bug is that TCP connections
could time out extremely quickly (in 12 seconds).
Also backed out jdp's partial fix for this problem in rev 1.17 of
tcp_timer.c as it is obsoleted by this commit.
Bug was pointed out by Kevin Lehey <kml@roller.nas.nasa.gov>.

PR: 6068


35370 21-Apr-1998 julian

Remove the artificial limit on the size of the ipfw filter structure.
This allows the addition of extra fields if we need them (I have plans).


35314 19-Apr-1998 brian

o Support a compile-time -DNO_FW_PUNCH for portability
(and those of us that don't want the functionality).
o Don't assume sizeof(long) == 4.
Ok'd by: Charles Mott <cmott@srv.net>


35304 19-Apr-1998 phk

According to:

ftp://ftp.isi.edu/in-notes/iana/assignments/port-numbers

port numbers are divided into three ranges:

0 - 1023 Well Known Ports
1024 - 49151 Registered Ports
49152 - 65535 Dynamic and/or Private Ports

This patch changes the "local port range" from 40000-44999
to the range shown above (plus fix the comment in in_pcb.c).

WARNING: This may have an impact on firewall configurations!

PR: 5402
Reviewed by: phk
Submitted by: Stephen J. Roznowski <sjr@home.net>


35256 17-Apr-1998 des

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


35210 15-Apr-1998 bde

Support compiling with `gcc -ansi'.


35174 13-Apr-1998 phk

Wrong header length used for certain reassembled IP packets.
PR: 6177
Reviewed by: phk, wollman
Submitted by: Eric Sprinkle <eric@ennovatenetworks.com>


35065 06-Apr-1998 phk

Use read_random()


35056 06-Apr-1998 phk

Remove the last traces of TUBA.

Inspired by: PR kern/3317


34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


34924 28-Mar-1998 bde

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


34923 28-Mar-1998 bde

Fixed style bugs (mostly) in previous commit.


34922 28-Mar-1998 bde

Get socket and locking stuff by including <sys/socket.h> and <sys/lock.h>,
not by including <sys/mount.h> and depending on namespace pollution in it.


34916 27-Mar-1998 peter

When building in in the kernel rather than as a LKM, don't compile
all the LKM load/unload junk, and don't forget to register the SYSINIT
so that the cdevsw entry is attached.

BTW: I think the way it builds it's /dev nodes on the fly as an LKM with
vnode ops is kinda cute - I guess that'd be one way to solve the devfs
persistance problems.. :-) (ie: have the drivers make the nodes in /dev
on disk directly if they are missing, but leave them alone if present).


34915 27-Mar-1998 peter

allow open on all minors


34914 27-Mar-1998 peter

A fix for a link down route cleanup panic, when the route cleanup
pulls the rug out from underneath itself.

Obtained from: wollman (a few months ago, I've been using this for ages)


34881 24-Mar-1998 wollman

Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
use.


34815 23-Mar-1998 bde

FixedSpellingErrorInAFunctionname.


34756 21-Mar-1998 peter

Make it compile.. missing "opt_ipfilter.h" and missing <sys/malloc.h>


34751 21-Mar-1998 peter

Some patchups for when this code is compiled in userland (!).


34747 21-Mar-1998 peter

replaced by FreeBSD specific version


34746 21-Mar-1998 peter

Make this compile.. There are some unpleasing hacks in here.
A major unifdef session is sorely tempting but would destroy any remaining
chance of tracking the original sources.


34745 21-Mar-1998 peter

Merge vendor changes from 3.2.1 -> 3.2.3 onto mainline


34743 21-Mar-1998 peter

This commit was generated by cvs2svn to compensate for changes in r34742,
which included commits to RCS files with non-trunk default branches.


34697 20-Mar-1998 fenner

Remove the check for SYN in SYN_RECEIVED state; it breaks simultaneous
connect. This check was added as part of the defense against the "land"
attack, to prevent attacks which guess the ISS from going into ESTABLISHED.
The "src == dst" check will still prevent the single-homed case of the
"land" attack, and guessing ISS's should be hard anyway.

Submitted by: David Borman <dab@bsdi.com>


34586 15-Mar-1998 alex

Allow ICMP unreachable messages to be sent in response to ICMP query
packets (as per Stevens volume 1 section 6.2).


33955 01-Mar-1998 guido

Make sure that you can only bind a more specific address when it is
done by the same uid.
Obtained from: OpenBSD


33897 27-Feb-1998 brian

1) in CleanupAliasData, don't nullify entry in linkTableOut
since there might be permanent entries still left after
calls to DeleteLink (it will be nullified by DeleteLink
if all entries are deleted, won't it ?)

2) in PacketAliasSetAddress, set the aliasing address
even when PKT_ALIAS_RESET_ON_ADDR_CHANGE is in effect.
Just don't clean up links in this case.

Submitted by: Ari Suutari <ari@suutari.iki.fi>
via: Charles Mott <cmott@srv.net>
PR: 5041


33851 26-Feb-1998 dima

NetBSD PR# 2772

Reviewed by: David Greenman


33846 26-Feb-1998 dg

Changes to support the addition of a new sysctl variable:
net.inet.tcp.delack_enabled
Which defaults to 1 and can be set to 0 to disable TCP delayed-ack
processing (i.e. all acks are immediate).


33814 25-Feb-1998 julian

OOPs typo TCF, not TCP....


33804 25-Feb-1998 julian

Bring our in.h up to date with respect to allocated
IP protocol numbers. It is possible that the names may require tuning,
but the numbers represent what is in rfc1700 which is the present
active RFC.


33678 20-Feb-1998 bde

Don't depend on "implicit int".


33440 16-Feb-1998 guido

Add new sysctl variable: net.inet.ip.accept_sourceroute
It controls if the system is to accept source routed packets.
It used to be such that, no matter if the setting of net.inet.ip.sourceroute,
source routed packets destined at us would be accepted. Now it is
controllable with eth default set to NOT accept those.


33268 12-Feb-1998 ache

Replace non-existent ip_forwarding with ipforwarding
(compilation error)


33260 12-Feb-1998 alex

Alter ipfw's behavior with respect to fragmented packets when the packet
offset is non-zero:

- Do not match fragmented packets if the rule specifies a port or
TCP flags
- Match fragmented packets if the rule does not specify a port and
TCP flags

Since ipfw cannot examine port numbers or TCP flags for such packets,
it is now illegal to specify the 'frag' option with either ports or
tcpflags. Both kernel and ipfw userland utility will reject rules
containing a combination of these options.

BEWARE: packets that were previously passed may now be rejected, and
vice versa.

Reviewed by: Archie Cobbs <archie@whistle.com>


33249 11-Feb-1998 guido

Only forward source routed packets when ip_forwarding is set to 1.
This means that a FreeBSD will only forward source routed packets
when both net.inet.ip.forwarding and net.inet.ip.sourceroute are set
to 1.

You can hit me now ;-)
Submitted by: Thomas Ptacek


33181 09-Feb-1998 eivind

Staticize.


33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


33130 06-Feb-1998 alex

Don't attempt to display information which we don't have: specifically,
TCP and UDP port numbers in fragmented packets when IP offset != 0.

2.2.6 candidate.

Discovered by: Marc Slemko <marcs@znep.com>
Submitted by: Archie Cobbs <archie@whistle.com> w/fix from me


33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


33067 04-Feb-1998 eivind

Add #include "opt_devfs.h"


33058 03-Feb-1998 bde

Added #include of <sys/queue.h> so that this file is more "self"-sufficent.


33054 03-Feb-1998 bde

Forward declare some structs so that this file is more self-sufficient.


32925 31-Jan-1998 eivind

Make POWERFAIL_NMI, PPS_SYNC and NATM new style options.

This also fixes a couple of defunct options; submitted by bde.


32920 31-Jan-1998 eivind

Add #include "opt_devfs.h".


32821 27-Jan-1998 dg

Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).


32773 25-Jan-1998 steve

Fix a couple of operator precedence bugs.

PR: 5450
Submitted by: Sakari Jalovaara <sja@tekla.fi>


32752 25-Jan-1998 eivind

Make TCP_COMPAT_42 a new style option.


32662 21-Jan-1998 fenner

A more complete fix for the "land" attack, removing the "quick fix" from
rev 1.66. This fix contains both belt and suspenders.

Belt: ignore packets where src == dst and srcport == dstport in TCPS_LISTEN.
These packets can only legitimately occur when connecting a socket to itself,
which doesn't go through TCPS_LISTEN (it goes CLOSED->SYN_SENT->SYN_RCVD->
ESTABLISHED). This prevents the "standard" "land" attack, although doesn't
prevent the multi-homed variation.

Suspenders: send a RST in response to a SYN/ACK in SYN_RECEIVED state.
The only packets we should get in SYN_RECEIVED are
1. A retransmitted SYN, or
2. An ack of our SYN/ACK.
The "land" attack depends on us accepting our own SYN/ACK as an ACK;
in SYN_RECEIVED state; this should prevent all "land" attacks.

We also move up the sequence number check for the ACK in SYN_RECEIVED.
This neither helps nor hurts with respect to the "land" attack, but
puts more of the validation checking in one spot.

PR: kern/5103


32561 16-Jan-1998 bde

Fixed a missing #include in the synopsis.
Fixed some wrong prototypes.
Fixed a misspelled function name.

The owner of this file should add a copyright and an Id.


32560 16-Jan-1998 bde

Added prototypes for functions that were documented in libalias.3
but not prototyped here.


32498 14-Jan-1998 brian

Remove __libalias_version. Ppp no longer uses it.


32443 11-Jan-1998 eivind

Remove use of <osreldate.h>.

Screwed up by: myself


32398 10-Jan-1998 steve

Put back __libalias_version so ppp(8) build again.


32396 10-Jan-1998 alex

Sync with ipfw interface change: fw_pts is now part of a union (a
necessary evil due to the 108 byte setsockopt() limit).


32392 10-Jan-1998 jkh

include <net/if.h> and restore this to sanity.


32377 09-Jan-1998 eivind

Teach libalias to work with IPFW firewalls (controlled by a flag).

Obtained from: Yes development tree (+ 10 lines of patches from
Charles Mott, original libalias author)


32358 09-Jan-1998 eivind

Make the BOOTP family new-style options (in opt_bootp.h)


32350 08-Jan-1998 eivind

Make INET a proper option.

This will not make any of object files that LINT create change; there
might be differences with INET disabled, but hardly anything compiled
before without INET anyway. Now the 'obvious' things will give a
proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The
only thing that _should_ work (but can't be made to compile reasonably
easily) is sppp :-(

This commit move struct arpcom from <netinet/if_ether.h> to
<net/if_arp.h>.


32330 08-Jan-1998 alex

Bump up packet and byte counters to 64-bit unsigned ints. As a
consequence, ipfw's list command now adjusts its output at runtime
based on the largest packet/byte counter values.

NOTE:
o The ipfw struct has changed requiring a recompile of both kernel
and userland ipfw utility.

o This probably should not be brought into 2.2.

PR: 3738


32264 05-Jan-1998 alex

Use LIST_FIRST/LIST_NEXT macros instead of accessing the fields lh_first
and le_next.


32260 05-Jan-1998 alex

Added missing parens from previous commit.


32257 05-Jan-1998 alex

Bound the ICMP type bitmap now that it doesn't cover all possible
ICMP type values.


32254 04-Jan-1998 alex

Reduce the amount of time that network interrupts are blocked while
zeroing & deleting rules.

Return EINVAL when zeroing an nonexistent entry.


32022 27-Dec-1997 alex

Bring back part of rev 1.44 which was commented out by rev 1.58.

Reviewed by: nate


31987 25-Dec-1997 dg

The spl fixes in in_setsockaddr and in_setpeeraddr that were meant to
fix PR#3618 weren't sufficient since malloc() can block - allowing the
net interrupts in and leading to the same problem mentioned in the
PR (a panic). The order of operations has been changed so that this
is no longer a problem.
Needs to be brought into the 2.2.x branch.
PR: 3618


31941 23-Dec-1997 alex

Removed unnecessary setting of 'error' -- binding to a privileged port
by a non-root user always returns EACCES.


31884 20-Dec-1997 bde

Fixed gratuitous ANSIisms.


31882 19-Dec-1997 bde

Don't use ANSI string concatenation to misformat a string.


31881 19-Dec-1997 bde

Removed a stale comment. (We don't declare ip_len and ip_offset as
short. I guess we depend on bogus ANSI value-preserving extension
of u_short to int to avoid unsigned comparison bugs.)


31848 19-Dec-1997 julian

Fix an incredibly horrible bug in the ipfw code
where if you are using the "reset tcp" firewall command,
the kernel would write ethernet headers onto random kernel stack locations.

Fought to the death by: terry, julian, archie.
fix valid for 2.2 series as well.


31840 18-Dec-1997 dg

Fixed a missing splx(s) bug in tcp_usr_send().


31838 18-Dec-1997 dg

Call in_pcballoc() at splnet(). As near as I can tell, this won't fix
any instability problems, but it was wrong nonetheless and will be
required in an upcoming round of PCB changes.


31742 15-Dec-1997 eivind

Throw options IPX, IPXIP and IPTUNNEL into opt_ipx.h.

The #ifdef IPXIP in netipx/ipx_if.h is OK (used from ipx_usrreq.c and
ifconfig.c only).

I also fixed a typo IPXTUNNEL -> IPTUNNEL (and #ifdef'ed out the code
inside, as it never could have compiled - doh.)


31323 20-Nov-1997 wollman

Add Matt Dillon's quick fix hack for the self-connect DoS.

PR: 5103


31188 16-Nov-1997 peter

This commit was generated by cvs2svn to compensate for changes in r31187,
which included commits to RCS files with non-trunk default branches.


31163 13-Nov-1997 julian

Submitted by: Archie cobbs (IPDIVERT author)
close small security hole where an atacker could sendpackets with
IPDIVERT protocol, and select how it would be diverted thus bypassing
the ipfirewall. Discovered by inspection rather than attack.
(you'd have to know how the firewall was configured (EXACTLY) to
make use of this but..)


31017 07-Nov-1997 phk

Rename some local variables to avoid shadowing other local variables.

Found by: -Wshadow


31016 07-Nov-1997 phk

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


30966 05-Nov-1997 joerg

Make IPDIVERT a supported option. Alas, in_var.h depends on it, i
hope i've found out all files that actually depend on this dependancy.
IMHO, it's not very good practice to change the size of internal
structs depending on kernel options.


30948 05-Nov-1997 julian

Return the entire if info, rather than just the index number. (at least try)
Interface index numbers are an abomination that should go away
(at least in that form)


30816 28-Oct-1997 guido

Fix bugs from my previous commit
Submitted by: Bruce Evans


30813 28-Oct-1997 bde

Removed unused #includes.


30790 27-Oct-1997 guido

When dosourcerouting is set do not sourceoute....


30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


30209 07-Oct-1997 fenner

Don't allow the window to be increased beyond what is possible to
represent in the TCP header. The old code did effectively:
win = min(win, MAX_ALLOWED);
win = max(win, what_i_think_i_advertised_last_time);
so if what_i_think_i_advertised_last_time is bigger than can be
represented in the header (e.g. large buffers and no window scaling)
then we stuff a too-big number into a short. This fix reverses the
order of the comparisons.

PR: kern/4712


30052 02-Oct-1997 dg

Killed the SYN_RECEIVED addition from rev 1.52. It results in legitimate
RST's being ignored, keeping a connection around until it times out, and
thus has the opposite effect of what was intended (which is to make the
system more robust to DoS attacks).


30005 30-Sep-1997 fenner

Don't consider a SYN/ACK with CC but no CCECHO a proper T/TCP
handshake.

Reviewed by: Rich Stevens <rstevens@kohala.com>


29838 25-Sep-1997 wollman

Export ipstat via sysctl. Don't understand why this wasn't done before.


29681 21-Sep-1997 gibbs

Update for new callout interface.


29514 16-Sep-1997 joerg

Make TCPDEBUG a new-style option.


29506 16-Sep-1997 bde

Fixed gratuitous ANSIisms.


29480 15-Sep-1997 ache

Prevent overflow with fragmented packets
Reviewed by: wollman


29366 14-Sep-1997 peter

Update network code to use poll support.


29327 13-Sep-1997 peter

Some mbuf -> sockaddr changes seem to have been missed here.


29268 10-Sep-1997 peter

Allow a compile-time override of the ipfw deny rule. For a 'firewall'
you don't want this (and the documentation explains why), but if you
use ipfw as an as-needed casual filter as needed which normally runs as
'allow all' then having the kernel and /sbin/ipfw get out of sync is a
*MAJOR* pain in the behind.

PR: 4141
Submitted by: Heikki Suonsivu <hsu@mail.clinet.fi>


29179 07-Sep-1997 bde

Some staticized variables were still declared to be extern.


29162 06-Sep-1997 brian

Upgrade to 2.4 (Fix -PKT_ALIAS_UNREGISTERED_ONLY)
Submitted by: Charles Mott <cmott@srv.net>

Add __libalias_version so that ppp can derive the
correct library name for dlopen()


29024 02-Sep-1997 bde

Added used #include - don't depend on <sys/mbuf.h> including
<sys/malloc.h> (unless we only use the bogusly shared M*WAIT flags).


28723 25-Aug-1997 wollman

ICMP Timestamp Request messages could have harbored the same sort of
problem as Echo Requests when broad/multicast. When multicast echo responses
are disabled, also do the same for timestamp responses.


28683 25-Aug-1997 wollman

Configurably don't reply to broadcast or multicast echos. There are still
potential problems with other automatic-reply ICMPs, but some of them may
depend on broadcast/multicast to operate. (This code can simply be
moved to the `reflect' label to generalize it.)


28616 23-Aug-1997 alex

Fixed logging of verbose limited packets.

PR: 4351
Submitted by: Ron Bickers <rbickers@intercenter.net>


28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


28084 11-Aug-1997 brian

Fix file descriptor leak.

Submitted by: Charles Mott <cmott@srv.net>
Identified by: Gordon Burditt


27981 08-Aug-1997 alex

Support interface names up to 15 characters in length. In order to
accommodate the expanded name, the ICMP types bitmap has been
reduced from 256 bits to 32.

A recompile of kernel and user level ipfw is required.

To be merged into 2.2 after a brief period in -current.

PR: bin/4209
Reviewed by: Archie Cobbs <archie@whistle.com>


27926 06-Aug-1997 alex

Ensure that the interface name is terminated.


27864 03-Aug-1997 brian

Update to version 2.2. Only the PacketAlias*()
functions should now be used. The old 2.1 stuff is
there for backwards compatability.
Submitted by: Charles Mott <cmott@snake.srv.net>


27845 02-Aug-1997 bde

Removed unused #includes.


27669 25-Jul-1997 brian

Recalculate ip_sum before passing a
re-assembled packet to a divert port.
Pointed-out by: Ari Suutari <ari@suutari.iki.fi>
VS: then name the system in this line, otherwise delete it.


27529 19-Jul-1997 fenner

Remove crufty LBL ifdef that only applies to Suns.

Submitted by: Craig Leres <leres@ee.lbl.gov>


27135 01-Jul-1997 jdp

Fix a bug (apparently very old) that can cause a TCP connection to
be dropped when it has an unusual traffic pattern. For full details
as well as a test case that demonstrates the failure, see the
referenced PR.

Under certain circumstances involving the persist state, it is
possible for the receive side's tp->rcv_nxt to advance beyond its
tp->rcv_adv. This causes (tp->rcv_adv - tp->rcv_nxt) to become
negative. However, in the code affected by this fix, that difference
was interpreted as an unsigned number by max(). Since it was
negative, it was taken as a huge unsigned number. The effect was
to cause the receiver to believe that its receive window had negative
size, thereby rejecting all received segments including ACKs. As
the test case shows, this led to fruitless retransmissions and
eventually to a dropped connection. Even connections using the
loopback interface could be dropped. The fix substitutes the signed
imax() for the unsigned max() function.

PR: closes kern/3998
Reviewed by: davidg, fenner, wollman


26706 18-Jun-1997 wollman

Add for public examination the beginnings of the per-host cache support
which will for the basis of RTF_PRCLONING's more efficient, better-
designed replacement.


26451 04-Jun-1997 julian

make it compile with -Wall
Submitted by: Archi Cobbs, archie@whistle.com


26359 02-Jun-1997 julian

Submitted by: Whistle Communications (archie Cobbs)

these are quite extensive additions to the ipfw code.
they include a change to the API because the old method was
broken, but the user view is kept the same.

The new code allows a particular match to skip forward to a particular
line number, so that blocks of rules can be
used without checking all the intervening rules.
There are also many more ways of rejecting
connections especially TCP related, and
many many more ...

see the man page for a complete description.


26345 01-Jun-1997 peter

typo fix, s/imp/inp'; move lookup call inside splnet since there were
comments on it being outside.


26147 26-May-1997 peter

Uninitialised inp variable in div_bind().

Submitted by: Åge Røbekk <aagero@aage.priv.no>


26125 25-May-1997 darrenr

This commit was generated by cvs2svn to compensate for changes in r26124,
which included commits to RCS files with non-trunk default branches.


26113 25-May-1997 peter

Connect the ipdivert div_usrreqs struct to the ip proto switch table


26096 24-May-1997 peter

Attempt to convert the ip_divert code to use the new-style protocol request
switch. I needed 'LINT' to compile for other reasons so I kinda got the
blood on my hands. Note: I don't know how to test this, I don't know if
it works correctly.


26079 23-May-1997 julian

submitted by: archie@whistle.com

Don't search for interface addresses matching interface "NULL"
it's likely to cause a page fault..
this can be triggered by the ipfw code rejecting a locally generated
packet (e.g. you decide to make some network unreachable by local users)


26026 23-May-1997 brian

Create the alias library. This is currently only used by
ppp (or will be shortly). Natd can now be updated to use
this library rather than carrying its own version of the code.

Submitted by: Charles Mott <cmott@srv.net>


26008 22-May-1997 fenner

Disallow writing raw IP packets shorter than the IP header.


25907 19-May-1997 tegge

Break apart initialization of s and inp from the declarations in
in_setsockaddr and in_setpeeraddr.
Suggested by: Justin T. Gibbs <gibbs@plutotech.com>


25904 19-May-1997 tegge

Disallow network interrupts while the address is found and copied in
in_setsockaddr and in_setpeeraddr.
Handle the case where the socket was disconnected before the network
interrupts were disabled.
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


25822 14-May-1997 tegge

Don't send arp request for the ip address 0.0.0.0.


25723 11-May-1997 tegge

Bring in some kernel bootp support. This removes the need for netboot
to fill in the nfs_diskless structure, at the cost of some kernel
bloat. The advantage is that this code works on a wider range of
network adapters than netboot. Several new kernel options are
documented in LINT.
Obtained from: parts of the code comes from NetBSD.


25604 09-May-1997 kjc

This commit was generated by cvs2svn to compensate for changes in r25603,
which included commits to RCS files with non-trunk default branches.


25516 06-May-1997 fenner

Pull up the IP header in ip_mloopback(). This makes sure that the
operations on the header inside ip_mloopback() are performed on
a private copy instead of a shared cluster.

PR: kern/3410


25502 06-May-1997 alex

Create the default rule with flags IP_FW_F_IN | IP_FW_F_OUT.
Closes PR#3100.


25201 27-Apr-1997 wollman

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


24674 06-Apr-1997 dufault

Make MOD_* macros almost consistent:

Use the name argument almost the same in all LKM types. Maintain
the current behavior for the external (e.g., modstat) name for DEV,
EXEC, and MISC types being #name ## "_mod" and SYCALL and VFS only
#name. This is a candidate for change and I vote just the name without
the "_mod".

Change the DISPATCH macro to MOD_DISPATCH for consistency with the
other macros.

Add an LKM_ANON #define to eliminate the magic -1 and associated
signed/unsigned warnings.

Add MOD_PRIVATE to support wcd.c's poking around in the lkm structure.

Change source in tree to use the new interface.

Reviewed by: Bruce Evans


24590 03-Apr-1997 darrenr

Resolve conflicts created by import.


24587 03-Apr-1997 darrenr

This commit was generated by cvs2svn to compensate for changes in r24586,
which included commits to RCS files with non-trunk default branches.


24570 03-Apr-1997 dg

Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman


24204 24-Mar-1997 bde

Don't include <sys/ioctl.h> in the kernel. Stage 2: include
<sys/sockio.h> instead of <sys/ioctl.h> in network files.


24203 24-Mar-1997 bde

Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include
it when it is not used. In most cases, the reasons for including it
went away when the special ioctl headers became self-sufficient.


23324 03-Mar-1997 dg

Improved performance of hash algorithm while (hopefully) not reducing
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>


23286 02-Mar-1997 peter

This commit was generated by cvs2svn to compensate for changes in r23285,
which included commits to RCS files with non-trunk default branches.


23283 02-Mar-1997 peter

This commit was generated by cvs2svn to compensate for changes in r23282,
which included commits to RCS files with non-trunk default branches.


23221 28-Feb-1997 fenner

Fix a comment and some commented-out code in ip_mloopback to
reflect how multicast loopback really works.


23082 24-Feb-1997 wollman

Fix #include order.


22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


22967 21-Feb-1997 wollman

Properly notice error returns from if_allmulti().


22962 21-Feb-1997 wollman

Fix potential crash where a user attempts to perform an implied
connect in TCP while sending urgent data. It is not clear what
purpose is served by doing this, but there's no good reason why it
shouldn't work.

Submitted by: tjevans@raleigh.ibm.com via wpaul


22952 20-Feb-1997 wollman

Fix the parameters of a call to in_setsockaddr().


22927 19-Feb-1997 darrenr

change IP Filter hooks to match new 3.1.8 patches for FreeBSD


22900 18-Feb-1997 wollman

Convert raw IP from mondo-switch-statement-from-Hell to
pr_usrreqs. Collapse duplicates with udp_usrreq.c and
tcp_usrreq.c (calling the generic routines in uipc_socket2.c and
in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached
socket now traps, rather than harmlessly returning an error; this
should never happen. Allow the raw IP buffer sizes to be
controlled via sysctl.


22719 14-Feb-1997 wollman

Fix the mechanism for choosing wehether to save the slow-start threshold
in the route. This allows us to remove the unconditional setting of the
pipesize in the route, which should mean that SO_SNDBUF and SO_RCVBUF
should actually work again. While we're at it:

- Convert udp_usrreq from `mondo switch statement from Hell' to new-style.
- Delete old TCP mondo switch statement from Hell, which had previously
been diked out.


22672 13-Feb-1997 wollman

Provide PRC_IFDOWN and PRC_IFUP support for IP. Now, when an interface
is administratively downed, all routes to that interface (including the
interface route itself) which are not static will be deleted. When
it comes back up, and addresses remaining will have their interface routes
re-added. This solves the problem where, for example, an Ethernet interface
is downed by traffic continues to flow by way of ARP entries.


22531 10-Feb-1997 darrenr

Add IP Filter hooks (from patches).


22333 06-Feb-1997 brian

Don't zero ip->ip_sum during sum validation. This should only
affect programs that sit on top of divert(4) sockets. The
multicast routing code already unconditionally zeros the sum
before recalculating.

Any code that unconditionaly sums a packet without first zeroing
the sum (assuming that it's already zero'd) will break. No such
code seems to exist.


22212 02-Feb-1997 brian

Reset ip_divert_ignore to zero immediately after use - also,
set it in the first place, independent of whether sin->sin_port
is set.

The result is that diverted packets that are being forwarded
will be diverted once and only once on the way in (ip_input())
and again, once and only once on the way out (ip_output()) -
twice in total. ICMP packets that don't contain a port will
now also be diverted.


21932 21-Jan-1997 wollman

Count multicast packets received for groups of which we are not
a member separately from generic ``can't forward'' packets. This
would have helped me find the previous bug much faster.


21929 21-Jan-1997 wollman

Who had the conical hat? Correct a typo, hidden by a bad cast,
which prevented IP multicast reception from happening.


21830 17-Jan-1997 joerg

This mega-merge brings Matt Thomas' 960801 FDDI driver (almost) up
to -current.

Thanks goes to Ulrike Nitzsche <ulrike@ifw-dresden.de> for giving me
a chance to test this. Only the PCI driver is tested though.

One final patch will follow in a separate commit. This is so that
everything up to here can be dragged into 2.2, if we decide so.

Reviewed by: joerg
Submitted by: Matt Thomas <matt@3am-software.com>


21785 16-Jan-1997 adam

implement "not" keyword for inverting the address logic


21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


21666 13-Jan-1997 wollman

Use the new if_multiaddrs list for multicast addresses rather than the
previous hackery involving struct in_ifaddr and arpcom. Get rid of the
abominable multi_kludge. Update all network interfaces to use the
new machanism. Distressingly few Ethernet drivers program the multicast
filter properly (assuming the hardware has one, which it usually does).


21261 03-Jan-1997 wollman

Expose more of these structures to tthe user so that netstat
doesn't walk around with its KERNEL exposed.

More commits to follow...


21260 03-Jan-1997 wollman

Move the ethertypes from <netinet/if_ether.h> to <net/ethernet.h>.
Many programs need the numbers but don't need the internals of ARP.

More commits to follow...


21098 30-Dec-1996 peter

Add INADDR_LOOPBACK, moved from <rpc/rpc.h>


20532 15-Dec-1996 wollman

Some days, it just doesn't pay to get out of bed. Fix another broken
reference to the now-dead-for-real-this-time ia_next field.

Reminded by: Russell Vincent


20527 15-Dec-1996 wollman

Somehow the removal of ia_next didn't make it in the last time. Hope
it makes it in this time, and remember not to commit changes next time
late on a Friday evening!


20525 15-Dec-1996 bde

Attempt to complete the fix in the previous revision. This version
fixes the problem reported by max.


20448 14-Dec-1996 dyson

Missing TAILQ mod.


20407 13-Dec-1996 wollman

Convert the interface address and IP interface address structures
to TAILQs. Fix places which referenced these for no good reason
that I can see (the references remain, but were fixed to compile
again; they are still questionable).


20337 11-Dec-1996 wollman

Use queue macros for the list of interfaces. Next stop: ifaddrs!


20330 11-Dec-1996 wollman

Include <net/if_arp.h> in the one header that requires it,
<netinet/if_ether.h>, rather than in <net/if.h>, most of whose callers
have no need of it.

Pointed-out-by: bde


20308 11-Dec-1996 dg

Only pay attention to the offset and the IP_MF flag in ip_off. Pointed
out by Nathaniel D. Daw (daw@panix.com), but fixed differently by me.


19940 23-Nov-1996 fenner

Allocate a header mbuf for the start of the encapsulated packet.
The rest of the code was treating it as a header mbuf, but it was
allocated as a normal mbuf.

This fixes the panic: ip_output no HDR when you have a multicast
tunnel configured.


19794 15-Nov-1996 fenner

Reword two messages:

duplicate ip address 204.162.228.7! sent from ethernet address: 08:00:20:09:7b:1d
changed to
arp: 08:00:20:09:7b:1d is using my IP address 204.162.228.7!

and

arp info overwritten for 204.162.228.2 by 08:00:20:09:7b:1d
changed to
arp: 204.162.228.2 moved from 08:00:20:07:b6:a0 to 08:00:20:09:7b:1d

I think the new wordings are more clear and could save some support
questions.


19669 12-Nov-1996 bde

Forward-declare `struct inpcb' so that including this file doesn't cause
lots of warnings.

Should be in 2.2. Previous version shouldn't have been in 2.2.


19622 11-Nov-1996 fenner

Add the IP_RECVIF socket option, which supplies a packet's incoming interface
using a sockaddr_dl.

Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).


19597 10-Nov-1996 fenner

Re-enable the TCP SYN-attack protection code. I was the one who didn't
understand the socket state flag.

2.2 candidate.


19262 30-Oct-1996 peter

Fix braino on my part. When we have three different port ranges (default,
"high" and "secure"), we can't use a single variable to track the most
recently used port in all three ranges.. :-] This caused the next
transient port to be allocated from the start of the range more often than
it should.


19183 25-Oct-1996 fenner

Don't allow reassembly to create packets bigger than IP_MAXPACKET, and count
attempts to do so.
Don't allow users to source packets bigger than IP_MAXPACKET.
Make UDP length and ipovly's protocol length unsigned short.

Reviewed by: wollman
Submitted by: (partly by) kml@nas.nasa.gov (Kevin Lahey)


19136 23-Oct-1996 wollman

Give ip_len and ip_off more natural, unsigned types.


19113 22-Oct-1996 sos

Changed args to the nat functions.


19035 19-Oct-1996 alex

Reword two comments.


18940 15-Oct-1996 bde

Forward-declared `struct route' for the KERNEL case so that <net/route.h>
isn't a prerequisite.

Fixed style of ifdefs.


18892 12-Oct-1996 bde

Removed nested include if <sys/socket.h> from <net/if.h> and
<net/if_arp.h> and fixed the things that depended on it. The nested
include just allowed unportable programs to compile and made my
simple #include checking program report that networking code doesn't
need to include <sys/socket.h>.


18891 12-Oct-1996 alex

Log the interface name which received the packet.

Suggested by: Hal Snyder <hsndyer@thoughtport.com>


18874 11-Oct-1996 pst

Fix two bugs I accidently put into the syn code at the last minute
(yes I had tested the hell out of this).

I've also temporarily disabled the code so that it behaves as it previously
did (tail drop's the syns) pending discussion with fenner about some socket
state flags that I don't fully understand.

Submitted by: fenner


18797 07-Oct-1996 wollman

All three files: make COMPAT_IPFW==0 case work again.
ip_input.c:
- delete some dusty code
- _IP_VHL
- use fast inline header checksum when possible


18795 07-Oct-1996 dg

Improved in_pcblookuphash() to support wildcarding, and changed relavent
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be
done).

Reviewed by: wollman


18787 07-Oct-1996 pst

Increase robustness of FreeBSD against high-rate connection attempt
denial of service attacks.

Reviewed by: bde,wollman,olah
Inspired by: vjs@sgi.com


18437 21-Sep-1996 pst

I don't understand, I committed this fix (move a counter and fixed a typo)
this evening.

I think I'm going insane.


18436 21-Sep-1996 ache

Syntax error: so_incom -> so_incomp


18431 20-Sep-1996 pst

If the incomplete listen queue for a given socket is full,
drop the oldest entry in the queue.

There was a fair bit of discussion as to whether or not the
proper action is to drop a random entry in the queue. It's
my conclusion that a random drop is better than a head drop,
however profiling this section of code (done by John Capo)
shows that a head-drop results in a significant performance
increase.

There are scenarios where a random drop is more appropriate.
If I find one in reality, I'll add the random drop code under
a conditional.

Obtained from: discussions and code done by Vernon Schryver (vjs@sgi.com).


18416 20-Sep-1996 pst

Handle ICMP codes defined in RFC1812 more appropriately


18281 13-Sep-1996 pst

Move TCPCTL_KEEPINIT to end of MIB list (sigh)


18280 13-Sep-1996 pst

Make the misnamed tcp initial keepalive timer value (which is really the
time, in seconds, that state for non-established TCP sessions stays about)
a sysctl modifyable variable.

[part 1 of two commits, I just realized I can't play with the indices as
I was typing this commit message.]


18278 13-Sep-1996 pst

Receipt of two SYN's are sufficient to set the t_timer[TCPT_KEEP]
to "keepidle". this should not occur unless the connection has
been established via the 3-way handshake which requires an ACK

Submitted by: jmb
Obtained from: problem discussed in Stevens vol. 3


18193 09-Sep-1996 wollman

Set subnetsarelocal to false. In a classless world, the other case
is almost never useful. (This is only a quick hack; someone should
go back and delete the entire subnetsarelocal==1 code path.)


18160 08-Sep-1996 dg

Dequeue mbuf before freeing it. Fixes mbuf leak and a potential crash when
handling IP fragments.

Submitted by: Darren Reed <avalon@coombs.anu.edu.au>


17977 31-Aug-1996 alex

Fix the visibility of the sysctl variables.

Submitted by: phk


17851 27-Aug-1996 sos

Oops, send the operation type, not the name to the NAT code...


17795 23-Aug-1996 phk

Mark sockets where the kernel chose the port# for.
This can be used by netstat to behave more intelligently.


17758 21-Aug-1996 sos

Add hooks for an IP NAT module, much like the firewall stuff...
Move the sockopt definitions for the firewall code from
ip_fw.h to in.h where it belongs.


17720 20-Aug-1996 fenner

Add #define's for RFC1716/RFC1812 new ICMP UNREACHABLE types.

Obtained from: LBL's tcpdump distribution


17587 13-Aug-1996 pst

Completely rewrite handling of protocol field for firewalls, things are
now completely consistent across all IP protocols and should be quite a
bit faster.

Discussed with: fenner & alex


17541 12-Aug-1996 peter

Add two more portrange sysctls, which control the area of the below
IPPORT_RESERVED that is used for selection when bind() is told to allocate
a reserved port.

Also, implement simple sanity checking for all the addresses set, to make
it a little harder for a user/sysadmin to shoot themselves in the feet.


17455 06-Aug-1996 phk

Megacommit to straigthen out ETHER_ mess.

I'm pretty convinced after looking at this that the majority of our
drivers are confused about the in/exclusion of ETHER_CRC_LEN :-(


17440 05-Aug-1996 alex

Filter by IP protocol.

Submitted by: fenner (with modifications by me)

Use a common prefix string for all warning messages generated during
ip_fw_ctl.


17269 24-Jul-1996 wollman

Eliminate some more references to separate ip_v and ip_hl fields.


17227 20-Jul-1996 alex

Removed extraneous return.


17172 14-Jul-1996 alex

Switch back to logging accepted packets with the text "Allow" instead
of "Accept"


17138 12-Jul-1996 dg

Fixed two bugs in previous commit: be sure to include tcp_debug.h when
TCPDEBUG is defined, and fix typo in TCPDEBUG2() macro.


17137 12-Jul-1996 fenner

Fix braino in rev 1.30 fix; m_copy() the mbuf that has the header
pulled up already. This bug can cause the first packet from a source
to a group to be corrupted when it is delivered to a process listening
on the mrouter.


17108 12-Jul-1996 bde

Don't use NULL in non-pointer contexts.


17096 11-Jul-1996 wollman

Modify the kernel to use the new pr_usrreqs interface rather than the old
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.


17072 10-Jul-1996 julian

Adding changes to ipfw and the kernel to support ip packet diversion..
This stuff should not be too destructive if the IPDIVERT is not compiled in..
be aware that this changes the size of the ip_fw struct
so ipfw needs to be recompiled to use it.. more changes coming to clean this up.


17048 09-Jul-1996 nate

Functionality for IPFIREWALL_VERBOSE logging:
- State when we've reached the limit on a particular rule in the kernel logfile
- State when a rule or all rules have been zero'd.

This gives a log of all actions that occur w/regard to the firewall
occurances, and can explain why a particular break-in attempt might not
get logged due to the limit being reached.

Reviewed by: alex


16827 29-Jun-1996 alex

Reject rules which try to mix ports with incompatible protocols.


16678 25-Jun-1996 alex

Allow fragment checking to work with specific protocols.
Reviewed by: phk

Reject the addition of rules that will never match (for example,
1.2.3.4:255.255.255.0). User level utilities specify the policy by either
masking the IP address for the user (as ipfw(8) does) or rejecting the
entry with an error. In either case, the kernel should not modify chain
entries to make them work.


16619 23-Jun-1996 bde

Use IPFIREWALL_MODULE instead of ACTUALLY_LKM_NOT_KERNEL to indicate
LKM'ness. ACTUALLY_LKM_NOT_KERNEL is supposed to be so ugly that it
only gets used until <machine/conf.h> goes away. bsd.kmod.mk should
define a better-named general macro for this. Some places use
PSEUDO_LKM. This is another bad name.

Makefile:
Added IPFIREWALL_VERBOSE_LIMIT option (commented out).


16576 21-Jun-1996 peter

Set the rmx.rmx_expire to 0 when creating fake ethernet addresses for the
broadcast and multicast routes, otherwise they will be expired by
arptimeout after a few minutes, reverting to " (incomplete)". This makes
the work done by rev 1.27 stay around until the route itself is deleted.
This is mainly cosmetic for 'arp' and 'netstat -r'.


16557 20-Jun-1996 fenner

Use the route that's guaranteed to exist when picking a source address
for ARP requests.

The NetBSD version of this patch (see NetBSD PR kern/2381) has this change
already. This should close our PR kern/1140 .

Although it's not quite what he submitted, I got the idea from him so
Submitted by: Jin Guojun <jin@george.lbl.gov>


16548 20-Jun-1996 fenner

Remove one last rip_output from inetsw (gpalmer missed it in rev 1.30)


16542 20-Jun-1996 nate

Put the 'debug' messages of the type:
/kernel: in_rtqtimo: adjusted rtq_reallyold to 1066
/kernel: in_rtqtimo: adjusted rtq_reallyold to 710
inside of #ifdef DIAGNOSTIC to avoid the support questions from folks
asking what this means.


16413 17-Jun-1996 alex

Fix chain numbering bug when the highest line number installed >= 65435
and the rule being added has no explicit line number set.

Submitted by: Archie Cobbs <archie@whistle.com>


16367 14-Jun-1996 wollman

Better selection of initial retransmit timeout when no cached
RTT information is available.

Submitted by: kbracey@art.acorn.co.uk (Kevin Bracey)
(slightly modified by me)


16349 13-Jun-1996 gpalmer

Don't try to include opt_ipfw.h in LKMs

Submitted by: Ollivier Robert <roberto@keltia.freenix.fr>


16341 13-Jun-1996 dg

Keep ether_type in network order for BPF to be consistent with other
systems.

Submitted by: Ted Lemon, Matt Thomas, and others. Retrofitted for
-current by me.


16333 12-Jun-1996 gpalmer

Convert ipfw to use opt_ipfw.h


16322 12-Jun-1996 gpalmer

Clean up -Wunused warnings.

Reviewed by: bde


16266 09-Jun-1996 alex

Big sweep over ipfw, picking up where Poul left off:

- Log ICMP type during verbose output.
- Added IPFIREWALL_VERBOSE_LIMIT option to prevent denial of service
attacks via syslog flooding.
- Filter based on ICMP type.
- Timestamp chain entries when they are matched.
- Interfaces can now be matched with a wildcard specification (i.e.
will match any interface unit for a given name).
- Prevent the firewall chain from being manipulated when securelevel
is greater than 2.
- Fixed bug that allowed the default policy to be deleted.
- Ability to zero individual accounting entries.
- Remove definitions of old_chk_ptr and old_ctl_ptr when compiling
ipfw as a lkm.
- Remove some redundant code shared between ip_fw_init and ipfw_load.

Closes PRs: 1192, 1219, and 1267.


16206 08-Jun-1996 bde

Changed some memcpy()'s back to bcopy()'s.

gcc only inlines memcpy()'s whose count is constant and didn't inline
these. I want memcpy() in the kernel go away so that it's obvious that
it doesn't need to be optimized. Now it is only used for one struct
copy in si.c.


16143 05-Jun-1996 wollman

Instrument UDP PCB hashing to see how often the hash lookup is effective
for incoming packets.


16141 05-Jun-1996 wollman

Correct formula for TCP RTO calculation. Also try to do a better job in
filling in a new PCB's rttvar (but this is not the last word on the subject).
And get rid of `#ifdef RTV_RTT', it's been true for four years now...


16099 03-Jun-1996 jdp

Fix a bug in the handling of the "persist" state which, under certain
circumstances, caused perfectly good connections to be dropped. This
happened for connections over a LAN, where the retransmit timer
calculation TCP_REXMTVAL(tp) returned 0. If sending was blocked by flow
control for long enough, the old code dropped the connection, even
though timely replies were being received for all window probes.

Reviewed by: W. Richard Stevens <rstevens@noao.edu>


16065 02-Jun-1996 gpalmer

Correct spelling error in comment


16035 31-May-1996 peter

More closely preserve the original operation of rresvport() when using
IP_PORTRANGE_LOW.


15869 22-May-1996 wollman

Conditionalize calls to IPFW code on COMPAT_IPFW. This is done slightly
unconventionally:
If COMPAT_IPFW is not defined, or if it is defined to 1, enable;
otherwise, disable.

This means that these changes actually have no effect on anyone at the
moment. (It just makes it easier for me to keep my code in sync.)
In the future, the `not defined' part of the hack should be eliminated,
but doing this now would require everyone to change their config files.

The same conditionals need to be made in ip_input.c as well for this to
ave any useful effect, but I'm not ready to do that right now.


15850 21-May-1996 peter

Fix an embarresing error on my part that made the IP_PORTRANGE options
return a failure code (even though it worked).
This commit brought to you by the 'C' keyword "break".. :-)


15701 09-May-1996 wollman

Make it possible to return more than one piece of control information
(PR #1178).
Define a new SO_TIMESTAMP socket option for datagram sockets to return
packet-arrival timestamps as control information (PR #1179).

Submitted by: Louis Mamakos <loiue@TransSys.com>


15681 08-May-1996 gpalmer

Remove useless entries from the inetsw structure initiliser which
only produced compile-time warnings.

Reviewed/Tested by: Bill Fenner <fenner@parc.xerox.com>


15680 08-May-1996 gpalmer

Clean up various compiler warnings. Most (if not all) were benign

Reviewed by: bde


15653 06-May-1996 phk

Several locations in sys/netinet/ip_fw.c are lacking or incorrectly
use spl() functions.

Reviewed by: phk
Submitted by: Alex Nash <alex@zen.nash.org>


15652 06-May-1996 wollman

Add three new route flags to help determine what sort of address
the destination represents. For IP:

- Iff it is a host route, RTF_LOCAL and RTF_BROADCAST indicate local
(belongs to this host) and broadcast addresses, respectively.

- For all routes, RTF_MULTICAST is set if the destination is multicast.

The RTF_BROADCAST flag is used by ip_output() to eliminate a call to
in_broadcast() in a common case; this gives about 1% in our packet-generation
experiments. All three flags might be used (although they aren't now)
to determine whether a packet can be forwarded; a given host route can
represent a forwardable address if:

(rt->rt_flags & (RTF_HOST | RTF_LOCAL | RTF_BROADCAST | RTF_MULTICAST))
== RTF_HOST

Obviously, one still has to do all the work if a host route is not present,
but this code allows one to cache the results of such a lookup if rtalloc1()
is called without masking RTF_PRCLONING.


15525 02-May-1996 fenner

Back out my stupid braino; I was thinking strlen and not sizeof.


15524 02-May-1996 fenner

Size temp var correctly; buf[4*sizeof "123"] is not long enough
to store "192.252.119.189\0".


15414 27-Apr-1996 ache

inet_ntoa buffer was evaluated twice in log_in_vain, fix it.
Thanx to: jdp


15396 26-Apr-1996 wollman

Delete #ifdef notdef blocks containing old method of srtt calculation.

Requested by: davidg


15395 26-Apr-1996 wollman

Delete #if 0 block containing remnants of pre-MTU discovery rmx_mtu
initialization.


15394 26-Apr-1996 wollman

Delete #if 0 block containing unused definitions for ARPANET/DDN IMP
and HYPERchannel link layers.


15335 21-Apr-1996 bde

Fixed in-line IP header checksumming. It was performed on the wrong header
in one case.


15295 18-Apr-1996 wollman

Three speed-ups in the output path (two small, one substantial):

1) Require all callers to pass a valid route pointer to ip_output()
so that we don't have to check and allocate one off the stack
as was done before. This eliminates one test and some stack
bloat from the common (UDP and TCP) case.

2) Perform the IP header checksum in-line if it's of the usual length.
This results in about a 5% speed-up in my packet-generation test.

3) Use ip_vhl field rather than ip_v and ip_hl bitfields.


15294 18-Apr-1996 wollman

Define a few macros useful in the _IP_VHL case.


15293 18-Apr-1996 wollman

Fix a warning by not referencing ip_output() as a pr_output() member.


15292 18-Apr-1996 wollman

Always call ip_output() with a valid route pointer. For igmp, also get the
multicast option structure off the stack rather than malloc.


15262 15-Apr-1996 dg

Two fixes from Rich Stevens:

1) Set the persist timer to help time-out connections in the CLOSING state.
2) Honor the keep-alive timer in the CLOSING state.

This fixes problems with connections getting "stuck" due to incompletion
of the final connection shutdown which can be a BIG problem on busy WWW
servers.


15238 13-Apr-1996 bde

Eliminated sloppy common-style declarations. Now there are no duplicated
common labels for LINT. There are still some common declarations for the
!KERNEL case in tcp_debug.h and spx_debug.h. trpt depends on the ones in
tcp_debug.h.


15211 12-Apr-1996 phk

Fix a bogon I introduced with my last change.

Submitted by: Andreas Klemm <andreas@knobel.gun.de>


15154 09-Apr-1996 pst

Logging UDP and TCP connection attempts should not be enabled by default.
It's trivial to create a denial of service attack on a box so enabled.

These messages, if enabled at all, must be rate-limited. (!)


15092 07-Apr-1996 dg

Added proper splnet protection while modifying the interface address list.
This fixes a panic that occurs when ifconfig ioctl(s) were interrupted
by IP traffic at the wrong time - resulting in a NULL pointer dereference.
This was originally noticed on a FreeBSD 1.0 system, but the problem still
exists in current sources.


15039 04-Apr-1996 phk

Add a sysctl (net.inet.tcp.always_keepalive: 0) that when set will force
keepalive on all tcp sessions. Setsockopt(2) cannot override this setting.
Maybe another one is needed that just changes the default for SO_KEEPALIVE ?
Requested by: Joe Greco <jgreco@brasil.moneng.mei.com>


15038 04-Apr-1996 phk

Log TCP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.tcp.log_in_vain: 1

Log UDP syn packets for ports we don't listen on.
Controlled by: sysctl net.inet.udp.log_in_vain: 1

Suggested by: Warren Toomey <wkt@cs.adfa.oz.au>


15028 03-Apr-1996 wollman

Always pass a route structure when calling ip_output().


15026 03-Apr-1996 phk

Add feature for tcp "established".
Change interface between netinet and ip_fw to be more general, and thus
hopefully also support other ip filtering implementations.


14998 02-Apr-1996 phk

Fix two cases where ia->ia_ifp could be NULL.


14841 27-Mar-1996 wollman

In tcp_respond(), check that ro->ro_rt is non-null before RTFREEing
it.


14824 26-Mar-1996 fenner

Make rip_input() take the header length
Move ipip_input() and rsvp_input() prototypes to ip_var.h
Remove unused prototype for rip_ip_input() from ip_var.h
Remove unused variable *opts from rip_output()


14823 26-Mar-1996 fenner

Add missing splx(s) in IP_MULTICAST_IF

Submitted by: Jim Binkley <jrb@cs.pdx.edu>


14819 25-Mar-1996 wollman

Slight modification of RTO floor calculation.


14817 25-Mar-1996 phk

Check the validity of ia->ia_ifp before we dereference it.


14761 23-Mar-1996 fenner

Send ARP's for aliased subnets with the proper source address.
Get rid of ac->ac_ipaddr and arpwhohas() since they assume that
an interface has only one address.

Obtained from: BSD/OS 2.1, via Rich Stevens <rstevens@noao.edu>


14754 22-Mar-1996 wollman

Make sure tcp_respond() always calls ip_output() with a valid
route pointer. This has no effect in the current ip_output(),
but my version requires that ip_output() always be passed a route.


14753 22-Mar-1996 wollman

A number of performance-reducing flaws fixed based on comments
from Larry Peterson &co. at Arizona:

- Header prediction for ACKs did not exclude Fast Retransmit/Recovery.
- srtt calculation tended to get ``stuck'' and could never decrease
when below 8. It still can't, but the scaling factors are adjusted
so that this artifact does not cause as bad an effect on the RTO
value as it used to.

The paper also points out the incr/8 error that has been long since fixed,
and the problems with ACKing frequency resulting from the use of options
which I suspect to be fixed already as well (as part of the T/TCP work).

Obtained from: Brakmo & Peterson, ``Performance Problems in BSD4.4 TCP''


14632 15-Mar-1996 fenner

Allow SIOCGIFBRDADDR and SIOCGIFNETMASK to return information about
aliases, if the alias address was passed in the struct ifreq.
Default to first address on the list, for backwards compatibility.


14622 14-Mar-1996 fenner

IGMPv2 routines rewritten, to be more compact and to fully comply
with the IGMPv2 Internet Draft (including Router Alert IP option)


14611 13-Mar-1996 pst

Fix ip option processing for raw IP sockets. This whole thing is a compromise
between ignoring options specified in the setsockopt call if IP_HDRINCL is set
(the UCB choice when VJ's code was brought in) vs allowing them (what everyone
else did, and what is assumed by programs everywhere...sigh).

Also perform some checking of the passed down packet to avoid running off
the end of a mbuf chain.

Reviewed by: fenner


14549 11-Mar-1996 fenner

Cleaned up uninitialized 'rt' warning properly
Make a copy of the header of a packet that gets queued due to
lack of forwarding cache entry, so that nobody else can step
on it. Thanks to Mike Karels <karels@bsdi.com> for pointing
this one out.


14546 11-Mar-1996 dg

Move or add #include <queue.h> in preparation for upcoming struct socket
changes.


14328 02-Mar-1996 peter

Add more options into the conf/options and i386/conf/options.i386 files
and the #include hooks so that 'make depend' is more useful. This
covers most of the options I regularly use (but not all) and some other
easy ones.


14293 28-Feb-1996 phk

Forgot to remove this file.


14281 27-Feb-1996 bde

Spell tcp_listendrop consistently so that tcp_input.c and netstat compile.


14268 26-Feb-1996 guido

Add a counter for the number of times the listen queue was overflowed to
the tcpstat structure. (netstat -s)
Reviewed by: wollman
Obtained from: Steves, TCP/IP Ill. vol.3, page 189


14266 26-Feb-1996 phk

Fix wrong logic, certain rules never matched.


14232 24-Feb-1996 phk

Make getsockopt() capable of handling more than one mbuf worth of data.
Use this to read rules out of ipfw.
Add the lkm code to ipfw.c


14230 24-Feb-1996 phk

The new firewall functionality:
Filter on the direction (in/out).
Filter on fragment/not fragment.


14226 23-Feb-1996 phk

I overlooked this one.


14209 23-Feb-1996 phk

Big sweep over the IPFIREWALL and IPACCT code.

Close the ip-fragment hole.
Waste less memory.
Rewrite to contemporary more readable style.
Kill separate IPACCT facility, use "accept" rules in IPFIREWALL.
Filter incoming >and< outgoing packets.
Replace "policy" by sticky "deny all" rule.
Rules have numbers used for ordering and deletion.
Remove "rerorder" code entirely.
Count packet & bytecount matches for rules.

Code in -current & -stable is now the same.


14195 22-Feb-1996 peter

Make the default behavior of local port assignment match traditional
systems (my last change did not mix well with some firewall
configurations). As much as I dislike firewalls, this is one thing I
I was not prepared to break by default.. :-)

Allow the user to nominate one of three ranges of port numbers as
candidates for selecting a local address to replace a zero port number.
The ranges are selected via a setsockopt(s, IPPROTO_IP, IP_PORTRANGE, &arg)
call. The three ranges are: default, high (to bypass firewalls) and
low (to get a port below 1024).

The default and high port ranges are sysctl settable under sysctl
net.inet.ip.portrange.*

This code also fixes a potential deadlock if the system accidently ran out
of local port addresses. It'd drop into an infinite while loop.

The secure port selection (for root) should reduce overheads and increase
reliability of rlogin/rlogind/rsh/rshd if they are modified to take
advantage of it.

Partly suggested by: pst
Reviewed by: wollman


14181 22-Feb-1996 dg

Fixed bug in Path MTU Discovery that caused the system to have to re-
discover the Path MTU for each connection if the connecting host didn't
offer an initial MSS.

Submitted by: davidg & olah


14163 20-Feb-1996 fenner

Make the "arpresolve: can't allocate llinfo" error message
more useful by printing out the IP address it was trying to
resolve, since we're seeing so many complaints about this
error.


13971 08-Feb-1996 wollman

#if out unsupported IMP code.


13929 05-Feb-1996 wollman

Provide a direct entry point for IP input. This actually results
in a slight decrease in performance, but will lead to better
performance later.


13926 05-Feb-1996 wollman

Fill in the corresponding ether address of multicast and broadcast
pseudo-``ARP entries'' so arp(8) doesn't show them as `unresolved'.


13879 03-Feb-1996 phk

Make the sorting of IPFW rules an option. You don't want it to sort them.
>>>WARNING<<< you may have to revisit your firewall setup.


13779 31-Jan-1996 olah

Fix a bug related to the interworking of T/TCP and window scaling:
when a connection enters the ESTBLS state using T/TCP, then window
scaling wasn't properly handled. The fix is twofold.

1) When the 3WHS completes, make sure that we update our window
scaling state variables.

2) When setting the `virtual advertized window', then make sure
that we do not try to offer a window that is larger than the maximum
window without scaling (TCP_MAXWIN).

Reviewed by: davidg
Reported by: Jerry Chen <chen@Ipsilon.COM>


13765 30-Jan-1996 mpp

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


13638 26-Jan-1996 phk

The last part of the ether_sprint -> %6D change.
Sorry for the delay.
(%D is for hexdumping.)


13619 24-Jan-1996 phk

Use new printf features rather than local kludges.


13581 23-Jan-1996 fenner

First piece of fixing ppp/proxy arp problem:

If an attempt to add a route fails because an "ARP table" entry is in
the way, remove the ARP entry and retry the add.

Reviewed by: nate


13492 19-Jan-1996 peter

remove tcp_lastport - it has not been used for quite a while (at least
since the hashed pcb's I think).


13491 19-Jan-1996 peter

Change the default local address range for IP from 1024 through 5000
to 20000 through 30000. These numbers are used for local IP port numbers
when an explicit address is not specified.

The values are sysctl modifiable under: net.inet.ip.port_{first|last}_auto

These numbers do not overlap with any known server addresses, without going
above 32768 which are "negative" on some other implementations.

20000 through 30000 is 2.5 times larger than the old range, but some have
suggested even that may not be enough... (gasp!) Setting a low address
of 10000 should be plenty.. :-)


13486 19-Jan-1996 fenner

Add definitions for ICMP router discovery.

Reviewed by: wollman


13475 17-Jan-1996 olah

Be more conservative when T/TCP extensions are disabled. In particular,
do not send data and/or FIN on SYN segments in this case.


13357 09-Jan-1996 dg

Fix logic bug (!= should be ==) in recent P2P/multicast kludge.

Reviewed by: Bill Fenner <fenner@parc.xerox.com>
Submitted by: Dave Marquardt <marquard@austin.ibm.com>


13351 08-Jan-1996 guido

Fix a bug where having a process listening to both a INADDR_ANY and a
local address, that was assigned with ifconfig alias and netmask
0xffffffff, would receive duplictae udp packets.
This behaviour can easily be seen by having named run, and using the alias
address as the name server.
This solution is not the pretiest one, but after talk with Garreth, it
is seen as the most easy one.


13266 05-Jan-1996 wollman

Finally demolished the last, tottering remnants of GATEWAY. If you want
to enable IP forwarding, use sysctl(8). Also did the same for IPX,
which involved inventing a completely new MIB from whole cloth (which
I may not quite have correct); be aware of this if you use IPX forwarding.
(The two should never have been controlled by the same option anyway.)


13229 04-Jan-1996 olah

Reverse the modification which caused the annoying m_copydata crash: set
the TF_ACKNOW flag when the REXMT timer goes off to force a
retransmission. In certain situations pulling snd_nxt back to snd_una
is not sufficient.


13200 03-Jan-1996 wollman

Try to make multicast routing work correctly over point-to-point
links (which was broken previously by the support for half-routers).

Submitted by: Bill Fenner <fenner@parc.xerox.com>


13091 29-Dec-1995 dg

Remove some bogus externs.


12956 21-Dec-1995 wollman

If _IP_VHL is defined, declare a single ip_vhl member in struct ip rather
than separate ip_v and ip_hl members. Should have no effect on current code,
but I'd eventually like to get rid of those obnoxious bitfields completely.


12955 21-Dec-1995 wollman

Delete old-style-broadcast-address compatibility cruft in IP input path.
If users want to use the old-style broadcast addresses, they will have to
currectly configure their systems.


12942 20-Dec-1995 wollman

in_proto.c: spell ``Internet'' right and put whitespace after commas.

others: start to populate the link-layer branch of the net mib, by
moving ARP to its proper place. (ARP is not a protocol family, it's an
interface layer between a medium-access layer and a protocol family.)
sysctl(8) needs to be taught about the structure of this branch, unless
Poul-Henning implements dynamic MIB exploration soon.


12940 20-Dec-1995 wollman

Demolish DIRECTED_BROADCAST. It was always a bad idea, and nobody uses it.


12939 20-Dec-1995 wollman

Fix a nagging divide-by-zero error resulting from the MTU discovery code
getting triggered at a bad time.


12934 19-Dec-1995 wollman

Added a comment about why trying to make a one-behind cache for
the route in ip_output() is a bad idea.


12933 19-Dec-1995 wollman

Actually call in_rtqdrain()as was originally intended.


12881 16-Dec-1995 bde

Uniformized pr_ctlinput protosw functions. The third arg is now `void
*' instead of caddr_t and it isn't optional (it never was). Most of the
netipx (and netns) pr_ctlinput functions abuse the second arg instead of
using the third arg but fixing this is beyond the scope of this round
of changes.


12877 16-Dec-1995 bde

Added a prototype.


12820 14-Dec-1995 phk

Another mega commit to staticize things.


12704 09-Dec-1995 phk

Staticize.


12693 09-Dec-1995 phk

Remove old ballast, clean up a little bit, staticize.
Add five sysctl variables that you should probably never tweak.
net.arp.t_prune: 300
net.arp.t_keep: 1200
net.arp.t_down: 20
net.arp.maxtries: 5
net.arp.useloopback: 1
net.arp.proxyall: 0

(It's net.arp because arp isn't limited to inet, though our present
implementation surely is).


12676 08-Dec-1995 wollman

Added a conditionalized printf for debugging MTU discovery.


12657 06-Dec-1995 bde

Removed unnecessary #includes of vm stuff. Most of them were once
prerequisites for <sys/sysctl.h>.

subr_prof.c:
Also replaced #include of <sys/user.h> by #include of <sys/resourcevar.h>.


12644 05-Dec-1995 bde

Added explicit include of <sys/queue.h>. Currently, some things only
compile because <vm/vm.h> happens to be gratuitously included before
<netinet/in_pcb.h> and <vm/vm.h> happens to include <sys/queue.h>.


12635 05-Dec-1995 wollman

Path MTU Discovery is now standard.


12628 05-Dec-1995 dg

all:
Removed ifnet.if_init and ifnet.if_reset as they are generally unused.
Change the parameter passed to if_watchdog to be a ifnet * rather than
a unit number. All of this is an attempt to move toward not needing an
array of softc pointers (which is usually static in size) to point to
the driver softc.

if_ed.c:
Changed some of the argument passing to some functions to make a little
more sense.

if_ep.c, if_vx.c:
Killed completely bogus use of if_timer. It was being set in such a way
that the interface was being reset once per second (blech!).


12579 02-Dec-1995 bde

Completed function declarations and/or added prototypes.


12426 20-Nov-1995 phk

fix #includes & warnings.


12376 18-Nov-1995 bde

Fixed the type of a function pointer.


12325 16-Nov-1995 bde

Fixed recent staticizations. Some protypes for static functions were
left in headers and not staticized.


12296 14-Nov-1995 phk

New style sysctl & staticize alot of stuff.


12172 09-Nov-1995 phk

Start adding new style sysctl here too.


12047 03-Nov-1995 olah

Cosmetic changes to processing of segments in the SYN_SENT state:
- remove a redundant condition;
- complete all validity checks on segment before calling
soisconnected(so).

Reviewed by: Richard Stevens, davidg, wollman


12046 03-Nov-1995 olah

Setting the TF_ACKNOW flag was redundant in the REXMT timeout because
tcp_output() checks for the condition snd_nxt == snd_una.

Reviewed by: davidg, wollman, olah
Suggested by: Richard Stevens


12045 03-Nov-1995 olah

Fix a logical error in T/TCP: when we actively open a connection, we
have to decide whether to send a CC or CCnew option in our SYN segment
depending on the contents of our TAO cache. This decision has to be
made once when the connection starts. The earlier code delayed this
decision until the segment was assembled in tcp_output() and
retransmitted SYN segments could have different CC options.

Reviewed by: Richard Stevens, davidg, wollman


12003 01-Nov-1995 wollman

Instrument the IP input queue with two new read-only MIB entries:
net.inet.ip.intr-queue-maxlen (=== ipintrq.ifq_maxlen)
and net.inet.ip.intr-queue-drops (=== ipintrq.ifq_drops)

There should probably be a standard way of getting the same information
going the other way.


11928 29-Oct-1995 olah

Start the 2MSL timer when the socket is closed and the TCP connection is
in the FIN_WAIT_2 state in order to prevent the conn. hanging there
forever.

Reviewed by: davidg, olah
Submitted by: Arne Henrik Juul <arnej@imf.unit.no>
Obtained from: bugs@netbsd.org


11921 29-Oct-1995 phk

Second batch of cleanup changes.
This time mostly making a lot of things static and some unused
variables here and there.


11819 26-Oct-1995 julian

Reviewed by: julian and jhay@mikom.csir.co.za
Submitted by: Mike Mitchell, supervisor@alb.asctmd.com

This is a bulk mport of Mike's IPX/SPX protocol stacks and all the
related gunf that goes with it..
it is not guaranteed to work 100% correctly at this time
but as we had several people trying to work on it
I figured it would be better to get it checked in so
they could all get teh same thing to work on..

Mikes been using it for a year or so
but on 2.0

more changes and stuff will be merged in from other developers now that this is in.

Mike Mitchell, Network Engineer
AMTECH Systems Corporation, Technology and Manufacturing
8600 Jefferson Street, Albuquerque, New Mexico 87113 (505) 856-8000
supervisor@alb.asctmd.com


11706 23-Oct-1995 ugen

Support all the tcpflag options in firewall.
Add reading options from file, now ipfw <filename> will
read commands string after string from file , form of strings
same as command line interface.


11680 22-Oct-1995 phk

Remove the last trace of arptnew()


11603 21-Oct-1995 dg

Fix panic caused by PRU_CONTROL not being dealt with properly. Bug pointed
out by David Maltz <dmaltz@orval.mach.cs.cmu.edu>, but this fix is by me.


11537 16-Oct-1995 wollman

The ability to administratively change the MTU of an interface presents
a few new wrinkles for MTU discovery which tcp_output() had better
be prepared to handle. ip_output() is also modified to do something
helpful in this case, since it has already calculated the information
we need.


11458 13-Oct-1995 wollman

Routes can be asymmetric. Always offer to /accept/ an MSS of up to the
capacity of the link, even if the route's MTU indicates that we cannot
send that much in their direction. (This might actually make it possible
to test Path MTU discovery in a useful variety of cases.)


11450 12-Oct-1995 wollman

The additional checks involving sequence numbers in MTU discovery resends
turned out not to be necessary; simply watching for MTU decreases (which
we already did) automagically eliminates all the cases we were trying to
protect against.


11415 10-Oct-1995 wollman

More MTU discovery: avoid over-retransmission if route changes in the
middle of a fully-open window. Also, keep track of how many retransmits
we do as a result of MTU discovery. This may actually do more work than
necessary, but it's an unusual condition...

Suggested by: Janey Hoe <janey@lcs.mit.edu>


11284 06-Oct-1995 wollman

Put newline at end of log()ed messages so syslog can't fill up your
/var quite as fast.


11225 05-Oct-1995 wollman

Convert ARP to use queue.h macros rather than insque/remque. While
we're at it, eliminate obsolete exposure of `struct llinfo_arp' to
the world. (This dates back to when ARP entries were not stored in
the routing table, and there was no other way for the `arp' program
to read the whole table than to grovel around in /dev/kmem.)


11187 04-Oct-1995 wollman

Make a whole bunch of PCB variables ints rather than shorts. There appear
to be no ill effects, and so far as Iknow none of the variables in
question depend on 16-bit wraparound behavior. (The sizes are in
many cases relics from when a PCB had to fit inside a 128-byte mbuf. PCBs
are no longer stored in that way, and the old structure would not have
fit, either.)


11150 03-Oct-1995 wollman

Finish 4.4-Lite-2 merge: randomize TCP initial sequence numbers
to make ISS-guessing spoofing attacks harder.


11119 01-Oct-1995 ugen

Well..finally..this is the first part..it should take care of
matching IP options..Check and test this - i made only a couple
of rough tests and this could be buggy.. Ipaccounting can't use
IP Options (and i don't see any need to cound packets with specific
options either..)
More to come...


10965 22-Sep-1995 wollman

Merge 4.4-Lite-2: update version number (we already have the same fixes).

Obtained from: 4.4BSD-Lite-2


10961 22-Sep-1995 wollman

Merge 4.4-Lite-2: always check the UDP checksum if it is present, even
if we are not generating checksums. (Save a test in the input path.)


10956 22-Sep-1995 wollman

Correct spelling error in MTUDISC code.


10950 22-Sep-1995 peter

Remove duplicate definition for tcps_persistdrop, as added by davidg some
time ago. I left in Garrett's one, because his was in the 4.4-Lite-2
location, making any diffs just that little bit smaller.

I presume this choice means that netstat needs to be recompiled before
"netstat -s" will give a meaningful answer on tcp stats.


10944 21-Sep-1995 wollman

Merge with 4.4-Lite-2: fix bug that caused getsockopt of IP_HDRINCL
to fail.

Obtained from: 4.4BSD-Lite-2


10942 21-Sep-1995 wollman

Merge 4.4-Lite-2 by updating the version number.

Obtained from: 4.4BSD-Lite-2


10941 21-Sep-1995 wollman

Merge 4.4-Lite-2: update some declarations that we don't support anyway.

Obtained from: 4.4BSD-Lite-2


10940 21-Sep-1995 wollman

Merge 4.4-Lite-2: use M_NOWAIT in in_pcballoc(), and return EACCES rather
than EPERM on illegal attempt to bind a reserved port.

Obtained from: 4.4BSD-Lite-2


10939 21-Sep-1995 wollman

Merge with 4.4-Lite-2. This is actually a 64-bit fix; the second parameter
to in_control() is sometimes a pointer, and sometimes an integer, so use
u_long rather than int.

Obtained from: 4.4BSD-Lite-2


10938 21-Sep-1995 wollman

Merge with 4.4-Lite-2. This involves changing the version number and
moving a declaration around.

Obtained from: 4.4BSD-Lite-2


10937 21-Sep-1995 wollman

Merge with 4.4-Lite-2. This just adds a couple of tcpstat entries which
we don't currently set, but might in the future.


10930 20-Sep-1995 wollman

Add support in TCP for Path MTU discovery. This is highly experimental
and gated on `options MTUDISC' in the source. It is also practically
untested becausse (sniff!) I don't have easy access to a network with
an MTU of less than an Ethernet. If you have a small MTU network,
please try it and tell me if it works!


10881 18-Sep-1995 wollman

Initial back-end support for IP MTU discovery, gated on MTUDISC. The support
for TCP has yet to be written.


10714 13-Sep-1995 wollman

Don't leak mbufs in an unusual error case in tcp_usrreq().

Reviewed by: Andras Olah <olah@freebsd.org>
Obtained from: Lite-2


10712 13-Sep-1995 wollman

If tcp_output() is unable to allocate space for a copy of the data waiting
to be sent, just clean up and return ENOBUFS rather than silently
proceeding without sending any of the data. This makes it consistent
with the `#ifdef notyet' case immediately above.

Reviewed by: Andras Olah <olah@freebsd.org>
Obtained from: Lite-2


10421 29-Aug-1995 wollman

Fix long-standing bug in ICMPPRINTFS code where NTOHL was used instead
of ntohl for printing IP addresses, by instead substituting inet_ntoa()
to produce human-readable output.

Obtained from: 4.4-Lite-2


10203 23-Aug-1995 wollman

Fix some problems with multicast forwarding:

Garrett,

Here are some patches for the rate limiting code. It should be faster,
and in particular it doesn't leak malloc'd memory any more when rate_limit'ing
a phyint.

It now uses an mbuf chain at each vif, instead of the static queue array.
This means that the MAXQSIZE is now variable per vif (although there is no
interface to change it other than a debugger); this is an area for more
experimentation.

Bill

Submitted by: Bill Fenner <fenner@parc.xerox.com>


10095 17-Aug-1995 olah

Add a sanity check for the UDP length field in order to prevent
malformed UDP packets to panic the kernel.
Reviewed by: davidg, wollman
Obtained from: dab@berserkly.cray.com (David A. Borman) via end2end list


9820 31-Jul-1995 gpalmer

Try to make the `syn' blocking code act a bit more sensibly - don't
block `syn' packets that have `ack' set.
Reviewed by:
Submitted by:
Obtained from:


9818 31-Jul-1995 olah

Remove a redundant `if' from tcp_reass().

Correct a typo in a comment (SEND_SYN -> NEEDSYN).

Reviewed by: David Greenman


9773 29-Jul-1995 dg

Add connection drop capability for persist timeouts.

Reviewed by: Andras Olah
Obtained from: 4.4BSD-lite2 via W. Richard Stevens


9728 26-Jul-1995 wollman

Fix test for determining when RSVP is inactive in a router. (In this
case, multicast options are not passed to ip_mforward().) The previous
version had a wrong test, thus causing RSVP mrouters to forward RSVP messages
in violation of the spec.


9682 24-Jul-1995 wollman

Declare rsvp_input() to take the correct set of arguments and figure out
the receipt interface in the correct way.


9680 24-Jul-1995 wollman

Completely turn off RSVP intercept when a socket being used for that purpose
is PRU_DETACHed. This solves the problem that RSVP would not come up inm
raw mode if previously killed.


9661 23-Jul-1995 dg

Added $Id$.


9575 18-Jul-1995 peter

Change the compile-time option of DIRECTED_BROADCAST into a sysctl
variable underneath ip, "directed-broadcast".
Reviewed by: David Greenman
Obtained from: NetBSD, by Darren Reed.


9563 17-Jul-1995 wollman

Return EDESTADDRREQ rather than EADDRNOTAVAIL if the user attempts to
half-configure a point-to-point interface.

Submitted by: Jonathan M. Bresler <jmb@kryten.atinc.com>


9472 10-Jul-1995 wollman

ICMP messages received from broken hosts which reply to multicast packets
were mistakenly delivered, rather than getting thrown out, which caused
substantial lossage.

Submitted by: Bill Fenner <fenner@parc.xerox.com>


9470 10-Jul-1995 wollman

tcp_input.c - keep track of how many times a route contained a cached rtt
or ssthresh that we were able to use

tcp_var.h - declare tcpstat entries for above; declare tcp_{send,recv}space

in_rmx.c - fill in the MTU and pipe sizes with the defaults TCP would have
used anyway in the absence of values here


9460 09-Jul-1995 dg

Fixed panic that occurs on certain firewall rejected packets that was
caused by dtom() being used on an mbuf cluster. The fix involves passing
around the mbuf pointer.

Submitted by: Bill Fenner


9392 04-Jul-1995 dg

Added some spaces for KNF. Moved some zero-initialized pointers into the
kernel's .bss.


9391 04-Jul-1995 dg

This is the end result of about a dozen passes through this code to fix
incorrect indents, a variety of poor coding practices such as comparing
pointers to constants ('0'), poor code structuring, etc, etc. This brings
the code up to the minimum standards for inclusion in FreeBSD.


9390 04-Jul-1995 dg

Define TRUE and FALSE.


9389 04-Jul-1995 dg

1) Removed bogus #include
2) Rewrote "bad_packet" code to be less buggy and more readable.
3) Removed a pile of goto's; the code is now somewhat less reminiscent
of a certain Italian pasta.
4) Changed all boolean returns of "0" and "1" to FALSE/TRUE.


9386 02-Jul-1995 joerg

Slightly modify my previous change to return EINVAL instead of
EFAULT.

Submitted by: Peter Wemm


9383 01-Jul-1995 joerg

I saw a very low-key commit message on the netbsd mailing lists and
figured out what the problem was.. Anyway, I rate it as "highly
serious".

Submitted by: peter@haywire.DIALix.COM (Peter Wemm)


9373 29-Jun-1995 wollman

Keep track of the number of samples through the srtt filter so that we
know better when to cache values in the route, rather than relying on a
heuristic involving sequence numbers that broke when tcp_sendspace
was increased to 16k.


9359 28-Jun-1995 gpalmer

Add a missing `goto' statement so that this compiles yet again.


9347 28-Jun-1995 dg

Added function prototypes for ip_rsvp_vif_init, ip_rsvp_vif_done, and
ip_rsvp_force_done.


9339 27-Jun-1995 wollman

Delete obsolete #if 0 block.


9338 27-Jun-1995 guido

reject option in ip_fw used to panic the system. This fixes it.

-Guido
Reviewed by:
Submitted by:
Obtained from:


9334 26-Jun-1995 wollman

From Bill Fenner:

> Also, I don't remember if I sent you this; it affects PIM assert processing.

Submitted by: Bill Fenner <fenner@parc.xerox.com>


9333 26-Jun-1995 wollman

Corrected a bug that caused protocol-4 tunnels (used for multicast
forwarding between networks that aren't directly connected) not to work
by intercepting the wrong protocol number. This should fix a bug reported
previously by someone I don't remember.


9279 21-Jun-1995 wollman

Fix an error in the comparison direction of the ap->updating case of
in_rtqkill().

Submitted by: W. Richard Stevens


9266 19-Jun-1995 wollman

Fix a resource allocation bug where multicast forwarding would leak mbufs
in certain cases when allocation of another mbuf has already failed.

Submitted by: Bill Fenner <fenner@parc.xerox.com>


9263 19-Jun-1995 wollman

Now that we've gone to all sorts of effort to allow TCP to cache some of
its connection parameters, we want to keep statistics on how often this
actually happens to see whether there is any work that needs to be done in
TCP itself.

Suggested by: John Wroclawski <jtw@lcs.mit.edu>


9209 13-Jun-1995 wollman

Kernel side of 3.5 multicast routing code, based on work by Bill Fenner
and other work done here. The LKM support is probably broken, but it
still compiles and will be fixed later.


9202 11-Jun-1995 rgrimes

Merge RELENG_2_0_5 into HEAD


8876 30-May-1995 rgrimes

Remove trailing whitespace.


8546 16-May-1995 dg

These diffs modify the behaviour of multicast clients to conform with the
IGMPv2 spec. This fixes the following bugs:

o ntohs() on a char provides silly results
o timer needs to be scaled to units of PR_FASTHZ; this was being done
inconsistenly so now it gets done when it is initialized.

Reviewed by: Garrett Wollman
Submitted by: Bill Fenner <fenner@parc.xerox.com>


8483 12-May-1995 ache

Fix getsockopt(IP_ACCT_*) to not panic kernel
Submitted by: Bill Fenner <fenner@parc.xerox.com>


8456 11-May-1995 rgrimes

Fix -Wformat warnings from LINT kernel.


8429 11-May-1995 dg

#ifdef'd my Nagel/ACK hack with "TCP_ACK_HACK", disabled by default. I'm
currently considering reducing the TCP fasttimo to 100ms to help improve
things, but this would be done as a seperate step at some point in the
future.
This was done because it was causing some sometimes serious performance
problems with T/TCP.


8426 11-May-1995 wollman

Make networking domains drop-ins, through the magic of GNU ld. (Some day,
there may even be LKMs.) Also, change the internal name of `unixdomain'
to `localdomain' since AF_LOCAL is now the preferred name of this family.
Declare netisr correctly and in the right place.


8384 09-May-1995 dg

Replaced some bcopy()'s with memcpy()'s so that gcc while inline/optimize.


8377 09-May-1995 olah

Fix a misspelled constant in tcp_input.c.

On Tue, 09 May 1995 04:35:27 PDT, Richard Stevens wrote:
> In tcp_dooptions() under the case TCPOPT_CC there is an assignment
>
> to->to_flag |= TCPOPT_CC;
>
> that should be
>
> to->to_flag |= TOF_CC;
>
> I haven't thought through the ramifications of what's been happening ...
>
> Rich Stevens

Submitted by: rstevens@noao.edu (Richard Stevens)


8293 05-May-1995 ache

Add IPTOS_MINCOST according to RFC 1349
Change IPTOS_PREC_ROUTINE to 0 (was conflict with IPTOS_LOWDELAY) according
to RFC 791 (unchanged since it) and BSDI 2.0 style
Submitted by: Igor Sviridov <siac@ua.net>


8235 03-May-1995 dg

Changed in_pcblookuphash() to not automatically call in_pcblookup() if
the lookup fails. Updated callers to deal with this. Call in_pcblookuphash
instead of in_pcblookup() in in_pcbconnect; this improves performance of
UDP output by about 17% in the standard case.


8090 26-Apr-1995 pst

Cleanup loopback interface support.
Reviewed by: wollman


8071 25-Apr-1995 wollman

Disallow half-configured point-to-point interfaces. It's still possible to
get into a half-configured state by using the old-style ioctls;this
may be a feature.


7933 19-Apr-1995 olah

Include <sys/queue.h> because <netinet/in_pcb.h> (also included
later in tcp_debug.c) requires it due to the pcb changes of DavidG.


7770 12-Apr-1995 dg

Fixed bug I introduced when changing PCB list to use 4.4BSD style queue
macros. Basically, detect 'tp' going away differently.


7738 10-Apr-1995 dg

Further satisfy my paranoia by making sure that the ACKNOW is only
set when ti_len is non-zero.


7737 10-Apr-1995 dg

Fixed bug I introduced with my Nagel hack which caused tcp_input and
tcp_output to loop endlessly. This was freefall's problem during the past
day.


7735 10-Apr-1995 dg

Added splnet protections for PCB list manipulations and traversals.


7728 10-Apr-1995 dg

Backed out Jordan's #include of queue.h


7720 09-Apr-1995 jkh

#include <sys/queue.h> or die horribly.


7684 09-Apr-1995 dg

Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.


7634 05-Apr-1995 olah

Fix a bug in tcp_input reported by Rick Jones <raj@hpisrdq.cup.hp.com>.

If a goto findpcb occurred during the processing of a segment, the TCP and
IP headers were dropped twice from the mbuf which resulted in data acked
by TCP but not delivered to the user.
Reviewed by: davidg


7593 02-Apr-1995 bde

Remove redundant declarations.


7575 02-Apr-1995 wpaul

Add declaration for struct ether_addr (this is where Sun documents
it to go).


7504 30-Mar-1995 dg

Backed out changes in rev 1.5 that prevent sending FIN if in CLOSING
state. This causes an infinite loop in some rare cases (probably caused
by some other, much more difficult to find bug).


7417 27-Mar-1995 dg

Re-apply my "breakage" to the Nagel congestion avoidence. This version
differs slightly in the logic from the previous version; packets are now
acked immediately if the sender set PUSH.


7280 23-Mar-1995 wollman

in_var.h: in_multi structures now form a queue(3)-style LIST structure
in.c: when an interface address is deleted, keep its multicast membership
. records (attached to a struct multi_kludge) for attachment to the
. next address on the same interface. Also, in_multi structures now
. gain a reference to the ifaddr so that they won't point off into
. freed memory if an interface goes away and doesn't come back before
. the last socket reference drops. This is analogous to how it is
. done for routes, and seems to make the most sense.


7191 20-Mar-1995 wollman

This should be splimp() rather than splnet() since ifaddrs might go away
as a result of link-layer processing.


7190 20-Mar-1995 wollman

Fix race conditions involved in setting IP multicast options. This should
fix Dennis Fortin's problem for good, if I've got it figured out right.

(The problem was that a `struct ifaddr' could get deleted out from under
the current requester, thus leaving him with an invalid interface pointer
and causing even more bogus accesses.)


7170 19-Mar-1995 dg

Removed redundant newlines that were in some panic strings.


7091 16-Mar-1995 wollman

Reject source routes unless configured on by administrator.


7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


7088 16-Mar-1995 wollman

Add inet_ntoa() and replace ARP's private routine with same.


7083 16-Mar-1995 wollman

This set of patches enables IP multicasting to work under FreeBSD. I am
submitting them as context diffs for the following files:

sys/netinet/ip_mroute.c
sys/netinet/ip_var.h
sys/netinet/raw_ip.c
usr.sbin/mrouted/igmp.c
usr.sbin/mrouted/prune.c

The routine rip_ip_input in raw_ip.c is suggested by Mark Tinguely
(tinguely@plains.nodak.edu). I have been running mrouted with these patches
for over a week and nothing has seemed seriously wrong. It is being run in
two places on our network as a tunnel on one and a subnet querier on the
other. The only problem I have run into is that mrouted on the tunnel must
start up last or the pruning isn't done correctly and multicast packets
flood your subnets.

Submitted by: Soochon Radee <slr@mitre.org>


7060 14-Mar-1995 dg

pcb allocations are not always done on behalf of a process; it is not
okay to wait.


7055 14-Mar-1995 dg

Added support for generic FDDI and the DEC DEFEA and DEFPA FDDI adapters.

Submitted by: Matt Thomas


7035 12-Mar-1995 ugen

Allocate memory as M_IPFW,now we can watch firewall memory usage
in vmstat..


6922 06-Mar-1995 nate

Removed unnecessary define for TCPOUTFLAGS since they are not used.


6835 02-Mar-1995 dg

Move exact match pcb's to the head of the list to improve lookup
performance.


6690 24-Feb-1995 ugen

Allow "via" to be specified ever as IP adress or
as interface name/unit...


6616 22-Feb-1995 bde

Fix benign type mismatch.


6568 20-Feb-1995 dg

Added missing newlines to calls to log().


6510 17-Feb-1995 wollman

Include missing <sys/kernel.h> for `hz'.

Submitted by: David Greenman, Rod Grimes, Christoph Kukulies


6483 16-Feb-1995 wollman

Don't need to retransmit FIN bit in CLOSING state.

Obtained from: Stevens, vol. 2, exercise 29.5 (solution p. 1090)


6482 16-Feb-1995 wollman

spl back down in unusual out-of-memory condition in udp_output().

Obtained from: Stevens, vol. 2, exercise 23.4 (solution p. 1083)


6481 16-Feb-1995 wollman

Correctly initialize so_linger in ticks (not seconds).

Obtained from: Stevens, vol. 2, p. 1010


6480 16-Feb-1995 wollman

Avoid deadlock situation described by Stevens using his suggested replacement
code.

Obtained from: Stevens, vol. 2, pp. 959-960


6479 16-Feb-1995 wollman

Don't add back in the IP header length to ip_len; icmp_error will do it
for us.

Obtained from: Stevens, vol. 2, p. 774


6475 16-Feb-1995 wollman

Transaction TCP support now standard. Hack away!


6472 16-Feb-1995 wollman

Add lots of useful MIB variables and a few not-so-useful ones for
completeness.


6400 14-Feb-1995 wollman

After dynamically reducing rtq_reallyold, have in_rtqkill() reduce the
expiration timer of anything which would expire later than that. (There
should be a way to call this from ip_sysctl() as well, but there currently
isn't.)


6399 14-Feb-1995 wollman

Attempt to make the host route cache a bit smarter under conditions of
high load:

1) If there ever get to be more than net.inet.ip.rtmaxcache entries
in the cache, in_rtqtimo() will reduce net.inet.ip.rtexpire by
1/3 and do another round, unles net.inet.ip.rtexpire is less than
net.inet.ip.rtminexpire, and never more than once in ten minutes
(rtq_timeout).

2) If net.inet.ip.rtexpire is set to zero, don't bother to cache
anything.


6363 14-Feb-1995 phk

YFfix.


6362 14-Feb-1995 phk

YPfix


6348 14-Feb-1995 wollman

Get rid of some unneeded #ifdef TTCP lines. Also, get rid of some
bogus commons declared in header files.


6283 09-Feb-1995 wollman

Merge Transaction TCP, courtesy of Andras Olah <olah@cs.utwente.nl> and
Bob Braden <braden@isi.edu>.

NB: This has not had David's TCP ACK hack re-integrated. It is not clear
what the correct solution to this problem is, if any. If a better solution
doesn't pop up in response to this message, I'll put David's code back in
(or he's welcome to do so himself).


6257 09-Feb-1995 dg

Fixed another TTCP ifdef problem...there isn't any tcp_sysctl field in
!TTCP.


6256 09-Feb-1995 dg

Fix/#ifdef prototype for tcp_mss...apparantly overlooked by Garrett.


6248 08-Feb-1995 wollman

T/TCP changes to generic IP code. This is all ifdefed TTCP so should
have no effect on most users for now. (Eventually, once this code is
fully tested, the ifdefs will go away.)


6247 08-Feb-1995 wollman

Merge in T/TCP TCP header file changes.


6237 07-Feb-1995 gpalmer

Remove a possible loophole - previously the code wouldn't pass packets destined
to the loopback address to the packet filter.

Reviewed by: "Ugen J.S.Antsilevich" <ugen@netvision.net.il>


6224 07-Feb-1995 wollman

Make sure to disable RSVP intercept when the socket is closed.


5941 26-Jan-1995 wollman

Correct long-standing error in the RSVP hooks (would initialize but never
return success).


5936 26-Jan-1995 ugen

ip_fwdef.c was missing some assignments , and this
caused that bug by which firewall code was not working
if configured into kernel and worked only as lkm.
Now this must be fixed...Sorry guys..


5919 26-Jan-1995 dg

Kill previous commit as it isn't necessary.


5835 24-Jan-1995 dg

Extended the previous change to cover the non-options case, too.


5802 23-Jan-1995 dg

Applied fix from Andreas Schulz with a different comment by me. Fixes a
bug where TCP connections are closed prematurely.

Submitted by: Andreas Schulz


5792 23-Jan-1995 wollman

Change caching strategy somewhat:
1) Don't clone routes to multicast destinations; there is nothing useful
to be gained in this case.
2) Reduce default expiration timer to one hour. Busy sites will still
likely want to reduce this, but for ordinary users this is a reasonable
value to use.


5543 12-Jan-1995 ugen

Actual firewall change.
1) Firewall is not subdivided on forwarding / blocking chains
anymore.Actually only one chain left-it was the blocking one.
2) LKM support.ip_fwdef.c is function pointers definition and
goes into kernel along with all INET stuff.


5534 12-Jan-1995 dg

Fixed mbuf lossage when level != IPPROTO_IP. Problem reported by Robert
Dobbs, hint from Charles Hannum, fix by me.


5196 22-Dec-1994 wollman

Make arp_rtrequest() static since nobody needs to referene it any more.


5195 22-Dec-1994 wollman

Move ARP interface initialization into if_ether.c:arp_ifinit().


5180 21-Dec-1994 wollman

Avoid a serious race by blocking netisrs while walking the route tree.
(IWBRNI we could just block IP netisrs...)


5179 21-Dec-1994 wollman

Correct sysctl info so that net.inet.ip.rtexpire is actually accessible.


5112 15-Dec-1994 wollman

Fix PR 59: don't allow TCP connections withmulticast addresses at either
end.


5109 14-Dec-1994 wollman

Make rtq_reallyold user-configurable via sysctl.


5105 13-Dec-1994 wollman

Call rtalloc_ign() so that protocol cloning will not occur at the IP layer.


5101 13-Dec-1994 wollman

Update calls to rtalloc1(). Also merge rt_prflags with rt_flags.


5089 13-Dec-1994 ugen

Add clear one accounting entry control.
Structure fields changed to seem more standart.


5086 12-Dec-1994 ugen

Late patch for delete control..


5085 12-Dec-1994 ugen

Add match by interface from which packet arrived (via)
Handle right fragmented packets. Remove checking option
from kernel..


5045 11-Dec-1994 wollman

Advanced route cache management is now an official part of IP support.


4909 02-Dec-1994 wollman

Delete old, confusing comment.


4896 02-Dec-1994 wollman

Add a check to make sure that we don't fiddle with the NFS routing tables
as well (bleah!). Also, increase the interval to the real-life value and
eliminate debugging printfs. This will be standard once tested by others.


4893 01-Dec-1994 wollman

Add latest version of ``advanced route metric management'' :-)
As before, this is currently conditionalized on options IN_RMX until
I'm sure it's working.


4849 28-Nov-1994 ugen

Added: ICMP reply,TCP SYN check,logging..


4523 16-Nov-1994 jkh

Ugen J.S.Antsilevich's latest, happiest, IP firewall code.
Poul: Please take this into BETA. It's non-intrusive, and a rather
substantial improvement over what was there before.


4286 08-Nov-1994 jkh

Ugen makes it in with 10 seconds to spare with a one-char diff. Some
people are born lucky..
Submitted by: ugen


4277 08-Nov-1994 jkh

Almost 12th hour (the 11th hour was almost an hour ago :-) patches
from Ugen.


4234 07-Nov-1994 jkh

2 11th-hour fixes from Ugen (not Uben, sorry!) J.S.Antsilevich.
I think it's time for Ugen to get a freefall account, just so I can
direct mail at him directly and let him drop off patches for us here. Ugen?
Done!
Submitted by: ugen


4127 03-Nov-1994 wollman

Fix off-by-one error reported to NetBSD by Karl Fox in
<9411031449.AA11102@gefilte.MorningStar.Com>.


4105 03-Nov-1994 wollman

Completely replace JTW's idea with my (incompletely implemented) original
idea. This is les likely to crash your machine. As before, this code is only
enabled under `options IN_RMX'.


4074 02-Nov-1994 wollman

This is the file that actually implements the smarter behavior.


4073 02-Nov-1994 wollman

Add code to be a bit smarter about IP routes, conditioned on the option
IN_RMX. (Eventually this will be standard, but I just wrote the code today
and don't want to break anyone.)


4069 02-Nov-1994 wollman

Clean up ARP error messages: format IP addresses, explain arplookup()
failures in English.


4036 31-Oct-1994 jkh

Latest changes from Uben.
Submitted by: uben


4028 31-Oct-1994 pst

Detect old-style multicast routers and interoperate properly


3969 28-Oct-1994 jkh

IP Firewall code from Daniel Boulet and J.S.Antsilevich
Submitted by: danny ugen


3865 25-Oct-1994 swallace

Patch for proper multicast support on point-to-point links.
Submitted by: apg@demos.su (Paul Antonov) - patch020


3747 21-Oct-1994 wollman

Bug fixes from John Brezak.


3571 13-Oct-1994 wollman

Fix some endianness and packet header bugs found in BSDi's port of this code.
(From mbone mailing-list.)


3561 13-Oct-1994 wollman

As suggested by Sally Floyd, don't add the ``small fraction of the window
size'' when doing congestion avoidance.

Submitted by: Mark Andrews


3514 11-Oct-1994 wollman

Fix a bug which caused panics when attempting to change just the flags of
a route. (This still doesn't work, but it doesn't panic now.) It looks
like there may be a number of incipient bugs in this code.

Also, get ready for the time when all IP gateway routes are cloning, which
is necessary to keep proper TCP statistics.


3497 10-Oct-1994 phk

Cosmetics. Silence gcc -Wall.


3444 08-Oct-1994 phk

Cosmetics: silences gcc -Wall.


3311 02-Oct-1994 phk

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


3282 01-Oct-1994 wollman

Implement full proxy ARP, gated on option ARP_PROXYALL. This allows
a FreeBSD box to do proxy ARP as easily as most commercial routers do,
without messing around with (potentially variable) Ethernet addresses.
This code is really quite simple; I'm not at all sure why it wasn't
implemented in 4.4.

It might be worth stealing an interface flag (maybe IFF_LINK1) to use for
finer-grained control over which interfaces get proxy treatment. For the
moment, it's all or nothing.


2822 16-Sep-1994 phk

Made the kernel compile even without "ether".


2788 15-Sep-1994 dg

Made TCPDEBUG truely optional. Based on changes I made in FreeBSD 1.1.5.
Fixed somebody's idea of a joke - about the first half of the lines in
in_proto.c were spaced over by one space.


2763 14-Sep-1994 wollman

Add code to make multicast routing be an LKM.


2754 14-Sep-1994 wollman

Shuffle some functions and variables around to make it possible for
multicast routing to be implemented as an LKM. (There's still a bit of
work to do in this area.)


2628 09-Sep-1994 wollman

Disable IPMULTICAST_VIF socket option when MROUTING is not defined,
since it doesn'tmake any sense for non-routers.
CVS:


2531 06-Sep-1994 wollman

Initial get-the-easy-case-working upgrade of the multicast code
to something more recent than the ancient 1.2 release contained in
4.4. This code has the following advantages as compared to
previous versions (culled from the README file for the SunOS release):

- True multicast delivery
- Configurable rate-limiting of forwarded multicast traffic on each
physical interface or tunnel, using a token-bucket limiter.
- Simplistic classification of packets for prioritized dropping.
- Administrative scoping of multicast address ranges.
- Faster detection of hosts leaving groups.
- Support for multicast traceroute (code not yet available).
- Support for RSVP, the Resource Reservation Protocol.

What still needs to be done:

- The multicast forwarder needs testing.
- The multicast routing daemon needs to be ported.
- Network interface drivers need to have the `#ifdef MULTICAST' goop ripped
out of them.
- The IGMP code should probably be bogon-tested.

Some notes about the porting process:

In some cases, the Berkeley people decided to incorporate functionality from
later releases of the multicast code, but then had to do things differently.
As a result, if you look at Deering's patches, and then look at
our code, it is not always obvious whether the patch even applies. Let
the reader beware.

I ran ip_mroute.c through several passes of `unifdef' to get rid of
useless grot, and to permanently enable the RSVP support, which we will
include as standard.

Ported by: Garrett Wollman
Submitted by: Steve Deering and Ajit Thyagarajan (among others)


2304 26-Aug-1994 wollman

Obey RFC 793, section 3.4:

Several examples of connection initiation follow. Although these
examples do not show connection synchronization using data-carrying
segments, this is perfectly legitimate, so long as the receiving TCP
doesn't deliver the data to the user until it is clear the data is
valid (i.e., the data must be buffered at the receiver until the
connection reaches the ESTABLISHED state).


2169 21-Aug-1994 paul

Made idempotent.

Submitted by: Paul


2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


1817 02-Aug-1994 dg

Added $Id$


1813 01-Aug-1994 dg

fixed bug where large amounts of unidirectional UDP traffic would fill
the interface output queue and further udp packets would be fragmented
and only partially sent - keeping the output queue full and jamming the
network, but not actually getting any real work done (because you can't
send just 'part' of a udp packet - if you fragment it, you must send
the whole thing). The fix involves adding a check to make sure that the
output queue has sufficient space for all of the fragments.


1812 01-Aug-1994 dg

Fixed bug with Nagel Congestion Avoidance where a tcp connection would
stall unnecessarily - always send an ACK when a packet len of < mss is
received.


1621 29-May-1994 dg

Increased tcp_send/recvspace to 16k, and added TCP_SMALLSPACE ifdef
to set it to 4k.


1565 26-May-1994 dg

Added missing ntohl()'s that are needed before calling IN_MULTICAST in
a couple of places.
Submitted by: Johannes Helander


1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources