Cross Reference: /freebsd-current/lib/libc/locale/utf8.c

History log of /freebsd-current/lib/libc/locale/utf8.c
Revision	Date	Author	Comments
# 1d386b48	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# 5b5fa75a	04-Aug-2022	Ed Maste <emaste@FreeBSD.org>	libc: drop "All rights reserved" from Foundation copyrights This has already been done for most files that have the Foundation as the only listed copyright holder. Do it now for files that list multiple copyright holders, but have the Foundation copyright in its own section. Sponsored by: The FreeBSD Foundation
# d915a14e	25-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	libc: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using mis-identified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.
# 57c69b14	25-Aug-2015	Ed Schouten <ed@FreeBSD.org>	Make UTF-8 parsing and generation more strict. - in mbrtowc() we need to disallow codepoints above 0x10ffff. - In wcrtomb() we need to disallow codepoints between 0xd800 and 0xdfff. Reviewed by: bapt Differential Revision: https://reviews.freebsd.org/D3399
# 81eb7d7e	09-Aug-2015	Baptiste Daroussin <bapt@FreeBSD.org>	Readd checking utf16 surrogates that are invalid in utf8
# 8bb93485	08-Aug-2015	Baptiste Daroussin <bapt@FreeBSD.org>	Remove 5 and 6 bytes sequences which are illegal in UTF-8 space. (part2) Per rfc3629 value greater than 0x10ffff should be rejected Suggested by: jilles
# c9d24bcf	08-Aug-2015	Baptiste Daroussin <bapt@FreeBSD.org>	Remove 5 and 6 bytes sequences which are illegal in UTF-8 space. Per rfc3629 value greater than 0x10ffff should be rejected Suggested by: jilles
# 7b247341	08-Aug-2015	Baptiste Daroussin <bapt@FreeBSD.org>	Revamp CTYPE support (from Illumos & Dragonfly) Obtained from: Dragonfly
# 0716c0ff	04-Jul-2014	Pedro F. Giffuni <pfg@FreeBSD.org>	minor perf enhancement for UTF-8 Reduce some duplicate code. Reference: https://www.illumos.org/issues/628 Obtained from: Illumos MFC after: 1 week
# 0f5132cd	30-Apr-2014	Pedro F. Giffuni <pfg@FreeBSD.org>	citrus: Avoid invalid code points. From the OpenBSD log: The UTF-8 decoder should not accept byte sequences which decode to unicode code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http://unicode.org/faq/utf_bom.html#utf8-4 Reported by: Stefan Sperling Obtained from: OpenBSD MFC after: 5 days
# 97ecaa89	29-Apr-2014	Pedro F. Giffuni <pfg@FreeBSD.org>	citrus: Avoid invalid code points. From the OpenBSD log: The UTF-8 decoder should not accept byte sequences which decode to unicode code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http://unicode.org/faq/utf_bom.html#utf8-4 Reported by: Stefan Sperling Obtained from: OpenBSD MFC after: 5 days
# 3c87aa1d	20-Nov-2011	David Chisnall <theraven@FreeBSD.org>	Implement xlocale APIs from Darwin, mainly for use by libc++. This adds a load of _l suffixed versions of various standard library functions that use the global locale, making them take an explicit locale parameter. Also adds support for per-thread locales. This work was funded by the FreeBSD Foundation. Please test any code you have that uses the C standard locale functions! Reviewed by: das (gdtoa changes) Approved by: dim (mentor)
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# fe0506d7	09-Mar-2010	Marcel Moolenaar <marcel@FreeBSD.org>	Create the altix project branch. The altix project will add support for the SGI Altix 350 to FreeBSD/ia64. The hardware used for porting is a two-module system, consisting of a base compute module and a CPU expansion module. SGI's NUMAFlex architecture can be an excellent platform to test CPU affinity and NUMA-aware features in FreeBSD.
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 4932c895	15-Oct-2007	Andrey A. Chernov <ache@FreeBSD.org>	Add comment explaining __mb_sb_limit trick here.
# 367ed4e1	13-Oct-2007	Andrey A. Chernov <ache@FreeBSD.org>	The problem is: currently our single byte ctype(3) functions are broken for wide characters locales in the argument range >= 0x80 - they may return false positives. Example 1: for UTF-8 locale we currently have: iswspace(0xA0)==1 and isspace(0xA0)==1 (because iswspace() and isspace() are the same code) but must have iswspace(0xA0)==1 and isspace(0xA0)==0 (because there is no such character and all others in the range 0x80..0xff for the UTF-8 locale, it keeps ASCII only in the single byte range because our internal wchar_t representation for UTF-8 is UCS-4). Example 2: for all wide character locales isalpha(arg) when arg > 0xFF may return false positives (must be 0). (because iswalpha() and isalpha() are the same code) This change address this issue separating single byte and wide ctype and also fix iswascii() (currently iswascii() is broken for arguments > 0xFF). This change is 100% binary compatible with old binaries. Reviewied by: i18n@
# 639dab22	30-Mar-2006	Tom Rhodes <trhodes@FreeBSD.org>	Fix a bug where, for 6-byte sequences, the top 6 bits get compared to 111111 rather than the top 7 bits being compared against 1111110 causing illegal bytes fe and ff being treated the same as legal bytes fc and fd.
# e94c6cb4	27-Feb-2005	Alexey Zelkin <phantom@FreeBSD.org>	. Static'ize functions exported via function reference variables only. . Replace inclusion of sys/param.h to sys/cdefs.h and sys/types.h where appropriate. . move __init() prototypes to mblocal.h, and remove these prototypes from .c files . use _none_init() in __setrunelocale() instead of duplicating code . move __mb variables from table.c to none.c allowing us to not to export _none_*() externs, and appropriately remove them from mblocal.h Ok'ed by: tjr
# 610b5a1f	12-Feb-2005	Stefan Farfeleder <stefanf@FreeBSD.org>	Fix comparisons that test if an unsigned value is < 0. Reviewed by: tjr
# ea9a9a37	27-Jul-2004	Tim J. Robbins <tjr@FreeBSD.org>	Add UTF-8-specific implementations of mbsnrtowcs() and wcsnrtombs(). These convert plain ASCII characters in-line, making them only slightly slower than the single-byte ("NONE" encoding) version when processing ASCII strings.
# 550473de	09-Jul-2004	Tim J. Robbins <tjr@FreeBSD.org>	Add fast paths for conversion of plain ASCII characters.
# 5e44d7eb	16-May-2004	Tim J. Robbins <tjr@FreeBSD.org>	Use conversion state objects to store the accumulated wide character, low bound, and the number of bytes remaining instead of storing the raw byte sequence and deriving them every time mbrtowc() is called. This is much faster -- about twice as fast in some crude benchmarks.
# 2051a8f2	12-May-2004	Tim J. Robbins <tjr@FreeBSD.org>	Move prototypes of various encoding-related functions into a new header file to avoid extern'ing them all over the place.
# fc813796	12-Apr-2004	Tim J. Robbins <tjr@FreeBSD.org>	Perform some basic validation of multibyte conversion state objects.
# fa02ee78	09-Apr-2004	Tim J. Robbins <tjr@FreeBSD.org>	Don't cast away const qualifiers. Spotted by: bde
# ca2dae42	07-Apr-2004	Tim J. Robbins <tjr@FreeBSD.org>	Allow partial multibyte characters to accumulate in conversion state objects passed to mbrtowc(), mbsrtowcs(), and mbrlen(), as required by C99.
# b1c572ad	11-Nov-2003	Tim J. Robbins <tjr@FreeBSD.org>	Fix a typo that caused mbrtowc() to always return 0.
# 02f4f60a	02-Nov-2003	Tim J. Robbins <tjr@FreeBSD.org>	Convert the Big5, EUC, MSKanji and UTF-8 encoding methods to implement mbrtowc() and wcrtomb() directly. GB18030, GBK and UTF2 are left unconverted; GB18030 will be done eventually, but GBK and UTF2 may just be removed, as they are subsets of GB18030 and UTF-8 respectively.
# 6d7bd75a	18-Feb-2003	Jacques Vidrine <nectar@FreeBSD.org>	Whack 28 unused variables.
# 972baa37	10-Oct-2002	Tim J. Robbins <tjr@FreeBSD.org>	Add a UTF-8 encoding method, which will eventually replace the antique "UTF2" method. Although UTF-8 and the old UTF2 encoding are compatible for 16-bit characters, the new UTF-8 implementation is much more strict about rejecting malformed input and also handles the full 31 bit range of characters.