History log of /freebsd-current/lib/libc/locale/utf8.c
Revision Date Author Comments
# 1d386b48 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 5b5fa75a 04-Aug-2022 Ed Maste <emaste@FreeBSD.org>

libc: drop "All rights reserved" from Foundation copyrights

This has already been done for most files that have the Foundation as
the only listed copyright holder. Do it now for files that list
multiple copyright holders, but have the Foundation copyright in its own
section.

Sponsored by: The FreeBSD Foundation


# d915a14e 25-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

libc: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using mis-identified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 57c69b14 25-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Make UTF-8 parsing and generation more strict.

- in mbrtowc() we need to disallow codepoints above 0x10ffff.
- In wcrtomb() we need to disallow codepoints between 0xd800 and 0xdfff.

Reviewed by: bapt
Differential Revision: https://reviews.freebsd.org/D3399


# 81eb7d7e 09-Aug-2015 Baptiste Daroussin <bapt@FreeBSD.org>

Readd checking utf16 surrogates that are invalid in utf8


# 8bb93485 08-Aug-2015 Baptiste Daroussin <bapt@FreeBSD.org>

Remove 5 and 6 bytes sequences which are illegal in UTF-8 space. (part2)

Per rfc3629 value greater than 0x10ffff should be rejected

Suggested by: jilles


# c9d24bcf 08-Aug-2015 Baptiste Daroussin <bapt@FreeBSD.org>

Remove 5 and 6 bytes sequences which are illegal in UTF-8 space.

Per rfc3629 value greater than 0x10ffff should be rejected

Suggested by: jilles


# 7b247341 08-Aug-2015 Baptiste Daroussin <bapt@FreeBSD.org>

Revamp CTYPE support (from Illumos & Dragonfly)

Obtained from: Dragonfly


# 0716c0ff 04-Jul-2014 Pedro F. Giffuni <pfg@FreeBSD.org>

minor perf enhancement for UTF-8

Reduce some duplicate code.

Reference:
https://www.illumos.org/issues/628

Obtained from: Illumos
MFC after: 1 week


# 0f5132cd 30-Apr-2014 Pedro F. Giffuni <pfg@FreeBSD.org>

citrus: Avoid invalid code points.

From the OpenBSD log:
The UTF-8 decoder should not accept byte sequences which decode to unicode
code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://unicode.org/faq/utf_bom.html#utf8-4

Reported by: Stefan Sperling
Obtained from: OpenBSD
MFC after: 5 days


# 97ecaa89 29-Apr-2014 Pedro F. Giffuni <pfg@FreeBSD.org>

citrus: Avoid invalid code points.

From the OpenBSD log:
The UTF-8 decoder should not accept byte sequences which decode to unicode
code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://unicode.org/faq/utf_bom.html#utf8-4

Reported by: Stefan Sperling
Obtained from: OpenBSD
MFC after: 5 days


# 3c87aa1d 20-Nov-2011 David Chisnall <theraven@FreeBSD.org>

Implement xlocale APIs from Darwin, mainly for use by libc++. This adds a
load of _l suffixed versions of various standard library functions that use
the global locale, making them take an explicit locale parameter. Also
adds support for per-thread locales. This work was funded by the FreeBSD
Foundation.

Please test any code you have that uses the C standard locale functions!

Reviewed by: das (gdtoa changes)
Approved by: dim (mentor)


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# fe0506d7 09-Mar-2010 Marcel Moolenaar <marcel@FreeBSD.org>

Create the altix project branch. The altix project will add support
for the SGI Altix 350 to FreeBSD/ia64. The hardware used for porting
is a two-module system, consisting of a base compute module and a
CPU expansion module. SGI's NUMAFlex architecture can be an excellent
platform to test CPU affinity and NUMA-aware features in FreeBSD.


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 4932c895 15-Oct-2007 Andrey A. Chernov <ache@FreeBSD.org>

Add comment explaining __mb_sb_limit trick here.


# 367ed4e1 13-Oct-2007 Andrey A. Chernov <ache@FreeBSD.org>

The problem is: currently our single byte ctype(3) functions are broken
for wide characters locales in the argument range >= 0x80 - they may
return false positives.

Example 1: for UTF-8 locale we currently have:
iswspace(0xA0)==1 and isspace(0xA0)==1
(because iswspace() and isspace() are the same code)
but must have
iswspace(0xA0)==1 and isspace(0xA0)==0
(because there is no such character and all others in the range
0x80..0xff for the UTF-8 locale, it keeps ASCII only in the single byte
range because our internal wchar_t representation for UTF-8 is UCS-4).

Example 2: for all wide character locales isalpha(arg) when arg > 0xFF may
return false positives (must be 0).
(because iswalpha() and isalpha() are the same code)

This change address this issue separating single byte and wide ctype
and also fix iswascii() (currently iswascii() is broken for
arguments > 0xFF).
This change is 100% binary compatible with old binaries.

Reviewied by: i18n@


# 639dab22 30-Mar-2006 Tom Rhodes <trhodes@FreeBSD.org>

Fix a bug where, for 6-byte sequences, the top 6 bits get compared to
111111 rather than the top 7 bits being compared against 1111110 causing
illegal bytes fe and ff being treated the same as legal bytes fc and fd.


# e94c6cb4 27-Feb-2005 Alexey Zelkin <phantom@FreeBSD.org>

. Static'ize functions exported via function reference variables only.
. Replace inclusion of sys/param.h to sys/cdefs.h and sys/types.h where
appropriate.
. move _*_init() prototypes to mblocal.h, and remove these prototypes
from .c files
. use _none_init() in __setrunelocale() instead of duplicating code
. move __mb* variables from table.c to none.c allowing us to not to
export _none_*() externs, and appropriately remove them from mblocal.h

Ok'ed by: tjr


# 610b5a1f 12-Feb-2005 Stefan Farfeleder <stefanf@FreeBSD.org>

Fix comparisons that test if an unsigned value is < 0.

Reviewed by: tjr


# ea9a9a37 27-Jul-2004 Tim J. Robbins <tjr@FreeBSD.org>

Add UTF-8-specific implementations of mbsnrtowcs() and wcsnrtombs().
These convert plain ASCII characters in-line, making them only slightly
slower than the single-byte ("NONE" encoding) version when processing
ASCII strings.


# 550473de 09-Jul-2004 Tim J. Robbins <tjr@FreeBSD.org>

Add fast paths for conversion of plain ASCII characters.


# 5e44d7eb 16-May-2004 Tim J. Robbins <tjr@FreeBSD.org>

Use conversion state objects to store the accumulated wide character,
low bound, and the number of bytes remaining instead of storing the
raw byte sequence and deriving them every time mbrtowc() is called.
This is much faster -- about twice as fast in some crude benchmarks.


# 2051a8f2 12-May-2004 Tim J. Robbins <tjr@FreeBSD.org>

Move prototypes of various encoding-related functions into a new header
file to avoid extern'ing them all over the place.


# fc813796 12-Apr-2004 Tim J. Robbins <tjr@FreeBSD.org>

Perform some basic validation of multibyte conversion state objects.


# fa02ee78 09-Apr-2004 Tim J. Robbins <tjr@FreeBSD.org>

Don't cast away const qualifiers.

Spotted by: bde


# ca2dae42 07-Apr-2004 Tim J. Robbins <tjr@FreeBSD.org>

Allow partial multibyte characters to accumulate in conversion state
objects passed to mbrtowc(), mbsrtowcs(), and mbrlen(), as required
by C99.


# b1c572ad 11-Nov-2003 Tim J. Robbins <tjr@FreeBSD.org>

Fix a typo that caused mbrtowc() to always return 0.


# 02f4f60a 02-Nov-2003 Tim J. Robbins <tjr@FreeBSD.org>

Convert the Big5, EUC, MSKanji and UTF-8 encoding methods to implement
mbrtowc() and wcrtomb() directly. GB18030, GBK and UTF2 are left
unconverted; GB18030 will be done eventually, but GBK and UTF2 may just
be removed, as they are subsets of GB18030 and UTF-8 respectively.


# 6d7bd75a 18-Feb-2003 Jacques Vidrine <nectar@FreeBSD.org>

Whack 28 unused variables.


# 972baa37 10-Oct-2002 Tim J. Robbins <tjr@FreeBSD.org>

Add a UTF-8 encoding method, which will eventually replace the antique
"UTF2" method. Although UTF-8 and the old UTF2 encoding are compatible
for 16-bit characters, the new UTF-8 implementation is much more strict
about rejecting malformed input and also handles the full 31 bit range
of characters.