11400Sjjg Dec 5, 2000 22933Sakulyakh Dave Steck 31400Sjjg Novell, Inc. 41400Sjjg 51400Sjjg UTF-8 Conversion Functions 61400Sjjg 71400Sjjg 81400Sjjg1. Strings in the LDAP C SDK should be encoded in UTF-8 format. 91400Sjjg However, most platforms do not provide APIs for converting to 101400Sjjg this format. If they do, they are platform-specific. 111400Sjjg 121400Sjjg As a result, most applications (knowingly or not) use local strings 131400Sjjg with LDAP functions. This works fine for 7-bit ASCII characters, 141400Sjjg but will fail with 8-bit European characters, Asian characters, etc. 151400Sjjg 161400Sjjg We propose adding the following platform-independent conversion functions 171400Sjjg to the OpenLDAP SDK. There are 4 functions for converting between UTF-8 181400Sjjg and wide characters, and 4 functions for converting between UTF-8 and 191400Sjjg multibyte characters. 201400Sjjg 211400Sjjg For multibyte to UTF-8 conversions, charset translation is necessary. 221400Sjjg While a full charset translator is not practical or appropriate for the 231400Sjjg LDAP SDK, we can pass the translator function in as an argument. 241400Sjjg A NULL for this argument will use the ANSI C functions mbtowc, mbstowcs, 251400Sjjg wctomb, and wcstombs. 261400Sjjg 271400Sjjg2. UTF-8 <--> Wide Character conversions 281400Sjjg 292933SakulyakhThe following new conversion routines will be added, following the pattern of 301400Sjjgthe ANSI C conversion routines (mbtowc, mbstowcs, etc). These routines use 311400Sjjgthe wchar_t type. wchar_t is 2 bytes on some systems and 4 bytes on others. 321400SjjgHowever the advantage of using wchar_t is that all the standard wide character 331400Sjjgstring functions may be used on these strings: wcslen, wcscpy, etc. 341400Sjjg 351400Sjjg int ldap_x_utf8_to_wc - Convert a single UTF-8 encoded character to a wide character. 361400Sjjg int ldap_x_utf8s_to_wcs - Convert a UTF-8 string to a wide character string. 371400Sjjg int ldap_x_wc_to_utf8 - Convert a single wide character to a UTF-8 sequence. 381400Sjjg int ldap_x_wcs_to_utf8s - Convert a wide character string to a UTF-8 string. 391400Sjjg 401400Sjjg 411400Sjjg2.1 ldap_x_utf8_to_wc - Convert a single UTF-8 encoded character to a wide character. 421400Sjjg 431400Sjjgint ldap_x_utf8_to_wc ( wchar_t *wchar, const char *utf8char ) 441400Sjjg 451400Sjjg wchar (OUT) Points to a wide character code to receive the 461400Sjjg converted character. 471400Sjjg 481400Sjjg utf8char (IN) Address of the UTF8 sequence of bytes. 491400Sjjg 501400SjjgReturn Value: 511400Sjjg If successful, the function returns the length in 521400Sjjg bytes of the UTF-8 input character. 531400Sjjg 541400Sjjg If utf8char is NULL or points to an empty string, the 551400Sjjg function returns 1 and a NULL is written to wchar. 561400Sjjg 571400Sjjg If utf8char contains an invalid UTF-8 sequence -1 is returned. 581400Sjjg 591400Sjjg 601400Sjjg2.2 ldap_x_utf8s_to_wcs - Convert a UTF-8 string to a wide character string. 611400Sjjg 621400Sjjgint ldap_x_utf8s_to_wcs (wchar_t *wcstr, const char *utf8str, size_t count) 631400Sjjg 641400Sjjg wcstr (OUT) Points to a wide char buffer to receive the 651400Sjjg converted wide char string. The output string will be 661400Sjjg null terminated if there is space for it in the 671400Sjjg buffer. 681400Sjjg 691400Sjjg utf8str (IN) Address of the null-terminated UTF-8 string to convert. 701400Sjjg 711400Sjjg count (IN) The number of UTF-8 characters to convert, or 721400Sjjg equivalently, the size of the output buffer in wide 731400Sjjg characters. 741400Sjjg 751400SjjgReturn Value: 761400Sjjg If successful, the function returns the number of wide 771400Sjjg characters written to wcstr, excluding the null termination 781400Sjjg character, if any. 791400Sjjg 801400Sjjg If wcstr is NULL, the function returns the number of wide 811400Sjjg characters required to contain the converted string, 821400Sjjg excluding the null termination character. 831400Sjjg 841400Sjjg If an invalid UTF-8 sequence is encountered, the 851400Sjjg function returns -1. 861400Sjjg 871400Sjjg If the return value equals count, there was not enough space to fit the 881400Sjjg string and the null terminator in the buffer. 891400Sjjg 901400Sjjg 911400Sjjg2.3 ldap_x_wc_to_utf8 - Convert a single wide character to a UTF-8 sequence. 921400Sjjg 931400Sjjgint ldap_x_wc_to_utf8 ( char *utf8char, wchar_t wchar, count ) 941400Sjjg 951400Sjjg utf8char (OUT) Points to a byte array to receive the converted UTF-8 961400Sjjg string. 971400Sjjg 981400Sjjg wchar (IN) The wide character to convert. 991400Sjjg 1001491Sjjg count (IN) The maximum number of bytes to write to the output 1011400Sjjg buffer. Normally set this to LDAP_MAX_UTF8_LEN, which 1021400Sjjg is defined as 3 or 6 depending on the size of wchar_t. 1031400Sjjg A partial character will not be written. 1041400Sjjg 1051400SjjgReturn Value: 1061491Sjjg If successful, the function returns the length in bytes of 1071400Sjjg the converted UTF-8 output character. 1081400Sjjg 1091400Sjjg If wchar is NULL, the function returns 1 and a NULL is 1101400Sjjg written to utf8char. 1111400Sjjg 1121400Sjjg If wchar cannot be converted to a UTF-8 character, the 1131400Sjjg function returns -1. 1141400Sjjg 1151400Sjjg 1161400Sjjg2.4 int ldap_x_wcs_to_utf8s - Convert a wide character string to a UTF-8 string. 1171400Sjjg 1181400Sjjgint ldap_x_wcs_to_utf8s (char *utf8str, const wchar_t *wcstr, size_t count) 1191400Sjjg 1201400Sjjg utf8str (OUT) Points to a byte array to receive the converted 1211400Sjjg UTF-8 string. The output string will be null 1221400Sjjg terminated if there is space for it in the 1231400Sjjg buffer. 1241400Sjjg 1251400Sjjg 1261400Sjjg wcstr (IN) Address of the null-terminated wide char string to convert. 1271400Sjjg 128 count (IN) The size of the output buffer in bytes. 129 130Return Value: 131 If successful, the function returns the number of bytes 132 written to utf8str, excluding the null termination 133 character, if any. 134 135 If utf8str is NULL, the function returns the number of 136 bytes required to contain the converted string, excluding 137 the null termination character. The 'count' parameter is ignored. 138 139 If the function encounters a wide character that cannot 140 be mapped to a UTF-8 sequence, the function returns -1. 141 142 If the return value equals count, there was not enough space to fit 143 the string and the null terminator in the buffer. 144 145 146 1473. Multi-byte <--> UTF-8 Conversions 148 149These functions convert the string in a two-step process, from multibyte 150to Wide, then from Wide to UTF8, or vice versa. This conversion requires a 151charset translation routine, which is passed in as an argument. 152 153 ldap_x_mb_to_utf8 - Convert a multi-byte character to a UTF-8 character. 154 ldap_x_mbs_to_utf8s - Convert a multi-byte string to a UTF-8 string. 155 ldap_x_utf8_to_mb - Convert a UTF-8 character to a multi-byte character. 156 ldap_x_utf8s_to_mbs - Convert a UTF-8 string to a multi-byte string. 157 1583.1 ldap_x_mb_to_utf8 - Convert a multi-byte character to a UTF-8 character. 159 160int ldap_x_mb_to_utf8 ( char *utf8char, const char *mbchar, size_t mbsize, int (*f_mbtowc)(wchar_t *wchar, const char *mbchar, size_t count) ) 161 162 utf8char (OUT) Points to a byte buffer to receive the converted 163 UTF-8 character. May be NULL. The output is not 164 null-terminated. 165 166 mbchar (IN) Address of a sequence of bytes forming a multibyte character. 167 168 mbsize (IN) The maximum number of bytes of the mbchar argument to 169 check. This should normally be MB_CUR_MAX. 170 171 f_mbtowc (IN) The function to use for converting a multibyte 172 character to a wide character. If NULL, the local 173 ANSI C routine mbtowc is used. 174 175Return Value: 176 If successful, the function returns the length in bytes of 177 the UTF-8 output character. 178 179 If utf8char is NULL, count is ignored and the funtion 180 returns the number of bytes that would be written to the 181 output char. 182 183 If count is zero, 0 is returned and nothing is written to 184 utf8char. 185 186 If mbchar is NULL or points to an empty string, the 187 function returns 1 and a null byte is written to utf8char. 188 189 If mbchar contains an invalid multi-byte character, -1 is returned. 190 191 1923.2 ldap_x_mbs_to_utf8s - Convert a multi-byte string to a UTF-8 string. 193 194int ldap_x_mbs_to_utf8s (char *utf8str, const char *mbstr, size_t count, 195 size_t (*f_mbstowcs)(wchar_t *wcstr, const char *mbstr, size_t count)) 196 197utf8str (OUT) Points to a buffer to receive the converted UTF-8 string. 198 May be NULL. 199 200 mbchar (IN) Address of the null-terminated multi-byte input string. 201 202 count (IN) The size of the output buffer in bytes. 203 204 f_mbstowcs (IN) The function to use for converting a multibyte string 205 to a wide character string. If NULL, the local ANSI 206 C routine mbstowcs is used. 207 208Return Value: 209 If successful, the function returns the length in 210 bytes of the UTF-8 output string, excluding the null 211 terminator, if present. 212 213 If utf8str is NULL, count is ignored and the function 214 returns the number of bytes required for the output string, 215 excluding the NULL. 216 217 If count is zero, 0 is returned and nothing is written to utf8str. 218 219 If mbstr is NULL or points to an empty string, the 220 function returns 1 and a null byte is written to utf8str. 221 222 If mbstr contains an invalid multi-byte character, -1 is returned. 223 224 If the returned value is equal to count, the entire null-terminated 225 string would not fit in the output buffer. 226 227 2283.3 ldap_x_utf8_to_mb - Convert a UTF-8 character to a multi-byte character. 229 230int ldap_x_utf8_to_mb ( char *mbchar, const char *utf8char, 231 int (*f_wctomb)(char *mbchar, wchar_t wchar) ) 232 233mbchar (OUT) Points to a byte buffer to receive the converted multi-byte 234 character. May be NULL. 235 236 utf8char (IN) Address of the UTF-8 character sequence. 237 238 f_wctomb (IN) The function to use for converting a wide character 239 to a multibyte character. If NULL, the local 240 ANSI C routine wctomb is used. 241 242 243Return Value: 244 If successful, the function returns the length in 245 bytes of the multi-byte output character. 246 247 If utf8char is NULL or points to an empty string, the 248 function returns 1 and a null byte is written to mbchar. 249 250 If utf8char contains an invalid UTF-8 sequence, -1 is returned. 251 252 2533.4 int ldap_x_utf8s_to_mbs - Convert a UTF-8 string to a multi-byte string. 254 255 256int ldap_x_utf8s_to_mbs ( char *mbstr, const char *utf8str, size_t count, 257 size_t (*f_wcstombs)(char *mbstr, const wchar_t *wcstr, size_t count) ) 258 259 mbstr (OUT) Points to a byte buffer to receive the converted 260 multi-byte string. May be NULL. 261 262 utf8str (IN) Address of the null-terminated UTF-8 string to convert. 263 264 count (IN) The size of the output buffer in bytes. 265 266 f_wcstombs (IN) The function to use for converting a wide character 267 string to a multibyte string. If NULL, the local 268 ANSI C routine wcstombs is used. 269 270Return Value: 271 If successful, the function returns the number of bytes 272 written to mbstr, excluding the null termination 273 character, if any. 274 275 If mbstr is NULL, count is ignored and the funtion 276 returns the number of bytes required for the output string, 277 excluding the NULL. 278 279 If count is zero, 0 is returned and nothing is written to 280 mbstr. 281 282 If utf8str is NULL or points to an empty string, the 283 function returns 1 and a null byte is written to mbstr. 284 285 If an invalid UTF-8 character is encountered, the 286 function returns -1. 287 288The output string will be null terminated if there is space for it in 289the output buffer. 290 291 292