11400Sjjg                                                                Dec 5, 2000
22933Sakulyakh                                                                Dave Steck
31400Sjjg                                                                Novell, Inc.
41400Sjjg                    
51400Sjjg                    UTF-8 Conversion Functions
61400Sjjg
71400Sjjg
81400Sjjg1.  Strings in the LDAP C SDK should be encoded in UTF-8 format.
91400Sjjg    However, most platforms do not provide APIs for converting to
101400Sjjg    this format.  If they do, they are platform-specific.
111400Sjjg    
121400Sjjg    As a result, most applications (knowingly or not) use local strings
131400Sjjg    with LDAP functions.  This works fine for 7-bit ASCII characters,
141400Sjjg    but will fail with 8-bit European characters, Asian characters, etc.
151400Sjjg    
161400Sjjg    We propose adding the following platform-independent conversion functions 
171400Sjjg    to the OpenLDAP SDK.  There are 4 functions for converting between UTF-8 
181400Sjjg    and wide characters, and 4 functions for converting between UTF-8 and 
191400Sjjg    multibyte characters.
201400Sjjg
211400Sjjg    For multibyte to UTF-8 conversions, charset translation is necessary.
221400Sjjg    While a full charset translator is not practical or appropriate for the
231400Sjjg    LDAP SDK, we can pass the translator function in as an argument.
241400Sjjg    A NULL for this argument will use the ANSI C functions mbtowc, mbstowcs,
251400Sjjg    wctomb, and wcstombs.
261400Sjjg
271400Sjjg2.  UTF-8 <--> Wide Character conversions
281400Sjjg
292933SakulyakhThe following new conversion routines will be added, following the pattern of 
301400Sjjgthe ANSI C conversion routines (mbtowc, mbstowcs, etc).  These routines use
311400Sjjgthe wchar_t type.  wchar_t is 2 bytes on some systems and 4 bytes on others.  
321400SjjgHowever the advantage of using wchar_t is that all the standard wide character 
331400Sjjgstring functions may be used on these strings:   wcslen, wcscpy, etc.
341400Sjjg
351400Sjjg   int ldap_x_utf8_to_wc  -  Convert a single UTF-8 encoded character to a wide character.
361400Sjjg   int ldap_x_utf8s_to_wcs  -  Convert a UTF-8 string to a wide character string.
371400Sjjg   int ldap_x_wc_to_utf8  -  Convert a single wide character to a UTF-8 sequence.
381400Sjjg   int ldap_x_wcs_to_utf8s  -  Convert a wide character string to a UTF-8 string.
391400Sjjg
401400Sjjg
411400Sjjg2.1  ldap_x_utf8_to_wc  -  Convert a single UTF-8  encoded character to a wide character.
421400Sjjg
431400Sjjgint ldap_x_utf8_to_wc ( wchar_t *wchar, const char *utf8char )
441400Sjjg
451400Sjjg  wchar		(OUT)	Points to a wide character code to receive the 
461400Sjjg                    converted character.
471400Sjjg
481400Sjjg  utf8char	(IN)	Address of the UTF8 sequence of bytes.
491400Sjjg
501400SjjgReturn Value:
511400Sjjg		If successful, the function returns the length in 
521400Sjjg        bytes of the UTF-8 input character.
531400Sjjg
541400Sjjg        If utf8char is NULL or points to an empty string, the
551400Sjjg        function returns 1 and a NULL is written to wchar.
561400Sjjg        
571400Sjjg        If utf8char contains an invalid UTF-8 sequence -1 is returned.
581400Sjjg
591400Sjjg
601400Sjjg2.2  ldap_x_utf8s_to_wcs   -  Convert a UTF-8 string to a wide character string.
611400Sjjg
621400Sjjgint ldap_x_utf8s_to_wcs (wchar_t *wcstr, const char *utf8str, size_t count)
631400Sjjg
641400Sjjg  wcstr		(OUT)	Points to a wide char buffer to receive the 
651400Sjjg                    converted wide char string. The output string will be 
661400Sjjg                    null terminated if there is space for it in the 
671400Sjjg                    buffer.
681400Sjjg
691400Sjjg  utf8str   (IN)	Address of the null-terminated UTF-8 string to convert.  
701400Sjjg
711400Sjjg  count		(IN)	The number of UTF-8 characters to convert, or
721400Sjjg        			equivalently, the size of the output buffer in wide
731400Sjjg        			characters.
741400Sjjg
751400SjjgReturn Value:
761400Sjjg    If successful, the function returns the number of wide
771400Sjjg    characters written to wcstr, excluding the null termination
781400Sjjg    character, if any.
791400Sjjg
801400Sjjg	If wcstr is NULL, the function returns the number of wide
811400Sjjg    characters required to contain the converted string,
821400Sjjg    excluding the null termination character.
831400Sjjg
841400Sjjg    If an invalid UTF-8 sequence is encountered, the 
851400Sjjg    function returns -1. 
861400Sjjg
871400Sjjg    If the return value equals count, there was not enough space to fit the 
881400Sjjg    string and the null terminator in the buffer.  
891400Sjjg
901400Sjjg
911400Sjjg2.3  ldap_x_wc_to_utf8  -  Convert a single wide character to a UTF-8 sequence.
921400Sjjg
931400Sjjgint ldap_x_wc_to_utf8 ( char *utf8char, wchar_t wchar, count )
941400Sjjg
951400Sjjg  utf8char	(OUT)	Points to a byte array to receive the converted UTF-8
961400Sjjg        			string.
971400Sjjg
981400Sjjg  wchar		(IN)	The wide character to convert.
991400Sjjg
1001491Sjjg  count		(IN)	The maximum number of bytes to write to the output
1011400Sjjg                    buffer.  Normally set this to LDAP_MAX_UTF8_LEN, which 
1021400Sjjg                    is defined as 3 or 6 depending on the size of wchar_t.  
1031400Sjjg                    A partial character will not be written.
1041400Sjjg                    
1051400SjjgReturn Value:
1061491Sjjg		If successful, the function returns the length in bytes of
1071400Sjjg		the converted UTF-8 output character.
1081400Sjjg
1091400Sjjg        If wchar is NULL, the function returns 1 and a NULL is 
1101400Sjjg        written to utf8char.
1111400Sjjg        
1121400Sjjg        If wchar cannot be converted to a UTF-8 character, the 
1131400Sjjg        function returns -1.
1141400Sjjg
1151400Sjjg
1161400Sjjg2.4  int ldap_x_wcs_to_utf8s  -  Convert a wide character string to a UTF-8 string.
1171400Sjjg
1181400Sjjgint ldap_x_wcs_to_utf8s (char *utf8str, const wchar_t *wcstr, size_t count)
1191400Sjjg
1201400Sjjg  utf8str	(OUT)	Points to a byte array to receive the converted 
1211400Sjjg                    UTF-8 string. The output string will be null 
1221400Sjjg                    terminated if there is space for it in the 
1231400Sjjg                    buffer.
1241400Sjjg
1251400Sjjg
1261400Sjjg  wcstr		(IN)	Address of the null-terminated wide char string to convert.
1271400Sjjg
128  count		(IN)	The size of the output buffer in bytes.
129
130Return Value:
131		If successful, the function returns the number of bytes
132		written to utf8str, excluding the null termination
133        character, if any.
134
135		If utf8str is NULL, the function returns the number of
136        bytes required to contain the converted string, excluding 
137        the null termination character.  The 'count' parameter is ignored.
138        
139        If the function encounters a wide character that cannot 
140        be mapped to a UTF-8 sequence, the function returns -1.
141        
142        If the return value equals count, there was not enough space to fit 
143        the string and the null terminator in the buffer.
144
145
146
1473. Multi-byte <--> UTF-8 Conversions
148
149These functions convert the string in a two-step process, from multibyte 
150to Wide, then from Wide to UTF8, or vice versa.  This conversion requires a 
151charset translation routine, which is passed in as an argument.
152 
153   ldap_x_mb_to_utf8  -  Convert a multi-byte character  to a UTF-8 character.
154   ldap_x_mbs_to_utf8s  -  Convert a multi-byte string to a UTF-8 string.
155   ldap_x_utf8_to_mb  -  Convert a UTF-8 character to a multi-byte character.
156   ldap_x_utf8s_to_mbs  -  Convert a UTF-8 string to a multi-byte string.
157
1583.1  ldap_x_mb_to_utf8  - Convert a multi-byte character  to a UTF-8 character.
159
160int ldap_x_mb_to_utf8 ( char *utf8char, const char *mbchar, size_t mbsize, int (*f_mbtowc)(wchar_t *wchar, const char *mbchar, size_t count)  )
161
162  utf8char	(OUT)	Points to a byte buffer to receive the converted 
163                    UTF-8 character.  May be NULL.  The output is not
164                    null-terminated.
165
166  mbchar    (IN)	Address of a sequence of bytes forming a multibyte character.
167
168  mbsize	(IN)	The maximum number of bytes of the mbchar argument to 
169                    check.  This should normally be MB_CUR_MAX.
170
171  f_mbtowc	(IN)	The function to use for converting a multibyte 
172                    character to a wide character.  If NULL, the local 
173                    ANSI C routine mbtowc is used.
174
175Return Value:
176		If successful, the function returns the length in bytes of
177        the UTF-8 output character.  
178        
179        If utf8char is NULL, count is ignored and the funtion 
180        returns the number of bytes that would be written to the 
181        output char.
182
183        If count is zero, 0 is returned and nothing is written to
184        utf8char.
185         
186        If mbchar is NULL or points to an empty string, the 
187        function returns 1 and a null byte is written to utf8char.
188        
189        If mbchar contains an invalid multi-byte character, -1 is returned.
190
191
1923.2  ldap_x_mbs_to_utf8s  - Convert a multi-byte string  to a UTF-8 string.
193
194int ldap_x_mbs_to_utf8s (char *utf8str, const char *mbstr, size_t count, 
195        size_t (*f_mbstowcs)(wchar_t *wcstr, const char *mbstr, size_t count))
196
197utf8str	    (OUT)	Points to a buffer to receive the converted UTF-8 string.  
198                    May be NULL.
199
200  mbchar	(IN)	Address of the null-terminated multi-byte input string.
201
202  count	    (IN)	The size of the output buffer in bytes.
203
204  f_mbstowcs (IN)	The function to use for converting a multibyte string
205            		to a wide character string.  If NULL, the local ANSI
206            		C routine mbstowcs is used.
207
208Return Value:
209		If successful, the function returns the length in 
210        bytes of the UTF-8 output string, excluding the null
211        terminator, if present.
212        
213        If utf8str is NULL, count is ignored and the function 
214        returns the number of bytes required for the output string, 
215        excluding the NULL.
216        
217        If count is zero, 0 is returned and nothing is written to utf8str.
218         
219        If mbstr is NULL or points to an empty string, the 
220        function returns 1 and a null byte is written to utf8str.
221        
222        If mbstr contains an invalid multi-byte character, -1 is returned.
223        
224        If the returned value is equal to count, the entire null-terminated 
225        string would not fit in the output buffer.
226
227
2283.3  ldap_x_utf8_to_mb  -  Convert a UTF-8 character to a multi-byte character.
229
230int ldap_x_utf8_to_mb ( char *mbchar, const char *utf8char,
231                        int (*f_wctomb)(char *mbchar, wchar_t wchar) )
232
233mbchar	(OUT)	Points to a byte buffer to receive the converted multi-byte 
234                character.  May be NULL.
235
236  utf8char	(IN)	Address of the UTF-8 character sequence.
237
238  f_wctomb	(IN)	The function to use for converting a wide character 
239                    to a multibyte character.  If NULL, the local 
240                    ANSI C routine wctomb is used.
241
242
243Return Value:
244		If successful, the function returns the length in 
245        bytes of the multi-byte output character.  
246        
247        If utf8char is NULL or points to an empty string, the 
248        function returns 1 and a null byte is written to mbchar.
249        
250        If utf8char contains an invalid UTF-8 sequence, -1 is returned.
251
252
2533.4  int ldap_x_utf8s_to_mbs  - Convert a UTF-8 string to a multi-byte string.
254
255
256int ldap_x_utf8s_to_mbs ( char *mbstr, const char *utf8str, size_t count, 
257        size_t (*f_wcstombs)(char *mbstr, const wchar_t *wcstr, size_t count) )
258
259  mbstr		(OUT)	Points to a byte buffer to receive the converted 
260                    multi-byte string.  May be NULL.
261
262  utf8str   (IN)	Address of the null-terminated UTF-8 string to convert.
263
264  count		(IN)	The size of the output buffer in bytes.
265
266  f_wcstombs (IN)	The function to use for converting a wide character 
267                    string to a multibyte string.  If NULL, the local 
268                    ANSI C routine wcstombs is used.
269
270Return Value:
271        If successful, the function returns the number of bytes
272		written to mbstr, excluding the null termination
273        character, if any.
274
275        If mbstr is NULL, count is ignored and the funtion 
276        returns the number of bytes required for the output string,
277        excluding the NULL.
278        
279        If count is zero, 0 is returned and nothing is written to
280        mbstr.
281        
282        If utf8str is NULL or points to an empty string, the 
283        function returns 1 and a null byte is written to mbstr.
284        
285        If an invalid UTF-8 character is encountered, the 
286        function returns -1.
287
288The output string will be null terminated if there is space for it in 
289the output buffer.
290
291
292