Deleted Added
full compact
mandoc_escape.3 (274880) mandoc_escape.3 (275432)
1.\" $Id: mandoc_escape.3,v 1.1 2014/08/05 05:48:56 schwarze Exp $
1.\" $Id: mandoc_escape.3,v 1.2 2014/10/28 14:06:31 schwarze Exp $
2.\"
3.\" Copyright (c) 2014 Ingo Schwarze <schwarze@openbsd.org>
4.\"
5.\" Permission to use, copy, modify, and distribute this software for any
6.\" purpose with or without fee is hereby granted, provided that the above
7.\" copyright notice and this permission notice appear in all copies.
8.\"
9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16.\"
2.\"
3.\" Copyright (c) 2014 Ingo Schwarze <schwarze@openbsd.org>
4.\"
5.\" Permission to use, copy, modify, and distribute this software for any
6.\" purpose with or without fee is hereby granted, provided that the above
7.\" copyright notice and this permission notice appear in all copies.
8.\"
9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16.\"
17.Dd $Mdocdate: August 5 2014 $
17.Dd $Mdocdate: October 28 2014 $
18.Dt MANDOC_ESCAPE 3
19.Os
20.Sh NAME
21.Nm mandoc_escape
22.Nd parse roff escape sequences
23.Sh LIBRARY
24.Lb libmandoc
25.Sh SYNOPSIS
26.In sys/types.h
27.In mandoc.h
28.Ft "enum mandoc_esc"
29.Fo mandoc_escape
30.Fa "const char **end"
31.Fa "const char **start"
32.Fa "int *sz"
33.Fc
34.Sh DESCRIPTION
35This function scans a
36.Xr roff 7
37escape sequence.
38.Pp
39An escape sequence consists of
40.Bl -dash -compact -width 2n
41.It
42an initial backslash character
43.Pq Sq \e ,
44.It
45a single ASCII character called the escape sequence identifier,
46.It
47and, with only a few exceptions, an argument.
48.El
49.Pp
50Arguments can be given in the following forms; some escape sequence
51identifiers only accept some of these forms as specified below.
52The first three forms are called the standard forms.
53.Bl -tag -width 2n
54.It \&In brackets: Ic \&[ Ns Ar argument Ns Ic \&]
55The argument starts after the initial
56.Sq \&[ ,
57ends before the final
58.Sq \&] ,
59and the escape sequence ends with the final
60.Sq \&] .
61.It Two-character argument short form: Ic \&( Ns Ar ar
62This form can only be used for arguments
63consisting of exactly two characters.
64It has the same effect as
65.Ic \&[ Ns Ar ar Ns Ic \&] .
66.It One-character argument short form: Ar a
67This form can only be used for arguments
68consisting of exactly one character.
69It has the same effect as
70.Ic \&[ Ns Ar a Ns Ic \&] .
71.It Delimited form: Ar C Ns Ar argument Ns Ar C
72The argument starts after the initial delimiter character
73.Ar C ,
74ends before the next occurrence of the delimiter character
75.Ar C ,
76and the escape sequence ends with that second
77.Ar C .
78Some escape sequences allow arbitrary characters
79.Ar C
80as quoting characters, some restrict the range of characters
81that can be used as quoting characters.
82.El
83.Pp
84Upon function entry,
85.Fa end
86is expected to point to the escape sequence identifier.
87The values passed in as
88.Fa start
89and
90.Fa sz
91are ignored and overwritten.
92.Pp
93By design, this function cannot handle those
94.Xr roff 7
95escape sequences that require in-place expansion, in particular
96user-defined strings
97.Ic \e* ,
98number registers
99.Ic \en ,
100width measurements
101.Ic \ew ,
102and numerical expression control
103.Ic \eB .
104These are handled by
105.Fn roff_res ,
106a private preprocessor function called from
107.Fn roff_parseln ,
108see the file
109.Pa roff.c .
110.Pp
111The function
112.Fn mandoc_escape
113is used
114.Bl -dash -compact -width 2n
115.It
116recursively by itself, because some escape sequence arguments can
117in turn contain other escape sequences,
118.It
119for error detection internally by the
120.Xr roff 7
121parser part of the
122.Lb libmandoc ,
123see the file
124.Pa roff.c ,
125.It
126above all externally by the
127.Xr mandoc
128formatting modules, in particular
129.Fl Tascii
130and
131.Fl Thtml ,
132for formatting purposes, see the files
133.Pa term.c
134and
135.Pa html.c ,
136.It
137and rarely externally by high-level utilities using the mandoc library,
138for example
139.Xr makewhatis 8 ,
140to purge escape sequences from text.
141.El
142.Sh RETURN VALUES
143Upon function return, the pointer
144.Fa end
145is set to the character after the end of the escape sequence,
146such that the calling higher-level parser can easily continue.
147.Pp
148For escape sequences taking an argument, the pointer
149.Fa start
150is set to the beginning of the argument and
151.Fa sz
152is set to the length of the argument.
153For escape sequences not taking an argument,
154.Fa start
155is set to the character after the end of the sequence and
156.Fa sz
157is set to 0.
158Both
159.Fa start
160and
161.Fa sz
162may be
163.Dv NULL ;
164in that case, the argument and the length are not returned.
165.Pp
166For sequences taking an argument, the function
167.Fn mandoc_escape
168returns one of the following values:
169.Bl -tag -width 2n
170.It Dv ESCAPE_FONT
171The escape sequence
172.Ic \ef
173taking an argument in standard form:
174.Ic \ef[ , \ef( , \ef Ns Ar a .
175Two-character arguments starting with the character
176.Sq C
177are reduced to one-character arguments by skipping the
178.Sq C .
179More specific values are returned for the most commonly used arguments:
180.Bl -column "argument" "ESCAPE_FONTITALIC"
181.It argument Ta return value
182.It Cm R No or Cm 1 Ta Dv ESCAPE_FONTROMAN
183.It Cm I No or Cm 2 Ta Dv ESCAPE_FONTITALIC
184.It Cm B No or Cm 3 Ta Dv ESCAPE_FONTBOLD
185.It Cm P Ta Dv ESCAPE_FONTPREV
186.It Cm BI Ta Dv ESCAPE_FONTBI
187.El
188.It Dv ESCAPE_SPECIAL
189The escape sequence
190.Ic \eC
191taking an argument delimited with the single quote character
192and, as a special exception, the escape sequences
193.Em not
194having an identifier, that is, those where the argument, in standard
195form, directly follows the initial backslash:
196.Ic \eC' , \e[ , \e( , \e Ns Ar a .
197Note that the one-character argument short form can only be used for
198argument characters that do not clash with escape sequence identifiers.
199.Pp
18.Dt MANDOC_ESCAPE 3
19.Os
20.Sh NAME
21.Nm mandoc_escape
22.Nd parse roff escape sequences
23.Sh LIBRARY
24.Lb libmandoc
25.Sh SYNOPSIS
26.In sys/types.h
27.In mandoc.h
28.Ft "enum mandoc_esc"
29.Fo mandoc_escape
30.Fa "const char **end"
31.Fa "const char **start"
32.Fa "int *sz"
33.Fc
34.Sh DESCRIPTION
35This function scans a
36.Xr roff 7
37escape sequence.
38.Pp
39An escape sequence consists of
40.Bl -dash -compact -width 2n
41.It
42an initial backslash character
43.Pq Sq \e ,
44.It
45a single ASCII character called the escape sequence identifier,
46.It
47and, with only a few exceptions, an argument.
48.El
49.Pp
50Arguments can be given in the following forms; some escape sequence
51identifiers only accept some of these forms as specified below.
52The first three forms are called the standard forms.
53.Bl -tag -width 2n
54.It \&In brackets: Ic \&[ Ns Ar argument Ns Ic \&]
55The argument starts after the initial
56.Sq \&[ ,
57ends before the final
58.Sq \&] ,
59and the escape sequence ends with the final
60.Sq \&] .
61.It Two-character argument short form: Ic \&( Ns Ar ar
62This form can only be used for arguments
63consisting of exactly two characters.
64It has the same effect as
65.Ic \&[ Ns Ar ar Ns Ic \&] .
66.It One-character argument short form: Ar a
67This form can only be used for arguments
68consisting of exactly one character.
69It has the same effect as
70.Ic \&[ Ns Ar a Ns Ic \&] .
71.It Delimited form: Ar C Ns Ar argument Ns Ar C
72The argument starts after the initial delimiter character
73.Ar C ,
74ends before the next occurrence of the delimiter character
75.Ar C ,
76and the escape sequence ends with that second
77.Ar C .
78Some escape sequences allow arbitrary characters
79.Ar C
80as quoting characters, some restrict the range of characters
81that can be used as quoting characters.
82.El
83.Pp
84Upon function entry,
85.Fa end
86is expected to point to the escape sequence identifier.
87The values passed in as
88.Fa start
89and
90.Fa sz
91are ignored and overwritten.
92.Pp
93By design, this function cannot handle those
94.Xr roff 7
95escape sequences that require in-place expansion, in particular
96user-defined strings
97.Ic \e* ,
98number registers
99.Ic \en ,
100width measurements
101.Ic \ew ,
102and numerical expression control
103.Ic \eB .
104These are handled by
105.Fn roff_res ,
106a private preprocessor function called from
107.Fn roff_parseln ,
108see the file
109.Pa roff.c .
110.Pp
111The function
112.Fn mandoc_escape
113is used
114.Bl -dash -compact -width 2n
115.It
116recursively by itself, because some escape sequence arguments can
117in turn contain other escape sequences,
118.It
119for error detection internally by the
120.Xr roff 7
121parser part of the
122.Lb libmandoc ,
123see the file
124.Pa roff.c ,
125.It
126above all externally by the
127.Xr mandoc
128formatting modules, in particular
129.Fl Tascii
130and
131.Fl Thtml ,
132for formatting purposes, see the files
133.Pa term.c
134and
135.Pa html.c ,
136.It
137and rarely externally by high-level utilities using the mandoc library,
138for example
139.Xr makewhatis 8 ,
140to purge escape sequences from text.
141.El
142.Sh RETURN VALUES
143Upon function return, the pointer
144.Fa end
145is set to the character after the end of the escape sequence,
146such that the calling higher-level parser can easily continue.
147.Pp
148For escape sequences taking an argument, the pointer
149.Fa start
150is set to the beginning of the argument and
151.Fa sz
152is set to the length of the argument.
153For escape sequences not taking an argument,
154.Fa start
155is set to the character after the end of the sequence and
156.Fa sz
157is set to 0.
158Both
159.Fa start
160and
161.Fa sz
162may be
163.Dv NULL ;
164in that case, the argument and the length are not returned.
165.Pp
166For sequences taking an argument, the function
167.Fn mandoc_escape
168returns one of the following values:
169.Bl -tag -width 2n
170.It Dv ESCAPE_FONT
171The escape sequence
172.Ic \ef
173taking an argument in standard form:
174.Ic \ef[ , \ef( , \ef Ns Ar a .
175Two-character arguments starting with the character
176.Sq C
177are reduced to one-character arguments by skipping the
178.Sq C .
179More specific values are returned for the most commonly used arguments:
180.Bl -column "argument" "ESCAPE_FONTITALIC"
181.It argument Ta return value
182.It Cm R No or Cm 1 Ta Dv ESCAPE_FONTROMAN
183.It Cm I No or Cm 2 Ta Dv ESCAPE_FONTITALIC
184.It Cm B No or Cm 3 Ta Dv ESCAPE_FONTBOLD
185.It Cm P Ta Dv ESCAPE_FONTPREV
186.It Cm BI Ta Dv ESCAPE_FONTBI
187.El
188.It Dv ESCAPE_SPECIAL
189The escape sequence
190.Ic \eC
191taking an argument delimited with the single quote character
192and, as a special exception, the escape sequences
193.Em not
194having an identifier, that is, those where the argument, in standard
195form, directly follows the initial backslash:
196.Ic \eC' , \e[ , \e( , \e Ns Ar a .
197Note that the one-character argument short form can only be used for
198argument characters that do not clash with escape sequence identifiers.
199.Pp
200If the argument consists of more than one character
201and starts with the character
202.Sq u ,
203.Dv ESCAPE_UNICODE
204is returned as described below.
205If the argument is just the single character
206.Sq u ,
207.Dv ESCAPE_ERROR
208is returned.
200If the argument matches one of the forms described below under
201.Dv ESCAPE_UNICODE ,
202that value is returned instead.
209.Pp
210The
211.Dv ESCAPE_SPECIAL
212special character escape sequences can be rendered using the functions
213.Fn mchars_spec2cp
214and
215.Fn mchars_spec2str
216described in the
217.Xr mchars_alloc 3
218manual.
219.It Dv ESCAPE_UNICODE
220Escape sequences of the same format as described above under
221.Dv ESCAPE_SPECIAL ,
203.Pp
204The
205.Dv ESCAPE_SPECIAL
206special character escape sequences can be rendered using the functions
207.Fn mchars_spec2cp
208and
209.Fn mchars_spec2str
210described in the
211.Xr mchars_alloc 3
212manual.
213.It Dv ESCAPE_UNICODE
214Escape sequences of the same format as described above under
215.Dv ESCAPE_SPECIAL ,
222but with an argument starting with the character
223.Sq u :
216but with an argument of the forms
217.Ic u Ns Ar XXXX ,
218.Ic u Ns Ar YXXXX ,
219or
220.Ic u10 Ns Ar XXXX
221where
222.Ar X
223and
224.Ar Y
225are hexadecimal digits and
226.Ar Y
227is not zero:
224.Ic \eC'u , \e[u .
225As a special exception,
226.Fa start
227is set to the character after the
228.Ic \eC'u , \e[u .
229As a special exception,
230.Fa start
231is set to the character after the
228.Sq u ,
232.Ic u ,
229and the
230.Fa sz
231return value does not include the
233and the
234.Fa sz
235return value does not include the
232.Sq u
236.Ic u
233either.
234.Pp
235Such Unicode character escape sequences can be rendered using the function
236.Fn mchars_num2uc
237described in the
238.Xr mchars_alloc 3
239manual.
240.It Dv ESCAPE_NUMBERED
241The escape sequence
242.Ic \eN
243followed by a delimited argument.
244The delimiter character is arbitrary except that digits cannot be used.
245If a digit is encountered instead of the opening delimiter, that
246digit is considered to be the argument and the end of the sequence, and
247.Dv ESCAPE_IGNORE
248is returned.
249.Pp
250Such ASCII character escape sequences can be rendered using the function
251.Fn mchars_num2char
252described in the
253.Xr mchars_alloc 3
254manual.
255.It Dv ESCAPE_IGNORE
256.Bl -bullet -width 2n
257.It
258The escape sequence
259.Ic \es
260followed by an argument in standard form or by an argument delimited
261by the single quote character:
262.Ic \es' , \es[ , \es( , \es Ns Ar a .
263As a special exception, an optional
264.Sq +
265or
266.Sq \-
267character is allowed after the
268.Sq s
269for all forms.
270.It
271The escape sequences
272.Ic \eF ,
273.Ic \eg ,
274.Ic \ek ,
275.Ic \eM ,
276.Ic \em ,
277.Ic \en ,
278.Ic \eV ,
279and
280.Ic \eY
281followed by an argument in standard form.
282.It
283The escape sequences
284.Ic \eA ,
285.Ic \eb ,
286.Ic \eD ,
287.Ic \eo ,
288.Ic \eR ,
289.Ic \eX ,
290and
291.Ic \eZ
292followed by an argument delimited by an arbitrary character.
293.It
294The escape sequences
295.Ic \eH ,
296.Ic \eh ,
297.Ic \eL ,
298.Ic \el ,
299.Ic \eS ,
300.Ic \ev ,
301and
302.Ic \ex
303followed by an argument delimited by a character that cannot occur
304in numerical expressions.
305However, if any character that can occur in numerical expressions
306is found instead of a delimiter, the sequence is considered to end
307with that character, and
308.Dv ESCAPE_ERROR
309is returned.
310.El
311.It Dv ESCAPE_ERROR
312Escape sequences taking an argument but not matching any of the above patterns.
313In particular, that happens if the end of the logical input line
314is reached before the end of the argument.
315.El
316.Pp
317For sequences that do not take an argument, the function
318.Fn mandoc_escape
319returns one of the following values:
320.Bl -tag -width 2n
321.It Dv ESCAPE_SKIPCHAR
322The escape sequence
323.Qq \ez .
324.It Dv ESCAPE_NOSPACE
325The escape sequence
326.Qq \ec .
327.It Dv ESCAPE_IGNORE
328The escape sequences
329.Qq \ed
330and
331.Qq \eu .
332.El
333.Sh FILES
334This function is implemented in
335.Pa mandoc.c .
336.Sh SEE ALSO
337.Xr mchars_alloc 3 ,
338.Xr mandoc_char 7 ,
339.Xr roff 7
340.Sh HISTORY
341This function has been available since mandoc 1.11.2.
342.Sh AUTHORS
343.An Kristaps Dzonsons Aq Mt kristaps@bsd.lv
344.An Ingo Schwarze Aq Mt schwarze@openbsd.org
345.Sh BUGS
346The function doesn't cleanly distinguish between sequences that are
347valid and supported, valid and ignored, valid and unsupported,
348syntactically invalid, or undefined.
349For sequences that are ignored or unsupported, it doesn't tell
350whether that deficiency is likely to cause major formatting problems
351and/or loss of document content.
352The function is already rather complicated and still parses some
353sequences incorrectly.
354.
355.ig
356For these sequences, the list given below specifies a starting string
357and either the length of the argument or an ending character.
358The argument starts after the starting string.
359In the former case, the sequence ends with the end of the argument.
360In the latter case, the argument ends before the ending character,
361and the sequence ends with the ending character.
362..
237either.
238.Pp
239Such Unicode character escape sequences can be rendered using the function
240.Fn mchars_num2uc
241described in the
242.Xr mchars_alloc 3
243manual.
244.It Dv ESCAPE_NUMBERED
245The escape sequence
246.Ic \eN
247followed by a delimited argument.
248The delimiter character is arbitrary except that digits cannot be used.
249If a digit is encountered instead of the opening delimiter, that
250digit is considered to be the argument and the end of the sequence, and
251.Dv ESCAPE_IGNORE
252is returned.
253.Pp
254Such ASCII character escape sequences can be rendered using the function
255.Fn mchars_num2char
256described in the
257.Xr mchars_alloc 3
258manual.
259.It Dv ESCAPE_IGNORE
260.Bl -bullet -width 2n
261.It
262The escape sequence
263.Ic \es
264followed by an argument in standard form or by an argument delimited
265by the single quote character:
266.Ic \es' , \es[ , \es( , \es Ns Ar a .
267As a special exception, an optional
268.Sq +
269or
270.Sq \-
271character is allowed after the
272.Sq s
273for all forms.
274.It
275The escape sequences
276.Ic \eF ,
277.Ic \eg ,
278.Ic \ek ,
279.Ic \eM ,
280.Ic \em ,
281.Ic \en ,
282.Ic \eV ,
283and
284.Ic \eY
285followed by an argument in standard form.
286.It
287The escape sequences
288.Ic \eA ,
289.Ic \eb ,
290.Ic \eD ,
291.Ic \eo ,
292.Ic \eR ,
293.Ic \eX ,
294and
295.Ic \eZ
296followed by an argument delimited by an arbitrary character.
297.It
298The escape sequences
299.Ic \eH ,
300.Ic \eh ,
301.Ic \eL ,
302.Ic \el ,
303.Ic \eS ,
304.Ic \ev ,
305and
306.Ic \ex
307followed by an argument delimited by a character that cannot occur
308in numerical expressions.
309However, if any character that can occur in numerical expressions
310is found instead of a delimiter, the sequence is considered to end
311with that character, and
312.Dv ESCAPE_ERROR
313is returned.
314.El
315.It Dv ESCAPE_ERROR
316Escape sequences taking an argument but not matching any of the above patterns.
317In particular, that happens if the end of the logical input line
318is reached before the end of the argument.
319.El
320.Pp
321For sequences that do not take an argument, the function
322.Fn mandoc_escape
323returns one of the following values:
324.Bl -tag -width 2n
325.It Dv ESCAPE_SKIPCHAR
326The escape sequence
327.Qq \ez .
328.It Dv ESCAPE_NOSPACE
329The escape sequence
330.Qq \ec .
331.It Dv ESCAPE_IGNORE
332The escape sequences
333.Qq \ed
334and
335.Qq \eu .
336.El
337.Sh FILES
338This function is implemented in
339.Pa mandoc.c .
340.Sh SEE ALSO
341.Xr mchars_alloc 3 ,
342.Xr mandoc_char 7 ,
343.Xr roff 7
344.Sh HISTORY
345This function has been available since mandoc 1.11.2.
346.Sh AUTHORS
347.An Kristaps Dzonsons Aq Mt kristaps@bsd.lv
348.An Ingo Schwarze Aq Mt schwarze@openbsd.org
349.Sh BUGS
350The function doesn't cleanly distinguish between sequences that are
351valid and supported, valid and ignored, valid and unsupported,
352syntactically invalid, or undefined.
353For sequences that are ignored or unsupported, it doesn't tell
354whether that deficiency is likely to cause major formatting problems
355and/or loss of document content.
356The function is already rather complicated and still parses some
357sequences incorrectly.
358.
359.ig
360For these sequences, the list given below specifies a starting string
361and either the length of the argument or an ending character.
362The argument starts after the starting string.
363In the former case, the sequence ends with the end of the argument.
364In the latter case, the argument ends before the ending character,
365and the sequence ends with the ending character.
366..