1<h1>TRE Regexp Syntax</h1>
2
3<p>
4This document describes the POSIX 1003.2 extended RE (ERE) syntax and
5the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
6to the ERE syntax.  A simple Extended Backus-Naur Form (EBNF) style
7notation is used to describe the grammar.
8</p>
9
10<h2>ERE Syntax</h2>
11
12<h3>Alternation operator</h3>
13<a name="alternation"></a>
14<a name="extended-regexp"></a>
15
16<table bgcolor="#e0e0f0" cellpadding="10">
17<tr><td>
18<pre>
19<i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
20                |   <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
21</pre>
22</td></tr>
23</table>
24<p>
25An extended regexp (ERE) is one or more <i>branches</i>, separated by
26<tt>|</tt>.  An ERE matches anything that matches one or more of the
27branches.
28</p>
29
30<h3>Catenation of REs</h3>
31<a name="catenation"></a>
32<a name="branch"></a>
33
34<table bgcolor="#e0e0f0" cellpadding="10">
35<tr><td>
36<pre>
37<i>branch</i> ::= <i>piece</i>
38       |   <i>branch</i> <i>piece</i>
39</pre>
40</td></tr>
41</table>
42<p>
43A branch is one or more <i>pieces</i> concatenated.  It matches a
44match for the first piece, followed by a match for the second piece,
45and so on.
46</p>
47
48
49<table bgcolor="#e0e0f0" cellpadding="10">
50<tr><td>
51<pre>
52<i>piece</i> ::= <i>atom</i>
53      |   <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
54      |   <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
55</pre>
56</td></tr>
57</table>
58<p>
59A piece is an <i>atom</i> possibly followed by a repeat operator or an
60expression controlling approximate matching parameters for the <i>atom</i>.
61</p>
62
63
64<table bgcolor="#e0e0f0" cellpadding="10">
65<tr><td>
66<pre>
67<i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
68     |   <a href="#bracket-expression"><i>bracket-expression</i></a>
69     |   <b>"."</b>
70     |   <a href="#assertion"><i>assertion</i></a>
71     |   <a href="#literal"><i>literal</i></a>
72     |   <a href="#backref"><i>back-reference</i></a>
73     |   <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
74     |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
75     |   <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
76</pre>
77</td></tr>
78</table>
79<p>
80An atom is either an ERE enclosed in parenthesis, a bracket
81expression, a <tt>.</tt> (period), an assertion, or a literal.
82</p>
83
84<p>
85The dot (<tt>.</tt>) matches any single character.
86If the <code>REG_NEWLINE</code> compilation flag (see <a
87href="api.html">API manual</a>) is specified, the newline
88character is not matched.
89</p>
90
91<p>
92<tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
93completely ignored by the regex parser and it used solely for readability purposes.
94</p>
95
96<h3>Repeat operators</h3>
97<a name="repeat-operator"></a>
98
99<table bgcolor="#e0e0f0" cellpadding="10">
100<tr><td>
101<pre>
102<i>repeat-operator</i> ::= <b>"*"</b>
103                |   <b>"+"</b>
104                |   <b>"?"</b>
105                |   <i>bound</i>
106                |   <b>"*?"</b>
107                |   <b>"+?"</b>
108                |   <b>"??"</b>
109                |   <i>bound</i> <b>?</b>
110</pre>
111</td></tr>
112</table>
113
114<p>
115An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
116of the atom.  <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
117of 1 or more matches of the atom.  An atom followed by <tt>?</tt>
118matches a sequence of 0 or 1 matches of the atom.
119</p>
120
121<p>
122A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
123are unsigned decimal integers between <tt>0</tt> and
124<tt>RE_DUP_MAX</tt>:
125</p>
126
127<ol>
128<li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
129<li><tt>{</tt><i>m</i><tt>,}</tt></li>
130<li><tt>{</tt><i>m</i><tt>}</tt></li>
131</ol>
132
133<p>
134An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
135(inclusive) matches of the atom.  An atom followed by [2]
136matches a sequence of <i>m</i> or more matches of the atom.  An atom
137followed by [3] matches a sequence of exactly <i>m</i> matches of the
138atom.
139</p>
140
141
142<p>
143Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or
144non-greedy.  Normally a repeated expression is greedy, that is, it matches as
145many characters as possible.  A non-greedy subexpression matches as few
146characters as possible.  Note that this does not (always) mean the same thing
147as matching as many or few repetitions as possible.  Also note
148that <strong>minimal repetitions are not currently supported for approximate
149matching</strong>.
150</p>
151
152<h3>Approximate matching settings</h3>
153<a name="approx-settings"></a>
154
155<table bgcolor="#e0e0f0" cellpadding="10">
156<tr><td>
157<pre>
158<i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
159
160<i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
161             |   <b>"-"</b> <i>number</i>?
162             |   <b>"#"</b> <i>number</i>?
163             |   <b>"~"</b> <i>number</i>?
164
165<i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"&lt;"</b> <i>number</i>
166
167<i>cost-term</i> ::= <i>number</i> <b>"i"</b>
168          |   <i>number</i> <b>"d"</b>
169          |   <i>number</i> <b>"s"</b>
170
171</pre>
172</td></tr>
173</table>
174
175<p>
176The approximate matching settings for a subpattern can be changed
177by appending <i>approx-settings</i> to the subpattern.  Limits for
178the number of errors can be set and an expression for specifying and
179limiting the costs can be given.
180</p>
181
182<p>
183The <i>count-limits</i> can be used to set limits for the number of
184insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
185(<tt>#</tt>), and total number of errors (<tt>~</tt>).  If the
186<i>number</i> part is omitted, the specified error count will be
187unlimited.
188</p>
189
190<p>
191The <i>cost-equation</i> can be thought of as a mathematical equation,
192where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
193insertions, deletions, and substitutions, respectively.  The equation
194can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
195<tt>s</tt>.  The multiplier is the cost of the error, and the number
196after <tt>&lt;</tt> is the maximum allowed cost of a match.  Spaces
197and pluses can be inserted to make the equation readable.  In fact, when
198specifying only a cost equation, adding a space after the opening <tt>{</tt>
199is <strong>required</strong>.
200</p>
201
202<p>
203Examples:
204<dl>
205<dt><tt>{~}</tt></dt>
206<dd>Sets the maximum number of errors to unlimited.</dd>
207<dt><tt>{~3}</tt></dt>
208<dd>Sets the maximum number of errors to three.</dd>
209<dt><tt>{+2~5}</tt></dt>
210<dd>Sets the maximum number of errors to five, and the maximum number
211of insertions to two.</dd>
212<dt><tt>{&lt;3}</tt></dt>
213<dd>Sets the maximum cost to three.
214<dt><tt>{ 2i + 1d + 2s &lt; 5 }</tt></dt>
215<dd>Sets the cost of an insertion to two, a deletion to one, a
216substitution to two, and the maximum cost to five.
217</dl>
218
219
220<h3>Bracket expressions</h3>
221<a name="bracket-expression"></a>
222
223<table bgcolor="#e0e0f0" cellpadding="10">
224<tr><td>
225<pre>
226<i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
227                   |   <b>"[^"</b> <i>item</i>+ <b>"]"</b>
228</pre>
229</td></tr>
230</table>
231
232<p>
233A bracket expression specifies a set of characters by enclosing a
234nonempty list of items in brackets.  Normally anything matching any
235item in the list is matched.  If the list begins with <tt>^</tt> the
236meaning is negated; any character matching no item in the list is
237matched.
238</p>
239
240<p>
241An item is any of the following:
242</p>
243<ul>
244<li>A single character, matching that character.</li>
245<li>Two characters separated by <tt>-</tt>.  This is shorthand for the
246full range of characters  between those two (inclusive) in the
247collating sequence.  For example, <tt>[0-9]</tt> in ASCII matches any
248decimal digit.</li>
249<li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
250matching the collating element.  This can be used to include a literal
251<tt>-</tt> or a multi-character collating element in the list.</li>
252<li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
253equivalence class), matching all collating elements with the same
254primary collation weight as that element, including the element
255itself.</li>
256<li>The name of a character class enclosed in <tt>[:</tt> and
257<tt>:]</tt>, matching any character belonging to the class.  The set
258of valid names depends on the <code>LC_CTYPE</code> category of the
259current locale, but the following names are valid in all locales:
260<ul>
261<li><tt>alnum</tt> - alphanumeric characters</li>
262<li><tt>alpha</tt> - alphabetic characters</li>
263<li><tt>blank</tt> - blank characters</li>
264<li><tt>cntrl</tt> - control characters</li>
265<li><tt>digit</tt> - decimal digits (0 through 9)</li>
266<li><tt>graph</tt> - all printable characters except space</li>
267<li><tt>lower</tt> - lower-case letters</li>
268<li><tt>print</tt> - printable characters including space</li>
269<li><tt>punct</tt> - printable characters not space or alphanumeric</li>
270<li><tt>space</tt> - white-space characters</li>
271<li><tt>upper</tt> - upper case letters</li>
272<li><tt>xdigit</tt> - hexadecimal digits</li>
273</ul>
274</ul>
275<p>
276To include a literal <tt>-</tt> in the list, make it either the first
277or last item, the second endpoint of a range, or enclose it in
278<tt>[.</tt> and <tt>.]</tt> to make it a collating element.  To
279include a literal <tt>]</tt> in the list, make it either the first
280item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
281<tt>.]</tt>.  To use a literal <tt>-</tt> as the first
282endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
283</p>
284
285
286<h3>Assertions</h3>
287<a name="assertion"></a>
288
289<table bgcolor="#e0e0f0" cellpadding="10">
290<tr><td>
291<pre>
292<i>assertion</i> ::= <b>"^"</b>
293          |   <b>"$"</b>
294          |   <b>"\"</b> <i>assertion-character</i>
295</pre>
296</td></tr>
297</table>
298
299<p>
300The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
301"right anchor", respectively.  The left anchor matches the empty
302string at the beginning of the string.  The right anchor matches the
303empty string at the end of the string.  The behaviour of both anchors
304can be varied by specifying certain execution and compilation flags;
305see the <a href="api.html">API manual</a>.
306</p>
307
308<p>
309An assertion-character can be any of the following:
310</p>
311
312<ul>
313<li><tt>&lt;</tt> - Beginning of word
314<li><tt>&gt;</tt> - End of word
315<li><tt>b</tt> - Word boundary
316<li><tt>B</tt> - Non-word boundary
317<li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
318<li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
319<li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
320<li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
321<li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
322<li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
323</ul>
324
325
326<h3>Literals</h3>
327<a name="literal"></a>
328
329<table bgcolor="#e0e0f0" cellpadding="10">
330<tr><td>
331<pre>
332<i>literal</i> ::= <i>ordinary-character</i>
333        |   <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
334        |   <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
335        |   <b>"\"</b> <i>character</i>
336</pre>
337</td></tr>
338</table>
339<p>
340A literal is either an ordinary character (a character that has no
341other significance in the context), an 8 bit hexadecimal encoded
342character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
343(e.g. <tt>\x{263a}</tt>), or an escaped character.  An escaped
344character is a <tt>\</tt> followed by any character, and matches that
345character.  Escaping can be used to match characters which have a
346special meaning in regexp syntax.  A <tt>\</tt> cannot be the last
347character of an ERE.  Escaping also allows you to include a few
348non-printable characters in the regular expression.  These special
349escape sequences include:
350</p>
351
352<ul>
353<li><tt>\a</tt> - Bell character (ASCII code 7)
354<li><tt>\e</tt> - Escape character (ASCII code 27)
355<li><tt>\f</tt> - Form-feed character (ASCII code 12)
356<li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
357<li><tt>\r</tt> - Carriage return character (ASCII code 13)
358<li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
359</ul>
360
361<p>
362An ordinary character is just a single character with no other
363significance, and matches that character.  A <tt>{</tt> followed by
364something else than a digit is considered an ordinary character.
365</p>
366
367
368<h3>Back references</h3>
369<a name="backref"></a>
370
371<table bgcolor="#e0e0f0" cellpadding="10">
372<tr><td>
373<pre>
374<i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
375</pre>
376</td></tr>
377</table>
378<p>
379A back reference is a backslash followed by a single non-zero decimal
380digit <i>d</i>.  It matches the same sequence of characters
381matched by the <i>d</i>th parenthesized subexpression.
382</p>
383
384<p>
385Back references are not defined for POSIX EREs (for BREs they are),
386but many matchers, including TRE, implement back references for both
387EREs and BREs.
388</p>
389
390<h3>Options</h3>
391<a name="options"></a>
392<table bgcolor="#e0e0f0" cellpadding="10">
393<tr><td>
394<pre>
395<i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
396</pre>
397</td></tr>
398</table>
399
400Options allow compile time options to be turned on/off for particular parts of the
401regular expression. The options equate to several compile time options specified to
402the regcomp API function. If the option is specified in the first section, it is
403turned on. If it is specified in the second section (after the <tt>-</tt>), it is
404turned off.
405<ul>
406<li>i - Case insensitive.
407<li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in
408the <a href="tre-api.html">API Manual</a>.
409<li>r - Causes the regex to be matched in a right associative manner rather than the normal
410left associative manner.
411<li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
412</ul>
413<h2>BRE Syntax</h2>
414
415<p>
416The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
417follows:
418</p>
419
420<ul>
421<li><tt>|</tt> is an ordinary character, and there is no equivalent
422for its functionality.  <tt>+</tt>, and <tt>?</tt> are ordinary
423characters.</li>
424<li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
425<tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
426<li>The parentheses for nested subexpressions are <tt>\(</tt> and
427<tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
428characters.</li>
429<li><tt>^</tt> is an ordinary character except at the beginning of the
430RE or the beginning of a parenthesized subexpression.  Similarly,
431<tt>$</tt> is an ordinary character except at the end of the
432RE or the end of a parenthesized subexpression.</li>
433</ul>
434