1<h1>TRE Regexp Syntax</h1> 2 3<p> 4This document describes the POSIX 1003.2 extended RE (ERE) syntax and 5the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions 6to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style 7notation is used to describe the grammar. 8</p> 9 10<h2>ERE Syntax</h2> 11 12<h3>Alternation operator</h3> 13<a name="alternation"></a> 14<a name="extended-regexp"></a> 15 16<table bgcolor="#e0e0f0" cellpadding="10"> 17<tr><td> 18<pre> 19<i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a> 20 | <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a> 21</pre> 22</td></tr> 23</table> 24<p> 25An extended regexp (ERE) is one or more <i>branches</i>, separated by 26<tt>|</tt>. An ERE matches anything that matches one or more of the 27branches. 28</p> 29 30<h3>Catenation of REs</h3> 31<a name="catenation"></a> 32<a name="branch"></a> 33 34<table bgcolor="#e0e0f0" cellpadding="10"> 35<tr><td> 36<pre> 37<i>branch</i> ::= <i>piece</i> 38 | <i>branch</i> <i>piece</i> 39</pre> 40</td></tr> 41</table> 42<p> 43A branch is one or more <i>pieces</i> concatenated. It matches a 44match for the first piece, followed by a match for the second piece, 45and so on. 46</p> 47 48 49<table bgcolor="#e0e0f0" cellpadding="10"> 50<tr><td> 51<pre> 52<i>piece</i> ::= <i>atom</i> 53 | <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a> 54 | <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a> 55</pre> 56</td></tr> 57</table> 58<p> 59A piece is an <i>atom</i> possibly followed by a repeat operator or an 60expression controlling approximate matching parameters for the <i>atom</i>. 61</p> 62 63 64<table bgcolor="#e0e0f0" cellpadding="10"> 65<tr><td> 66<pre> 67<i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b> 68 | <a href="#bracket-expression"><i>bracket-expression</i></a> 69 | <b>"."</b> 70 | <a href="#assertion"><i>assertion</i></a> 71 | <a href="#literal"><i>literal</i></a> 72 | <a href="#backref"><i>back-reference</i></a> 73 | <b>"(?#"</b> <i>comment-text</i> <b>")"</b> 74 | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i> 75 | <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b> 76</pre> 77</td></tr> 78</table> 79<p> 80An atom is either an ERE enclosed in parenthesis, a bracket 81expression, a <tt>.</tt> (period), an assertion, or a literal. 82</p> 83 84<p> 85The dot (<tt>.</tt>) matches any single character. 86If the <code>REG_NEWLINE</code> compilation flag (see <a 87href="api.html">API manual</a>) is specified, the newline 88character is not matched. 89</p> 90 91<p> 92<tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is 93completely ignored by the regex parser and it used solely for readability purposes. 94</p> 95 96<h3>Repeat operators</h3> 97<a name="repeat-operator"></a> 98 99<table bgcolor="#e0e0f0" cellpadding="10"> 100<tr><td> 101<pre> 102<i>repeat-operator</i> ::= <b>"*"</b> 103 | <b>"+"</b> 104 | <b>"?"</b> 105 | <i>bound</i> 106 | <b>"*?"</b> 107 | <b>"+?"</b> 108 | <b>"??"</b> 109 | <i>bound</i> <b>?</b> 110</pre> 111</td></tr> 112</table> 113 114<p> 115An atom followed by <tt>*</tt> matches a sequence of 0 or more matches 116of the atom. <tt>+</tt> is similar to <tt>*</tt>, matching a sequence 117of 1 or more matches of the atom. An atom followed by <tt>?</tt> 118matches a sequence of 0 or 1 matches of the atom. 119</p> 120 121<p> 122A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i> 123are unsigned decimal integers between <tt>0</tt> and 124<tt>RE_DUP_MAX</tt>: 125</p> 126 127<ol> 128<li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li> 129<li><tt>{</tt><i>m</i><tt>,}</tt></li> 130<li><tt>{</tt><i>m</i><tt>}</tt></li> 131</ol> 132 133<p> 134An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i> 135(inclusive) matches of the atom. An atom followed by [2] 136matches a sequence of <i>m</i> or more matches of the atom. An atom 137followed by [3] matches a sequence of exactly <i>m</i> matches of the 138atom. 139</p> 140 141 142<p> 143Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or 144non-greedy. Normally a repeated expression is greedy, that is, it matches as 145many characters as possible. A non-greedy subexpression matches as few 146characters as possible. Note that this does not (always) mean the same thing 147as matching as many or few repetitions as possible. Also note 148that <strong>minimal repetitions are not currently supported for approximate 149matching</strong>. 150</p> 151 152<h3>Approximate matching settings</h3> 153<a name="approx-settings"></a> 154 155<table bgcolor="#e0e0f0" cellpadding="10"> 156<tr><td> 157<pre> 158<i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b> 159 160<i>count-limits</i> ::= <b>"+"</b> <i>number</i>? 161 | <b>"-"</b> <i>number</i>? 162 | <b>"#"</b> <i>number</i>? 163 | <b>"~"</b> <i>number</i>? 164 165<i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"<"</b> <i>number</i> 166 167<i>cost-term</i> ::= <i>number</i> <b>"i"</b> 168 | <i>number</i> <b>"d"</b> 169 | <i>number</i> <b>"s"</b> 170 171</pre> 172</td></tr> 173</table> 174 175<p> 176The approximate matching settings for a subpattern can be changed 177by appending <i>approx-settings</i> to the subpattern. Limits for 178the number of errors can be set and an expression for specifying and 179limiting the costs can be given. 180</p> 181 182<p> 183The <i>count-limits</i> can be used to set limits for the number of 184insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions 185(<tt>#</tt>), and total number of errors (<tt>~</tt>). If the 186<i>number</i> part is omitted, the specified error count will be 187unlimited. 188</p> 189 190<p> 191The <i>cost-equation</i> can be thought of as a mathematical equation, 192where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of 193insertions, deletions, and substitutions, respectively. The equation 194can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and 195<tt>s</tt>. The multiplier is the cost of the error, and the number 196after <tt><</tt> is the maximum allowed cost of a match. Spaces 197and pluses can be inserted to make the equation readable. In fact, when 198specifying only a cost equation, adding a space after the opening <tt>{</tt> 199is <strong>required</strong>. 200</p> 201 202<p> 203Examples: 204<dl> 205<dt><tt>{~}</tt></dt> 206<dd>Sets the maximum number of errors to unlimited.</dd> 207<dt><tt>{~3}</tt></dt> 208<dd>Sets the maximum number of errors to three.</dd> 209<dt><tt>{+2~5}</tt></dt> 210<dd>Sets the maximum number of errors to five, and the maximum number 211of insertions to two.</dd> 212<dt><tt>{<3}</tt></dt> 213<dd>Sets the maximum cost to three. 214<dt><tt>{ 2i + 1d + 2s < 5 }</tt></dt> 215<dd>Sets the cost of an insertion to two, a deletion to one, a 216substitution to two, and the maximum cost to five. 217</dl> 218 219 220<h3>Bracket expressions</h3> 221<a name="bracket-expression"></a> 222 223<table bgcolor="#e0e0f0" cellpadding="10"> 224<tr><td> 225<pre> 226<i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b> 227 | <b>"[^"</b> <i>item</i>+ <b>"]"</b> 228</pre> 229</td></tr> 230</table> 231 232<p> 233A bracket expression specifies a set of characters by enclosing a 234nonempty list of items in brackets. Normally anything matching any 235item in the list is matched. If the list begins with <tt>^</tt> the 236meaning is negated; any character matching no item in the list is 237matched. 238</p> 239 240<p> 241An item is any of the following: 242</p> 243<ul> 244<li>A single character, matching that character.</li> 245<li>Two characters separated by <tt>-</tt>. This is shorthand for the 246full range of characters between those two (inclusive) in the 247collating sequence. For example, <tt>[0-9]</tt> in ASCII matches any 248decimal digit.</li> 249<li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>, 250matching the collating element. This can be used to include a literal 251<tt>-</tt> or a multi-character collating element in the list.</li> 252<li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an 253equivalence class), matching all collating elements with the same 254primary collation weight as that element, including the element 255itself.</li> 256<li>The name of a character class enclosed in <tt>[:</tt> and 257<tt>:]</tt>, matching any character belonging to the class. The set 258of valid names depends on the <code>LC_CTYPE</code> category of the 259current locale, but the following names are valid in all locales: 260<ul> 261<li><tt>alnum</tt> - alphanumeric characters</li> 262<li><tt>alpha</tt> - alphabetic characters</li> 263<li><tt>blank</tt> - blank characters</li> 264<li><tt>cntrl</tt> - control characters</li> 265<li><tt>digit</tt> - decimal digits (0 through 9)</li> 266<li><tt>graph</tt> - all printable characters except space</li> 267<li><tt>lower</tt> - lower-case letters</li> 268<li><tt>print</tt> - printable characters including space</li> 269<li><tt>punct</tt> - printable characters not space or alphanumeric</li> 270<li><tt>space</tt> - white-space characters</li> 271<li><tt>upper</tt> - upper case letters</li> 272<li><tt>xdigit</tt> - hexadecimal digits</li> 273</ul> 274</ul> 275<p> 276To include a literal <tt>-</tt> in the list, make it either the first 277or last item, the second endpoint of a range, or enclose it in 278<tt>[.</tt> and <tt>.]</tt> to make it a collating element. To 279include a literal <tt>]</tt> in the list, make it either the first 280item, the second endpoint of a range, or enclose it in <tt>[.</tt> and 281<tt>.]</tt>. To use a literal <tt>-</tt> as the first 282endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>. 283</p> 284 285 286<h3>Assertions</h3> 287<a name="assertion"></a> 288 289<table bgcolor="#e0e0f0" cellpadding="10"> 290<tr><td> 291<pre> 292<i>assertion</i> ::= <b>"^"</b> 293 | <b>"$"</b> 294 | <b>"\"</b> <i>assertion-character</i> 295</pre> 296</td></tr> 297</table> 298 299<p> 300The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and 301"right anchor", respectively. The left anchor matches the empty 302string at the beginning of the string. The right anchor matches the 303empty string at the end of the string. The behaviour of both anchors 304can be varied by specifying certain execution and compilation flags; 305see the <a href="api.html">API manual</a>. 306</p> 307 308<p> 309An assertion-character can be any of the following: 310</p> 311 312<ul> 313<li><tt><</tt> - Beginning of word 314<li><tt>></tt> - End of word 315<li><tt>b</tt> - Word boundary 316<li><tt>B</tt> - Non-word boundary 317<li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li> 318<li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li> 319<li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li> 320<li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li> 321<li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li> 322<li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li> 323</ul> 324 325 326<h3>Literals</h3> 327<a name="literal"></a> 328 329<table bgcolor="#e0e0f0" cellpadding="10"> 330<tr><td> 331<pre> 332<i>literal</i> ::= <i>ordinary-character</i> 333 | <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2} 334 | <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b> 335 | <b>"\"</b> <i>character</i> 336</pre> 337</td></tr> 338</table> 339<p> 340A literal is either an ordinary character (a character that has no 341other significance in the context), an 8 bit hexadecimal encoded 342character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character 343(e.g. <tt>\x{263a}</tt>), or an escaped character. An escaped 344character is a <tt>\</tt> followed by any character, and matches that 345character. Escaping can be used to match characters which have a 346special meaning in regexp syntax. A <tt>\</tt> cannot be the last 347character of an ERE. Escaping also allows you to include a few 348non-printable characters in the regular expression. These special 349escape sequences include: 350</p> 351 352<ul> 353<li><tt>\a</tt> - Bell character (ASCII code 7) 354<li><tt>\e</tt> - Escape character (ASCII code 27) 355<li><tt>\f</tt> - Form-feed character (ASCII code 12) 356<li><tt>\n</tt> - New-line/line-feed character (ASCII code 10) 357<li><tt>\r</tt> - Carriage return character (ASCII code 13) 358<li><tt>\t</tt> - Horizontal tab character (ASCII code 9) 359</ul> 360 361<p> 362An ordinary character is just a single character with no other 363significance, and matches that character. A <tt>{</tt> followed by 364something else than a digit is considered an ordinary character. 365</p> 366 367 368<h3>Back references</h3> 369<a name="backref"></a> 370 371<table bgcolor="#e0e0f0" cellpadding="10"> 372<tr><td> 373<pre> 374<i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>] 375</pre> 376</td></tr> 377</table> 378<p> 379A back reference is a backslash followed by a single non-zero decimal 380digit <i>d</i>. It matches the same sequence of characters 381matched by the <i>d</i>th parenthesized subexpression. 382</p> 383 384<p> 385Back references are not defined for POSIX EREs (for BREs they are), 386but many matchers, including TRE, implement back references for both 387EREs and BREs. 388</p> 389 390<h3>Options</h3> 391<a name="options"></a> 392<table bgcolor="#e0e0f0" cellpadding="10"> 393<tr><td> 394<pre> 395<i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)? 396</pre> 397</td></tr> 398</table> 399 400Options allow compile time options to be turned on/off for particular parts of the 401regular expression. The options equate to several compile time options specified to 402the regcomp API function. If the option is specified in the first section, it is 403turned on. If it is specified in the second section (after the <tt>-</tt>), it is 404turned off. 405<ul> 406<li>i - Case insensitive. 407<li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in 408the <a href="tre-api.html">API Manual</a>. 409<li>r - Causes the regex to be matched in a right associative manner rather than the normal 410left associative manner. 411<li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended. 412</ul> 413<h2>BRE Syntax</h2> 414 415<p> 416The obsolete basic regexp (BRE) syntax differs from the ERE syntax as 417follows: 418</p> 419 420<ul> 421<li><tt>|</tt> is an ordinary character, and there is no equivalent 422for its functionality. <tt>+</tt>, and <tt>?</tt> are ordinary 423characters.</li> 424<li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with 425<tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li> 426<li>The parentheses for nested subexpressions are <tt>\(</tt> and 427<tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary 428characters.</li> 429<li><tt>^</tt> is an ordinary character except at the beginning of the 430RE or the beginning of a parenthesized subexpression. Similarly, 431<tt>$</tt> is an ordinary character except at the end of the 432RE or the end of a parenthesized subexpression.</li> 433</ul> 434