1<html>
2<head>
3<title>pcresyntax specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcresyntax man page</h1>
7<p>
8Return to the <a href="index.html">PCRE index page</a>.
9</p>
10<p>
11This page is part of the PCRE HTML documentation. It was generated automatically
12from the original man page. If there is any nonsense in it, please consult the
13man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17<li><a name="TOC2" href="#SEC2">QUOTING</a>
18<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTY CODES FOR \p and \P</a>
21<li><a name="TOC6" href="#SEC6">SCRIPT NAMES FOR \p AND \P</a>
22<li><a name="TOC7" href="#SEC7">CHARACTER CLASSES</a>
23<li><a name="TOC8" href="#SEC8">QUANTIFIERS</a>
24<li><a name="TOC9" href="#SEC9">ANCHORS AND SIMPLE ASSERTIONS</a>
25<li><a name="TOC10" href="#SEC10">MATCH POINT RESET</a>
26<li><a name="TOC11" href="#SEC11">ALTERNATION</a>
27<li><a name="TOC12" href="#SEC12">CAPTURING</a>
28<li><a name="TOC13" href="#SEC13">ATOMIC GROUPS</a>
29<li><a name="TOC14" href="#SEC14">COMMENT</a>
30<li><a name="TOC15" href="#SEC15">OPTION SETTING</a>
31<li><a name="TOC16" href="#SEC16">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
32<li><a name="TOC17" href="#SEC17">BACKREFERENCES</a>
33<li><a name="TOC18" href="#SEC18">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
34<li><a name="TOC19" href="#SEC19">CONDITIONAL PATTERNS</a>
35<li><a name="TOC20" href="#SEC20">BACKTRACKING CONTROL</a>
36<li><a name="TOC21" href="#SEC21">NEWLINE CONVENTIONS</a>
37<li><a name="TOC22" href="#SEC22">WHAT \R MATCHES</a>
38<li><a name="TOC23" href="#SEC23">CALLOUTS</a>
39<li><a name="TOC24" href="#SEC24">SEE ALSO</a>
40<li><a name="TOC25" href="#SEC25">AUTHOR</a>
41<li><a name="TOC26" href="#SEC26">REVISION</a>
42</ul>
43<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
44<P>
45The full syntax and semantics of the regular expressions that are supported by
46PCRE are described in the
47<a href="pcrepattern.html"><b>pcrepattern</b></a>
48documentation. This document contains just a quick-reference summary of the
49syntax.
50</P>
51<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52<P>
53<pre>
54  \x         where x is non-alphanumeric is a literal x
55  \Q...\E    treat enclosed characters as literal
56</PRE>
57</P>
58<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59<P>
60<pre>
61  \a         alarm, that is, the BEL character (hex 07)
62  \cx        "control-x", where x is any character
63  \e         escape (hex 1B)
64  \f         formfeed (hex 0C)
65  \n         newline (hex 0A)
66  \r         carriage return (hex 0D)
67  \t         tab (hex 09)
68  \ddd       character with octal code ddd, or backreference
69  \xhh       character with hex code hh
70  \x{hhh..}  character with hex code hhh..
71</PRE>
72</P>
73<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
74<P>
75<pre>
76  .          any character except newline;
77               in dotall mode, any character whatsoever
78  \C         one byte, even in UTF-8 mode (best avoided)
79  \d         a decimal digit
80  \D         a character that is not a decimal digit
81  \h         a horizontal whitespace character
82  \H         a character that is not a horizontal whitespace character
83  \p{<i>xx</i>}     a character with the <i>xx</i> property
84  \P{<i>xx</i>}     a character without the <i>xx</i> property
85  \R         a newline sequence
86  \s         a whitespace character
87  \S         a character that is not a whitespace character
88  \v         a vertical whitespace character
89  \V         a character that is not a vertical whitespace character
90  \w         a "word" character
91  \W         a "non-word" character
92  \X         an extended Unicode sequence
93</pre>
94In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
95</P>
96<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTY CODES FOR \p and \P</a><br>
97<P>
98<pre>
99  C          Other
100  Cc         Control
101  Cf         Format
102  Cn         Unassigned
103  Co         Private use
104  Cs         Surrogate
105
106  L          Letter
107  Ll         Lower case letter
108  Lm         Modifier letter
109  Lo         Other letter
110  Lt         Title case letter
111  Lu         Upper case letter
112  L&         Ll, Lu, or Lt
113
114  M          Mark
115  Mc         Spacing mark
116  Me         Enclosing mark
117  Mn         Non-spacing mark
118
119  N          Number
120  Nd         Decimal number
121  Nl         Letter number
122  No         Other number
123
124  P          Punctuation
125  Pc         Connector punctuation
126  Pd         Dash punctuation
127  Pe         Close punctuation
128  Pf         Final punctuation
129  Pi         Initial punctuation
130  Po         Other punctuation
131  Ps         Open punctuation
132
133  S          Symbol
134  Sc         Currency symbol
135  Sk         Modifier symbol
136  Sm         Mathematical symbol
137  So         Other symbol
138
139  Z          Separator
140  Zl         Line separator
141  Zp         Paragraph separator
142  Zs         Space separator
143</PRE>
144</P>
145<br><a name="SEC6" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
146<P>
147Arabic,
148Armenian,
149Avestan,
150Balinese,
151Bamum,
152Bengali,
153Bopomofo,
154Braille,
155Buginese,
156Buhid,
157Canadian_Aboriginal,
158Carian,
159Cham,
160Cherokee,
161Common,
162Coptic,
163Cuneiform,
164Cypriot,
165Cyrillic,
166Deseret,
167Devanagari,
168Egyptian_Hieroglyphs,
169Ethiopic,
170Georgian,
171Glagolitic,
172Gothic,
173Greek,
174Gujarati,
175Gurmukhi,
176Han,
177Hangul,
178Hanunoo,
179Hebrew,
180Hiragana,
181Imperial_Aramaic,
182Inherited,
183Inscriptional_Pahlavi,
184Inscriptional_Parthian,
185Javanese,
186Kaithi,
187Kannada,
188Katakana,
189Kayah_Li,
190Kharoshthi,
191Khmer,
192Lao,
193Latin,
194Lepcha,
195Limbu,
196Linear_B,
197Lisu,
198Lycian,
199Lydian,
200Malayalam,
201Meetei_Mayek,
202Mongolian,
203Myanmar,
204New_Tai_Lue,
205Nko,
206Ogham,
207Old_Italic,
208Old_Persian,
209Old_South_Arabian,
210Old_Turkic,
211Ol_Chiki,
212Oriya,
213Osmanya,
214Phags_Pa,
215Phoenician,
216Rejang,
217Runic,
218Samaritan,
219Saurashtra,
220Shavian,
221Sinhala,
222Sundanese,
223Syloti_Nagri,
224Syriac,
225Tagalog,
226Tagbanwa,
227Tai_Le,
228Tai_Tham,
229Tai_Viet,
230Tamil,
231Telugu,
232Thaana,
233Thai,
234Tibetan,
235Tifinagh,
236Ugaritic,
237Vai,
238Yi.
239</P>
240<br><a name="SEC7" href="#TOC1">CHARACTER CLASSES</a><br>
241<P>
242<pre>
243  [...]       positive character class
244  [^...]      negative character class
245  [x-y]       range (can be used for hex characters)
246  [[:xxx:]]   positive POSIX named set
247  [[:^xxx:]]  negative POSIX named set
248
249  alnum       alphanumeric
250  alpha       alphabetic
251  ascii       0-127
252  blank       space or tab
253  cntrl       control character
254  digit       decimal digit
255  graph       printing, excluding space
256  lower       lower case letter
257  print       printing, including space
258  punct       printing, excluding alphanumeric
259  space       whitespace
260  upper       upper case letter
261  word        same as \w
262  xdigit      hexadecimal digit
263</pre>
264In PCRE, POSIX character set names recognize only ASCII characters. You can use
265\Q...\E inside a character class.
266</P>
267<br><a name="SEC8" href="#TOC1">QUANTIFIERS</a><br>
268<P>
269<pre>
270  ?           0 or 1, greedy
271  ?+          0 or 1, possessive
272  ??          0 or 1, lazy
273  *           0 or more, greedy
274  *+          0 or more, possessive
275  *?          0 or more, lazy
276  +           1 or more, greedy
277  ++          1 or more, possessive
278  +?          1 or more, lazy
279  {n}         exactly n
280  {n,m}       at least n, no more than m, greedy
281  {n,m}+      at least n, no more than m, possessive
282  {n,m}?      at least n, no more than m, lazy
283  {n,}        n or more, greedy
284  {n,}+       n or more, possessive
285  {n,}?       n or more, lazy
286</PRE>
287</P>
288<br><a name="SEC9" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
289<P>
290<pre>
291  \b          word boundary (only ASCII letters recognized)
292  \B          not a word boundary
293  ^           start of subject
294               also after internal newline in multiline mode
295  \A          start of subject
296  $           end of subject
297               also before newline at end of subject
298               also before internal newline in multiline mode
299  \Z          end of subject
300               also before newline at end of subject
301  \z          end of subject
302  \G          first matching position in subject
303</PRE>
304</P>
305<br><a name="SEC10" href="#TOC1">MATCH POINT RESET</a><br>
306<P>
307<pre>
308  \K          reset start of match
309</PRE>
310</P>
311<br><a name="SEC11" href="#TOC1">ALTERNATION</a><br>
312<P>
313<pre>
314  expr|expr|expr...
315</PRE>
316</P>
317<br><a name="SEC12" href="#TOC1">CAPTURING</a><br>
318<P>
319<pre>
320  (...)           capturing group
321  (?&#60;name&#62;...)    named capturing group (Perl)
322  (?'name'...)    named capturing group (Perl)
323  (?P&#60;name&#62;...)   named capturing group (Python)
324  (?:...)         non-capturing group
325  (?|...)         non-capturing group; reset group numbers for
326                   capturing groups in each alternative
327</PRE>
328</P>
329<br><a name="SEC13" href="#TOC1">ATOMIC GROUPS</a><br>
330<P>
331<pre>
332  (?&#62;...)         atomic, non-capturing group
333</PRE>
334</P>
335<br><a name="SEC14" href="#TOC1">COMMENT</a><br>
336<P>
337<pre>
338  (?#....)        comment (not nestable)
339</PRE>
340</P>
341<br><a name="SEC15" href="#TOC1">OPTION SETTING</a><br>
342<P>
343<pre>
344  (?i)            caseless
345  (?J)            allow duplicate names
346  (?m)            multiline
347  (?s)            single line (dotall)
348  (?U)            default ungreedy (lazy)
349  (?x)            extended (ignore white space)
350  (?-...)         unset option(s)
351</pre>
352The following is recognized only at the start of a pattern or after one of the
353newline-setting options with similar syntax:
354<pre>
355  (*UTF8)         set UTF-8 mode
356</PRE>
357</P>
358<br><a name="SEC16" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
359<P>
360<pre>
361  (?=...)         positive look ahead
362  (?!...)         negative look ahead
363  (?&#60;=...)        positive look behind
364  (?&#60;!...)        negative look behind
365</pre>
366Each top-level branch of a look behind must be of a fixed length.
367</P>
368<br><a name="SEC17" href="#TOC1">BACKREFERENCES</a><br>
369<P>
370<pre>
371  \n              reference by number (can be ambiguous)
372  \gn             reference by number
373  \g{n}           reference by number
374  \g{-n}          relative reference by number
375  \k&#60;name&#62;        reference by name (Perl)
376  \k'name'        reference by name (Perl)
377  \g{name}        reference by name (Perl)
378  \k{name}        reference by name (.NET)
379  (?P=name)       reference by name (Python)
380</PRE>
381</P>
382<br><a name="SEC18" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
383<P>
384<pre>
385  (?R)            recurse whole pattern
386  (?n)            call subpattern by absolute number
387  (?+n)           call subpattern by relative number
388  (?-n)           call subpattern by relative number
389  (?&name)        call subpattern by name (Perl)
390  (?P&#62;name)       call subpattern by name (Python)
391  \g&#60;name&#62;        call subpattern by name (Oniguruma)
392  \g'name'        call subpattern by name (Oniguruma)
393  \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
394  \g'n'           call subpattern by absolute number (Oniguruma)
395  \g&#60;+n&#62;          call subpattern by relative number (PCRE extension)
396  \g'+n'          call subpattern by relative number (PCRE extension)
397  \g&#60;-n&#62;          call subpattern by relative number (PCRE extension)
398  \g'-n'          call subpattern by relative number (PCRE extension)
399</PRE>
400</P>
401<br><a name="SEC19" href="#TOC1">CONDITIONAL PATTERNS</a><br>
402<P>
403<pre>
404  (?(condition)yes-pattern)
405  (?(condition)yes-pattern|no-pattern)
406
407  (?(n)...        absolute reference condition
408  (?(+n)...       relative reference condition
409  (?(-n)...       relative reference condition
410  (?(&#60;name&#62;)...   named reference condition (Perl)
411  (?('name')...   named reference condition (Perl)
412  (?(name)...     named reference condition (PCRE)
413  (?(R)...        overall recursion condition
414  (?(Rn)...       specific group recursion condition
415  (?(R&name)...   specific recursion condition
416  (?(DEFINE)...   define subpattern for reference
417  (?(assert)...   assertion condition
418</PRE>
419</P>
420<br><a name="SEC20" href="#TOC1">BACKTRACKING CONTROL</a><br>
421<P>
422The following act immediately they are reached:
423<pre>
424  (*ACCEPT)       force successful match
425  (*FAIL)         force backtrack; synonym (*F)
426</pre>
427The following act only when a subsequent match failure causes a backtrack to
428reach them. They all force a match failure, but they differ in what happens
429afterwards. Those that advance the start-of-match point do so only if the
430pattern is not anchored.
431<pre>
432  (*COMMIT)       overall failure, no advance of starting point
433  (*PRUNE)        advance to next starting character
434  (*SKIP)         advance start to current matching position
435  (*THEN)         local failure, backtrack to next alternation
436</PRE>
437</P>
438<br><a name="SEC21" href="#TOC1">NEWLINE CONVENTIONS</a><br>
439<P>
440These are recognized only at the very start of the pattern or after a
441(*BSR_...) or (*UTF8) option.
442<pre>
443  (*CR)           carriage return only
444  (*LF)           linefeed only
445  (*CRLF)         carriage return followed by linefeed
446  (*ANYCRLF)      all three of the above
447  (*ANY)          any Unicode newline sequence
448</PRE>
449</P>
450<br><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a><br>
451<P>
452These are recognized only at the very start of the pattern or after a
453(*...) option that sets the newline convention or UTF-8 mode.
454<pre>
455  (*BSR_ANYCRLF)  CR, LF, or CRLF
456  (*BSR_UNICODE)  any Unicode newline sequence
457</PRE>
458</P>
459<br><a name="SEC23" href="#TOC1">CALLOUTS</a><br>
460<P>
461<pre>
462  (?C)      callout
463  (?Cn)     callout with data n
464</PRE>
465</P>
466<br><a name="SEC24" href="#TOC1">SEE ALSO</a><br>
467<P>
468<b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
469<b>pcrematching</b>(3), <b>pcre</b>(3).
470</P>
471<br><a name="SEC25" href="#TOC1">AUTHOR</a><br>
472<P>
473Philip Hazel
474<br>
475University Computing Service
476<br>
477Cambridge CB2 3QH, England.
478<br>
479</P>
480<br><a name="SEC26" href="#TOC1">REVISION</a><br>
481<P>
482Last updated: 01 March 2010
483<br>
484Copyright &copy; 1997-2010 University of Cambridge.
485<br>
486<p>
487Return to the <a href="index.html">PCRE index page</a>.
488</p>
489