1<html>
2<head>
3<title>pcrematching specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcrematching man page</h1>
7<p>
8Return to the <a href="index.html">PCRE index page</a>.
9</p>
10<p>
11This page is part of the PCRE HTML documentation. It was generated automatically
12from the original man page. If there is any nonsense in it, please consult the
13man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a>
17<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
18<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
19<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
20<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
21<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
22<li><a name="TOC7" href="#SEC7">AUTHOR</a>
23<li><a name="TOC8" href="#SEC8">REVISION</a>
24</ul>
25<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br>
26<P>
27This document describes the two different algorithms that are available in PCRE
28for matching a compiled regular expression against a given subject string. The
29"standard" algorithm is the one provided by the <b>pcre_exec()</b> and
30<b>pcre16_exec()</b> functions. These work in the same was as Perl's matching
31function, and provide a Perl-compatible matching operation. The just-in-time
32(JIT) optimization that is described in the
33<a href="pcrejit.html"><b>pcrejit</b></a>
34documentation is compatible with these functions.
35</P>
36<P>
37An alternative algorithm is provided by the <b>pcre_dfa_exec()</b> and
38<b>pcre16_dfa_exec()</b> functions; they operate in a different way, and are not
39Perl-compatible. This alternative has advantages and disadvantages compared
40with the standard algorithm, and these are described below.
41</P>
42<P>
43When there is only one possible way in which a given subject string can match a
44pattern, the two algorithms give the same answer. A difference arises, however,
45when there are multiple possibilities. For example, if the pattern
46<pre>
47  ^&#60;.*&#62;
48</pre>
49is matched against the string
50<pre>
51  &#60;something&#62; &#60;something else&#62; &#60;something further&#62;
52</pre>
53there are three possible answers. The standard algorithm finds only one of
54them, whereas the alternative algorithm finds all three.
55</P>
56<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
57<P>
58The set of strings that are matched by a regular expression can be represented
59as a tree structure. An unlimited repetition in the pattern makes the tree of
60infinite size, but it is still a tree. Matching the pattern to a given subject
61string (from a given starting point) can be thought of as a search of the tree.
62There are two ways to search a tree: depth-first and breadth-first, and these
63correspond to the two matching algorithms provided by PCRE.
64</P>
65<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
66<P>
67In the terminology of Jeffrey Friedl's book "Mastering Regular
68Expressions", the standard algorithm is an "NFA algorithm". It conducts a
69depth-first search of the pattern tree. That is, it proceeds along a single
70path through the tree, checking that the subject matches what is required. When
71there is a mismatch, the algorithm tries any alternatives at the current point,
72and if they all fail, it backs up to the previous branch point in the tree, and
73tries the next alternative branch at that level. This often involves backing up
74(moving to the left) in the subject string as well. The order in which
75repetition branches are tried is controlled by the greedy or ungreedy nature of
76the quantifier.
77</P>
78<P>
79If a leaf node is reached, a matching string has been found, and at that point
80the algorithm stops. Thus, if there is more than one possible match, this
81algorithm returns the first one that it finds. Whether this is the shortest,
82the longest, or some intermediate length depends on the way the greedy and
83ungreedy repetition quantifiers are specified in the pattern.
84</P>
85<P>
86Because it ends up with a single path through the tree, it is relatively
87straightforward for this algorithm to keep track of the substrings that are
88matched by portions of the pattern in parentheses. This provides support for
89capturing parentheses and back references.
90</P>
91<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
92<P>
93This algorithm conducts a breadth-first search of the tree. Starting from the
94first matching point in the subject, it scans the subject string from left to
95right, once, character by character, and as it does this, it remembers all the
96paths through the tree that represent valid matches. In Friedl's terminology,
97this is a kind of "DFA algorithm", though it is not implemented as a
98traditional finite state machine (it keeps multiple states active
99simultaneously).
100</P>
101<P>
102Although the general principle of this matching algorithm is that it scans the
103subject string only once, without backtracking, there is one exception: when a
104lookaround assertion is encountered, the characters following or preceding the
105current point have to be independently inspected.
106</P>
107<P>
108The scan continues until either the end of the subject is reached, or there are
109no more unterminated paths. At this point, terminated paths represent the
110different matching possibilities (if there are none, the match has failed).
111Thus, if there is more than one possible match, this algorithm finds all of
112them, and in particular, it finds the longest. The matches are returned in
113decreasing order of length. There is an option to stop the algorithm after the
114first match (which is necessarily the shortest) is found.
115</P>
116<P>
117Note that all the matches that are found start at the same point in the
118subject. If the pattern
119<pre>
120  cat(er(pillar)?)?
121</pre>
122is matched against the string "the caterpillar catchment", the result will be
123the three strings "caterpillar", "cater", and "cat" that start at the fifth
124character of the subject. The algorithm does not automatically move on to find
125matches that start at later positions.
126</P>
127<P>
128There are a number of features of PCRE regular expressions that are not
129supported by the alternative matching algorithm. They are as follows:
130</P>
131<P>
1321. Because the algorithm finds all possible matches, the greedy or ungreedy
133nature of repetition quantifiers is not relevant. Greedy and ungreedy
134quantifiers are treated in exactly the same way. However, possessive
135quantifiers can make a difference when what follows could also match what is
136quantified, for example in a pattern like this:
137<pre>
138  ^a++\w!
139</pre>
140This pattern matches "aaab!" but not "aaa!", which would be matched by a
141non-possessive quantifier. Similarly, if an atomic group is present, it is
142matched as if it were a standalone pattern at the current point, and the
143longest match is then "locked in" for the rest of the overall pattern.
144</P>
145<P>
1462. When dealing with multiple paths through the tree simultaneously, it is not
147straightforward to keep track of captured substrings for the different matching
148possibilities, and PCRE's implementation of this algorithm does not attempt to
149do this. This means that no captured substrings are available.
150</P>
151<P>
1523. Because no substrings are captured, back references within the pattern are
153not supported, and cause errors if encountered.
154</P>
155<P>
1564. For the same reason, conditional expressions that use a backreference as the
157condition or test for a specific group recursion are not supported.
158</P>
159<P>
1605. Because many paths through the tree may be active, the \K escape sequence,
161which resets the start of the match when encountered (but may be on some paths
162and not on others), is not supported. It causes an error if encountered.
163</P>
164<P>
1656. Callouts are supported, but the value of the <i>capture_top</i> field is
166always 1, and the value of the <i>capture_last</i> field is always -1.
167</P>
168<P>
1697. The \C escape sequence, which (in the standard algorithm) always matches a
170single data unit, even in UTF-8 or UTF-16 modes, is not supported in these
171modes, because the alternative algorithm moves through the subject string one
172character (not data unit) at a time, for all active paths through the tree.
173</P>
174<P>
1758. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
176supported. (*FAIL) is supported, and behaves like a failing negative assertion.
177</P>
178<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
179<P>
180Using the alternative matching algorithm provides the following advantages:
181</P>
182<P>
1831. All possible matches (at a single point in the subject) are automatically
184found, and in particular, the longest match is found. To find more than one
185match using the standard algorithm, you have to do kludgy things with
186callouts.
187</P>
188<P>
1892. Because the alternative algorithm scans the subject string just once, and
190never needs to backtrack (except for lookbehinds), it is possible to pass very
191long subject strings to the matching function in several pieces, checking for
192partial matching each time. Although it is possible to do multi-segment
193matching using the standard algorithm by retaining partially matched
194substrings, it is more complicated. The
195<a href="pcrepartial.html"><b>pcrepartial</b></a>
196documentation gives details of partial matching and discusses multi-segment
197matching.
198</P>
199<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
200<P>
201The alternative algorithm suffers from a number of disadvantages:
202</P>
203<P>
2041. It is substantially slower than the standard algorithm. This is partly
205because it has to search for all possible matches, but is also because it is
206less susceptible to optimization.
207</P>
208<P>
2092. Capturing parentheses and back references are not supported.
210</P>
211<P>
2123. Although atomic groups are supported, their use does not provide the
213performance advantage that it does for the standard algorithm.
214</P>
215<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
216<P>
217Philip Hazel
218<br>
219University Computing Service
220<br>
221Cambridge CB2 3QH, England.
222<br>
223</P>
224<br><a name="SEC8" href="#TOC1">REVISION</a><br>
225<P>
226Last updated: 08 January 2012
227<br>
228Copyright &copy; 1997-2012 University of Cambridge.
229<br>
230<p>
231Return to the <a href="index.html">PCRE index page</a>.
232</p>
233