1<html> 2<head> 3<title>pcrematching specification</title> 4</head> 5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6<h1>pcrematching man page</h1> 7<p> 8Return to the <a href="index.html">PCRE index page</a>. 9</p> 10<p> 11This page is part of the PCRE HTML documentation. It was generated automatically 12from the original man page. If there is any nonsense in it, please consult the 13man page, in case the conversion went wrong. 14<br> 15<ul> 16<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a> 17<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> 18<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> 19<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a> 20<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 21<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 22<li><a name="TOC7" href="#SEC7">AUTHOR</a> 23<li><a name="TOC8" href="#SEC8">REVISION</a> 24</ul> 25<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br> 26<P> 27This document describes the two different algorithms that are available in PCRE 28for matching a compiled regular expression against a given subject string. The 29"standard" algorithm is the one provided by the <b>pcre_exec()</b> and 30<b>pcre16_exec()</b> functions. These work in the same was as Perl's matching 31function, and provide a Perl-compatible matching operation. The just-in-time 32(JIT) optimization that is described in the 33<a href="pcrejit.html"><b>pcrejit</b></a> 34documentation is compatible with these functions. 35</P> 36<P> 37An alternative algorithm is provided by the <b>pcre_dfa_exec()</b> and 38<b>pcre16_dfa_exec()</b> functions; they operate in a different way, and are not 39Perl-compatible. This alternative has advantages and disadvantages compared 40with the standard algorithm, and these are described below. 41</P> 42<P> 43When there is only one possible way in which a given subject string can match a 44pattern, the two algorithms give the same answer. A difference arises, however, 45when there are multiple possibilities. For example, if the pattern 46<pre> 47 ^<.*> 48</pre> 49is matched against the string 50<pre> 51 <something> <something else> <something further> 52</pre> 53there are three possible answers. The standard algorithm finds only one of 54them, whereas the alternative algorithm finds all three. 55</P> 56<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br> 57<P> 58The set of strings that are matched by a regular expression can be represented 59as a tree structure. An unlimited repetition in the pattern makes the tree of 60infinite size, but it is still a tree. Matching the pattern to a given subject 61string (from a given starting point) can be thought of as a search of the tree. 62There are two ways to search a tree: depth-first and breadth-first, and these 63correspond to the two matching algorithms provided by PCRE. 64</P> 65<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br> 66<P> 67In the terminology of Jeffrey Friedl's book "Mastering Regular 68Expressions", the standard algorithm is an "NFA algorithm". It conducts a 69depth-first search of the pattern tree. That is, it proceeds along a single 70path through the tree, checking that the subject matches what is required. When 71there is a mismatch, the algorithm tries any alternatives at the current point, 72and if they all fail, it backs up to the previous branch point in the tree, and 73tries the next alternative branch at that level. This often involves backing up 74(moving to the left) in the subject string as well. The order in which 75repetition branches are tried is controlled by the greedy or ungreedy nature of 76the quantifier. 77</P> 78<P> 79If a leaf node is reached, a matching string has been found, and at that point 80the algorithm stops. Thus, if there is more than one possible match, this 81algorithm returns the first one that it finds. Whether this is the shortest, 82the longest, or some intermediate length depends on the way the greedy and 83ungreedy repetition quantifiers are specified in the pattern. 84</P> 85<P> 86Because it ends up with a single path through the tree, it is relatively 87straightforward for this algorithm to keep track of the substrings that are 88matched by portions of the pattern in parentheses. This provides support for 89capturing parentheses and back references. 90</P> 91<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br> 92<P> 93This algorithm conducts a breadth-first search of the tree. Starting from the 94first matching point in the subject, it scans the subject string from left to 95right, once, character by character, and as it does this, it remembers all the 96paths through the tree that represent valid matches. In Friedl's terminology, 97this is a kind of "DFA algorithm", though it is not implemented as a 98traditional finite state machine (it keeps multiple states active 99simultaneously). 100</P> 101<P> 102Although the general principle of this matching algorithm is that it scans the 103subject string only once, without backtracking, there is one exception: when a 104lookaround assertion is encountered, the characters following or preceding the 105current point have to be independently inspected. 106</P> 107<P> 108The scan continues until either the end of the subject is reached, or there are 109no more unterminated paths. At this point, terminated paths represent the 110different matching possibilities (if there are none, the match has failed). 111Thus, if there is more than one possible match, this algorithm finds all of 112them, and in particular, it finds the longest. The matches are returned in 113decreasing order of length. There is an option to stop the algorithm after the 114first match (which is necessarily the shortest) is found. 115</P> 116<P> 117Note that all the matches that are found start at the same point in the 118subject. If the pattern 119<pre> 120 cat(er(pillar)?)? 121</pre> 122is matched against the string "the caterpillar catchment", the result will be 123the three strings "caterpillar", "cater", and "cat" that start at the fifth 124character of the subject. The algorithm does not automatically move on to find 125matches that start at later positions. 126</P> 127<P> 128There are a number of features of PCRE regular expressions that are not 129supported by the alternative matching algorithm. They are as follows: 130</P> 131<P> 1321. Because the algorithm finds all possible matches, the greedy or ungreedy 133nature of repetition quantifiers is not relevant. Greedy and ungreedy 134quantifiers are treated in exactly the same way. However, possessive 135quantifiers can make a difference when what follows could also match what is 136quantified, for example in a pattern like this: 137<pre> 138 ^a++\w! 139</pre> 140This pattern matches "aaab!" but not "aaa!", which would be matched by a 141non-possessive quantifier. Similarly, if an atomic group is present, it is 142matched as if it were a standalone pattern at the current point, and the 143longest match is then "locked in" for the rest of the overall pattern. 144</P> 145<P> 1462. When dealing with multiple paths through the tree simultaneously, it is not 147straightforward to keep track of captured substrings for the different matching 148possibilities, and PCRE's implementation of this algorithm does not attempt to 149do this. This means that no captured substrings are available. 150</P> 151<P> 1523. Because no substrings are captured, back references within the pattern are 153not supported, and cause errors if encountered. 154</P> 155<P> 1564. For the same reason, conditional expressions that use a backreference as the 157condition or test for a specific group recursion are not supported. 158</P> 159<P> 1605. Because many paths through the tree may be active, the \K escape sequence, 161which resets the start of the match when encountered (but may be on some paths 162and not on others), is not supported. It causes an error if encountered. 163</P> 164<P> 1656. Callouts are supported, but the value of the <i>capture_top</i> field is 166always 1, and the value of the <i>capture_last</i> field is always -1. 167</P> 168<P> 1697. The \C escape sequence, which (in the standard algorithm) always matches a 170single data unit, even in UTF-8 or UTF-16 modes, is not supported in these 171modes, because the alternative algorithm moves through the subject string one 172character (not data unit) at a time, for all active paths through the tree. 173</P> 174<P> 1758. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not 176supported. (*FAIL) is supported, and behaves like a failing negative assertion. 177</P> 178<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 179<P> 180Using the alternative matching algorithm provides the following advantages: 181</P> 182<P> 1831. All possible matches (at a single point in the subject) are automatically 184found, and in particular, the longest match is found. To find more than one 185match using the standard algorithm, you have to do kludgy things with 186callouts. 187</P> 188<P> 1892. Because the alternative algorithm scans the subject string just once, and 190never needs to backtrack (except for lookbehinds), it is possible to pass very 191long subject strings to the matching function in several pieces, checking for 192partial matching each time. Although it is possible to do multi-segment 193matching using the standard algorithm by retaining partially matched 194substrings, it is more complicated. The 195<a href="pcrepartial.html"><b>pcrepartial</b></a> 196documentation gives details of partial matching and discusses multi-segment 197matching. 198</P> 199<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 200<P> 201The alternative algorithm suffers from a number of disadvantages: 202</P> 203<P> 2041. It is substantially slower than the standard algorithm. This is partly 205because it has to search for all possible matches, but is also because it is 206less susceptible to optimization. 207</P> 208<P> 2092. Capturing parentheses and back references are not supported. 210</P> 211<P> 2123. Although atomic groups are supported, their use does not provide the 213performance advantage that it does for the standard algorithm. 214</P> 215<br><a name="SEC7" href="#TOC1">AUTHOR</a><br> 216<P> 217Philip Hazel 218<br> 219University Computing Service 220<br> 221Cambridge CB2 3QH, England. 222<br> 223</P> 224<br><a name="SEC8" href="#TOC1">REVISION</a><br> 225<P> 226Last updated: 08 January 2012 227<br> 228Copyright © 1997-2012 University of Cambridge. 229<br> 230<p> 231Return to the <a href="index.html">PCRE index page</a>. 232</p> 233