1<html>
2<head>
3<title>pcrepartial specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcrepartial man page</h1>
7<p>
8Return to the <a href="index.html">PCRE index page</a>.
9</p>
10<p>
11This page is part of the PCRE HTML documentation. It was generated automatically
12from the original man page. If there is any nonsense in it, please consult the
13man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
17<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec()</a>
18<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec()</a>
19<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
20<li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
21<li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
22<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
23<li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec()</a>
24<li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
25<li><a name="TOC10" href="#SEC10">AUTHOR</a>
26<li><a name="TOC11" href="#SEC11">REVISION</a>
27</ul>
28<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
29<P>
30In normal use of PCRE, if the subject string that is passed to
31<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
32too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
33are circumstances where it might be helpful to distinguish this case from other
34cases in which there is no match.
35</P>
36<P>
37Consider, for example, an application where a human is required to type in data
38for a field with specific formatting requirements. An example might be a date
39in the form <i>ddmmmyy</i>, defined by this pattern:
40<pre>
41  ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
42</pre>
43If the application sees the user's keystrokes one by one, and can check that
44what has been typed so far is potentially valid, it is able to raise an error
45as soon as a mistake is made, by beeping and not reflecting the character that
46has been typed, for example. This immediate feedback is likely to be a better
47user interface than a check that is delayed until the entire string has been
48entered. Partial matching can also sometimes be useful when the subject string
49is very long and is not all available at once.
50</P>
51<P>
52PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
53PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or
54<b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
55for PCRE_PARTIAL_SOFT. The essential difference between the two options is
56whether or not a partial match is preferred to an alternative complete match,
57though the details differ between the two matching functions. If both options
58are set, PCRE_PARTIAL_HARD takes precedence.
59</P>
60<P>
61Setting a partial matching option disables two of PCRE's optimizations. PCRE
62remembers the last literal byte in a pattern, and abandons matching immediately
63if such a byte is not present in the subject string. This optimization cannot
64be used for a subject string that might match only partially. If the pattern
65was studied, PCRE knows the minimum length of a matching string, and does not
66bother to run the matching function on shorter strings. This optimization is
67also disabled for partial matching.
68</P>
69<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
70<P>
71A partial match occurs during a call to <b>pcre_exec()</b> whenever the end of
72the subject string is reached successfully, but matching cannot continue
73because more characters are needed. However, at least one character must have
74been matched. (In other words, a partial match can never be an empty string.)
75</P>
76<P>
77If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
78continues as normal, and other alternatives in the pattern are tried. If no
79complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL
80instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
81vector, the first of them is set to the offset of the earliest character that
82was inspected when the partial match was found. For convenience, the second
83offset points to the end of the string so that a substring can easily be
84identified.
85</P>
86<P>
87For the majority of patterns, the first offset identifies the start of the
88partially matched string. However, for patterns that contain lookbehind
89assertions, or \K, or begin with \b or \B, earlier characters have been
90inspected while carrying out the match. For example:
91<pre>
92  /(?&#60;=abc)123/
93</pre>
94This pattern matches "123", but only if it is preceded by "abc". If the subject
95string is "xyzabc12", the offsets after a partial match are for the substring
96"abc12", because all these characters are needed if another match is tried
97with extra characters added.
98</P>
99<P>
100If there is more than one partial match, the first one that was found provides
101the data that is returned. Consider this pattern:
102<pre>
103  /123\w+X|dogY/
104</pre>
105If this is matched against the subject string "abc123dog", both
106alternatives fail to match, but the end of the subject is reached during
107matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
108offsets are set to 3 and 9, identifying "123dog" as the first partial match
109that was found. (In this example, there are two partial matches, because "dog"
110on its own partially matches the second alternative.)
111</P>
112<P>
113If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
114PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
115search for possible complete matches. The difference between the two options
116can be illustrated by a pattern such as:
117<pre>
118  /dog(sbody)?/
119</pre>
120This matches either "dog" or "dogsbody", greedily (that is, it prefers the
121longer string if possible). If it is matched against the string "dog" with
122PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
123PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
124if the pattern is made ungreedy the result is different:
125<pre>
126  /dog(sbody)??/
127</pre>
128In this case the result is always a complete match because <b>pcre_exec()</b>
129finds that first, and it never continues after finding a match. It might be
130easier to follow this explanation by thinking of the two patterns like this:
131<pre>
132  /dog(sbody)?/    is the same as  /dogsbody|dog/
133  /dog(sbody)??/   is the same as  /dog|dogsbody/
134</pre>
135The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
136used, because it will always find the shorter match first.
137</P>
138<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br>
139<P>
140The <b>pcre_dfa_exec()</b> function moves along the subject string character by
141character, without backtracking, searching for all possible matches
142simultaneously. If the end of the subject is reached before the end of the
143pattern, there is the possibility of a partial match, again provided that at
144least one character has matched.
145</P>
146<P>
147When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
148have been no complete matches. Otherwise, the complete matches are returned.
149However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
150complete matches. The portion of the string that was inspected when the longest
151partial match was found is set as the first matching string, provided there are
152at least two slots in the offsets vector.
153</P>
154<P>
155Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
156there is no difference between greedy and ungreedy repetition, its behaviour is
157different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
158string "dog" matched against the ungreedy pattern shown above:
159<pre>
160  /dog(sbody)??/
161</pre>
162Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
163"dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and
164so returns that when PCRE_PARTIAL_HARD is set.
165</P>
166<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
167<P>
168If a pattern ends with one of sequences \b or \B, which test for word
169boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
170results. Consider this pattern:
171<pre>
172  /\bcat\b/
173</pre>
174This matches "cat", provided there is a word boundary at either end. If the
175subject string is "the cat", the comparison of the final "t" with a following
176character cannot take place, so a partial match is found. However,
177<b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
178of the subject when the last character is a letter, thus finding a complete
179match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
180happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match.
181</P>
182<P>
183Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
184then the partial match takes precedence.
185</P>
186<br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
187<P>
188For releases of PCRE prior to 8.00, because of the way certain internal
189optimizations were implemented in the <b>pcre_exec()</b> function, the
190PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
191all patterns. From release 8.00 onwards, the restrictions no longer apply, and
192partial matching with <b>pcre_exec()</b> can be requested for any pattern.
193</P>
194<P>
195Items that were formerly restricted were repeated single characters and
196repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
197conform to the restrictions, <b>pcre_exec()</b> returned the error code
198PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
199PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
200pattern can be used for partial matching now always returns 1.
201</P>
202<br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
203<P>
204If the escape sequence \P is present in a <b>pcretest</b> data line, the
205PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
206that uses the date example quoted above:
207<pre>
208    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
209  data&#62; 25jun04\P
210   0: 25jun04
211   1: jun
212  data&#62; 25dec3\P
213  Partial match: 23dec3
214  data&#62; 3ju\P
215  Partial match: 3ju
216  data&#62; 3juj\P
217  No match
218  data&#62; j\P
219  No match
220</pre>
221The first data string is matched completely, so <b>pcretest</b> shows the
222matched substrings. The remaining four strings do not match the complete
223pattern, but the first two are partial matches. Similar output is obtained
224when <b>pcre_dfa_exec()</b> is used.
225</P>
226<P>
227If the escape sequence \P is present more than once in a <b>pcretest</b> data
228line, the PCRE_PARTIAL_HARD option is set for the match.
229</P>
230<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
231<P>
232When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
233to continue the match by providing additional subject data and calling
234<b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
235time setting the PCRE_DFA_RESTART option. You must pass the same working
236space as before, because this is where details of the previous partial match
237are stored. Here is an example using <b>pcretest</b>, using the \R escape
238sequence to set the PCRE_DFA_RESTART option (\D specifies the use of
239<b>pcre_dfa_exec()</b>):
240<pre>
241    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
242  data&#62; 23ja\P\D
243  Partial match: 23ja
244  data&#62; n05\R\D
245   0: n05
246</pre>
247The first call has "23ja" as the subject, and requests partial matching; the
248second call has "n05" as the subject for the continued (restarted) match.
249Notice that when the match is complete, only the last part is shown; PCRE does
250not retain the previously partially-matched string. It is up to the calling
251program to do that if it needs to.
252</P>
253<P>
254You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
255PCRE_DFA_RESTART to continue partial matching over multiple segments. This
256facility can be used to pass very long subject strings to
257<b>pcre_dfa_exec()</b>.
258</P>
259<br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br>
260<P>
261From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
262matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
263previous match with a new segment of data. Instead, new data must be added to
264the previous subject string, and the entire match re-run, starting from the
265point where the partial match occurred. Earlier data can be discarded.
266Consider an unanchored pattern that matches dates:
267<pre>
268    re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
269  data&#62; The date is 23ja\P
270  Partial match: 23ja
271</pre>
272At this stage, an application could discard the text preceding "23ja", add on
273text from the next segment, and call <b>pcre_exec()</b> again. Unlike
274<b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
275the complete matching process occurs for each call, so more memory and more
276processing time is needed.
277</P>
278<P>
279<b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
280with \b or \B, the string that is returned for a partial match will include
281characters that precede the partially matched string itself, because these must
282be retained when adding on more characters for a subsequent matching attempt.
283</P>
284<br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
285<P>
286Certain types of pattern may give problems with multi-segment matching,
287whichever matching function is used.
288</P>
289<P>
2901. If the pattern contains tests for the beginning or end of a line, you need
291to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
292subject string for any call does not contain the beginning or end of a line.
293</P>
294<P>
2952. Lookbehind assertions at the start of a pattern are catered for in the
296offsets that are returned for a partial match. However, in theory, a lookbehind
297assertion later in the pattern could require even earlier characters to be
298inspected, and it might not have been reached when a partial match occurs. This
299is probably an extremely unlikely case; you could guard against it to a certain
300extent by always including extra characters at the start.
301</P>
302<P>
3033. Matching a subject string that is split into multiple segments may not
304always produce exactly the same result as matching over one single long string,
305especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
306Word Boundaries" above describes an issue that arises if the pattern ends with
307\b or \B. Another kind of difference may occur when there are multiple
308matching possibilities, because a partial match result is given only when there
309are no completed matches. This means that as soon as the shortest match has
310been found, continuation to a new subject segment is no longer possible.
311Consider again this <b>pcretest</b> example:
312<pre>
313    re&#62; /dog(sbody)?/
314  data&#62; dogsb\P
315   0: dog
316  data&#62; do\P\D
317  Partial match: do
318  data&#62; gsb\R\P\D
319   0: g
320  data&#62; dogsbody\D
321   0: dogsbody
322   1: dog
323</pre>
324The first data line passes the string "dogsb" to <b>pcre_exec()</b>, setting the
325PCRE_PARTIAL_SOFT option. Although the string is a partial match for
326"dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string
327"dog" is a complete match. Similarly, when the subject is presented to
328<b>pcre_dfa_exec()</b> in several parts ("do" and "gsb" being the first two) the
329match stops when "dog" has been found, and it is not possible to continue. On
330the other hand, if "dogsbody" is presented as a single string,
331<b>pcre_dfa_exec()</b> finds both matches.
332</P>
333<P>
334Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when
335matching multi-segment data. The example above then behaves differently:
336<pre>
337    re&#62; /dog(sbody)?/
338  data&#62; dogsb\P\P
339  Partial match: dogsb
340  data&#62; do\P\D
341  Partial match: do
342  data&#62; gsb\R\P\P\D
343  Partial match: gsb
344
345</PRE>
346</P>
347<P>
3484. Patterns that contain alternatives at the top level which do not all
349start with the same pattern item may not work as expected when
350PCRE_DFA_RESTART is used with <b>pcre_dfa_exec()</b>. For example, consider this
351pattern:
352<pre>
353  1234|3789
354</pre>
355If the first part of the subject is "ABC123", a partial match of the first
356alternative is found at offset 3. There is no partial match for the second
357alternative, because such a match does not start at the same point in the
358subject string. Attempting to continue with the string "7890" does not yield a
359match because only those alternatives that match at one point in the subject
360are remembered. The problem arises because the start of the second alternative
361matches within the first alternative. There is no problem with anchored
362patterns or patterns such as:
363<pre>
364  1234|ABCD
365</pre>
366where no string can be a partial match for both alternatives. This is not a
367problem if <b>pcre_exec()</b> is used, because the entire match has to be rerun
368each time:
369<pre>
370    re&#62; /1234|3789/
371  data&#62; ABC123\P
372  Partial match: 123
373  data&#62; 1237890
374   0: 3789
375</pre>
376Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running
377the entire match can also be used with <b>pcre_dfa_exec()</b>. Another
378possibility is to work with two buffers. If a partial match at offset <i>n</i>
379in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
380the second buffer, you can then try a new match starting at offset <i>n+1</i> in
381the first buffer.
382</P>
383<br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
384<P>
385Philip Hazel
386<br>
387University Computing Service
388<br>
389Cambridge CB2 3QH, England.
390<br>
391</P>
392<br><a name="SEC11" href="#TOC1">REVISION</a><br>
393<P>
394Last updated: 19 October 2009
395<br>
396Copyright &copy; 1997-2009 University of Cambridge.
397<br>
398<p>
399Return to the <a href="index.html">PCRE index page</a>.
400</p>
401