1-----------------------------------------------------------------------------
2This file contains a concatenation of the PCRE man pages, converted to plain
3text format for ease of searching with a text editor, or for use on systems
4that do not have a man page processor. The small individual files that give
5synopses of each function in the library have not been included. Neither has
6the pcredemo program. There are separate text files for the pcregrep and
7pcretest commands.
8-----------------------------------------------------------------------------
9
10
11PCRE(3)                                                                PCRE(3)
12
13
14NAME
15       PCRE - Perl-compatible regular expressions
16
17
18INTRODUCTION
19
20       The  PCRE  library is a set of functions that implement regular expres-
21       sion pattern matching using the same syntax and semantics as Perl, with
22       just  a few differences. Some features that appeared in Python and PCRE
23       before they appeared in Perl are also available using the  Python  syn-
24       tax,  there  is  some  support for one or two .NET and Oniguruma syntax
25       items, and there is an option for requesting some  minor  changes  that
26       give better JavaScript compatibility.
27
28       Starting with release 8.30, it is possible to compile two separate PCRE
29       libraries:  the  original,  which  supports  8-bit  character   strings
30       (including  UTF-8  strings),  and a second library that supports 16-bit
31       character strings (including UTF-16 strings). The build process  allows
32       either  one  or both to be built. The majority of the work to make this
33       possible was done by Zoltan Herczeg.
34
35       The two libraries contain identical sets of functions, except that  the
36       names  in  the  16-bit  library start with pcre16_ instead of pcre_. To
37       avoid over-complication and reduce the documentation maintenance  load,
38       most of the documentation describes the 8-bit library, with the differ-
39       ences for the 16-bit library described separately in the  pcre16  page.
40       References  to  functions or structures of the form pcre[16]_xxx should
41       be  read  as  meaning  "pcre_xxx  when  using  the  8-bit  library  and
42       pcre16_xxx when using the 16-bit library".
43
44       The  current implementation of PCRE corresponds approximately with Perl
45       5.12, including support for UTF-8/16 encoded strings and  Unicode  gen-
46       eral  category properties. However, UTF-8/16 and Unicode support has to
47       be explicitly enabled; it is not the default. The Unicode tables corre-
48       spond to Unicode release 6.0.0.
49
50       In  addition to the Perl-compatible matching function, PCRE contains an
51       alternative function that matches the same compiled patterns in a  dif-
52       ferent way. In certain circumstances, the alternative function has some
53       advantages.  For a discussion of the two matching algorithms,  see  the
54       pcrematching page.
55
56       PCRE  is  written  in C and released as a C library. A number of people
57       have written wrappers and interfaces of various kinds.  In  particular,
58       Google  Inc.   have  provided a comprehensive C++ wrapper for the 8-bit
59       library. This is now included as part of  the  PCRE  distribution.  The
60       pcrecpp  page  has  details of this interface. Other people's contribu-
61       tions can be found in the Contrib directory at the  primary  FTP  site,
62       which is:
63
64       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
65
66       Details  of  exactly which Perl regular expression features are and are
67       not supported by PCRE are given in separate documents. See the pcrepat-
68       tern  and pcrecompat pages. There is a syntax summary in the pcresyntax
69       page.
70
71       Some features of PCRE can be included, excluded, or  changed  when  the
72       library  is  built.  The pcre_config() function makes it possible for a
73       client to discover which features are  available.  The  features  them-
74       selves  are described in the pcrebuild page. Documentation about build-
75       ing PCRE for various operating systems can be found in the  README  and
76       NON-UNIX-USE files in the source distribution.
77
78       The  libraries contains a number of undocumented internal functions and
79       data tables that are used by more than one  of  the  exported  external
80       functions,  but  which  are  not  intended for use by external callers.
81       Their names all begin with "_pcre_" or "_pcre16_", which hopefully will
82       not  provoke  any name clashes. In some environments, it is possible to
83       control which external symbols are exported when a  shared  library  is
84       built, and in these cases the undocumented symbols are not exported.
85
86
87USER DOCUMENTATION
88
89       The  user  documentation  for PCRE comprises a number of different sec-
90       tions. In the "man" format, each of these is a separate "man page".  In
91       the  HTML  format, each is a separate page, linked from the index page.
92       In the plain text format, all the sections, except  the  pcredemo  sec-
93       tion, are concatenated, for ease of searching. The sections are as fol-
94       lows:
95
96         pcre              this document
97         pcre16            details of the 16-bit library
98         pcre-config       show PCRE installation configuration information
99         pcreapi           details of PCRE's native C API
100         pcrebuild         options for building PCRE
101         pcrecallout       details of the callout feature
102         pcrecompat        discussion of Perl compatibility
103         pcrecpp           details of the C++ wrapper for the 8-bit library
104         pcredemo          a demonstration C program that uses PCRE
105         pcregrep          description of the pcregrep command (8-bit only)
106         pcrejit           discussion of the just-in-time optimization support
107         pcrelimits        details of size and other limits
108         pcrematching      discussion of the two matching algorithms
109         pcrepartial       details of the partial matching facility
110         pcrepattern       syntax and semantics of supported
111                             regular expressions
112         pcreperform       discussion of performance issues
113         pcreposix         the POSIX-compatible C API for the 8-bit library
114         pcreprecompile    details of saving and re-using precompiled patterns
115         pcresample        discussion of the pcredemo program
116         pcrestack         discussion of stack usage
117         pcresyntax        quick syntax reference
118         pcretest          description of the pcretest testing command
119         pcreunicode       discussion of Unicode and UTF-8/16 support
120
121       In addition, in the "man" and HTML formats, there is a short  page  for
122       each 8-bit C library function, listing its arguments and results.
123
124
125AUTHOR
126
127       Philip Hazel
128       University Computing Service
129       Cambridge CB2 3QH, England.
130
131       Putting  an actual email address here seems to have been a spam magnet,
132       so I've taken it away. If you want to email me, use  my  two  initials,
133       followed by the two digits 10, at the domain cam.ac.uk.
134
135
136REVISION
137
138       Last updated: 10 January 2012
139       Copyright (c) 1997-2012 University of Cambridge.
140------------------------------------------------------------------------------
141
142
143PCRE(3)                                                                PCRE(3)
144
145
146NAME
147       PCRE - Perl-compatible regular expressions
148
149       #include <pcre.h>
150
151
152PCRE 16-BIT API BASIC FUNCTIONS
153
154       pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
155            const char **errptr, int *erroffset,
156            const unsigned char *tableptr);
157
158       pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
159            int *errorcodeptr,
160            const char **errptr, int *erroffset,
161            const unsigned char *tableptr);
162
163       pcre16_extra *pcre16_study(const pcre16 *code, int options,
164            const char **errptr);
165
166       void pcre16_free_study(pcre16_extra *extra);
167
168       int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
169            PCRE_SPTR16 subject, int length, int startoffset,
170            int options, int *ovector, int ovecsize);
171
172       int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
173            PCRE_SPTR16 subject, int length, int startoffset,
174            int options, int *ovector, int ovecsize,
175            int *workspace, int wscount);
176
177
178PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
179
180       int pcre16_copy_named_substring(const pcre16 *code,
181            PCRE_SPTR16 subject, int *ovector,
182            int stringcount, PCRE_SPTR16 stringname,
183            PCRE_UCHAR16 *buffer, int buffersize);
184
185       int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
186            int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
187            int buffersize);
188
189       int pcre16_get_named_substring(const pcre16 *code,
190            PCRE_SPTR16 subject, int *ovector,
191            int stringcount, PCRE_SPTR16 stringname,
192            PCRE_SPTR16 *stringptr);
193
194       int pcre16_get_stringnumber(const pcre16 *code,
195            PCRE_SPTR16 name);
196
197       int pcre16_get_stringtable_entries(const pcre16 *code,
198            PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
199
200       int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
201            int stringcount, int stringnumber,
202            PCRE_SPTR16 *stringptr);
203
204       int pcre16_get_substring_list(PCRE_SPTR16 subject,
205            int *ovector, int stringcount, PCRE_SPTR16 **listptr);
206
207       void pcre16_free_substring(PCRE_SPTR16 stringptr);
208
209       void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
210
211
212PCRE 16-BIT API AUXILIARY FUNCTIONS
213
214       pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
215
216       void pcre16_jit_stack_free(pcre16_jit_stack *stack);
217
218       void pcre16_assign_jit_stack(pcre16_extra *extra,
219            pcre16_jit_callback callback, void *data);
220
221       const unsigned char *pcre16_maketables(void);
222
223       int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
224            int what, void *where);
225
226       int pcre16_refcount(pcre16 *code, int adjust);
227
228       int pcre16_config(int what, void *where);
229
230       const char *pcre16_version(void);
231
232       int pcre16_pattern_to_host_byte_order(pcre16 *code,
233            pcre16_extra *extra, const unsigned char *tables);
234
235
236PCRE 16-BIT API INDIRECTED FUNCTIONS
237
238       void *(*pcre16_malloc)(size_t);
239
240       void (*pcre16_free)(void *);
241
242       void *(*pcre16_stack_malloc)(size_t);
243
244       void (*pcre16_stack_free)(void *);
245
246       int (*pcre16_callout)(pcre16_callout_block *);
247
248
249PCRE 16-BIT API 16-BIT-ONLY FUNCTION
250
251       int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
252            PCRE_SPTR16 input, int length, int *byte_order,
253            int keep_boms);
254
255
256THE PCRE 16-BIT LIBRARY
257
258       Starting  with  release  8.30, it is possible to compile a PCRE library
259       that supports 16-bit character strings, including  UTF-16  strings,  as
260       well  as  or instead of the original 8-bit library. The majority of the
261       work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
262       libraries contain identical sets of functions, used in exactly the same
263       way. Only the names of the functions and the data types of their  argu-
264       ments  and results are different. To avoid over-complication and reduce
265       the documentation maintenance load,  most  of  the  PCRE  documentation
266       describes  the  8-bit  library,  with only occasional references to the
267       16-bit library. This page describes what is different when you use  the
268       16-bit library.
269
270       WARNING:  A  single  application can be linked with both libraries, but
271       you must take care when processing any particular pattern to use  func-
272       tions  from  just one library. For example, if you want to study a pat-
273       tern that was compiled with  pcre16_compile(),  you  must  do  so  with
274       pcre16_study(), not pcre_study(), and you must free the study data with
275       pcre16_free_study().
276
277
278THE HEADER FILE
279
280       There is only one header file, pcre.h. It contains prototypes  for  all
281       the  functions  in  both  libraries,  as  well as definitions of flags,
282       structures, error codes, etc.
283
284
285THE LIBRARY NAME
286
287       In Unix-like systems, the 16-bit library is called libpcre16,  and  can
288       normally  be  accesss  by adding -lpcre16 to the command for linking an
289       application that uses PCRE.
290
291
292STRING TYPES
293
294       In the 8-bit library, strings are passed to PCRE library  functions  as
295       vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
296       strings are passed as vectors of unsigned 16-bit quantities. The  macro
297       PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
298       defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
299       int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
300       as "short int", but checks that it really is a 16-bit data type. If  it
301       is not, the build fails with an error message telling the maintainer to
302       modify the definition appropriately.
303
304
305STRUCTURE TYPES
306
307       The types of the opaque structures that are used  for  compiled  16-bit
308       patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
309       The  type  of  the  user-accessible  structure  that  is  returned   by
310       pcre16_study()  is  pcre16_extra, and the type of the structure that is
311       used for passing data to a callout  function  is  pcre16_callout_block.
312       These structures contain the same fields, with the same names, as their
313       8-bit counterparts. The only difference is that pointers  to  character
314       strings are 16-bit instead of 8-bit types.
315
316
31716-BIT FUNCTIONS
318
319       For  every function in the 8-bit library there is a corresponding func-
320       tion in the 16-bit library with a name that starts with pcre16_ instead
321       of  pcre_.  The  prototypes are listed above. In addition, there is one
322       extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
323       function  that converts a UTF-16 character string to host byte order if
324       necessary. The other 16-bit  functions  expect  the  strings  they  are
325       passed to be in host byte order.
326
327       The input and output arguments of pcre16_utf16_to_host_byte_order() may
328       point to the same address, that is, conversion in place  is  supported.
329       The output buffer must be at least as long as the input.
330
331       The  length  argument  specifies the number of 16-bit data units in the
332       input string; a negative value specifies a zero-terminated string.
333
334       If byte_order is NULL, it is assumed that the string starts off in host
335       byte  order. This may be changed by byte-order marks (BOMs) anywhere in
336       the string (commonly as the first character).
337
338       If byte_order is not NULL, a non-zero value of the integer to which  it
339       points  means  that  the input starts off in host byte order, otherwise
340       the opposite order is assumed. Again, BOMs in  the  string  can  change
341       this. The final byte order is passed back at the end of processing.
342
343       If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
344       copied into the output string. Otherwise they are discarded.
345
346       The result of the function is the number of 16-bit  units  placed  into
347       the  output  buffer,  including  the  zero terminator if the string was
348       zero-terminated.
349
350
351SUBJECT STRING OFFSETS
352
353       The offsets within subject strings that are returned  by  the  matching
354       functions are in 16-bit units rather than bytes.
355
356
357NAMED SUBPATTERNS
358
359       The  name-to-number translation table that is maintained for named sub-
360       patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
361       function returns the length of each entry in the table as the number of
362       16-bit data units.
363
364
365OPTION NAMES
366
367       There   are   two   new   general   option   names,   PCRE_UTF16    and
368       PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
369       PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
370       define  the  same bits in the options word. There is a discussion about
371       the validity of UTF-16 strings in the pcreunicode page.
372
373       For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
374       that  returns  1  if UTF-16 support is configured, otherwise 0. If this
375       option is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option  is
376       given to pcre16_config(), the result is the PCRE_ERROR_BADOPTION error.
377
378
379CHARACTER CODES
380
381       In  16-bit  mode,  when  PCRE_UTF16  is  not  set, character values are
382       treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
383       that  they  can  range from 0 to 0xffff instead of 0 to 0xff. Character
384       types for characters less than 0xff can therefore be influenced by  the
385       locale  in  the  same way as before.  Characters greater than 0xff have
386       only one case, and no "type" (such as letter or digit).
387
388       In UTF-16 mode, the character code  is  Unicode,  in  the  range  0  to
389       0x10ffff,  with  the  exception of values in the range 0xd800 to 0xdfff
390       because those are "surrogate" values that are used in pairs  to  encode
391       values greater than 0xffff.
392
393       A  UTF-16 string can indicate its endianness by special code knows as a
394       byte-order mark (BOM). The PCRE functions do not handle this, expecting
395       strings   to   be  in  host  byte  order.  A  utility  function  called
396       pcre16_utf16_to_host_byte_order() is provided to help  with  this  (see
397       above).
398
399
400ERROR NAMES
401
402       The  errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
403       spond to their 8-bit  counterparts.  The  error  PCRE_ERROR_BADMODE  is
404       given  when  a  compiled pattern is passed to a function that processes
405       patterns in the other mode, for example, if  a  pattern  compiled  with
406       pcre_compile() is passed to pcre16_exec().
407
408       There  are  new  error  codes whose names begin with PCRE_UTF16_ERR for
409       invalid UTF-16 strings, corresponding to the  PCRE_UTF8_ERR  codes  for
410       UTF-8  strings that are described in the section entitled "Reason codes
411       for invalid UTF-8 strings" in the main pcreapi page. The UTF-16  errors
412       are:
413
414         PCRE_UTF16_ERR1  Missing low surrogate at end of string
415         PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
416         PCRE_UTF16_ERR3  Isolated low surrogate
417         PCRE_UTF16_ERR4  Invalid character 0xfffe
418
419
420ERROR TEXTS
421
422       If  there is an error while compiling a pattern, the error text that is
423       passed back by pcre16_compile() or pcre16_compile2() is still an  8-bit
424       character string, zero-terminated.
425
426
427CALLOUTS
428
429       The  subject  and  mark fields in the callout block that is passed to a
430       callout function point to 16-bit vectors.
431
432
433TESTING
434
435       The pcretest program continues to operate with 8-bit input  and  output
436       files,  but it can be used for testing the 16-bit library. If it is run
437       with the command line option -16, patterns and subject strings are con-
438       verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
439       library functions are used instead of the 8-bit ones.  Returned  16-bit
440       strings are converted to 8-bit for output. If the 8-bit library was not
441       compiled, pcretest defaults to 16-bit and the -16 option is ignored.
442
443       When PCRE is being built, the RunTest script that is  called  by  "make
444       check"  uses  the pcretest -C option to discover which of the 8-bit and
445       16-bit libraries has been built, and runs the tests appropriately.
446
447
448NOT SUPPORTED IN 16-BIT MODE
449
450       Not all the features of the 8-bit library are available with the 16-bit
451       library.  The  C++  and  POSIX wrapper functions support only the 8-bit
452       library, and the pcregrep program is at present 8-bit only.
453
454
455AUTHOR
456
457       Philip Hazel
458       University Computing Service
459       Cambridge CB2 3QH, England.
460
461
462REVISION
463
464       Last updated: 14 April 2012
465       Copyright (c) 1997-2012 University of Cambridge.
466------------------------------------------------------------------------------
467
468
469PCREBUILD(3)                                                      PCREBUILD(3)
470
471
472NAME
473       PCRE - Perl-compatible regular expressions
474
475
476PCRE BUILD-TIME OPTIONS
477
478       This  document  describes  the  optional  features  of PCRE that can be
479       selected when the library is compiled. It assumes use of the  configure
480       script,  where the optional features are selected or deselected by pro-
481       viding options to configure before running the make  command.  However,
482       the  same  options  can be selected in both Unix-like and non-Unix-like
483       environments using the GUI facility of cmake-gui if you are using CMake
484       instead of configure to build PCRE.
485
486       There  is  a  lot more information about building PCRE in non-Unix-like
487       environments in the file called NON_UNIX_USE, which is part of the PCRE
488       distribution.  You  should consult this file as well as the README file
489       if you are building in a non-Unix-like environment.
490
491       The complete list of options for configure (which includes the standard
492       ones  such  as  the  selection  of  the  installation directory) can be
493       obtained by running
494
495         ./configure --help
496
497       The following sections include  descriptions  of  options  whose  names
498       begin with --enable or --disable. These settings specify changes to the
499       defaults for the configure command. Because of the way  that  configure
500       works,  --enable  and --disable always come in pairs, so the complemen-
501       tary option always exists as well, but as it specifies the default,  it
502       is not described.
503
504
505BUILDING 8-BIT and 16-BIT LIBRARIES
506
507       By  default,  a  library  called libpcre is built, containing functions
508       that take string arguments contained in vectors  of  bytes,  either  as
509       single-byte  characters,  or interpreted as UTF-8 strings. You can also
510       build a separate library, called libpcre16, in which strings  are  con-
511       tained  in  vectors of 16-bit data units and interpreted either as sin-
512       gle-unit characters or UTF-16 strings, by adding
513
514         --enable-pcre16
515
516       to the configure command. If you do not want the 8-bit library, add
517
518         --disable-pcre8
519
520       as well. At least one of the two libraries must be built. Note that the
521       C++  and  POSIX wrappers are for the 8-bit library only, and that pcre-
522       grep is an 8-bit program. None of these are built if  you  select  only
523       the 16-bit library.
524
525
526BUILDING SHARED AND STATIC LIBRARIES
527
528       The  PCRE building process uses libtool to build both shared and static
529       Unix libraries by default. You can suppress one of these by adding  one
530       of
531
532         --disable-shared
533         --disable-static
534
535       to the configure command, as required.
536
537
538C++ SUPPORT
539
540       By  default,  if the 8-bit library is being built, the configure script
541       will search for a C++ compiler and C++ header files. If it finds  them,
542       it  automatically  builds  the C++ wrapper library (which supports only
543       8-bit strings). You can disable this by adding
544
545         --disable-cpp
546
547       to the configure command.
548
549
550UTF-8 and UTF-16 SUPPORT
551
552       To build PCRE with support for UTF Unicode character strings, add
553
554         --enable-utf
555
556       to the configure command.  This  setting  applies  to  both  libraries,
557       adding support for UTF-8 to the 8-bit library and support for UTF-16 to
558       the 16-bit library. There are no separate options  for  enabling  UTF-8
559       and  UTF-16  independently because that would allow ridiculous settings
560       such as  requesting  UTF-16  support  while  building  only  the  8-bit
561       library.  It  is not possible to build one library with UTF support and
562       the other without in the same configuration. (For backwards compatibil-
563       ity, --enable-utf8 is a synonym of --enable-utf.)
564
565       Of  itself,  this  setting does not make PCRE treat strings as UTF-8 or
566       UTF-16. As well as compiling PCRE with this option, you also have  have
567       to set the PCRE_UTF8 or PCRE_UTF16 option when you call one of the pat-
568       tern compiling functions.
569
570       If you set --enable-utf when compiling in an EBCDIC  environment,  PCRE
571       expects  its  input  to be either ASCII or UTF-8 (depending on the run-
572       time option). It is not possible to support both EBCDIC and UTF-8 codes
573       in  the  same  version  of  the library. Consequently, --enable-utf and
574       --enable-ebcdic are mutually exclusive.
575
576
577UNICODE CHARACTER PROPERTY SUPPORT
578
579       UTF support allows the libraries to process character codepoints up  to
580       0x10ffff  in the strings that they handle. On its own, however, it does
581       not provide any facilities for accessing the properties of such charac-
582       ters. If you want to be able to use the pattern escapes \P, \p, and \X,
583       which refer to Unicode character properties, you must add
584
585         --enable-unicode-properties
586
587       to the configure command. This implies UTF support, even  if  you  have
588       not explicitly requested it.
589
590       Including  Unicode  property  support  adds around 30K of tables to the
591       PCRE library. Only the general category properties such as  Lu  and  Nd
592       are supported. Details are given in the pcrepattern documentation.
593
594
595JUST-IN-TIME COMPILER SUPPORT
596
597       Just-in-time compiler support is included in the build by specifying
598
599         --enable-jit
600
601       This  support  is available only for certain hardware architectures. If
602       this option is set for an  unsupported  architecture,  a  compile  time
603       error  occurs.   See  the pcrejit documentation for a discussion of JIT
604       usage. When JIT support is enabled, pcregrep automatically makes use of
605       it, unless you add
606
607         --disable-pcregrep-jit
608
609       to the "configure" command.
610
611
612CODE VALUE OF NEWLINE
613
614       By  default,  PCRE interprets the linefeed (LF) character as indicating
615       the end of a line. This is the normal newline  character  on  Unix-like
616       systems.  You  can compile PCRE to use carriage return (CR) instead, by
617       adding
618
619         --enable-newline-is-cr
620
621       to the  configure  command.  There  is  also  a  --enable-newline-is-lf
622       option, which explicitly specifies linefeed as the newline character.
623
624       Alternatively, you can specify that line endings are to be indicated by
625       the two character sequence CRLF. If you want this, add
626
627         --enable-newline-is-crlf
628
629       to the configure command. There is a fourth option, specified by
630
631         --enable-newline-is-anycrlf
632
633       which causes PCRE to recognize any of the three sequences  CR,  LF,  or
634       CRLF as indicating a line ending. Finally, a fifth option, specified by
635
636         --enable-newline-is-any
637
638       causes PCRE to recognize any Unicode newline sequence.
639
640       Whatever  line  ending convention is selected when PCRE is built can be
641       overridden when the library functions are called. At build time  it  is
642       conventional to use the standard for your operating system.
643
644
645WHAT \R MATCHES
646
647       By  default,  the  sequence \R in a pattern matches any Unicode newline
648       sequence, whatever has been selected as the line  ending  sequence.  If
649       you specify
650
651         --enable-bsr-anycrlf
652
653       the  default  is changed so that \R matches only CR, LF, or CRLF. What-
654       ever is selected when PCRE is built can be overridden when the  library
655       functions are called.
656
657
658POSIX MALLOC USAGE
659
660       When  the  8-bit library is called through the POSIX interface (see the
661       pcreposix documentation), additional working storage  is  required  for
662       holding  the  pointers  to  capturing substrings, because PCRE requires
663       three integers per substring, whereas the POSIX interface provides only
664       two.  If  the number of expected substrings is small, the wrapper func-
665       tion uses space on the stack, because this is faster  than  using  mal-
666       loc()  for each call. The default threshold above which the stack is no
667       longer used is 10; it can be changed by adding a setting such as
668
669         --with-posix-malloc-threshold=20
670
671       to the configure command.
672
673
674HANDLING VERY LARGE PATTERNS
675
676       Within a compiled pattern, offset values are used  to  point  from  one
677       part  to another (for example, from an opening parenthesis to an alter-
678       nation metacharacter). By default, two-byte values are used  for  these
679       offsets,  leading  to  a  maximum size for a compiled pattern of around
680       64K. This is sufficient to handle all but the most  gigantic  patterns.
681       Nevertheless,  some  people do want to process truly enormous patterns,
682       so it is possible to compile PCRE to use three-byte or  four-byte  off-
683       sets by adding a setting such as
684
685         --with-link-size=3
686
687       to  the  configure command. The value given must be 2, 3, or 4. For the
688       16-bit library, a value of 3 is rounded up to 4. Using  longer  offsets
689       slows down the operation of PCRE because it has to load additional data
690       when handling them.
691
692
693AVOIDING EXCESSIVE STACK USAGE
694
695       When matching with the pcre_exec() function, PCRE implements backtrack-
696       ing  by  making recursive calls to an internal function called match().
697       In environments where the size of the stack is limited,  this  can  se-
698       verely  limit  PCRE's operation. (The Unix environment does not usually
699       suffer from this problem, but it may sometimes be necessary to increase
700       the  maximum  stack size.  There is a discussion in the pcrestack docu-
701       mentation.) An alternative approach to recursion that uses memory  from
702       the  heap  to remember data, instead of using recursive function calls,
703       has been implemented to work round the problem of limited  stack  size.
704       If you want to build a version of PCRE that works this way, add
705
706         --disable-stack-for-recursion
707
708       to  the  configure  command. With this configuration, PCRE will use the
709       pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
710       ment  functions. By default these point to malloc() and free(), but you
711       can replace the pointers so that your own functions are used instead.
712
713       Separate functions are  provided  rather  than  using  pcre_malloc  and
714       pcre_free  because  the  usage  is  very  predictable:  the block sizes
715       requested are always the same, and  the  blocks  are  always  freed  in
716       reverse  order.  A calling program might be able to implement optimized
717       functions that perform better  than  malloc()  and  free().  PCRE  runs
718       noticeably more slowly when built in this way. This option affects only
719       the pcre_exec() function; it is not relevant for pcre_dfa_exec().
720
721
722LIMITING PCRE RESOURCE USAGE
723
724       Internally, PCRE has a function called match(), which it calls  repeat-
725       edly   (sometimes   recursively)  when  matching  a  pattern  with  the
726       pcre_exec() function. By controlling the maximum number of  times  this
727       function  may be called during a single matching operation, a limit can
728       be placed on the resources used by a single call  to  pcre_exec().  The
729       limit  can be changed at run time, as described in the pcreapi documen-
730       tation. The default is 10 million, but this can be changed by adding  a
731       setting such as
732
733         --with-match-limit=500000
734
735       to   the   configure  command.  This  setting  has  no  effect  on  the
736       pcre_dfa_exec() matching function.
737
738       In some environments it is desirable to limit the  depth  of  recursive
739       calls of match() more strictly than the total number of calls, in order
740       to restrict the maximum amount of stack (or heap,  if  --disable-stack-
741       for-recursion is specified) that is used. A second limit controls this;
742       it defaults to the value that  is  set  for  --with-match-limit,  which
743       imposes  no  additional constraints. However, you can set a lower limit
744       by adding, for example,
745
746         --with-match-limit-recursion=10000
747
748       to the configure command. This value can  also  be  overridden  at  run
749       time.
750
751
752CREATING CHARACTER TABLES AT BUILD TIME
753
754       PCRE  uses fixed tables for processing characters whose code values are
755       less than 256. By default, PCRE is built with a set of tables that  are
756       distributed  in  the  file pcre_chartables.c.dist. These tables are for
757       ASCII codes only. If you add
758
759         --enable-rebuild-chartables
760
761       to the configure command, the distributed tables are  no  longer  used.
762       Instead,  a  program  called dftables is compiled and run. This outputs
763       the source for new set of tables, created in the default locale of your
764       C  run-time  system. (This method of replacing the tables does not work
765       if you are cross compiling, because dftables is run on the local  host.
766       If you need to create alternative tables when cross compiling, you will
767       have to do so "by hand".)
768
769
770USING EBCDIC CODE
771
772       PCRE assumes by default that it will run in an  environment  where  the
773       character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
774       This is the case for most computer operating systems.  PCRE  can,  how-
775       ever, be compiled to run in an EBCDIC environment by adding
776
777         --enable-ebcdic
778
779       to the configure command. This setting implies --enable-rebuild-charta-
780       bles. You should only use it if you know that  you  are  in  an  EBCDIC
781       environment  (for  example,  an  IBM  mainframe  operating system). The
782       --enable-ebcdic option is incompatible with --enable-utf.
783
784
785PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
786
787       By default, pcregrep reads all files as plain text. You can build it so
788       that it recognizes files whose names end in .gz or .bz2, and reads them
789       with libz or libbz2, respectively, by adding one or both of
790
791         --enable-pcregrep-libz
792         --enable-pcregrep-libbz2
793
794       to the configure command. These options naturally require that the rel-
795       evant  libraries  are installed on your system. Configuration will fail
796       if they are not.
797
798
799PCREGREP BUFFER SIZE
800
801       pcregrep uses an internal buffer to hold a "window" on the file  it  is
802       scanning, in order to be able to output "before" and "after" lines when
803       it finds a match. The size of the buffer is controlled by  a  parameter
804       whose default value is 20K. The buffer itself is three times this size,
805       but because of the way it is used for holding "before" lines, the long-
806       est  line  that  is guaranteed to be processable is the parameter size.
807       You can change the default parameter value by adding, for example,
808
809         --with-pcregrep-bufsize=50K
810
811       to the configure command. The caller of pcregrep can, however, override
812       this value by specifying a run-time option.
813
814
815PCRETEST OPTION FOR LIBREADLINE SUPPORT
816
817       If you add
818
819         --enable-pcretest-libreadline
820
821       to  the  configure  command,  pcretest  is  linked with the libreadline
822       library, and when its input is from a terminal, it reads it  using  the
823       readline() function. This provides line-editing and history facilities.
824       Note that libreadline is GPL-licensed, so if you distribute a binary of
825       pcretest linked in this way, there may be licensing issues.
826
827       Setting  this  option  causes  the -lreadline option to be added to the
828       pcretest build. In many operating environments with  a  sytem-installed
829       libreadline this is sufficient. However, in some environments (e.g.  if
830       an unmodified distribution version of readline is in use),  some  extra
831       configuration  may  be necessary. The INSTALL file for libreadline says
832       this:
833
834         "Readline uses the termcap functions, but does not link with the
835         termcap or curses library itself, allowing applications which link
836         with readline the to choose an appropriate library."
837
838       If your environment has not been set up so that an appropriate  library
839       is automatically included, you may need to add something like
840
841         LIBS="-ncurses"
842
843       immediately before the configure command.
844
845
846SEE ALSO
847
848       pcreapi(3), pcre16, pcre_config(3).
849
850
851AUTHOR
852
853       Philip Hazel
854       University Computing Service
855       Cambridge CB2 3QH, England.
856
857
858REVISION
859
860       Last updated: 07 January 2012
861       Copyright (c) 1997-2012 University of Cambridge.
862------------------------------------------------------------------------------
863
864
865PCREMATCHING(3)                                                PCREMATCHING(3)
866
867
868NAME
869       PCRE - Perl-compatible regular expressions
870
871
872PCRE MATCHING ALGORITHMS
873
874       This document describes the two different algorithms that are available
875       in PCRE for matching a compiled regular expression against a given sub-
876       ject  string.  The  "standard"  algorithm  is  the  one provided by the
877       pcre_exec() and pcre16_exec() functions. These work in the same was  as
878       Perl's matching function, and provide a Perl-compatible matching opera-
879       tion. The just-in-time (JIT) optimization  that  is  described  in  the
880       pcrejit documentation is compatible with these functions.
881
882       An  alternative  algorithm  is  provided  by  the  pcre_dfa_exec()  and
883       pcre16_dfa_exec() functions; they operate in a different way,  and  are
884       not  Perl-compatible. This alternative has advantages and disadvantages
885       compared with the standard algorithm, and these are described below.
886
887       When there is only one possible way in which a given subject string can
888       match  a pattern, the two algorithms give the same answer. A difference
889       arises, however, when there are multiple possibilities. For example, if
890       the pattern
891
892         ^<.*>
893
894       is matched against the string
895
896         <something> <something else> <something further>
897
898       there are three possible answers. The standard algorithm finds only one
899       of them, whereas the alternative algorithm finds all three.
900
901
902REGULAR EXPRESSIONS AS TREES
903
904       The set of strings that are matched by a regular expression can be rep-
905       resented  as  a  tree structure. An unlimited repetition in the pattern
906       makes the tree of infinite size, but it is still a tree.  Matching  the
907       pattern  to a given subject string (from a given starting point) can be
908       thought of as a search of the tree.  There are two  ways  to  search  a
909       tree:  depth-first  and  breadth-first, and these correspond to the two
910       matching algorithms provided by PCRE.
911
912
913THE STANDARD MATCHING ALGORITHM
914
915       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
916       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
917       depth-first search of the pattern tree. That is, it  proceeds  along  a
918       single path through the tree, checking that the subject matches what is
919       required. When there is a mismatch, the algorithm  tries  any  alterna-
920       tives  at  the  current point, and if they all fail, it backs up to the
921       previous branch point in the  tree,  and  tries  the  next  alternative
922       branch  at  that  level.  This often involves backing up (moving to the
923       left) in the subject string as well.  The  order  in  which  repetition
924       branches  are  tried  is controlled by the greedy or ungreedy nature of
925       the quantifier.
926
927       If a leaf node is reached, a matching string has  been  found,  and  at
928       that  point the algorithm stops. Thus, if there is more than one possi-
929       ble match, this algorithm returns the first one that it finds.  Whether
930       this  is the shortest, the longest, or some intermediate length depends
931       on the way the greedy and ungreedy repetition quantifiers are specified
932       in the pattern.
933
934       Because  it  ends  up  with a single path through the tree, it is rela-
935       tively straightforward for this algorithm to keep  track  of  the  sub-
936       strings  that  are  matched  by portions of the pattern in parentheses.
937       This provides support for capturing parentheses and back references.
938
939
940THE ALTERNATIVE MATCHING ALGORITHM
941
942       This algorithm conducts a breadth-first search of  the  tree.  Starting
943       from  the  first  matching  point  in the subject, it scans the subject
944       string from left to right, once, character by character, and as it does
945       this,  it remembers all the paths through the tree that represent valid
946       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
947       though  it is not implemented as a traditional finite state machine (it
948       keeps multiple states active simultaneously).
949
950       Although the general principle of this matching algorithm  is  that  it
951       scans  the subject string only once, without backtracking, there is one
952       exception: when a lookaround assertion is encountered,  the  characters
953       following  or  preceding  the  current  point  have to be independently
954       inspected.
955
956       The scan continues until either the end of the subject is  reached,  or
957       there  are  no more unterminated paths. At this point, terminated paths
958       represent the different matching possibilities (if there are none,  the
959       match  has  failed).   Thus,  if there is more than one possible match,
960       this algorithm finds all of them, and in particular, it finds the long-
961       est.  The  matches are returned in decreasing order of length. There is
962       an option to stop the algorithm after the first match (which is  neces-
963       sarily the shortest) is found.
964
965       Note that all the matches that are found start at the same point in the
966       subject. If the pattern
967
968         cat(er(pillar)?)?
969
970       is matched against the string "the caterpillar catchment",  the  result
971       will  be the three strings "caterpillar", "cater", and "cat" that start
972       at the fifth character of the subject. The algorithm does not automati-
973       cally move on to find matches that start at later positions.
974
975       There are a number of features of PCRE regular expressions that are not
976       supported by the alternative matching algorithm. They are as follows:
977
978       1. Because the algorithm finds all  possible  matches,  the  greedy  or
979       ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
980       ungreedy quantifiers are treated in exactly the same way. However, pos-
981       sessive  quantifiers can make a difference when what follows could also
982       match what is quantified, for example in a pattern like this:
983
984         ^a++\w!
985
986       This pattern matches "aaab!" but not "aaa!", which would be matched  by
987       a  non-possessive quantifier. Similarly, if an atomic group is present,
988       it is matched as if it were a standalone pattern at the current  point,
989       and  the  longest match is then "locked in" for the rest of the overall
990       pattern.
991
992       2. When dealing with multiple paths through the tree simultaneously, it
993       is  not  straightforward  to  keep track of captured substrings for the
994       different matching possibilities, and  PCRE's  implementation  of  this
995       algorithm does not attempt to do this. This means that no captured sub-
996       strings are available.
997
998       3. Because no substrings are captured, back references within the  pat-
999       tern are not supported, and cause errors if encountered.
1000
1001       4.  For  the same reason, conditional expressions that use a backrefer-
1002       ence as the condition or test for a specific group  recursion  are  not
1003       supported.
1004
1005       5.  Because  many  paths  through the tree may be active, the \K escape
1006       sequence, which resets the start of the match when encountered (but may
1007       be  on  some  paths  and not on others), is not supported. It causes an
1008       error if encountered.
1009
1010       6. Callouts are supported, but the value of the  capture_top  field  is
1011       always 1, and the value of the capture_last field is always -1.
1012
1013       7.  The  \C  escape  sequence, which (in the standard algorithm) always
1014       matches a single data unit, even in UTF-8 or UTF-16 modes, is not  sup-
1015       ported  in these modes, because the alternative algorithm moves through
1016       the subject string one character (not data unit) at  a  time,  for  all
1017       active paths through the tree.
1018
1019       8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
1020       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
1021       negative assertion.
1022
1023
1024ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1025
1026       Using  the alternative matching algorithm provides the following advan-
1027       tages:
1028
1029       1. All possible matches (at a single point in the subject) are automat-
1030       ically  found,  and  in particular, the longest match is found. To find
1031       more than one match using the standard algorithm, you have to do kludgy
1032       things with callouts.
1033
1034       2.  Because  the  alternative  algorithm  scans the subject string just
1035       once, and never needs to backtrack (except for lookbehinds), it is pos-
1036       sible  to  pass  very  long subject strings to the matching function in
1037       several pieces, checking for partial matching each time. Although it is
1038       possible  to  do multi-segment matching using the standard algorithm by
1039       retaining partially matched substrings, it  is  more  complicated.  The
1040       pcrepartial  documentation  gives  details of partial matching and dis-
1041       cusses multi-segment matching.
1042
1043
1044DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
1045
1046       The alternative algorithm suffers from a number of disadvantages:
1047
1048       1. It is substantially slower than  the  standard  algorithm.  This  is
1049       partly  because  it has to search for all possible matches, but is also
1050       because it is less susceptible to optimization.
1051
1052       2. Capturing parentheses and back references are not supported.
1053
1054       3. Although atomic groups are supported, their use does not provide the
1055       performance advantage that it does for the standard algorithm.
1056
1057
1058AUTHOR
1059
1060       Philip Hazel
1061       University Computing Service
1062       Cambridge CB2 3QH, England.
1063
1064
1065REVISION
1066
1067       Last updated: 08 January 2012
1068       Copyright (c) 1997-2012 University of Cambridge.
1069------------------------------------------------------------------------------
1070
1071
1072PCREAPI(3)                                                          PCREAPI(3)
1073
1074
1075NAME
1076       PCRE - Perl-compatible regular expressions
1077
1078       #include <pcre.h>
1079
1080
1081PCRE NATIVE API BASIC FUNCTIONS
1082
1083       pcre *pcre_compile(const char *pattern, int options,
1084            const char **errptr, int *erroffset,
1085            const unsigned char *tableptr);
1086
1087       pcre *pcre_compile2(const char *pattern, int options,
1088            int *errorcodeptr,
1089            const char **errptr, int *erroffset,
1090            const unsigned char *tableptr);
1091
1092       pcre_extra *pcre_study(const pcre *code, int options,
1093            const char **errptr);
1094
1095       void pcre_free_study(pcre_extra *extra);
1096
1097       int pcre_exec(const pcre *code, const pcre_extra *extra,
1098            const char *subject, int length, int startoffset,
1099            int options, int *ovector, int ovecsize);
1100
1101       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1102            const char *subject, int length, int startoffset,
1103            int options, int *ovector, int ovecsize,
1104            int *workspace, int wscount);
1105
1106
1107PCRE NATIVE API STRING EXTRACTION FUNCTIONS
1108
1109       int pcre_copy_named_substring(const pcre *code,
1110            const char *subject, int *ovector,
1111            int stringcount, const char *stringname,
1112            char *buffer, int buffersize);
1113
1114       int pcre_copy_substring(const char *subject, int *ovector,
1115            int stringcount, int stringnumber, char *buffer,
1116            int buffersize);
1117
1118       int pcre_get_named_substring(const pcre *code,
1119            const char *subject, int *ovector,
1120            int stringcount, const char *stringname,
1121            const char **stringptr);
1122
1123       int pcre_get_stringnumber(const pcre *code,
1124            const char *name);
1125
1126       int pcre_get_stringtable_entries(const pcre *code,
1127            const char *name, char **first, char **last);
1128
1129       int pcre_get_substring(const char *subject, int *ovector,
1130            int stringcount, int stringnumber,
1131            const char **stringptr);
1132
1133       int pcre_get_substring_list(const char *subject,
1134            int *ovector, int stringcount, const char ***listptr);
1135
1136       void pcre_free_substring(const char *stringptr);
1137
1138       void pcre_free_substring_list(const char **stringptr);
1139
1140
1141PCRE NATIVE API AUXILIARY FUNCTIONS
1142
1143       pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
1144
1145       void pcre_jit_stack_free(pcre_jit_stack *stack);
1146
1147       void pcre_assign_jit_stack(pcre_extra *extra,
1148            pcre_jit_callback callback, void *data);
1149
1150       const unsigned char *pcre_maketables(void);
1151
1152       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1153            int what, void *where);
1154
1155       int pcre_refcount(pcre *code, int adjust);
1156
1157       int pcre_config(int what, void *where);
1158
1159       const char *pcre_version(void);
1160
1161       int pcre_pattern_to_host_byte_order(pcre *code,
1162            pcre_extra *extra, const unsigned char *tables);
1163
1164
1165PCRE NATIVE API INDIRECTED FUNCTIONS
1166
1167       void *(*pcre_malloc)(size_t);
1168
1169       void (*pcre_free)(void *);
1170
1171       void *(*pcre_stack_malloc)(size_t);
1172
1173       void (*pcre_stack_free)(void *);
1174
1175       int (*pcre_callout)(pcre_callout_block *);
1176
1177
1178PCRE 8-BIT AND 16-BIT LIBRARIES
1179
1180       From  release  8.30,  PCRE  can  be  compiled as a library for handling
1181       16-bit character strings as  well  as,  or  instead  of,  the  original
1182       library that handles 8-bit character strings. To avoid too much compli-
1183       cation, this document describes the 8-bit versions  of  the  functions,
1184       with only occasional references to the 16-bit library.
1185
1186       The  16-bit  functions  operate in the same way as their 8-bit counter-
1187       parts; they just use different  data  types  for  their  arguments  and
1188       results, and their names start with pcre16_ instead of pcre_. For every
1189       option that has UTF8 in its name (for example, PCRE_UTF8), there  is  a
1190       corresponding 16-bit name with UTF8 replaced by UTF16. This facility is
1191       in fact just cosmetic; the 16-bit option names define the same bit val-
1192       ues.
1193
1194       References to bytes and UTF-8 in this document should be read as refer-
1195       ences to 16-bit data  quantities  and  UTF-16  when  using  the  16-bit
1196       library,  unless specified otherwise. More details of the specific dif-
1197       ferences for the 16-bit library are given in the pcre16 page.
1198
1199
1200PCRE API OVERVIEW
1201
1202       PCRE has its own native API, which is described in this document. There
1203       are  also some wrapper functions (for the 8-bit library only) that cor-
1204       respond to the POSIX regular expression  API,  but  they  do  not  give
1205       access  to  all  the functionality. They are described in the pcreposix
1206       documentation. Both of these APIs define a set of C function  calls.  A
1207       C++ wrapper (again for the 8-bit library only) is also distributed with
1208       PCRE. It is documented in the pcrecpp page.
1209
1210       The native API C function prototypes are defined  in  the  header  file
1211       pcre.h,  and  on Unix-like systems the (8-bit) library itself is called
1212       libpcre. It can normally be accessed by adding -lpcre  to  the  command
1213       for  linking an application that uses PCRE. The header file defines the
1214       macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1215       numbers  for the library. Applications can use these to include support
1216       for different releases of PCRE.
1217
1218       In a Windows environment, if you want to statically link an application
1219       program  against  a  non-dll  pcre.a  file, you must define PCRE_STATIC
1220       before including pcre.h or pcrecpp.h, because otherwise  the  pcre_mal-
1221       loc()   and   pcre_free()   exported   functions   will   be   declared
1222       __declspec(dllimport), with unwanted results.
1223
1224       The  functions  pcre_compile(),  pcre_compile2(),   pcre_study(),   and
1225       pcre_exec()  are used for compiling and matching regular expressions in
1226       a Perl-compatible manner. A sample program that demonstrates  the  sim-
1227       plest  way  of  using them is provided in the file called pcredemo.c in
1228       the PCRE source distribution. A listing of this program is given in the
1229       pcredemo  documentation, and the pcresample documentation describes how
1230       to compile and run it.
1231
1232       Just-in-time compiler support is an optional feature of PCRE  that  can
1233       be built in appropriate hardware environments. It greatly speeds up the
1234       matching performance of  many  patterns.  Simple  programs  can  easily
1235       request  that  it  be  used  if available, by setting an option that is
1236       ignored when it is not relevant. More complicated programs  might  need
1237       to     make    use    of    the    functions    pcre_jit_stack_alloc(),
1238       pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to  control
1239       the  JIT  code's  memory  usage.   These functions are discussed in the
1240       pcrejit documentation.
1241
1242       A second matching function, pcre_dfa_exec(), which is not Perl-compati-
1243       ble,  is  also provided. This uses a different algorithm for the match-
1244       ing. The alternative algorithm finds all possible matches (at  a  given
1245       point  in  the  subject), and scans the subject just once (unless there
1246       are lookbehind assertions). However, this  algorithm  does  not  return
1247       captured  substrings.  A description of the two matching algorithms and
1248       their advantages and disadvantages is given in the  pcrematching  docu-
1249       mentation.
1250
1251       In  addition  to  the  main compiling and matching functions, there are
1252       convenience functions for extracting captured substrings from a subject
1253       string that is matched by pcre_exec(). They are:
1254
1255         pcre_copy_substring()
1256         pcre_copy_named_substring()
1257         pcre_get_substring()
1258         pcre_get_named_substring()
1259         pcre_get_substring_list()
1260         pcre_get_stringnumber()
1261         pcre_get_stringtable_entries()
1262
1263       pcre_free_substring() and pcre_free_substring_list() are also provided,
1264       to free the memory used for extracted strings.
1265
1266       The function pcre_maketables() is used to  build  a  set  of  character
1267       tables   in   the   current   locale  for  passing  to  pcre_compile(),
1268       pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
1269       provided  for  specialist  use.  Most  commonly,  no special tables are
1270       passed, in which case internal tables that are generated when  PCRE  is
1271       built are used.
1272
1273       The  function  pcre_fullinfo()  is used to find out information about a
1274       compiled pattern. The function pcre_version() returns a  pointer  to  a
1275       string containing the version of PCRE and its date of release.
1276
1277       The  function  pcre_refcount()  maintains  a  reference count in a data
1278       block containing a compiled pattern. This is provided for  the  benefit
1279       of object-oriented applications.
1280
1281       The  global  variables  pcre_malloc and pcre_free initially contain the
1282       entry points of the standard malloc()  and  free()  functions,  respec-
1283       tively. PCRE calls the memory management functions via these variables,
1284       so a calling program can replace them if it  wishes  to  intercept  the
1285       calls. This should be done before calling any PCRE functions.
1286
1287       The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
1288       indirections to memory management functions.  These  special  functions
1289       are  used  only  when  PCRE is compiled to use the heap for remembering
1290       data, instead of recursive function calls, when running the pcre_exec()
1291       function.  See  the  pcrebuild  documentation  for details of how to do
1292       this. It is a non-standard way of building PCRE, for  use  in  environ-
1293       ments  that  have  limited stacks. Because of the greater use of memory
1294       management, it runs more slowly. Separate  functions  are  provided  so
1295       that  special-purpose  external  code  can  be used for this case. When
1296       used, these functions are always called in a  stack-like  manner  (last
1297       obtained,  first freed), and always for memory blocks of the same size.
1298       There is a discussion about PCRE's stack usage in the  pcrestack  docu-
1299       mentation.
1300
1301       The global variable pcre_callout initially contains NULL. It can be set
1302       by the caller to a "callout" function, which PCRE  will  then  call  at
1303       specified  points during a matching operation. Details are given in the
1304       pcrecallout documentation.
1305
1306
1307NEWLINES
1308
1309       PCRE supports five different conventions for indicating line breaks  in
1310       strings:  a  single  CR (carriage return) character, a single LF (line-
1311       feed) character, the two-character sequence CRLF, any of the three pre-
1312       ceding,  or any Unicode newline sequence. The Unicode newline sequences
1313       are the three just mentioned, plus the single characters  VT  (vertical
1314       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1315       separator, U+2028), and PS (paragraph separator, U+2029).
1316
1317       Each of the first three conventions is used by at least  one  operating
1318       system  as its standard newline sequence. When PCRE is built, a default
1319       can be specified.  The default default is LF, which is the  Unix  stan-
1320       dard.  When  PCRE  is run, the default can be overridden, either when a
1321       pattern is compiled, or when it is matched.
1322
1323       At compile time, the newline convention can be specified by the options
1324       argument  of  pcre_compile(), or it can be specified by special text at
1325       the start of the pattern itself; this overrides any other settings. See
1326       the pcrepattern page for details of the special character sequences.
1327
1328       In the PCRE documentation the word "newline" is used to mean "the char-
1329       acter or pair of characters that indicate a line break". The choice  of
1330       newline  convention  affects  the  handling of the dot, circumflex, and
1331       dollar metacharacters, the handling of #-comments in /x mode, and, when
1332       CRLF  is a recognized line ending sequence, the match position advance-
1333       ment for a non-anchored pattern. There is more detail about this in the
1334       section on pcre_exec() options below.
1335
1336       The  choice of newline convention does not affect the interpretation of
1337       the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
1338       which is controlled in a similar way, but by separate options.
1339
1340
1341MULTITHREADING
1342
1343       The  PCRE  functions  can be used in multi-threading applications, with
1344       the  proviso  that  the  memory  management  functions  pointed  to  by
1345       pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1346       callout function pointed to by pcre_callout, are shared by all threads.
1347
1348       The compiled form of a regular expression is not altered during  match-
1349       ing, so the same compiled pattern can safely be used by several threads
1350       at once.
1351
1352       If the just-in-time optimization feature is being used, it needs  sepa-
1353       rate  memory stack areas for each thread. See the pcrejit documentation
1354       for more details.
1355
1356
1357SAVING PRECOMPILED PATTERNS FOR LATER USE
1358
1359       The compiled form of a regular expression can be saved and re-used at a
1360       later  time,  possibly by a different program, and even on a host other
1361       than the one on which  it  was  compiled.  Details  are  given  in  the
1362       pcreprecompile  documentation,  which  includes  a  description  of the
1363       pcre_pattern_to_host_byte_order() function. However, compiling a  regu-
1364       lar  expression  with one version of PCRE for use with a different ver-
1365       sion is not guaranteed to work and may cause crashes.
1366
1367
1368CHECKING BUILD-TIME OPTIONS
1369
1370       int pcre_config(int what, void *where);
1371
1372       The function pcre_config() makes it possible for a PCRE client to  dis-
1373       cover which optional features have been compiled into the PCRE library.
1374       The pcrebuild documentation has more details about these optional  fea-
1375       tures.
1376
1377       The  first  argument  for pcre_config() is an integer, specifying which
1378       information is required; the second argument is a pointer to a variable
1379       into  which  the  information  is placed. The returned value is zero on
1380       success, or the negative error code PCRE_ERROR_BADOPTION if  the  value
1381       in  the  first argument is not recognized. The following information is
1382       available:
1383
1384         PCRE_CONFIG_UTF8
1385
1386       The output is an integer that is set to one if UTF-8 support is  avail-
1387       able;  otherwise  it  is  set  to  zero. If this option is given to the
1388       16-bit  version  of  this  function,  pcre16_config(),  the  result  is
1389       PCRE_ERROR_BADOPTION.
1390
1391         PCRE_CONFIG_UTF16
1392
1393       The output is an integer that is set to one if UTF-16 support is avail-
1394       able; otherwise it is set to zero. This value should normally be  given
1395       to the 16-bit version of this function, pcre16_config(). If it is given
1396       to the 8-bit version of this function, the result is  PCRE_ERROR_BADOP-
1397       TION.
1398
1399         PCRE_CONFIG_UNICODE_PROPERTIES
1400
1401       The  output  is  an  integer  that is set to one if support for Unicode
1402       character properties is available; otherwise it is set to zero.
1403
1404         PCRE_CONFIG_JIT
1405
1406       The output is an integer that is set to one if support for just-in-time
1407       compiling is available; otherwise it is set to zero.
1408
1409         PCRE_CONFIG_JITTARGET
1410
1411       The  output is a pointer to a zero-terminated "const char *" string. If
1412       JIT support is available, the string contains the name of the architec-
1413       ture  for  which the JIT compiler is configured, for example "x86 32bit
1414       (little endian + unaligned)". If JIT  support  is  not  available,  the
1415       result is NULL.
1416
1417         PCRE_CONFIG_NEWLINE
1418
1419       The  output  is  an integer whose value specifies the default character
1420       sequence that is recognized as meaning "newline". The four values  that
1421       are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1422       and -1 for ANY.  Though they are derived from ASCII,  the  same  values
1423       are returned in EBCDIC environments. The default should normally corre-
1424       spond to the standard sequence for your operating system.
1425
1426         PCRE_CONFIG_BSR
1427
1428       The output is an integer whose value indicates what character sequences
1429       the  \R  escape sequence matches by default. A value of 0 means that \R
1430       matches any Unicode line ending sequence; a value of 1  means  that  \R
1431       matches only CR, LF, or CRLF. The default can be overridden when a pat-
1432       tern is compiled or matched.
1433
1434         PCRE_CONFIG_LINK_SIZE
1435
1436       The output is an integer that contains the number  of  bytes  used  for
1437       internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
1438       library, the value can be 2, 3, or 4. For the 16-bit library, the value
1439       is either 2 or 4 and is still a number of bytes. The default value of 2
1440       is sufficient for all but the most massive patterns,  since  it  allows
1441       the  compiled  pattern  to  be  up to 64K in size.  Larger values allow
1442       larger regular expressions to be compiled, at  the  expense  of  slower
1443       matching.
1444
1445         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1446
1447       The  output  is  an integer that contains the threshold above which the
1448       POSIX interface uses malloc() for output vectors. Further  details  are
1449       given in the pcreposix documentation.
1450
1451         PCRE_CONFIG_MATCH_LIMIT
1452
1453       The  output is a long integer that gives the default limit for the num-
1454       ber of internal matching function calls  in  a  pcre_exec()  execution.
1455       Further details are given with pcre_exec() below.
1456
1457         PCRE_CONFIG_MATCH_LIMIT_RECURSION
1458
1459       The output is a long integer that gives the default limit for the depth
1460       of  recursion  when  calling  the  internal  matching  function  in   a
1461       pcre_exec()  execution.  Further  details  are  given  with pcre_exec()
1462       below.
1463
1464         PCRE_CONFIG_STACKRECURSE
1465
1466       The output is an integer that is set to one if internal recursion  when
1467       running pcre_exec() is implemented by recursive function calls that use
1468       the stack to remember their state. This is the usual way that  PCRE  is
1469       compiled. The output is zero if PCRE was compiled to use blocks of data
1470       on the  heap  instead  of  recursive  function  calls.  In  this  case,
1471       pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
1472       blocks on the heap, thus avoiding the use of the stack.
1473
1474
1475COMPILING A PATTERN
1476
1477       pcre *pcre_compile(const char *pattern, int options,
1478            const char **errptr, int *erroffset,
1479            const unsigned char *tableptr);
1480
1481       pcre *pcre_compile2(const char *pattern, int options,
1482            int *errorcodeptr,
1483            const char **errptr, int *erroffset,
1484            const unsigned char *tableptr);
1485
1486       Either of the functions pcre_compile() or pcre_compile2() can be called
1487       to compile a pattern into an internal form. The only difference between
1488       the two interfaces is that pcre_compile2() has an additional  argument,
1489       errorcodeptr,  via  which  a  numerical  error code can be returned. To
1490       avoid too much repetition, we refer just to pcre_compile()  below,  but
1491       the information applies equally to pcre_compile2().
1492
1493       The pattern is a C string terminated by a binary zero, and is passed in
1494       the pattern argument. A pointer to a single block  of  memory  that  is
1495       obtained  via  pcre_malloc is returned. This contains the compiled code
1496       and related data. The pcre type is defined for the returned block; this
1497       is a typedef for a structure whose contents are not externally defined.
1498       It is up to the caller to free the memory (via pcre_free) when it is no
1499       longer required.
1500
1501       Although  the compiled code of a PCRE regex is relocatable, that is, it
1502       does not depend on memory location, the complete pcre data block is not
1503       fully  relocatable, because it may contain a copy of the tableptr argu-
1504       ment, which is an address (see below).
1505
1506       The options argument contains various bit settings that affect the com-
1507       pilation.  It  should be zero if no options are required. The available
1508       options are described below. Some of them (in  particular,  those  that
1509       are  compatible with Perl, but some others as well) can also be set and
1510       unset from within the pattern (see  the  detailed  description  in  the
1511       pcrepattern  documentation). For those options that can be different in
1512       different parts of the pattern, the contents of  the  options  argument
1513       specifies their settings at the start of compilation and execution. The
1514       PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK,  and
1515       PCRE_NO_START_OPTIMIZE  options  can  be set at the time of matching as
1516       well as at compile time.
1517
1518       If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1519       if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1520       sets the variable pointed to by errptr to point to a textual error mes-
1521       sage. This is a static string that is part of the library. You must not
1522       try to free it. Normally, the offset from the start of the  pattern  to
1523       the  byte  that  was  being  processed when the error was discovered is
1524       placed in the variable pointed to by erroffset, which must not be  NULL
1525       (if  it is, an immediate error is given). However, for an invalid UTF-8
1526       string, the offset is that of the first byte of the failing character.
1527
1528       Some errors are not detected until the whole pattern has been  scanned;
1529       in  these  cases,  the offset passed back is the length of the pattern.
1530       Note that the offset is in bytes, not characters, even in  UTF-8  mode.
1531       It may sometimes point into the middle of a UTF-8 character.
1532
1533       If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1534       codeptr argument is not NULL, a non-zero error code number is  returned
1535       via  this argument in the event of an error. This is in addition to the
1536       textual error message. Error codes and messages are listed below.
1537
1538       If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
1539       character  tables  that  are  built  when  PCRE  is compiled, using the
1540       default C locale. Otherwise, tableptr must be an address  that  is  the
1541       result  of  a  call to pcre_maketables(). This value is stored with the
1542       compiled pattern, and used again by pcre_exec(), unless  another  table
1543       pointer is passed to it. For more discussion, see the section on locale
1544       support below.
1545
1546       This code fragment shows a typical straightforward  call  to  pcre_com-
1547       pile():
1548
1549         pcre *re;
1550         const char *error;
1551         int erroffset;
1552         re = pcre_compile(
1553           "^A.*Z",          /* the pattern */
1554           0,                /* default options */
1555           &error,           /* for error message */
1556           &erroffset,       /* for error offset */
1557           NULL);            /* use default character tables */
1558
1559       The  following  names  for option bits are defined in the pcre.h header
1560       file:
1561
1562         PCRE_ANCHORED
1563
1564       If this bit is set, the pattern is forced to be "anchored", that is, it
1565       is  constrained to match only at the first matching point in the string
1566       that is being searched (the "subject string"). This effect can also  be
1567       achieved  by appropriate constructs in the pattern itself, which is the
1568       only way to do it in Perl.
1569
1570         PCRE_AUTO_CALLOUT
1571
1572       If this bit is set, pcre_compile() automatically inserts callout items,
1573       all  with  number  255, before each pattern item. For discussion of the
1574       callout facility, see the pcrecallout documentation.
1575
1576         PCRE_BSR_ANYCRLF
1577         PCRE_BSR_UNICODE
1578
1579       These options (which are mutually exclusive) control what the \R escape
1580       sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1581       or to match any Unicode newline sequence. The default is specified when
1582       PCRE is built. It can be overridden from within the pattern, or by set-
1583       ting an option when a compiled pattern is matched.
1584
1585         PCRE_CASELESS
1586
1587       If this bit is set, letters in the pattern match both upper  and  lower
1588       case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
1589       changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
1590       always  understands the concept of case for characters whose values are
1591       less than 128, so caseless matching is always possible. For  characters
1592       with  higher  values,  the concept of case is supported if PCRE is com-
1593       piled with Unicode property support, but not otherwise. If you want  to
1594       use  caseless  matching  for  characters 128 and above, you must ensure
1595       that PCRE is compiled with Unicode property support  as  well  as  with
1596       UTF-8 support.
1597
1598         PCRE_DOLLAR_ENDONLY
1599
1600       If  this bit is set, a dollar metacharacter in the pattern matches only
1601       at the end of the subject string. Without this option,  a  dollar  also
1602       matches  immediately before a newline at the end of the string (but not
1603       before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
1604       if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
1605       Perl, and no way to set it within a pattern.
1606
1607         PCRE_DOTALL
1608
1609       If this bit is set, a dot metacharacter in the pattern matches a  char-
1610       acter of any value, including one that indicates a newline. However, it
1611       only ever matches one character, even if newlines are  coded  as  CRLF.
1612       Without  this option, a dot does not match when the current position is
1613       at a newline. This option is equivalent to Perl's /s option, and it can
1614       be  changed within a pattern by a (?s) option setting. A negative class
1615       such as [^a] always matches newline characters, independent of the set-
1616       ting of this option.
1617
1618         PCRE_DUPNAMES
1619
1620       If  this  bit is set, names used to identify capturing subpatterns need
1621       not be unique. This can be helpful for certain types of pattern when it
1622       is  known  that  only  one instance of the named subpattern can ever be
1623       matched. There are more details of named subpatterns  below;  see  also
1624       the pcrepattern documentation.
1625
1626         PCRE_EXTENDED
1627
1628       If  this  bit  is  set,  white space data characters in the pattern are
1629       totally ignored except when escaped or inside a character class.  White
1630       space does not include the VT character (code 11). In addition, charac-
1631       ters between an unescaped # outside a character class and the next new-
1632       line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
1633       option, and it can be changed within a pattern by a  (?x)  option  set-
1634       ting.
1635
1636       Which  characters  are  interpreted  as  newlines  is controlled by the
1637       options passed to pcre_compile() or by a special sequence at the  start
1638       of  the  pattern, as described in the section entitled "Newline conven-
1639       tions" in the pcrepattern documentation. Note that the end of this type
1640       of  comment  is  a  literal  newline  sequence  in  the pattern; escape
1641       sequences that happen to represent a newline do not count.
1642
1643       This option makes it possible to include  comments  inside  complicated
1644       patterns.   Note,  however,  that this applies only to data characters.
1645       White space  characters  may  never  appear  within  special  character
1646       sequences in a pattern, for example within the sequence (?( that intro-
1647       duces a conditional subpattern.
1648
1649         PCRE_EXTRA
1650
1651       This option was invented in order to turn on  additional  functionality
1652       of  PCRE  that  is  incompatible with Perl, but it is currently of very
1653       little use. When set, any backslash in a pattern that is followed by  a
1654       letter  that  has  no  special  meaning causes an error, thus reserving
1655       these combinations for future expansion. By  default,  as  in  Perl,  a
1656       backslash  followed by a letter with no special meaning is treated as a
1657       literal. (Perl can, however, be persuaded to give an error for this, by
1658       running  it with the -w option.) There are at present no other features
1659       controlled by this option. It can also be set by a (?X) option  setting
1660       within a pattern.
1661
1662         PCRE_FIRSTLINE
1663
1664       If  this  option  is  set,  an  unanchored pattern is required to match
1665       before or at the first  newline  in  the  subject  string,  though  the
1666       matched text may continue over the newline.
1667
1668         PCRE_JAVASCRIPT_COMPAT
1669
1670       If this option is set, PCRE's behaviour is changed in some ways so that
1671       it is compatible with JavaScript rather than Perl. The changes  are  as
1672       follows:
1673
1674       (1)  A  lone  closing square bracket in a pattern causes a compile-time
1675       error, because this is illegal in JavaScript (by default it is  treated
1676       as a data character). Thus, the pattern AB]CD becomes illegal when this
1677       option is set.
1678
1679       (2) At run time, a back reference to an unset subpattern group  matches
1680       an  empty  string (by default this causes the current matching alterna-
1681       tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
1682       set  (assuming  it can find an "a" in the subject), whereas it fails by
1683       default, for Perl compatibility.
1684
1685       (3) \U matches an upper case "U" character; by default \U causes a com-
1686       pile time error (Perl uses \U to upper case subsequent characters).
1687
1688       (4) \u matches a lower case "u" character unless it is followed by four
1689       hexadecimal digits, in which case the hexadecimal  number  defines  the
1690       code  point  to match. By default, \u causes a compile time error (Perl
1691       uses it to upper case the following character).
1692
1693       (5) \x matches a lower case "x" character unless it is followed by  two
1694       hexadecimal  digits,  in  which case the hexadecimal number defines the
1695       code point to match. By default, as in Perl, a  hexadecimal  number  is
1696       always expected after \x, but it may have zero, one, or two digits (so,
1697       for example, \xz matches a binary zero character followed by z).
1698
1699         PCRE_MULTILINE
1700
1701       By default, PCRE treats the subject string as consisting  of  a  single
1702       line  of characters (even if it actually contains newlines). The "start
1703       of line" metacharacter (^) matches only at the  start  of  the  string,
1704       while  the  "end  of line" metacharacter ($) matches only at the end of
1705       the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1706       is set). This is the same as Perl.
1707
1708       When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1709       constructs match immediately following or immediately  before  internal
1710       newlines  in  the  subject string, respectively, as well as at the very
1711       start and end. This is equivalent to Perl's /m option, and  it  can  be
1712       changed within a pattern by a (?m) option setting. If there are no new-
1713       lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1714       setting PCRE_MULTILINE has no effect.
1715
1716         PCRE_NEWLINE_CR
1717         PCRE_NEWLINE_LF
1718         PCRE_NEWLINE_CRLF
1719         PCRE_NEWLINE_ANYCRLF
1720         PCRE_NEWLINE_ANY
1721
1722       These  options  override the default newline definition that was chosen
1723       when PCRE was built. Setting the first or the second specifies  that  a
1724       newline  is  indicated  by a single character (CR or LF, respectively).
1725       Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1726       two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1727       that any of the three preceding sequences should be recognized. Setting
1728       PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1729       recognized. The Unicode newline sequences are the three just mentioned,
1730       plus  the  single  characters VT (vertical tab, U+000B), FF (form feed,
1731       U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1732       (paragraph  separator, U+2029). For the 8-bit library, the last two are
1733       recognized only in UTF-8 mode.
1734
1735       The newline setting in the  options  word  uses  three  bits  that  are
1736       treated as a number, giving eight possibilities. Currently only six are
1737       used (default plus the five values above). This means that if  you  set
1738       more  than one newline option, the combination may or may not be sensi-
1739       ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1740       PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1741       cause an error.
1742
1743       The only time that a line break in a pattern  is  specially  recognized
1744       when  compiling is when PCRE_EXTENDED is set. CR and LF are white space
1745       characters, and so are ignored in this mode. Also, an unescaped #  out-
1746       side  a  character class indicates a comment that lasts until after the
1747       next line break sequence. In other circumstances, line break  sequences
1748       in patterns are treated as literal data.
1749
1750       The newline option that is set at compile time becomes the default that
1751       is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1752
1753         PCRE_NO_AUTO_CAPTURE
1754
1755       If this option is set, it disables the use of numbered capturing paren-
1756       theses  in the pattern. Any opening parenthesis that is not followed by
1757       ? behaves as if it were followed by ?: but named parentheses can  still
1758       be  used  for  capturing  (and  they acquire numbers in the usual way).
1759       There is no equivalent of this option in Perl.
1760
1761         NO_START_OPTIMIZE
1762
1763       This is an option that acts at matching time; that is, it is really  an
1764       option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
1765       time, it is remembered with the compiled pattern and assumed at  match-
1766       ing  time.  For  details  see  the discussion of PCRE_NO_START_OPTIMIZE
1767       below.
1768
1769         PCRE_UCP
1770
1771       This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,
1772       \w,  and  some  of  the POSIX character classes. By default, only ASCII
1773       characters are recognized, but if PCRE_UCP is set,  Unicode  properties
1774       are  used instead to classify characters. More details are given in the
1775       section on generic character types in the pcrepattern page. If you  set
1776       PCRE_UCP,  matching  one of the items it affects takes much longer. The
1777       option is available only if PCRE has been compiled with  Unicode  prop-
1778       erty support.
1779
1780         PCRE_UNGREEDY
1781
1782       This  option  inverts  the "greediness" of the quantifiers so that they
1783       are not greedy by default, but become greedy if followed by "?". It  is
1784       not  compatible  with Perl. It can also be set by a (?U) option setting
1785       within the pattern.
1786
1787         PCRE_UTF8
1788
1789       This option causes PCRE to regard both the pattern and the  subject  as
1790       strings of UTF-8 characters instead of single-byte strings. However, it
1791       is available only when PCRE is built to include UTF  support.  If  not,
1792       the  use  of  this option provokes an error. Details of how this option
1793       changes the behaviour of PCRE are given in the pcreunicode page.
1794
1795         PCRE_NO_UTF8_CHECK
1796
1797       When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1798       automatically  checked.  There  is  a  discussion about the validity of
1799       UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence  is
1800       found,  pcre_compile()  returns an error. If you already know that your
1801       pattern is valid, and you want to skip this check for performance  rea-
1802       sons,  you  can set the PCRE_NO_UTF8_CHECK option.  When it is set, the
1803       effect of passing an invalid UTF-8 string as a pattern is undefined. It
1804       may  cause  your  program  to  crash. Note that this option can also be
1805       passed to pcre_exec() and pcre_dfa_exec(),  to  suppress  the  validity
1806       checking of subject strings.
1807
1808
1809COMPILATION ERROR CODES
1810
1811       The  following  table  lists  the  error  codes than may be returned by
1812       pcre_compile2(), along with the error messages that may be returned  by
1813       both  compiling  functions.  Note  that error messages are always 8-bit
1814       ASCII strings, even in 16-bit mode. As PCRE has developed,  some  error
1815       codes  have  fallen  out of use. To avoid confusion, they have not been
1816       re-used.
1817
1818          0  no error
1819          1  \ at end of pattern
1820          2  \c at end of pattern
1821          3  unrecognized character follows \
1822          4  numbers out of order in {} quantifier
1823          5  number too big in {} quantifier
1824          6  missing terminating ] for character class
1825          7  invalid escape sequence in character class
1826          8  range out of order in character class
1827          9  nothing to repeat
1828         10  [this code is not in use]
1829         11  internal error: unexpected repeat
1830         12  unrecognized character after (? or (?-
1831         13  POSIX named classes are supported only within a class
1832         14  missing )
1833         15  reference to non-existent subpattern
1834         16  erroffset passed as NULL
1835         17  unknown option bit(s) set
1836         18  missing ) after comment
1837         19  [this code is not in use]
1838         20  regular expression is too large
1839         21  failed to get memory
1840         22  unmatched parentheses
1841         23  internal error: code overflow
1842         24  unrecognized character after (?<
1843         25  lookbehind assertion is not fixed length
1844         26  malformed number or name after (?(
1845         27  conditional group contains more than two branches
1846         28  assertion expected after (?(
1847         29  (?R or (?[+-]digits must be followed by )
1848         30  unknown POSIX class name
1849         31  POSIX collating elements are not supported
1850         32  this version of PCRE is compiled without UTF support
1851         33  [this code is not in use]
1852         34  character value in \x{...} sequence is too large
1853         35  invalid condition (?(0)
1854         36  \C not allowed in lookbehind assertion
1855         37  PCRE does not support \L, \l, \N{name}, \U, or \u
1856         38  number after (?C is > 255
1857         39  closing ) for (?C expected
1858         40  recursive call could loop indefinitely
1859         41  unrecognized character after (?P
1860         42  syntax error in subpattern name (missing terminator)
1861         43  two named subpatterns have the same name
1862         44  invalid UTF-8 string (specifically UTF-8)
1863         45  support for \P, \p, and \X has not been compiled
1864         46  malformed \P or \p sequence
1865         47  unknown property name after \P or \p
1866         48  subpattern name is too long (maximum 32 characters)
1867         49  too many named subpatterns (maximum 10000)
1868         50  [this code is not in use]
1869         51  octal value is greater than \377 in 8-bit non-UTF-8 mode
1870         52  internal error: overran compiling workspace
1871         53  internal error: previously-checked referenced subpattern
1872               not found
1873         54  DEFINE group contains more than one branch
1874         55  repeating a DEFINE group is not allowed
1875         56  inconsistent NEWLINE options
1876         57  \g is not followed by a braced, angle-bracketed, or quoted
1877               name/number or by a plain number
1878         58  a numbered reference must not be zero
1879         59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
1880         60  (*VERB) not recognized
1881         61  number is too big
1882         62  subpattern name expected
1883         63  digit expected after (?+
1884         64  ] is an invalid data character in JavaScript compatibility mode
1885         65  different names for subpatterns of the same number are
1886               not allowed
1887         66  (*MARK) must have an argument
1888         67  this version of PCRE is not compiled with Unicode property
1889               support
1890         68  \c must be followed by an ASCII character
1891         69  \k is not followed by a braced, angle-bracketed, or quoted name
1892         70  internal error: unknown opcode in find_fixedlength()
1893         71  \N is not supported in a class
1894         72  too many forward references
1895         73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
1896         74  invalid UTF-16 string (specifically UTF-16)
1897         75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
1898         76  character value in \u.... sequence is too large
1899
1900       The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1901       values may be used if the limits were changed when PCRE was built.
1902
1903
1904STUDYING A PATTERN
1905
1906       pcre_extra *pcre_study(const pcre *code, int options
1907            const char **errptr);
1908
1909       If  a  compiled  pattern is going to be used several times, it is worth
1910       spending more time analyzing it in order to speed up the time taken for
1911       matching.  The function pcre_study() takes a pointer to a compiled pat-
1912       tern as its first argument. If studying the pattern produces additional
1913       information  that  will  help speed up matching, pcre_study() returns a
1914       pointer to a pcre_extra block, in which the study_data field points  to
1915       the results of the study.
1916
1917       The  returned  value  from  pcre_study()  can  be  passed  directly  to
1918       pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-
1919       tains  other  fields  that can be set by the caller before the block is
1920       passed; these are described below in the section on matching a pattern.
1921
1922       If studying the  pattern  does  not  produce  any  useful  information,
1923       pcre_study() returns NULL. In that circumstance, if the calling program
1924       wants  to  pass  any  of   the   other   fields   to   pcre_exec()   or
1925       pcre_dfa_exec(), it must set up its own pcre_extra block.
1926
1927       The  second  argument  of  pcre_study() contains option bits. There are
1928       three options:
1929
1930         PCRE_STUDY_JIT_COMPILE
1931         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
1932         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
1933
1934       If any of these are set, and the just-in-time  compiler  is  available,
1935       the  pattern  is  further compiled into machine code that executes much
1936       faster than the pcre_exec()  interpretive  matching  function.  If  the
1937       just-in-time  compiler is not available, these options are ignored. All
1938       other bits in the options argument must be zero.
1939
1940       JIT compilation is a heavyweight optimization. It can  take  some  time
1941       for  patterns  to  be analyzed, and for one-off matches and simple pat-
1942       terns the benefit of faster execution might be offset by a much  slower
1943       study time.  Not all patterns can be optimized by the JIT compiler. For
1944       those that cannot be handled, matching automatically falls back to  the
1945       pcre_exec()  interpreter.  For more details, see the pcrejit documenta-
1946       tion.
1947
1948       The third argument for pcre_study() is a pointer for an error  message.
1949       If  studying  succeeds  (even  if no data is returned), the variable it
1950       points to is set to NULL. Otherwise it is set to  point  to  a  textual
1951       error message. This is a static string that is part of the library. You
1952       must not try to free it. You should test the  error  pointer  for  NULL
1953       after calling pcre_study(), to be sure that it has run successfully.
1954
1955       When  you are finished with a pattern, you can free the memory used for
1956       the study data by calling pcre_free_study(). This function was added to
1957       the  API  for  release  8.20. For earlier versions, the memory could be
1958       freed with pcre_free(), just like the pattern itself. This  will  still
1959       work  in  cases where JIT optimization is not used, but it is advisable
1960       to change to the new function when convenient.
1961
1962       This is a typical way in which pcre_study() is used (except that  in  a
1963       real application there should be tests for errors):
1964
1965         int rc;
1966         pcre *re;
1967         pcre_extra *sd;
1968         re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
1969         sd = pcre_study(
1970           re,             /* result of pcre_compile() */
1971           0,              /* no options */
1972           &error);        /* set to NULL or points to a message */
1973         rc = pcre_exec(   /* see below for details of pcre_exec() options */
1974           re, sd, "subject", 7, 0, 0, ovector, 30);
1975         ...
1976         pcre_free_study(sd);
1977         pcre_free(re);
1978
1979       Studying a pattern does two things: first, a lower bound for the length
1980       of subject string that is needed to match the pattern is computed. This
1981       does not mean that there are any strings of that length that match, but
1982       it does guarantee that no shorter strings match. The value is  used  by
1983       pcre_exec()  and  pcre_dfa_exec()  to  avoid  wasting time by trying to
1984       match strings that are shorter than the lower bound. You can  find  out
1985       the value in a calling program via the pcre_fullinfo() function.
1986
1987       Studying a pattern is also useful for non-anchored patterns that do not
1988       have a single fixed starting character. A bitmap of  possible  starting
1989       bytes  is  created. This speeds up finding a position in the subject at
1990       which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
1991       values less than 256.)
1992
1993       These  two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
1994       and the information is also used by the JIT  compiler.   The  optimiza-
1995       tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when
1996       calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu-
1997       tion  is  also disabled. You might want to do this if your pattern con-
1998       tains callouts or (*MARK) and you want to make use of these  facilities
1999       in    cases    where    matching   fails.   See   the   discussion   of
2000       PCRE_NO_START_OPTIMIZE below.
2001
2002
2003LOCALE SUPPORT
2004
2005       PCRE handles caseless matching, and determines whether  characters  are
2006       letters,  digits, or whatever, by reference to a set of tables, indexed
2007       by character value. When running in UTF-8 mode, this  applies  only  to
2008       characters  with  codes  less than 128. By default, higher-valued codes
2009       never match escapes such as \w or \d, but they can be tested with \p if
2010       PCRE  is  built with Unicode character property support. Alternatively,
2011       the PCRE_UCP option can be set at compile  time;  this  causes  \w  and
2012       friends to use Unicode property support instead of built-in tables. The
2013       use of locales with Unicode is discouraged. If you are handling charac-
2014       ters  with codes greater than 128, you should either use UTF-8 and Uni-
2015       code, or use locales, but not try to mix the two.
2016
2017       PCRE contains an internal set of tables that are used  when  the  final
2018       argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
2019       applications.  Normally, the internal tables recognize only ASCII char-
2020       acters. However, when PCRE is built, it is possible to cause the inter-
2021       nal tables to be rebuilt in the default "C" locale of the local system,
2022       which may cause them to be different.
2023
2024       The  internal tables can always be overridden by tables supplied by the
2025       application that calls PCRE. These may be created in a different locale
2026       from  the  default.  As more and more applications change to using Uni-
2027       code, the need for this locale support is expected to die away.
2028
2029       External tables are built by calling  the  pcre_maketables()  function,
2030       which  has no arguments, in the relevant locale. The result can then be
2031       passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
2032       example,  to  build  and use tables that are appropriate for the French
2033       locale (where accented characters with  values  greater  than  128  are
2034       treated as letters), the following code could be used:
2035
2036         setlocale(LC_CTYPE, "fr_FR");
2037         tables = pcre_maketables();
2038         re = pcre_compile(..., tables);
2039
2040       The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
2041       if you are using Windows, the name for the French locale is "french".
2042
2043       When pcre_maketables() runs, the tables are built  in  memory  that  is
2044       obtained  via  pcre_malloc. It is the caller's responsibility to ensure
2045       that the memory containing the tables remains available for as long  as
2046       it is needed.
2047
2048       The pointer that is passed to pcre_compile() is saved with the compiled
2049       pattern, and the same tables are used via this pointer by  pcre_study()
2050       and normally also by pcre_exec(). Thus, by default, for any single pat-
2051       tern, compilation, studying and matching all happen in the same locale,
2052       but different patterns can be compiled in different locales.
2053
2054       It  is  possible to pass a table pointer or NULL (indicating the use of
2055       the internal tables) to pcre_exec(). Although  not  intended  for  this
2056       purpose,  this facility could be used to match a pattern in a different
2057       locale from the one in which it was compiled. Passing table pointers at
2058       run time is discussed below in the section on matching a pattern.
2059
2060
2061INFORMATION ABOUT A PATTERN
2062
2063       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
2064            int what, void *where);
2065
2066       The  pcre_fullinfo() function returns information about a compiled pat-
2067       tern. It replaces the pcre_info() function, which was removed from  the
2068       library at version 8.30, after more than 10 years of obsolescence.
2069
2070       The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
2071       pattern. The second argument is the result of pcre_study(), or NULL  if
2072       the  pattern  was not studied. The third argument specifies which piece
2073       of information is required, and the fourth argument is a pointer  to  a
2074       variable  to  receive  the  data. The yield of the function is zero for
2075       success, or one of the following negative numbers:
2076
2077         PCRE_ERROR_NULL           the argument code was NULL
2078                                   the argument where was NULL
2079         PCRE_ERROR_BADMAGIC       the "magic number" was not found
2080         PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
2081                                   endianness
2082         PCRE_ERROR_BADOPTION      the value of what was invalid
2083
2084       The "magic number" is placed at the start of each compiled  pattern  as
2085       an  simple check against passing an arbitrary memory pointer. The endi-
2086       anness error can occur if a compiled pattern is saved and reloaded on a
2087       different  host.  Here  is a typical call of pcre_fullinfo(), to obtain
2088       the length of the compiled pattern:
2089
2090         int rc;
2091         size_t length;
2092         rc = pcre_fullinfo(
2093           re,               /* result of pcre_compile() */
2094           sd,               /* result of pcre_study(), or NULL */
2095           PCRE_INFO_SIZE,   /* what is required */
2096           &length);         /* where to put the data */
2097
2098       The possible values for the third argument are defined in  pcre.h,  and
2099       are as follows:
2100
2101         PCRE_INFO_BACKREFMAX
2102
2103       Return  the  number  of  the highest back reference in the pattern. The
2104       fourth argument should point to an int variable. Zero  is  returned  if
2105       there are no back references.
2106
2107         PCRE_INFO_CAPTURECOUNT
2108
2109       Return  the  number of capturing subpatterns in the pattern. The fourth
2110       argument should point to an int variable.
2111
2112         PCRE_INFO_DEFAULT_TABLES
2113
2114       Return a pointer to the internal default character tables within  PCRE.
2115       The  fourth  argument should point to an unsigned char * variable. This
2116       information call is provided for internal use by the pcre_study() func-
2117       tion.  External  callers  can  cause PCRE to use its internal tables by
2118       passing a NULL table pointer.
2119
2120         PCRE_INFO_FIRSTBYTE
2121
2122       Return information about the first data unit of any matched string, for
2123       a  non-anchored  pattern.  (The name of this option refers to the 8-bit
2124       library, where data units are bytes.) The fourth argument should  point
2125       to an int variable.
2126
2127       If  there  is  a  fixed first value, for example, the letter "c" from a
2128       pattern such as (cat|cow|coyote), its value is returned. In  the  8-bit
2129       library,  the  value is always less than 256; in the 16-bit library the
2130       value can be up to 0xffff.
2131
2132       If there is no fixed first value, and if either
2133
2134       (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
2135       branch starts with "^", or
2136
2137       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2138       set (if it were set, the pattern would be anchored),
2139
2140       -1 is returned, indicating that the pattern matches only at  the  start
2141       of  a  subject string or after any newline within the string. Otherwise
2142       -2 is returned. For anchored patterns, -2 is returned.
2143
2144         PCRE_INFO_FIRSTTABLE
2145
2146       If the pattern was studied, and this resulted in the construction of  a
2147       256-bit  table indicating a fixed set of values for the first data unit
2148       in any matching string, a pointer to the table is  returned.  Otherwise
2149       NULL  is returned. The fourth argument should point to an unsigned char
2150       * variable.
2151
2152         PCRE_INFO_HASCRORLF
2153
2154       Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
2155       characters,  otherwise  0.  The  fourth argument should point to an int
2156       variable. An explicit match is either a literal CR or LF character,  or
2157       \r or \n.
2158
2159         PCRE_INFO_JCHANGED
2160
2161       Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
2162       otherwise 0. The fourth argument should point to an int variable.  (?J)
2163       and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2164
2165         PCRE_INFO_JIT
2166
2167       Return  1  if  the pattern was studied with one of the JIT options, and
2168       just-in-time compiling was successful. The fourth argument should point
2169       to  an  int variable. A return value of 0 means that JIT support is not
2170       available in this version of PCRE, or that the pattern was not  studied
2171       with  a JIT option, or that the JIT compiler could not handle this par-
2172       ticular pattern. See the pcrejit documentation for details of what  can
2173       and cannot be handled.
2174
2175         PCRE_INFO_JITSIZE
2176
2177       If  the  pattern was successfully studied with a JIT option, return the
2178       size of the JIT compiled code, otherwise return zero. The fourth  argu-
2179       ment should point to a size_t variable.
2180
2181         PCRE_INFO_LASTLITERAL
2182
2183       Return  the value of the rightmost literal data unit that must exist in
2184       any matched string, other than at its start, if such a value  has  been
2185       recorded. The fourth argument should point to an int variable. If there
2186       is no such value, -1 is returned. For anchored patterns, a last literal
2187       value  is recorded only if it follows something of variable length. For
2188       example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2189       /^a\dz\d/ the returned value is -1.
2190
2191         PCRE_INFO_MAXLOOKBEHIND
2192
2193       Return  the  number of characters (NB not bytes) in the longest lookbe-
2194       hind assertion in the pattern. Note that the simple assertions  \b  and
2195       \B  require a one-character lookbehind. This information is useful when
2196       doing multi-segment matching using the partial matching facilities.
2197
2198         PCRE_INFO_MINLENGTH
2199
2200       If the pattern was studied and a minimum length  for  matching  subject
2201       strings  was  computed,  its  value is returned. Otherwise the returned
2202       value is -1. The value is a number of characters, which in  UTF-8  mode
2203       may  be  different from the number of bytes. The fourth argument should
2204       point to an int variable. A non-negative value is a lower bound to  the
2205       length  of  any  matching  string. There may not be any strings of that
2206       length that do actually match, but every string that does match  is  at
2207       least that long.
2208
2209         PCRE_INFO_NAMECOUNT
2210         PCRE_INFO_NAMEENTRYSIZE
2211         PCRE_INFO_NAMETABLE
2212
2213       PCRE  supports the use of named as well as numbered capturing parenthe-
2214       ses. The names are just an additional way of identifying the  parenthe-
2215       ses, which still acquire numbers. Several convenience functions such as
2216       pcre_get_named_substring() are provided for  extracting  captured  sub-
2217       strings  by  name. It is also possible to extract the data directly, by
2218       first converting the name to a number in order to  access  the  correct
2219       pointers in the output vector (described with pcre_exec() below). To do
2220       the conversion, you need  to  use  the  name-to-number  map,  which  is
2221       described by these three values.
2222
2223       The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2224       gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2225       of  each  entry;  both  of  these  return  an int value. The entry size
2226       depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
2227       a pointer to the first entry of the table. This is a pointer to char in
2228       the 8-bit library, where the first two bytes of each entry are the num-
2229       ber  of  the capturing parenthesis, most significant byte first. In the
2230       16-bit library, the pointer points to 16-bit data units, the  first  of
2231       which  contains  the  parenthesis  number. The rest of the entry is the
2232       corresponding name, zero terminated.
2233
2234       The names are in alphabetical order. Duplicate names may appear if  (?|
2235       is used to create multiple groups with the same number, as described in
2236       the section on duplicate subpattern numbers in  the  pcrepattern  page.
2237       Duplicate  names  for  subpatterns with different numbers are permitted
2238       only if PCRE_DUPNAMES is set. In all cases  of  duplicate  names,  they
2239       appear  in  the table in the order in which they were found in the pat-
2240       tern. In the absence of (?| this is the  order  of  increasing  number;
2241       when (?| is used this is not necessarily the case because later subpat-
2242       terns may have lower numbers.
2243
2244       As a simple example of the name/number table,  consider  the  following
2245       pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
2246       set, so white space - including newlines - is ignored):
2247
2248         (?<date> (?<year>(\d\d)?\d\d) -
2249         (?<month>\d\d) - (?<day>\d\d) )
2250
2251       There are four named subpatterns, so the table has  four  entries,  and
2252       each  entry  in the table is eight bytes long. The table is as follows,
2253       with non-printing bytes shows in hexadecimal, and undefined bytes shown
2254       as ??:
2255
2256         00 01 d  a  t  e  00 ??
2257         00 05 d  a  y  00 ?? ??
2258         00 04 m  o  n  t  h  00
2259         00 02 y  e  a  r  00 ??
2260
2261       When  writing  code  to  extract  data from named subpatterns using the
2262       name-to-number map, remember that the length of the entries  is  likely
2263       to be different for each compiled pattern.
2264
2265         PCRE_INFO_OKPARTIAL
2266
2267       Return  1  if  the  pattern  can  be  used  for  partial  matching with
2268       pcre_exec(), otherwise 0. The fourth argument should point  to  an  int
2269       variable.  From  release  8.00,  this  always  returns  1,  because the
2270       restrictions that previously applied  to  partial  matching  have  been
2271       lifted.  The  pcrepartial documentation gives details of partial match-
2272       ing.
2273
2274         PCRE_INFO_OPTIONS
2275
2276       Return a copy of the options with which the pattern was  compiled.  The
2277       fourth  argument  should  point to an unsigned long int variable. These
2278       option bits are those specified in the call to pcre_compile(), modified
2279       by any top-level option settings at the start of the pattern itself. In
2280       other words, they are the options that will be in force  when  matching
2281       starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
2282       the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
2283       and PCRE_EXTENDED.
2284
2285       A  pattern  is  automatically  anchored by PCRE if all of its top-level
2286       alternatives begin with one of the following:
2287
2288         ^     unless PCRE_MULTILINE is set
2289         \A    always
2290         \G    always
2291         .*    if PCRE_DOTALL is set and there are no back
2292                 references to the subpattern in which .* appears
2293
2294       For such patterns, the PCRE_ANCHORED bit is set in the options returned
2295       by pcre_fullinfo().
2296
2297         PCRE_INFO_SIZE
2298
2299       Return  the size of the compiled pattern in bytes (for both libraries).
2300       The fourth argument should point to a size_t variable. This value  does
2301       not  include  the  size  of  the  pcre  structure  that  is returned by
2302       pcre_compile(). The value that is passed as the argument  to  pcre_mal-
2303       loc()  when pcre_compile() is getting memory in which to place the com-
2304       piled data is the value returned by this option plus the  size  of  the
2305       pcre  structure. Studying a compiled pattern, with or without JIT, does
2306       not alter the value returned by this option.
2307
2308         PCRE_INFO_STUDYSIZE
2309
2310       Return the size in bytes of the data block pointed to by the study_data
2311       field  in  a  pcre_extra  block.  If pcre_extra is NULL, or there is no
2312       study data, zero is returned. The fourth argument  should  point  to  a
2313       size_t  variable. The study_data field is set by pcre_study() to record
2314       information that will speed  up  matching  (see  the  section  entitled
2315       "Studying a pattern" above). The format of the study_data block is pri-
2316       vate, but its length is made available via this option so that  it  can
2317       be  saved  and  restored  (see  the  pcreprecompile  documentation  for
2318       details).
2319
2320
2321REFERENCE COUNTS
2322
2323       int pcre_refcount(pcre *code, int adjust);
2324
2325       The pcre_refcount() function is used to maintain a reference  count  in
2326       the data block that contains a compiled pattern. It is provided for the
2327       benefit of applications that  operate  in  an  object-oriented  manner,
2328       where different parts of the application may be using the same compiled
2329       pattern, but you want to free the block when they are all done.
2330
2331       When a pattern is compiled, the reference count field is initialized to
2332       zero.   It is changed only by calling this function, whose action is to
2333       add the adjust value (which may be positive or  negative)  to  it.  The
2334       yield of the function is the new value. However, the value of the count
2335       is constrained to lie between 0 and 65535, inclusive. If the new  value
2336       is outside these limits, it is forced to the appropriate limit value.
2337
2338       Except  when it is zero, the reference count is not correctly preserved
2339       if a pattern is compiled on one host and then  transferred  to  a  host
2340       whose byte-order is different. (This seems a highly unlikely scenario.)
2341
2342
2343MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2344
2345       int pcre_exec(const pcre *code, const pcre_extra *extra,
2346            const char *subject, int length, int startoffset,
2347            int options, int *ovector, int ovecsize);
2348
2349       The  function pcre_exec() is called to match a subject string against a
2350       compiled pattern, which is passed in the code argument. If the  pattern
2351       was  studied,  the  result  of  the study should be passed in the extra
2352       argument. You can call pcre_exec() with the same code and  extra  argu-
2353       ments  as  many  times as you like, in order to match different subject
2354       strings with the same pattern.
2355
2356       This function is the main matching facility  of  the  library,  and  it
2357       operates  in  a  Perl-like  manner. For specialist use there is also an
2358       alternative matching function, which is described below in the  section
2359       about the pcre_dfa_exec() function.
2360
2361       In  most applications, the pattern will have been compiled (and option-
2362       ally studied) in the same process that calls pcre_exec().  However,  it
2363       is possible to save compiled patterns and study data, and then use them
2364       later in different processes, possibly even on different hosts.  For  a
2365       discussion about this, see the pcreprecompile documentation.
2366
2367       Here is an example of a simple call to pcre_exec():
2368
2369         int rc;
2370         int ovector[30];
2371         rc = pcre_exec(
2372           re,             /* result of pcre_compile() */
2373           NULL,           /* we didn't study the pattern */
2374           "some string",  /* the subject string */
2375           11,             /* the length of the subject string */
2376           0,              /* start at offset 0 in the subject */
2377           0,              /* default options */
2378           ovector,        /* vector of integers for substring information */
2379           30);            /* number of elements (NOT size in bytes) */
2380
2381   Extra data for pcre_exec()
2382
2383       If  the  extra argument is not NULL, it must point to a pcre_extra data
2384       block. The pcre_study() function returns such a block (when it  doesn't
2385       return  NULL), but you can also create one for yourself, and pass addi-
2386       tional information in it. The pcre_extra block contains  the  following
2387       fields (not necessarily in this order):
2388
2389         unsigned long int flags;
2390         void *study_data;
2391         void *executable_jit;
2392         unsigned long int match_limit;
2393         unsigned long int match_limit_recursion;
2394         void *callout_data;
2395         const unsigned char *tables;
2396         unsigned char **mark;
2397
2398       In  the  16-bit  version  of  this  structure,  the mark field has type
2399       "PCRE_UCHAR16 **".
2400
2401       The flags field is used to specify which of the other fields  are  set.
2402       The flag bits are:
2403
2404         PCRE_EXTRA_CALLOUT_DATA
2405         PCRE_EXTRA_EXECUTABLE_JIT
2406         PCRE_EXTRA_MARK
2407         PCRE_EXTRA_MATCH_LIMIT
2408         PCRE_EXTRA_MATCH_LIMIT_RECURSION
2409         PCRE_EXTRA_STUDY_DATA
2410         PCRE_EXTRA_TABLES
2411
2412       Other  flag  bits should be set to zero. The study_data field and some-
2413       times the executable_jit field are set in the pcre_extra block that  is
2414       returned  by pcre_study(), together with the appropriate flag bits. You
2415       should not set these yourself, but you may add to the block by  setting
2416       other fields and their corresponding flag bits.
2417
2418       The match_limit field provides a means of preventing PCRE from using up
2419       a vast amount of resources when running patterns that are not going  to
2420       match,  but  which  have  a very large number of possibilities in their
2421       search trees. The classic example is a pattern that uses nested  unlim-
2422       ited repeats.
2423
2424       Internally,  pcre_exec() uses a function called match(), which it calls
2425       repeatedly (sometimes recursively). The limit  set  by  match_limit  is
2426       imposed  on the number of times this function is called during a match,
2427       which has the effect of limiting the amount of  backtracking  that  can
2428       take place. For patterns that are not anchored, the count restarts from
2429       zero for each position in the subject string.
2430
2431       When pcre_exec() is called with a pattern that was successfully studied
2432       with  a  JIT  option, the way that the matching is executed is entirely
2433       different.  However, there is still the possibility of runaway matching
2434       that goes on for a very long time, and so the match_limit value is also
2435       used in this case (but in a different way) to limit how long the match-
2436       ing can continue.
2437
2438       The  default  value  for  the  limit can be set when PCRE is built; the
2439       default default is 10 million, which handles all but the  most  extreme
2440       cases.  You  can  override  the  default by suppling pcre_exec() with a
2441       pcre_extra    block    in    which    match_limit    is    set,     and
2442       PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
2443       exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
2444
2445       The match_limit_recursion field is similar to match_limit, but  instead
2446       of limiting the total number of times that match() is called, it limits
2447       the depth of recursion. The recursion depth is a  smaller  number  than
2448       the  total number of calls, because not all calls to match() are recur-
2449       sive.  This limit is of use only if it is set smaller than match_limit.
2450
2451       Limiting the recursion depth limits the amount of  machine  stack  that
2452       can  be used, or, when PCRE has been compiled to use memory on the heap
2453       instead of the stack, the amount of heap memory that can be used.  This
2454       limit  is not relevant, and is ignored, when matching is done using JIT
2455       compiled code.
2456
2457       The default value for match_limit_recursion can be  set  when  PCRE  is
2458       built;  the  default  default  is  the  same  value  as the default for
2459       match_limit. You can override the default by suppling pcre_exec()  with
2460       a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
2461       PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
2462       limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
2463
2464       The  callout_data  field is used in conjunction with the "callout" fea-
2465       ture, and is described in the pcrecallout documentation.
2466
2467       The tables field  is  used  to  pass  a  character  tables  pointer  to
2468       pcre_exec();  this overrides the value that is stored with the compiled
2469       pattern. A non-NULL value is stored with the compiled pattern  only  if
2470       custom  tables  were  supplied to pcre_compile() via its tableptr argu-
2471       ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
2472       PCRE's  internal  tables  to be used. This facility is helpful when re-
2473       using patterns that have been saved after compiling  with  an  external
2474       set  of  tables,  because  the  external tables might be at a different
2475       address when pcre_exec() is called. See the  pcreprecompile  documenta-
2476       tion for a discussion of saving compiled patterns for later use.
2477
2478       If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be
2479       set to point to a suitable variable. If the pattern contains any  back-
2480       tracking  control verbs such as (*MARK:NAME), and the execution ends up
2481       with a name to pass back, a pointer to the  name  string  (zero  termi-
2482       nated)  is  placed  in  the  variable pointed to by the mark field. The
2483       names are within the compiled pattern; if you wish  to  retain  such  a
2484       name  you must copy it before freeing the memory of a compiled pattern.
2485       If there is no name to pass back, the variable pointed to by  the  mark
2486       field  is  set  to NULL. For details of the backtracking control verbs,
2487       see the section entitled "Backtracking control" in the pcrepattern doc-
2488       umentation.
2489
2490   Option bits for pcre_exec()
2491
2492       The  unused  bits of the options argument for pcre_exec() must be zero.
2493       The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
2494       PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
2495       PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_HARD,   and
2496       PCRE_PARTIAL_SOFT.
2497
2498       If  the  pattern  was successfully studied with one of the just-in-time
2499       (JIT) compile options, the only supported options for JIT execution are
2500       PCRE_NO_UTF8_CHECK,     PCRE_NOTBOL,     PCRE_NOTEOL,    PCRE_NOTEMPTY,
2501       PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If  an
2502       unsupported  option  is  used, JIT execution is disabled and the normal
2503       interpretive code in pcre_exec() is run.
2504
2505         PCRE_ANCHORED
2506
2507       The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
2508       matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
2509       turned out to be anchored by virtue of its contents, it cannot be  made
2510       unachored at matching time.
2511
2512         PCRE_BSR_ANYCRLF
2513         PCRE_BSR_UNICODE
2514
2515       These options (which are mutually exclusive) control what the \R escape
2516       sequence matches. The choice is either to match only CR, LF,  or  CRLF,
2517       or  to  match  any Unicode newline sequence. These options override the
2518       choice that was made or defaulted when the pattern was compiled.
2519
2520         PCRE_NEWLINE_CR
2521         PCRE_NEWLINE_LF
2522         PCRE_NEWLINE_CRLF
2523         PCRE_NEWLINE_ANYCRLF
2524         PCRE_NEWLINE_ANY
2525
2526       These options override  the  newline  definition  that  was  chosen  or
2527       defaulted  when the pattern was compiled. For details, see the descrip-
2528       tion of pcre_compile()  above.  During  matching,  the  newline  choice
2529       affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
2530       ters. It may also alter the way the match position is advanced after  a
2531       match failure for an unanchored pattern.
2532
2533       When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
2534       set, and a match attempt for an unanchored pattern fails when the  cur-
2535       rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
2536       explicit matches for  CR  or  LF  characters,  the  match  position  is
2537       advanced by two characters instead of one, in other words, to after the
2538       CRLF.
2539
2540       The above rule is a compromise that makes the most common cases work as
2541       expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
2542       option is not set), it does not match the string "\r\nA" because, after
2543       failing  at the start, it skips both the CR and the LF before retrying.
2544       However, the pattern [\r\n]A does match that string,  because  it  con-
2545       tains an explicit CR or LF reference, and so advances only by one char-
2546       acter after the first failure.
2547
2548       An explicit match for CR of LF is either a literal appearance of one of
2549       those  characters,  or  one  of the \r or \n escape sequences. Implicit
2550       matches such as [^X] do not count, nor does \s (which includes  CR  and
2551       LF in the characters that it matches).
2552
2553       Notwithstanding  the above, anomalous effects may still occur when CRLF
2554       is a valid newline sequence and explicit \r or \n escapes appear in the
2555       pattern.
2556
2557         PCRE_NOTBOL
2558
2559       This option specifies that first character of the subject string is not
2560       the beginning of a line, so the  circumflex  metacharacter  should  not
2561       match  before it. Setting this without PCRE_MULTILINE (at compile time)
2562       causes circumflex never to match. This option affects only  the  behav-
2563       iour of the circumflex metacharacter. It does not affect \A.
2564
2565         PCRE_NOTEOL
2566
2567       This option specifies that the end of the subject string is not the end
2568       of a line, so the dollar metacharacter should not match it nor  (except
2569       in  multiline mode) a newline immediately before it. Setting this with-
2570       out PCRE_MULTILINE (at compile time) causes dollar never to match. This
2571       option  affects only the behaviour of the dollar metacharacter. It does
2572       not affect \Z or \z.
2573
2574         PCRE_NOTEMPTY
2575
2576       An empty string is not considered to be a valid match if this option is
2577       set.  If  there are alternatives in the pattern, they are tried. If all
2578       the alternatives match the empty string, the entire  match  fails.  For
2579       example, if the pattern
2580
2581         a?b?
2582
2583       is  applied  to  a  string not beginning with "a" or "b", it matches an
2584       empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
2585       match is not valid, so PCRE searches further into the string for occur-
2586       rences of "a" or "b".
2587
2588         PCRE_NOTEMPTY_ATSTART
2589
2590       This is like PCRE_NOTEMPTY, except that an empty string match  that  is
2591       not  at  the  start  of  the  subject  is  permitted. If the pattern is
2592       anchored, such a match can occur only if the pattern contains \K.
2593
2594       Perl    has    no    direct    equivalent    of    PCRE_NOTEMPTY     or
2595       PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern
2596       match of the empty string within its split() function, and  when  using
2597       the  /g  modifier.  It  is  possible  to emulate Perl's behaviour after
2598       matching a null string by first trying the match again at the same off-
2599       set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
2600       fails, by advancing the starting offset (see below) and trying an ordi-
2601       nary  match  again. There is some code that demonstrates how to do this
2602       in the pcredemo sample program. In the most general case, you  have  to
2603       check  to  see  if the newline convention recognizes CRLF as a newline,
2604       and if so, and the current character is CR followed by LF, advance  the
2605       starting offset by two characters instead of one.
2606
2607         PCRE_NO_START_OPTIMIZE
2608
2609       There  are a number of optimizations that pcre_exec() uses at the start
2610       of a match, in order to speed up the process. For  example,  if  it  is
2611       known that an unanchored match must start with a specific character, it
2612       searches the subject for that character, and fails  immediately  if  it
2613       cannot  find  it,  without actually running the main matching function.
2614       This means that a special item such as (*COMMIT) at the start of a pat-
2615       tern  is  not  considered until after a suitable starting point for the
2616       match has been found. When callouts or (*MARK) items are in use,  these
2617       "start-up" optimizations can cause them to be skipped if the pattern is
2618       never actually used. The start-up optimizations are in  effect  a  pre-
2619       scan of the subject that takes place before the pattern is run.
2620
2621       The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
2622       possibly causing performance to suffer,  but  ensuring  that  in  cases
2623       where  the  result is "no match", the callouts do occur, and that items
2624       such as (*COMMIT) and (*MARK) are considered at every possible starting
2625       position  in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at
2626       compile time,  it  cannot  be  unset  at  matching  time.  The  use  of
2627       PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set, matching
2628       is always done using interpretively.
2629
2630       Setting PCRE_NO_START_OPTIMIZE can change the  outcome  of  a  matching
2631       operation.  Consider the pattern
2632
2633         (*COMMIT)ABC
2634
2635       When  this  is  compiled, PCRE records the fact that a match must start
2636       with the character "A". Suppose the subject  string  is  "DEFABC".  The
2637       start-up  optimization  scans along the subject, finds "A" and runs the
2638       first match attempt from there. The (*COMMIT) item means that the  pat-
2639       tern  must  match the current starting position, which in this case, it
2640       does. However, if the same match  is  run  with  PCRE_NO_START_OPTIMIZE
2641       set,  the  initial  scan  along the subject string does not happen. The
2642       first match attempt is run starting  from  "D"  and  when  this  fails,
2643       (*COMMIT)  prevents  any  further  matches  being tried, so the overall
2644       result is "no match". If the pattern is studied,  more  start-up  opti-
2645       mizations  may  be  used. For example, a minimum length for the subject
2646       may be recorded. Consider the pattern
2647
2648         (*MARK:A)(X|Y)
2649
2650       The minimum length for a match is one  character.  If  the  subject  is
2651       "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then
2652       finally an empty string.  If the pattern is studied, the final  attempt
2653       does  not take place, because PCRE knows that the subject is too short,
2654       and so the (*MARK) is never encountered.  In this  case,  studying  the
2655       pattern  does  not  affect the overall match result, which is still "no
2656       match", but it does affect the auxiliary information that is returned.
2657
2658         PCRE_NO_UTF8_CHECK
2659
2660       When PCRE_UTF8 is set at compile time, the validity of the subject as a
2661       UTF-8  string is automatically checked when pcre_exec() is subsequently
2662       called.  The entire string is checked before any other processing takes
2663       place.  The  value  of  startoffset  is  also checked to ensure that it
2664       points to the start of a UTF-8 character. There is a  discussion  about
2665       the  validity  of  UTF-8 strings in the pcreunicode page. If an invalid
2666       sequence  of  bytes   is   found,   pcre_exec()   returns   the   error
2667       PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
2668       truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
2669       both  cases, information about the precise nature of the error may also
2670       be returned (see the descriptions of these errors in the section  enti-
2671       tled  Error return values from pcre_exec() below).  If startoffset con-
2672       tains a value that does not point to the start of a UTF-8 character (or
2673       to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2674
2675       If  you  already  know that your subject is valid, and you want to skip
2676       these   checks   for   performance   reasons,   you   can    set    the
2677       PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
2678       do this for the second and subsequent calls to pcre_exec() if  you  are
2679       making  repeated  calls  to  find  all  the matches in a single subject
2680       string. However, you should be  sure  that  the  value  of  startoffset
2681       points  to  the  start of a character (or the end of the subject). When
2682       PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
2683       subject  or  an invalid value of startoffset is undefined. Your program
2684       may crash.
2685
2686         PCRE_PARTIAL_HARD
2687         PCRE_PARTIAL_SOFT
2688
2689       These options turn on the partial matching feature. For backwards  com-
2690       patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2691       match occurs if the end of the subject string is reached  successfully,
2692       but  there  are not enough subject characters to complete the match. If
2693       this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2694       matching  continues  by  testing any remaining alternatives. Only if no
2695       complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of
2696       PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the
2697       caller is prepared to handle a partial match, but only if  no  complete
2698       match can be found.
2699
2700       If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this
2701       case, if a partial match  is  found,  pcre_exec()  immediately  returns
2702       PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In
2703       other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-
2704       ered to be more important that an alternative complete match.
2705
2706       In  both  cases,  the portion of the string that was inspected when the
2707       partial match was found is set as the first matching string. There is a
2708       more  detailed  discussion  of partial and multi-segment matching, with
2709       examples, in the pcrepartial documentation.
2710
2711   The string to be matched by pcre_exec()
2712
2713       The subject string is passed to pcre_exec() as a pointer in subject,  a
2714       length  in  bytes in length, and a starting byte offset in startoffset.
2715       If this is  negative  or  greater  than  the  length  of  the  subject,
2716       pcre_exec()  returns  PCRE_ERROR_BADOFFSET. When the starting offset is
2717       zero, the search for a match starts at the beginning  of  the  subject,
2718       and this is by far the most common case. In UTF-8 mode, the byte offset
2719       must point to the start of a UTF-8 character (or the end  of  the  sub-
2720       ject).  Unlike  the pattern string, the subject may contain binary zero
2721       bytes.
2722
2723       A non-zero starting offset is useful when searching for  another  match
2724       in  the same subject by calling pcre_exec() again after a previous suc-
2725       cess.  Setting startoffset differs from just passing over  a  shortened
2726       string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
2727       with any kind of lookbehind. For example, consider the pattern
2728
2729         \Biss\B
2730
2731       which finds occurrences of "iss" in the middle of  words.  (\B  matches
2732       only  if  the  current position in the subject is not a word boundary.)
2733       When applied to the string "Mississipi" the first call  to  pcre_exec()
2734       finds  the  first  occurrence. If pcre_exec() is called again with just
2735       the remainder of the subject,  namely  "issipi",  it  does  not  match,
2736       because \B is always false at the start of the subject, which is deemed
2737       to be a word boundary. However, if pcre_exec()  is  passed  the  entire
2738       string again, but with startoffset set to 4, it finds the second occur-
2739       rence of "iss" because it is able to look behind the starting point  to
2740       discover that it is preceded by a letter.
2741
2742       Finding  all  the  matches  in a subject is tricky when the pattern can
2743       match an empty string. It is possible to emulate Perl's /g behaviour by
2744       first   trying   the   match   again  at  the  same  offset,  with  the
2745       PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that
2746       fails,  advancing  the  starting  offset  and  trying an ordinary match
2747       again. There is some code that demonstrates how to do this in the pcre-
2748       demo sample program. In the most general case, you have to check to see
2749       if the newline convention recognizes CRLF as a newline, and if so,  and
2750       the current character is CR followed by LF, advance the starting offset
2751       by two characters instead of one.
2752
2753       If a non-zero starting offset is passed when the pattern  is  anchored,
2754       one attempt to match at the given offset is made. This can only succeed
2755       if the pattern does not require the match to be at  the  start  of  the
2756       subject.
2757
2758   How pcre_exec() returns captured substrings
2759
2760       In  general, a pattern matches a certain portion of the subject, and in
2761       addition, further substrings from the subject  may  be  picked  out  by
2762       parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
2763       this is called "capturing" in what follows, and the  phrase  "capturing
2764       subpattern"  is  used for a fragment of a pattern that picks out a sub-
2765       string. PCRE supports several other kinds of  parenthesized  subpattern
2766       that do not cause substrings to be captured.
2767
2768       Captured substrings are returned to the caller via a vector of integers
2769       whose address is passed in ovector. The number of elements in the  vec-
2770       tor  is  passed in ovecsize, which must be a non-negative number. Note:
2771       this argument is NOT the size of ovector in bytes.
2772
2773       The first two-thirds of the vector is used to pass back  captured  sub-
2774       strings,  each  substring using a pair of integers. The remaining third
2775       of the vector is used as workspace by pcre_exec() while  matching  cap-
2776       turing  subpatterns, and is not available for passing back information.
2777       The number passed in ovecsize should always be a multiple of three.  If
2778       it is not, it is rounded down.
2779
2780       When  a  match  is successful, information about captured substrings is
2781       returned in pairs of integers, starting at the  beginning  of  ovector,
2782       and  continuing  up  to two-thirds of its length at the most. The first
2783       element of each pair is set to the byte offset of the  first  character
2784       in  a  substring, and the second is set to the byte offset of the first
2785       character after the end of a substring. Note: these values  are  always
2786       byte offsets, even in UTF-8 mode. They are not character counts.
2787
2788       The  first  pair  of  integers, ovector[0] and ovector[1], identify the
2789       portion of the subject string matched by the entire pattern.  The  next
2790       pair  is  used for the first capturing subpattern, and so on. The value
2791       returned by pcre_exec() is one more than the highest numbered pair that
2792       has  been  set.  For example, if two substrings have been captured, the
2793       returned value is 3. If there are no capturing subpatterns, the  return
2794       value from a successful match is 1, indicating that just the first pair
2795       of offsets has been set.
2796
2797       If a capturing subpattern is matched repeatedly, it is the last portion
2798       of the string that it matched that is returned.
2799
2800       If  the vector is too small to hold all the captured substring offsets,
2801       it is used as far as possible (up to two-thirds of its length), and the
2802       function  returns a value of zero. If neither the actual string matched
2803       nor any captured substrings are of interest, pcre_exec() may be  called
2804       with  ovector passed as NULL and ovecsize as zero. However, if the pat-
2805       tern contains back references and the ovector  is  not  big  enough  to
2806       remember  the related substrings, PCRE has to get additional memory for
2807       use during matching. Thus it is usually advisable to supply an  ovector
2808       of reasonable size.
2809
2810       There  are  some  cases where zero is returned (indicating vector over-
2811       flow) when in fact the vector is exactly the right size for  the  final
2812       match. For example, consider the pattern
2813
2814         (a)(?:(b)c|bd)
2815
2816       If  a  vector of 6 elements (allowing for only 1 captured substring) is
2817       given with subject string "abd", pcre_exec() will try to set the second
2818       captured string, thereby recording a vector overflow, before failing to
2819       match "c" and backing up  to  try  the  second  alternative.  The  zero
2820       return,  however,  does  correctly  indicate that the maximum number of
2821       slots (namely 2) have been filled. In similar cases where there is tem-
2822       porary  overflow,  but  the final number of used slots is actually less
2823       than the maximum, a non-zero value is returned.
2824
2825       The pcre_fullinfo() function can be used to find out how many capturing
2826       subpatterns  there  are  in  a  compiled pattern. The smallest size for
2827       ovector that will allow for n captured substrings, in addition  to  the
2828       offsets of the substring matched by the whole pattern, is (n+1)*3.
2829
2830       It  is  possible for capturing subpattern number n+1 to match some part
2831       of the subject when subpattern n has not been used at all. For example,
2832       if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
2833       return from the function is 4, and subpatterns 1 and 3 are matched, but
2834       2  is  not.  When  this happens, both values in the offset pairs corre-
2835       sponding to unused subpatterns are set to -1.
2836
2837       Offset values that correspond to unused subpatterns at the end  of  the
2838       expression  are  also  set  to  -1. For example, if the string "abc" is
2839       matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
2840       matched.  The  return  from the function is 2, because the highest used
2841       capturing subpattern number is 1, and the offsets for  for  the  second
2842       and  third  capturing subpatterns (assuming the vector is large enough,
2843       of course) are set to -1.
2844
2845       Note: Elements in the first two-thirds of ovector that  do  not  corre-
2846       spond  to  capturing parentheses in the pattern are never changed. That
2847       is, if a pattern contains n capturing parentheses, no more  than  ovec-
2848       tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in
2849       the first two-thirds) retain whatever values they previously had.
2850
2851       Some convenience functions are provided  for  extracting  the  captured
2852       substrings as separate strings. These are described below.
2853
2854   Error return values from pcre_exec()
2855
2856       If  pcre_exec()  fails, it returns a negative number. The following are
2857       defined in the header file:
2858
2859         PCRE_ERROR_NOMATCH        (-1)
2860
2861       The subject string did not match the pattern.
2862
2863         PCRE_ERROR_NULL           (-2)
2864
2865       Either code or subject was passed as NULL,  or  ovector  was  NULL  and
2866       ovecsize was not zero.
2867
2868         PCRE_ERROR_BADOPTION      (-3)
2869
2870       An unrecognized bit was set in the options argument.
2871
2872         PCRE_ERROR_BADMAGIC       (-4)
2873
2874       PCRE  stores a 4-byte "magic number" at the start of the compiled code,
2875       to catch the case when it is passed a junk pointer and to detect when a
2876       pattern that was compiled in an environment of one endianness is run in
2877       an environment with the other endianness. This is the error  that  PCRE
2878       gives when the magic number is not present.
2879
2880         PCRE_ERROR_UNKNOWN_OPCODE (-5)
2881
2882       While running the pattern match, an unknown item was encountered in the
2883       compiled pattern. This error could be caused by a bug  in  PCRE  or  by
2884       overwriting of the compiled pattern.
2885
2886         PCRE_ERROR_NOMEMORY       (-6)
2887
2888       If  a  pattern contains back references, but the ovector that is passed
2889       to pcre_exec() is not big enough to remember the referenced substrings,
2890       PCRE  gets  a  block of memory at the start of matching to use for this
2891       purpose. If the call via pcre_malloc() fails, this error is given.  The
2892       memory is automatically freed at the end of matching.
2893
2894       This  error  is also given if pcre_stack_malloc() fails in pcre_exec().
2895       This can happen only when PCRE has been compiled with  --disable-stack-
2896       for-recursion.
2897
2898         PCRE_ERROR_NOSUBSTRING    (-7)
2899
2900       This  error is used by the pcre_copy_substring(), pcre_get_substring(),
2901       and  pcre_get_substring_list()  functions  (see  below).  It  is  never
2902       returned by pcre_exec().
2903
2904         PCRE_ERROR_MATCHLIMIT     (-8)
2905
2906       The  backtracking  limit,  as  specified  by the match_limit field in a
2907       pcre_extra structure (or defaulted) was reached.  See  the  description
2908       above.
2909
2910         PCRE_ERROR_CALLOUT        (-9)
2911
2912       This error is never generated by pcre_exec() itself. It is provided for
2913       use by callout functions that want to yield a distinctive  error  code.
2914       See the pcrecallout documentation for details.
2915
2916         PCRE_ERROR_BADUTF8        (-10)
2917
2918       A  string  that contains an invalid UTF-8 byte sequence was passed as a
2919       subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of
2920       the  output  vector  (ovecsize)  is  at least 2, the byte offset to the
2921       start of the the invalid UTF-8 character is placed in  the  first  ele-
2922       ment,  and  a  reason  code is placed in the second element. The reason
2923       codes are listed in the following section.  For backward compatibility,
2924       if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
2925       acter  at  the  end  of  the   subject   (reason   codes   1   to   5),
2926       PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
2927
2928         PCRE_ERROR_BADUTF8_OFFSET (-11)
2929
2930       The  UTF-8  byte  sequence that was passed as a subject was checked and
2931       found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the
2932       value  of startoffset did not point to the beginning of a UTF-8 charac-
2933       ter or the end of the subject.
2934
2935         PCRE_ERROR_PARTIAL        (-12)
2936
2937       The subject string did not match, but it did match partially.  See  the
2938       pcrepartial documentation for details of partial matching.
2939
2940         PCRE_ERROR_BADPARTIAL     (-13)
2941
2942       This  code  is  no  longer  in  use.  It was formerly returned when the
2943       PCRE_PARTIAL option was used with a compiled pattern  containing  items
2944       that  were  not  supported  for  partial  matching.  From  release 8.00
2945       onwards, there are no restrictions on partial matching.
2946
2947         PCRE_ERROR_INTERNAL       (-14)
2948
2949       An unexpected internal error has occurred. This error could  be  caused
2950       by a bug in PCRE or by overwriting of the compiled pattern.
2951
2952         PCRE_ERROR_BADCOUNT       (-15)
2953
2954       This error is given if the value of the ovecsize argument is negative.
2955
2956         PCRE_ERROR_RECURSIONLIMIT (-21)
2957
2958       The internal recursion limit, as specified by the match_limit_recursion
2959       field in a pcre_extra structure (or defaulted)  was  reached.  See  the
2960       description above.
2961
2962         PCRE_ERROR_BADNEWLINE     (-23)
2963
2964       An invalid combination of PCRE_NEWLINE_xxx options was given.
2965
2966         PCRE_ERROR_BADOFFSET      (-24)
2967
2968       The value of startoffset was negative or greater than the length of the
2969       subject, that is, the value in length.
2970
2971         PCRE_ERROR_SHORTUTF8      (-25)
2972
2973       This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject
2974       string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
2975       option is set.  Information  about  the  failure  is  returned  as  for
2976       PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but
2977       this special error code for PCRE_PARTIAL_HARD precedes the  implementa-
2978       tion  of returned information; it is retained for backwards compatibil-
2979       ity.
2980
2981         PCRE_ERROR_RECURSELOOP    (-26)
2982
2983       This error is returned when pcre_exec() detects a recursion loop within
2984       the  pattern. Specifically, it means that either the whole pattern or a
2985       subpattern has been called recursively for the second time at the  same
2986       position in the subject string. Some simple patterns that might do this
2987       are detected and faulted at compile time, but more  complicated  cases,
2988       in particular mutual recursions between two different subpatterns, can-
2989       not be detected until run time.
2990
2991         PCRE_ERROR_JIT_STACKLIMIT (-27)
2992
2993       This error is returned when a pattern  that  was  successfully  studied
2994       using  a  JIT compile option is being matched, but the memory available
2995       for the just-in-time processing stack is  not  large  enough.  See  the
2996       pcrejit documentation for more details.
2997
2998         PCRE_ERROR_BADMODE        (-28)
2999
3000       This error is given if a pattern that was compiled by the 8-bit library
3001       is passed to a 16-bit library function, or vice versa.
3002
3003         PCRE_ERROR_BADENDIANNESS  (-29)
3004
3005       This error is given if  a  pattern  that  was  compiled  and  saved  is
3006       reloaded  on  a  host  with  different endianness. The utility function
3007       pcre_pattern_to_host_byte_order() can be used to convert such a pattern
3008       so that it runs on the new host.
3009
3010       Error numbers -16 to -20, -22, and -30 are not used by pcre_exec().
3011
3012   Reason codes for invalid UTF-8 strings
3013
3014       This  section  applies  only  to  the  8-bit library. The corresponding
3015       information for the 16-bit library is given in the pcre16 page.
3016
3017       When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
3018       UTF8,  and  the size of the output vector (ovecsize) is at least 2, the
3019       offset of the start of the invalid UTF-8 character  is  placed  in  the
3020       first output vector element (ovector[0]) and a reason code is placed in
3021       the second element (ovector[1]). The reason codes are  given  names  in
3022       the pcre.h header file:
3023
3024         PCRE_UTF8_ERR1
3025         PCRE_UTF8_ERR2
3026         PCRE_UTF8_ERR3
3027         PCRE_UTF8_ERR4
3028         PCRE_UTF8_ERR5
3029
3030       The  string  ends  with a truncated UTF-8 character; the code specifies
3031       how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
3032       characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
3033       nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
3034       checked first; hence the possibility of 4 or 5 missing bytes.
3035
3036         PCRE_UTF8_ERR6
3037         PCRE_UTF8_ERR7
3038         PCRE_UTF8_ERR8
3039         PCRE_UTF8_ERR9
3040         PCRE_UTF8_ERR10
3041
3042       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
3043       the character do not have the binary value 0b10 (that  is,  either  the
3044       most significant bit is 0, or the next bit is 1).
3045
3046         PCRE_UTF8_ERR11
3047         PCRE_UTF8_ERR12
3048
3049       A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
3050       long; these code points are excluded by RFC 3629.
3051
3052         PCRE_UTF8_ERR13
3053
3054       A 4-byte character has a value greater than 0x10fff; these code  points
3055       are excluded by RFC 3629.
3056
3057         PCRE_UTF8_ERR14
3058
3059       A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
3060       range of code points are reserved by RFC 3629 for use with UTF-16,  and
3061       so are excluded from UTF-8.
3062
3063         PCRE_UTF8_ERR15
3064         PCRE_UTF8_ERR16
3065         PCRE_UTF8_ERR17
3066         PCRE_UTF8_ERR18
3067         PCRE_UTF8_ERR19
3068
3069       A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
3070       for a value that can be represented by fewer bytes, which  is  invalid.
3071       For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
3072       rect coding uses just one byte.
3073
3074         PCRE_UTF8_ERR20
3075
3076       The two most significant bits of the first byte of a character have the
3077       binary  value 0b10 (that is, the most significant bit is 1 and the sec-
3078       ond is 0). Such a byte can only validly occur as the second  or  subse-
3079       quent byte of a multi-byte character.
3080
3081         PCRE_UTF8_ERR21
3082
3083       The  first byte of a character has the value 0xfe or 0xff. These values
3084       can never occur in a valid UTF-8 string.
3085
3086
3087EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3088
3089       int pcre_copy_substring(const char *subject, int *ovector,
3090            int stringcount, int stringnumber, char *buffer,
3091            int buffersize);
3092
3093       int pcre_get_substring(const char *subject, int *ovector,
3094            int stringcount, int stringnumber,
3095            const char **stringptr);
3096
3097       int pcre_get_substring_list(const char *subject,
3098            int *ovector, int stringcount, const char ***listptr);
3099
3100       Captured substrings can be  accessed  directly  by  using  the  offsets
3101       returned  by  pcre_exec()  in  ovector.  For convenience, the functions
3102       pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
3103       string_list()  are  provided for extracting captured substrings as new,
3104       separate, zero-terminated strings. These functions identify  substrings
3105       by  number.  The  next section describes functions for extracting named
3106       substrings.
3107
3108       A substring that contains a binary zero is correctly extracted and  has
3109       a  further zero added on the end, but the result is not, of course, a C
3110       string.  However, you can process such a string  by  referring  to  the
3111       length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
3112       string().  Unfortunately, the interface to pcre_get_substring_list() is
3113       not  adequate for handling strings containing binary zeros, because the
3114       end of the final string is not independently indicated.
3115
3116       The first three arguments are the same for all  three  of  these  func-
3117       tions:  subject  is  the subject string that has just been successfully
3118       matched, ovector is a pointer to the vector of integer offsets that was
3119       passed to pcre_exec(), and stringcount is the number of substrings that
3120       were captured by the match, including the substring  that  matched  the
3121       entire regular expression. This is the value returned by pcre_exec() if
3122       it is greater than zero. If pcre_exec() returned zero, indicating  that
3123       it  ran out of space in ovector, the value passed as stringcount should
3124       be the number of elements in the vector divided by three.
3125
3126       The functions pcre_copy_substring() and pcre_get_substring() extract  a
3127       single  substring,  whose  number  is given as stringnumber. A value of
3128       zero extracts the substring that matched the  entire  pattern,  whereas
3129       higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
3130       string(), the string is placed in buffer,  whose  length  is  given  by
3131       buffersize,  while  for  pcre_get_substring()  a new block of memory is
3132       obtained via pcre_malloc, and its address is  returned  via  stringptr.
3133       The  yield  of  the function is the length of the string, not including
3134       the terminating zero, or one of these error codes:
3135
3136         PCRE_ERROR_NOMEMORY       (-6)
3137
3138       The buffer was too small for pcre_copy_substring(), or the  attempt  to
3139       get memory failed for pcre_get_substring().
3140
3141         PCRE_ERROR_NOSUBSTRING    (-7)
3142
3143       There is no substring whose number is stringnumber.
3144
3145       The  pcre_get_substring_list()  function  extracts  all  available sub-
3146       strings and builds a list of pointers to them. All this is  done  in  a
3147       single block of memory that is obtained via pcre_malloc. The address of
3148       the memory block is returned via listptr, which is also  the  start  of
3149       the  list  of  string pointers. The end of the list is marked by a NULL
3150       pointer. The yield of the function is zero if all  went  well,  or  the
3151       error code
3152
3153         PCRE_ERROR_NOMEMORY       (-6)
3154
3155       if the attempt to get the memory block failed.
3156
3157       When  any of these functions encounter a substring that is unset, which
3158       can happen when capturing subpattern number n+1 matches  some  part  of
3159       the  subject, but subpattern n has not been used at all, they return an
3160       empty string. This can be distinguished from a genuine zero-length sub-
3161       string  by inspecting the appropriate offset in ovector, which is nega-
3162       tive for unset substrings.
3163
3164       The two convenience functions pcre_free_substring() and  pcre_free_sub-
3165       string_list()  can  be  used  to free the memory returned by a previous
3166       call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
3167       tively.  They  do  nothing  more  than  call the function pointed to by
3168       pcre_free, which of course could be called directly from a  C  program.
3169       However,  PCRE is used in some situations where it is linked via a spe-
3170       cial  interface  to  another  programming  language  that  cannot   use
3171       pcre_free  directly;  it is for these cases that the functions are pro-
3172       vided.
3173
3174
3175EXTRACTING CAPTURED SUBSTRINGS BY NAME
3176
3177       int pcre_get_stringnumber(const pcre *code,
3178            const char *name);
3179
3180       int pcre_copy_named_substring(const pcre *code,
3181            const char *subject, int *ovector,
3182            int stringcount, const char *stringname,
3183            char *buffer, int buffersize);
3184
3185       int pcre_get_named_substring(const pcre *code,
3186            const char *subject, int *ovector,
3187            int stringcount, const char *stringname,
3188            const char **stringptr);
3189
3190       To extract a substring by name, you first have to find associated  num-
3191       ber.  For example, for this pattern
3192
3193         (a+)b(?<xxx>\d+)...
3194
3195       the number of the subpattern called "xxx" is 2. If the name is known to
3196       be unique (PCRE_DUPNAMES was not set), you can find the number from the
3197       name by calling pcre_get_stringnumber(). The first argument is the com-
3198       piled pattern, and the second is the name. The yield of the function is
3199       the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
3200       subpattern of that name.
3201
3202       Given the number, you can extract the substring directly, or use one of
3203       the functions described in the previous section. For convenience, there
3204       are also two functions that do the whole job.
3205
3206       Most   of   the   arguments    of    pcre_copy_named_substring()    and
3207       pcre_get_named_substring()  are  the  same  as  those for the similarly
3208       named functions that extract by number. As these are described  in  the
3209       previous  section,  they  are not re-described here. There are just two
3210       differences:
3211
3212       First, instead of a substring number, a substring name is  given.  Sec-
3213       ond, there is an extra argument, given at the start, which is a pointer
3214       to the compiled pattern. This is needed in order to gain access to  the
3215       name-to-number translation table.
3216
3217       These  functions call pcre_get_stringnumber(), and if it succeeds, they
3218       then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
3219       ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
3220       behaviour may not be what you want (see the next section).
3221
3222       Warning: If the pattern uses the (?| feature to set up multiple subpat-
3223       terns  with  the  same number, as described in the section on duplicate
3224       subpattern numbers in the pcrepattern page, you  cannot  use  names  to
3225       distinguish  the  different subpatterns, because names are not included
3226       in the compiled code. The matching process uses only numbers. For  this
3227       reason,  the  use of different names for subpatterns of the same number
3228       causes an error at compile time.
3229
3230
3231DUPLICATE SUBPATTERN NAMES
3232
3233       int pcre_get_stringtable_entries(const pcre *code,
3234            const char *name, char **first, char **last);
3235
3236       When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
3237       subpatterns  are not required to be unique. (Duplicate names are always
3238       allowed for subpatterns with the same number, created by using the  (?|
3239       feature.  Indeed,  if  such subpatterns are named, they are required to
3240       use the same names.)
3241
3242       Normally, patterns with duplicate names are such that in any one match,
3243       only  one of the named subpatterns participates. An example is shown in
3244       the pcrepattern documentation.
3245
3246       When   duplicates   are   present,   pcre_copy_named_substring()    and
3247       pcre_get_named_substring()  return the first substring corresponding to
3248       the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
3249       (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
3250       function returns one of the numbers that are associated with the  name,
3251       but it is not defined which it is.
3252
3253       If  you want to get full details of all captured substrings for a given
3254       name, you must use  the  pcre_get_stringtable_entries()  function.  The
3255       first argument is the compiled pattern, and the second is the name. The
3256       third and fourth are pointers to variables which  are  updated  by  the
3257       function. After it has run, they point to the first and last entries in
3258       the name-to-number table  for  the  given  name.  The  function  itself
3259       returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
3260       there are none. The format of the table is described above in the  sec-
3261       tion  entitled  Information about a pattern above.  Given all the rele-
3262       vant entries for the name, you can extract each of their  numbers,  and
3263       hence the captured data, if any.
3264
3265
3266FINDING ALL POSSIBLE MATCHES
3267
3268       The  traditional  matching  function  uses a similar algorithm to Perl,
3269       which stops when it finds the first match, starting at a given point in
3270       the  subject.  If you want to find all possible matches, or the longest
3271       possible match, consider using the alternative matching  function  (see
3272       below)  instead.  If you cannot use the alternative function, but still
3273       need to find all possible matches, you can kludge it up by  making  use
3274       of the callout facility, which is described in the pcrecallout documen-
3275       tation.
3276
3277       What you have to do is to insert a callout right at the end of the pat-
3278       tern.   When your callout function is called, extract and save the cur-
3279       rent matched substring. Then return  1,  which  forces  pcre_exec()  to
3280       backtrack  and  try other alternatives. Ultimately, when it runs out of
3281       matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
3282
3283
3284OBTAINING AN ESTIMATE OF STACK USAGE
3285
3286       Matching certain patterns using pcre_exec() can use a  lot  of  process
3287       stack,  which  in  certain  environments can be rather limited in size.
3288       Some users find it helpful to have an estimate of the amount  of  stack
3289       that  is  used  by  pcre_exec(),  to help them set recursion limits, as
3290       described in the pcrestack documentation. The estimate that  is  output
3291       by pcretest when called with the -m and -C options is obtained by call-
3292       ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for  its
3293       first five arguments.
3294
3295       Normally,  if  its  first  argument  is  NULL,  pcre_exec() immediately
3296       returns the negative error code PCRE_ERROR_NULL, but with this  special
3297       combination  of  arguments,  it returns instead a negative number whose
3298       absolute value is the approximate stack frame size in bytes.  (A  nega-
3299       tive  number  is  used so that it is clear that no match has happened.)
3300       The value is approximate because in  some  cases,  recursive  calls  to
3301       pcre_exec() occur when there are one or two additional variables on the
3302       stack.
3303
3304       If PCRE has been compiled to use the heap  instead  of  the  stack  for
3305       recursion,  the  value  returned  is  the  size  of  each block that is
3306       obtained from the heap.
3307
3308
3309MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3310
3311       int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
3312            const char *subject, int length, int startoffset,
3313            int options, int *ovector, int ovecsize,
3314            int *workspace, int wscount);
3315
3316       The function pcre_dfa_exec()  is  called  to  match  a  subject  string
3317       against  a  compiled pattern, using a matching algorithm that scans the
3318       subject string just once, and does not backtrack.  This  has  different
3319       characteristics  to  the  normal  algorithm, and is not compatible with
3320       Perl. Some of the features of PCRE patterns are not  supported.  Never-
3321       theless,  there are times when this kind of matching can be useful. For
3322       a discussion of the two matching algorithms, and  a  list  of  features
3323       that  pcre_dfa_exec() does not support, see the pcrematching documenta-
3324       tion.
3325
3326       The arguments for the pcre_dfa_exec() function  are  the  same  as  for
3327       pcre_exec(), plus two extras. The ovector argument is used in a differ-
3328       ent way, and this is described below. The other  common  arguments  are
3329       used  in  the  same way as for pcre_exec(), so their description is not
3330       repeated here.
3331
3332       The two additional arguments provide workspace for  the  function.  The
3333       workspace  vector  should  contain at least 20 elements. It is used for
3334       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
3335       workspace  will  be  needed for patterns and subjects where there are a
3336       lot of potential matches.
3337
3338       Here is an example of a simple call to pcre_dfa_exec():
3339
3340         int rc;
3341         int ovector[10];
3342         int wspace[20];
3343         rc = pcre_dfa_exec(
3344           re,             /* result of pcre_compile() */
3345           NULL,           /* we didn't study the pattern */
3346           "some string",  /* the subject string */
3347           11,             /* the length of the subject string */
3348           0,              /* start at offset 0 in the subject */
3349           0,              /* default options */
3350           ovector,        /* vector of integers for substring information */
3351           10,             /* number of elements (NOT size in bytes) */
3352           wspace,         /* working space vector */
3353           20);            /* number of elements (NOT size in bytes) */
3354
3355   Option bits for pcre_dfa_exec()
3356
3357       The unused bits of the options argument  for  pcre_dfa_exec()  must  be
3358       zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
3359       LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
3360       PCRE_NOTEMPTY_ATSTART,       PCRE_NO_UTF8_CHECK,      PCRE_BSR_ANYCRLF,
3361       PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD,  PCRE_PAR-
3362       TIAL_SOFT,  PCRE_DFA_SHORTEST,  and PCRE_DFA_RESTART.  All but the last
3363       four of these are  exactly  the  same  as  for  pcre_exec(),  so  their
3364       description is not repeated here.
3365
3366         PCRE_PARTIAL_HARD
3367         PCRE_PARTIAL_SOFT
3368
3369       These  have the same general effect as they do for pcre_exec(), but the
3370       details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for
3371       pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-
3372       ject is reached and there is still at least  one  matching  possibility
3373       that requires additional characters. This happens even if some complete
3374       matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
3375       code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
3376       of the subject is reached, there have been  no  complete  matches,  but
3377       there  is  still  at least one matching possibility. The portion of the
3378       string that was inspected when the longest partial match was  found  is
3379       set  as  the  first  matching  string  in  both cases.  There is a more
3380       detailed discussion of partial and multi-segment matching,  with  exam-
3381       ples, in the pcrepartial documentation.
3382
3383         PCRE_DFA_SHORTEST
3384
3385       Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
3386       stop as soon as it has found one match. Because of the way the alterna-
3387       tive  algorithm  works, this is necessarily the shortest possible match
3388       at the first possible matching point in the subject string.
3389
3390         PCRE_DFA_RESTART
3391
3392       When pcre_dfa_exec() returns a partial match, it is possible to call it
3393       again,  with  additional  subject characters, and have it continue with
3394       the same match. The PCRE_DFA_RESTART option requests this action;  when
3395       it  is  set,  the workspace and wscount options must reference the same
3396       vector as before because data about the match so far is  left  in  them
3397       after a partial match. There is more discussion of this facility in the
3398       pcrepartial documentation.
3399
3400   Successful returns from pcre_dfa_exec()
3401
3402       When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
3403       string in the subject. Note, however, that all the matches from one run
3404       of the function start at the same point in  the  subject.  The  shorter
3405       matches  are all initial substrings of the longer matches. For example,
3406       if the pattern
3407
3408         <.*>
3409
3410       is matched against the string
3411
3412         This is <something> <something else> <something further> no more
3413
3414       the three matched strings are
3415
3416         <something>
3417         <something> <something else>
3418         <something> <something else> <something further>
3419
3420       On success, the yield of the function is a number  greater  than  zero,
3421       which  is  the  number of matched substrings. The substrings themselves
3422       are returned in ovector. Each string uses two elements;  the  first  is
3423       the  offset  to  the start, and the second is the offset to the end. In
3424       fact, all the strings have the same start  offset.  (Space  could  have
3425       been  saved by giving this only once, but it was decided to retain some
3426       compatibility with the way pcre_exec() returns data,  even  though  the
3427       meaning of the strings is different.)
3428
3429       The strings are returned in reverse order of length; that is, the long-
3430       est matching string is given first. If there were too many  matches  to
3431       fit  into ovector, the yield of the function is zero, and the vector is
3432       filled with the longest matches.  Unlike  pcre_exec(),  pcre_dfa_exec()
3433       can use the entire ovector for returning matched strings.
3434
3435   Error returns from pcre_dfa_exec()
3436
3437       The  pcre_dfa_exec()  function returns a negative number when it fails.
3438       Many of the errors are the same  as  for  pcre_exec(),  and  these  are
3439       described  above.   There are in addition the following errors that are
3440       specific to pcre_dfa_exec():
3441
3442         PCRE_ERROR_DFA_UITEM      (-16)
3443
3444       This return is given if pcre_dfa_exec() encounters an item in the  pat-
3445       tern  that  it  does not support, for instance, the use of \C or a back
3446       reference.
3447
3448         PCRE_ERROR_DFA_UCOND      (-17)
3449
3450       This return is given if pcre_dfa_exec()  encounters  a  condition  item
3451       that  uses  a back reference for the condition, or a test for recursion
3452       in a specific group. These are not supported.
3453
3454         PCRE_ERROR_DFA_UMLIMIT    (-18)
3455
3456       This return is given if pcre_dfa_exec() is called with an  extra  block
3457       that  contains  a  setting  of the match_limit or match_limit_recursion
3458       fields. This is not supported (these fields  are  meaningless  for  DFA
3459       matching).
3460
3461         PCRE_ERROR_DFA_WSSIZE     (-19)
3462
3463       This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
3464       workspace vector.
3465
3466         PCRE_ERROR_DFA_RECURSE    (-20)
3467
3468       When a recursive subpattern is processed, the matching  function  calls
3469       itself  recursively,  using  private vectors for ovector and workspace.
3470       This error is given if the output vector  is  not  large  enough.  This
3471       should be extremely rare, as a vector of size 1000 is used.
3472
3473         PCRE_ERROR_DFA_BADRESTART (-30)
3474
3475       When  pcre_dfa_exec()  is called with the PCRE_DFA_RESTART option, some
3476       plausibility checks are made on the contents of  the  workspace,  which
3477       should  contain  data about the previous partial match. If any of these
3478       checks fail, this error is given.
3479
3480
3481SEE ALSO
3482
3483       pcre16(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),   pcrematch-
3484       ing(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3),
3485       pcrestack(3).
3486
3487
3488AUTHOR
3489
3490       Philip Hazel
3491       University Computing Service
3492       Cambridge CB2 3QH, England.
3493
3494
3495REVISION
3496
3497       Last updated: 17 June 2012
3498       Copyright (c) 1997-2012 University of Cambridge.
3499------------------------------------------------------------------------------
3500
3501
3502PCRECALLOUT(3)                                                  PCRECALLOUT(3)
3503
3504
3505NAME
3506       PCRE - Perl-compatible regular expressions
3507
3508
3509PCRE CALLOUTS
3510
3511       int (*pcre_callout)(pcre_callout_block *);
3512
3513       int (*pcre16_callout)(pcre16_callout_block *);
3514
3515       PCRE provides a feature called "callout", which is a means of temporar-
3516       ily passing control to the caller of PCRE  in  the  middle  of  pattern
3517       matching.  The  caller of PCRE provides an external function by putting
3518       its entry point in the global variable pcre_callout (pcre16_callout for
3519       the  16-bit  library).  By  default, this variable contains NULL, which
3520       disables all calling out.
3521
3522       Within a regular expression, (?C) indicates the  points  at  which  the
3523       external  function  is  to  be  called. Different callout points can be
3524       identified by putting a number less than 256 after the  letter  C.  The
3525       default  value  is  zero.   For  example,  this pattern has two callout
3526       points:
3527
3528         (?C1)abc(?C2)def
3529
3530       If the PCRE_AUTO_CALLOUT option bit is set when a pattern is  compiled,
3531       PCRE  automatically  inserts callouts, all with number 255, before each
3532       item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
3533       pattern
3534
3535         A(\d{2}|--)
3536
3537       it is processed as if it were
3538
3539       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
3540
3541       Notice  that  there  is a callout before and after each parenthesis and
3542       alternation bar. Automatic  callouts  can  be  used  for  tracking  the
3543       progress  of  pattern matching. The pcretest command has an option that
3544       sets automatic callouts; when it is used, the output indicates how  the
3545       pattern  is  matched. This is useful information when you are trying to
3546       optimize the performance of a particular pattern.
3547
3548       The use of callouts in a pattern makes it ineligible  for  optimization
3549       by  the  just-in-time  compiler.  Studying  such  a  pattern  with  the
3550       PCRE_STUDY_JIT_COMPILE option always fails.
3551
3552
3553MISSING CALLOUTS
3554
3555       You should be aware that, because of  optimizations  in  the  way  PCRE
3556       matches  patterns  by  default,  callouts  sometimes do not happen. For
3557       example, if the pattern is
3558
3559         ab(?C4)cd
3560
3561       PCRE knows that any matching string must contain the letter "d". If the
3562       subject  string  is "abyz", the lack of "d" means that matching doesn't
3563       ever start, and the callout is never  reached.  However,  with  "abyd",
3564       though the result is still no match, the callout is obeyed.
3565
3566       If  the pattern is studied, PCRE knows the minimum length of a matching
3567       string, and will immediately give a "no match" return without  actually
3568       running  a  match if the subject is not long enough, or, for unanchored
3569       patterns, if it has been scanned far enough.
3570
3571       You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
3572       MIZE  option  to the matching function, or by starting the pattern with
3573       (*NO_START_OPT). This slows down the matching process, but does  ensure
3574       that callouts such as the example above are obeyed.
3575
3576
3577THE CALLOUT INTERFACE
3578
3579       During  matching, when PCRE reaches a callout point, the external func-
3580       tion defined by pcre_callout or pcre16_callout  is  called  (if  it  is
3581       set).   This applies to both normal and DFA matching. The only argument
3582       to the callout function is a pointer to a pcre_callout or  pcre16_call-
3583       out block.  These structures contains the following fields:
3584
3585         int           version;
3586         int           callout_number;
3587         int          *offset_vector;
3588         const char   *subject;           (8-bit version)
3589         PCRE_SPTR16   subject;           (16-bit version)
3590         int           subject_length;
3591         int           start_match;
3592         int           current_position;
3593         int           capture_top;
3594         int           capture_last;
3595         void         *callout_data;
3596         int           pattern_position;
3597         int           next_item_length;
3598         const unsigned char *mark;       (8-bit version)
3599         const PCRE_UCHAR16  *mark;       (16-bit version)
3600
3601       The  version  field  is an integer containing the version number of the
3602       block format. The initial version was 0; the current version is 2.  The
3603       version  number  will  change  again in future if additional fields are
3604       added, but the intention is never to remove any of the existing fields.
3605
3606       The callout_number field contains the number of the  callout,  as  com-
3607       piled  into  the pattern (that is, the number after ?C for manual call-
3608       outs, and 255 for automatically generated callouts).
3609
3610       The offset_vector field is a pointer to the vector of offsets that  was
3611       passed  by  the  caller  to  the matching function. When pcre_exec() or
3612       pcre16_exec() is used, the contents  can  be  inspected,  in  order  to
3613       extract  substrings  that  have been matched so far, in the same way as
3614       for extracting substrings after a match  has  completed.  For  the  DFA
3615       matching functions, this field is not useful.
3616
3617       The subject and subject_length fields contain copies of the values that
3618       were passed to the matching function.
3619
3620       The start_match field normally contains the offset within  the  subject
3621       at  which  the  current  match  attempt started. However, if the escape
3622       sequence \K has been encountered, this value is changed to reflect  the
3623       modified  starting  point.  If the pattern is not anchored, the callout
3624       function may be called several times from the same point in the pattern
3625       for different starting points in the subject.
3626
3627       The  current_position  field  contains the offset within the subject of
3628       the current match pointer.
3629
3630       When the pcre_exec() or pcre16_exec() is used,  the  capture_top  field
3631       contains one more than the number of the highest numbered captured sub-
3632       string so far. If no substrings have been captured, the value  of  cap-
3633       ture_top  is  one.  This  is always the case when the DFA functions are
3634       used, because they do not support captured substrings.
3635
3636       The capture_last field contains the number of the  most  recently  cap-
3637       tured  substring. If no substrings have been captured, its value is -1.
3638       This is always the case for the DFA matching functions.
3639
3640       The callout_data field contains a value that is passed  to  a  matching
3641       function  specifically so that it can be passed back in callouts. It is
3642       passed in the callout_data field of a pcre_extra or  pcre16_extra  data
3643       structure.  If  no such data was passed, the value of callout_data in a
3644       callout block is NULL. There is a description of the pcre_extra  struc-
3645       ture in the pcreapi documentation.
3646
3647       The  pattern_position  field  is  present from version 1 of the callout
3648       structure. It contains the offset to the next item to be matched in the
3649       pattern string.
3650
3651       The  next_item_length  field  is  present from version 1 of the callout
3652       structure. It contains the length of the next item to be matched in the
3653       pattern  string.  When  the callout immediately precedes an alternation
3654       bar, a closing parenthesis, or the end of the pattern,  the  length  is
3655       zero.  When  the callout precedes an opening parenthesis, the length is
3656       that of the entire subpattern.
3657
3658       The pattern_position and next_item_length fields are intended  to  help
3659       in  distinguishing between different automatic callouts, which all have
3660       the same callout number. However, they are set for all callouts.
3661
3662       The mark field is present from version 2 of the callout  structure.  In
3663       callouts from pcre_exec() or pcre16_exec() it contains a pointer to the
3664       zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
3665       (*THEN)  item  in the match, or NULL if no such items have been passed.
3666       Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
3667       previous  (*MARK).  In  callouts  from  the DFA matching functions this
3668       field always contains NULL.
3669
3670
3671RETURN VALUES
3672
3673       The external callout function returns an integer to PCRE. If the  value
3674       is  zero,  matching  proceeds  as  normal. If the value is greater than
3675       zero, matching fails at the current point, but  the  testing  of  other
3676       matching possibilities goes ahead, just as if a lookahead assertion had
3677       failed. If the value is less than zero, the  match  is  abandoned,  the
3678       matching function returns the negative value.
3679
3680       Negative   values   should   normally   be   chosen  from  the  set  of
3681       PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
3682       dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
3683       reserved for use by callout functions; it will never be  used  by  PCRE
3684       itself.
3685
3686
3687AUTHOR
3688
3689       Philip Hazel
3690       University Computing Service
3691       Cambridge CB2 3QH, England.
3692
3693
3694REVISION
3695
3696       Last updated: 08 Janurary 2012
3697       Copyright (c) 1997-2012 University of Cambridge.
3698------------------------------------------------------------------------------
3699
3700
3701PCRECOMPAT(3)                                                    PCRECOMPAT(3)
3702
3703
3704NAME
3705       PCRE - Perl-compatible regular expressions
3706
3707
3708DIFFERENCES BETWEEN PCRE AND PERL
3709
3710       This  document describes the differences in the ways that PCRE and Perl
3711       handle regular expressions. The differences  described  here  are  with
3712       respect to Perl versions 5.10 and above.
3713
3714       1. PCRE has only a subset of Perl's Unicode support. Details of what it
3715       does have are given in the pcreunicode page.
3716
3717       2. PCRE allows repeat quantifiers only on parenthesized assertions, but
3718       they  do  not mean what you might think. For example, (?!a){3} does not
3719       assert that the next three characters are not "a". It just asserts that
3720       the next character is not "a" three times (in principle: PCRE optimizes
3721       this to run the assertion just once). Perl allows repeat quantifiers on
3722       other assertions such as \b, but these do not seem to have any use.
3723
3724       3.  Capturing  subpatterns  that occur inside negative lookahead asser-
3725       tions are counted, but their entries in the offsets  vector  are  never
3726       set.  Perl sets its numerical variables from any such patterns that are
3727       matched before the assertion fails to match something (thereby succeed-
3728       ing),  but  only  if the negative lookahead assertion contains just one
3729       branch.
3730
3731       4. Though binary zero characters are supported in the  subject  string,
3732       they are not allowed in a pattern string because it is passed as a nor-
3733       mal C string, terminated by zero. The escape sequence \0 can be used in
3734       the pattern to represent a binary zero.
3735
3736       5.  The  following Perl escape sequences are not supported: \l, \u, \L,
3737       \U, and \N when followed by a character name or Unicode value.  (\N  on
3738       its own, matching a non-newline character, is supported.) In fact these
3739       are implemented by Perl's general string-handling and are not  part  of
3740       its  pattern  matching engine. If any of these are encountered by PCRE,
3741       an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-
3742       PAT  option  is set, \U and \u are interpreted as JavaScript interprets
3743       them.
3744
3745       6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
3746       is  built  with Unicode character property support. The properties that
3747       can be tested with \p and \P are limited to the general category  prop-
3748       erties  such  as  Lu and Nd, script names such as Greek or Han, and the
3749       derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
3750       property,  which  Perl  does  not; the Perl documentation says "Because
3751       Perl hides the need for the user to understand the internal representa-
3752       tion  of Unicode characters, there is no need to implement the somewhat
3753       messy concept of surrogates."
3754
3755       7. PCRE implements a simpler version of \X than Perl, which changed  to
3756       make  \X  match what Unicode calls an "extended grapheme cluster". This
3757       is more complicated than an extended Unicode sequence,  which  is  what
3758       PCRE matches.
3759
3760       8. PCRE does support the \Q...\E escape for quoting substrings. Charac-
3761       ters in between are treated as literals.  This  is  slightly  different
3762       from  Perl  in  that  $  and  @ are also handled as literals inside the
3763       quotes. In Perl, they cause variable interpolation (but of course  PCRE
3764       does not have variables). Note the following examples:
3765
3766           Pattern            PCRE matches      Perl matches
3767
3768           \Qabc$xyz\E        abc$xyz           abc followed by the
3769                                                  contents of $xyz
3770           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
3771           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
3772
3773       The  \Q...\E  sequence  is recognized both inside and outside character
3774       classes.
3775
3776       9. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
3777       constructions.  However,  there is support for recursive patterns. This
3778       is not available in Perl 5.8, but it is in Perl 5.10.  Also,  the  PCRE
3779       "callout"  feature allows an external function to be called during pat-
3780       tern matching. See the pcrecallout documentation for details.
3781
3782       10. Subpatterns that are called as subroutines (whether or  not  recur-
3783       sively)  are  always  treated  as  atomic  groups in PCRE. This is like
3784       Python, but unlike Perl.  Captured values that are set outside  a  sub-
3785       routine  call  can  be  reference from inside in PCRE, but not in Perl.
3786       There is a discussion that explains these differences in more detail in
3787       the section on recursion differences from Perl in the pcrepattern page.
3788
3789       11.  If  any of the backtracking control verbs are used in an assertion
3790       or in a subpattern that is called  as  a  subroutine  (whether  or  not
3791       recursively),  their effect is confined to that subpattern; it does not
3792       extend to the surrounding pattern. This is not always the case in Perl.
3793       In  particular,  if  (*THEN)  is present in a group that is called as a
3794       subroutine, its action is limited to that group, even if the group does
3795       not  contain any | characters. There is one exception to this: the name
3796       from a *(MARK), (*PRUNE), or (*THEN) that is encountered in a  success-
3797       ful  positive  assertion  is passed back when a match succeeds (compare
3798       capturing parentheses in assertions). Note that  such  subpatterns  are
3799       processed as anchored at the point where they are tested.
3800
3801       12.  There are some differences that are concerned with the settings of
3802       captured strings when part of  a  pattern  is  repeated.  For  example,
3803       matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
3804       unset, but in PCRE it is set to "b".
3805
3806       13. PCRE's handling of duplicate subpattern numbers and duplicate  sub-
3807       pattern names is not as general as Perl's. This is a consequence of the
3808       fact the PCRE works internally just with numbers, using an external ta-
3809       ble  to  translate  between numbers and names. In particular, a pattern
3810       such as (?|(?<a>A)|(?<b)B), where the two  capturing  parentheses  have
3811       the  same  number  but different names, is not supported, and causes an
3812       error at compile time. If it were allowed, it would not be possible  to
3813       distinguish  which  parentheses matched, because both names map to cap-
3814       turing subpattern number 1. To avoid this confusing situation, an error
3815       is given at compile time.
3816
3817       14.  Perl  recognizes  comments  in some places that PCRE does not, for
3818       example, between the ( and ? at the start of a subpattern.  If  the  /x
3819       modifier is set, Perl allows white space between ( and ? but PCRE never
3820       does, even if the PCRE_EXTENDED option is set.
3821
3822       15. PCRE provides some extensions to the Perl regular expression facil-
3823       ities.   Perl  5.10  includes new features that are not in earlier ver-
3824       sions of Perl, some of which (such as named parentheses) have  been  in
3825       PCRE for some time. This list is with respect to Perl 5.10:
3826
3827       (a)  Although  lookbehind  assertions  in  PCRE must match fixed length
3828       strings, each alternative branch of a lookbehind assertion can match  a
3829       different  length  of  string.  Perl requires them all to have the same
3830       length.
3831
3832       (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
3833       meta-character matches only at the very end of the string.
3834
3835       (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
3836       cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
3837       ignored.  (Perl can be made to issue a warning.)
3838
3839       (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
3840       fiers is inverted, that is, by default they are not greedy, but if fol-
3841       lowed by a question mark they are.
3842
3843       (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
3844       tried only at the first matching position in the subject string.
3845
3846       (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
3847       and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-
3848       lents.
3849
3850       (g) The \R escape sequence can be restricted to match only CR,  LF,  or
3851       CRLF by the PCRE_BSR_ANYCRLF option.
3852
3853       (h) The callout facility is PCRE-specific.
3854
3855       (i) The partial matching facility is PCRE-specific.
3856
3857       (j) Patterns compiled by PCRE can be saved and re-used at a later time,
3858       even on different hosts that have the other endianness.  However,  this
3859       does not apply to optimized data created by the just-in-time compiler.
3860
3861       (k)   The   alternative   matching   functions   (pcre_dfa_exec()   and
3862       pcre16_dfa_exec()) match in a different way and are  not  Perl-compati-
3863       ble.
3864
3865       (l)  PCRE  recognizes some special sequences such as (*CR) at the start
3866       of a pattern that set overall options that cannot be changed within the
3867       pattern.
3868
3869
3870AUTHOR
3871
3872       Philip Hazel
3873       University Computing Service
3874       Cambridge CB2 3QH, England.
3875
3876
3877REVISION
3878
3879       Last updated: 01 June 2012
3880       Copyright (c) 1997-2012 University of Cambridge.
3881------------------------------------------------------------------------------
3882
3883
3884PCREPATTERN(3)                                                  PCREPATTERN(3)
3885
3886
3887NAME
3888       PCRE - Perl-compatible regular expressions
3889
3890
3891PCRE REGULAR EXPRESSION DETAILS
3892
3893       The  syntax and semantics of the regular expressions that are supported
3894       by PCRE are described in detail below. There is a quick-reference  syn-
3895       tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
3896       semantics as closely as it can. PCRE  also  supports  some  alternative
3897       regular  expression  syntax (which does not conflict with the Perl syn-
3898       tax) in order to provide some compatibility with regular expressions in
3899       Python, .NET, and Oniguruma.
3900
3901       Perl's  regular expressions are described in its own documentation, and
3902       regular expressions in general are covered in a number of  books,  some
3903       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
3904       Expressions", published by  O'Reilly,  covers  regular  expressions  in
3905       great  detail.  This  description  of  PCRE's  regular  expressions  is
3906       intended as reference material.
3907
3908       The original operation of PCRE was on strings of  one-byte  characters.
3909       However,  there  is  now also support for UTF-8 strings in the original
3910       library, and a second library that supports 16-bit and UTF-16 character
3911       strings. To use these features, PCRE must be built to include appropri-
3912       ate support. When using UTF strings you must either call the  compiling
3913       function  with  the PCRE_UTF8 or PCRE_UTF16 option, or the pattern must
3914       start with one of these special sequences:
3915
3916         (*UTF8)
3917         (*UTF16)
3918
3919       Starting a pattern with such a sequence is equivalent  to  setting  the
3920       relevant option. This feature is not Perl-compatible. How setting a UTF
3921       mode affects pattern matching is mentioned  in  several  places  below.
3922       There is also a summary of features in the pcreunicode page.
3923
3924       Another  special  sequence that may appear at the start of a pattern or
3925       in combination with (*UTF8) or (*UTF16) is:
3926
3927         (*UCP)
3928
3929       This has the same effect as setting  the  PCRE_UCP  option:  it  causes
3930       sequences  such  as  \d  and  \w to use Unicode properties to determine
3931       character types, instead of recognizing only characters with codes less
3932       than 128 via a lookup table.
3933
3934       If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
3935       setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
3936       time. There are also some more of these special sequences that are con-
3937       cerned with the handling of newlines; they are described below.
3938
3939       The remainder of this document discusses the  patterns  that  are  sup-
3940       ported  by  PCRE  when  one  its  main  matching functions, pcre_exec()
3941       (8-bit) or pcre16_exec() (16-bit), is used. PCRE also  has  alternative
3942       matching  functions, pcre_dfa_exec() and pcre16_dfa_exec(), which match
3943       using a different algorithm that is not Perl-compatible.  Some  of  the
3944       features  discussed  below are not available when DFA matching is used.
3945       The advantages and disadvantages of the alternative functions, and  how
3946       they  differ from the normal functions, are discussed in the pcrematch-
3947       ing page.
3948
3949
3950NEWLINE CONVENTIONS
3951
3952       PCRE supports five different conventions for indicating line breaks  in
3953       strings:  a  single  CR (carriage return) character, a single LF (line-
3954       feed) character, the two-character sequence CRLF, any of the three pre-
3955       ceding,  or  any Unicode newline sequence. The pcreapi page has further
3956       discussion about newlines, and shows how to set the newline  convention
3957       in the options arguments for the compiling and matching functions.
3958
3959       It  is also possible to specify a newline convention by starting a pat-
3960       tern string with one of the following five sequences:
3961
3962         (*CR)        carriage return
3963         (*LF)        linefeed
3964         (*CRLF)      carriage return, followed by linefeed
3965         (*ANYCRLF)   any of the three above
3966         (*ANY)       all Unicode newline sequences
3967
3968       These override the default and the options given to the compiling func-
3969       tion.  For  example,  on  a Unix system where LF is the default newline
3970       sequence, the pattern
3971
3972         (*CR)a.b
3973
3974       changes the convention to CR. That pattern matches "a\nb" because LF is
3975       no  longer  a  newline. Note that these special settings, which are not
3976       Perl-compatible, are recognized only at the very start  of  a  pattern,
3977       and  that  they  must  be  in  upper  case. If more than one of them is
3978       present, the last one is used.
3979
3980       The newline convention affects the interpretation of the dot  metachar-
3981       acter  when  PCRE_DOTALL is not set, and also the behaviour of \N. How-
3982       ever, it does not affect  what  the  \R  escape  sequence  matches.  By
3983       default,  this is any Unicode newline sequence, for Perl compatibility.
3984       However, this can be changed; see the description of \R in the  section
3985       entitled  "Newline sequences" below. A change of \R setting can be com-
3986       bined with a change of newline convention.
3987
3988
3989CHARACTERS AND METACHARACTERS
3990
3991       A regular expression is a pattern that is  matched  against  a  subject
3992       string  from  left  to right. Most characters stand for themselves in a
3993       pattern, and match the corresponding characters in the  subject.  As  a
3994       trivial example, the pattern
3995
3996         The quick brown fox
3997
3998       matches a portion of a subject string that is identical to itself. When
3999       caseless matching is specified (the PCRE_CASELESS option), letters  are
4000       matched  independently  of case. In a UTF mode, PCRE always understands
4001       the concept of case for characters whose values are less than  128,  so
4002       caseless  matching  is always possible. For characters with higher val-
4003       ues, the concept of case is supported if PCRE is compiled with  Unicode
4004       property  support,  but  not  otherwise.   If  you want to use caseless
4005       matching for characters 128 and above, you must  ensure  that  PCRE  is
4006       compiled with Unicode property support as well as with UTF support.
4007
4008       The  power  of  regular  expressions  comes from the ability to include
4009       alternatives and repetitions in the pattern. These are encoded  in  the
4010       pattern by the use of metacharacters, which do not stand for themselves
4011       but instead are interpreted in some special way.
4012
4013       There are two different sets of metacharacters: those that  are  recog-
4014       nized  anywhere in the pattern except within square brackets, and those
4015       that are recognized within square brackets.  Outside  square  brackets,
4016       the metacharacters are as follows:
4017
4018         \      general escape character with several uses
4019         ^      assert start of string (or line, in multiline mode)
4020         $      assert end of string (or line, in multiline mode)
4021         .      match any character except newline (by default)
4022         [      start character class definition
4023         |      start of alternative branch
4024         (      start subpattern
4025         )      end subpattern
4026         ?      extends the meaning of (
4027                also 0 or 1 quantifier
4028                also quantifier minimizer
4029         *      0 or more quantifier
4030         +      1 or more quantifier
4031                also "possessive quantifier"
4032         {      start min/max quantifier
4033
4034       Part  of  a  pattern  that is in square brackets is called a "character
4035       class". In a character class the only metacharacters are:
4036
4037         \      general escape character
4038         ^      negate the class, but only if the first character
4039         -      indicates character range
4040         [      POSIX character class (only if followed by POSIX
4041                  syntax)
4042         ]      terminates the character class
4043
4044       The following sections describe the use of each of the metacharacters.
4045
4046
4047BACKSLASH
4048
4049       The backslash character has several uses. Firstly, if it is followed by
4050       a character that is not a number or a letter, it takes away any special
4051       meaning that character may have. This use of  backslash  as  an  escape
4052       character applies both inside and outside character classes.
4053
4054       For  example,  if  you want to match a * character, you write \* in the
4055       pattern.  This escaping action applies whether  or  not  the  following
4056       character  would  otherwise be interpreted as a metacharacter, so it is
4057       always safe to precede a non-alphanumeric  with  backslash  to  specify
4058       that  it stands for itself. In particular, if you want to match a back-
4059       slash, you write \\.
4060
4061       In a UTF mode, only ASCII numbers and letters have any special  meaning
4062       after  a  backslash.  All  other characters (in particular, those whose
4063       codepoints are greater than 127) are treated as literals.
4064
4065       If a pattern is compiled with the PCRE_EXTENDED option, white space  in
4066       the  pattern (other than in a character class) and characters between a
4067       # outside a character class and the next newline are ignored. An escap-
4068       ing  backslash  can  be used to include a white space or # character as
4069       part of the pattern.
4070
4071       If you want to remove the special meaning from a  sequence  of  charac-
4072       ters,  you can do so by putting them between \Q and \E. This is differ-
4073       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
4074       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
4075       tion. Note the following examples:
4076
4077         Pattern            PCRE matches   Perl matches
4078
4079         \Qabc$xyz\E        abc$xyz        abc followed by the
4080                                             contents of $xyz
4081         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
4082         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
4083
4084       The \Q...\E sequence is recognized both inside  and  outside  character
4085       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
4086       is not followed by \E later in the pattern, the literal  interpretation
4087       continues  to  the  end  of  the pattern (that is, \E is assumed at the
4088       end). If the isolated \Q is inside a character class,  this  causes  an
4089       error, because the character class is not terminated.
4090
4091   Non-printing characters
4092
4093       A second use of backslash provides a way of encoding non-printing char-
4094       acters in patterns in a visible manner. There is no restriction on  the
4095       appearance  of non-printing characters, apart from the binary zero that
4096       terminates a pattern, but when a pattern  is  being  prepared  by  text
4097       editing,  it  is  often  easier  to  use  one  of  the following escape
4098       sequences than the binary character it represents:
4099
4100         \a        alarm, that is, the BEL character (hex 07)
4101         \cx       "control-x", where x is any ASCII character
4102         \e        escape (hex 1B)
4103         \f        form feed (hex 0C)
4104         \n        linefeed (hex 0A)
4105         \r        carriage return (hex 0D)
4106         \t        tab (hex 09)
4107         \ddd      character with octal code ddd, or back reference
4108         \xhh      character with hex code hh
4109         \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
4110         \uhhhh    character with hex code hhhh (JavaScript mode only)
4111
4112       The precise effect of \cx is as follows: if x is a lower  case  letter,
4113       it  is converted to upper case. Then bit 6 of the character (hex 40) is
4114       inverted.  Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({
4115       is  7B),  while  \c; becomes hex 7B (; is 3B). If the byte following \c
4116       has a value greater than 127, a compile-time error occurs.  This  locks
4117       out non-ASCII characters in all modes. (When PCRE is compiled in EBCDIC
4118       mode, all byte values are valid. A lower case letter  is  converted  to
4119       upper case, and then the 0xc0 bits are flipped.)
4120
4121       By  default,  after  \x,  from  zero to two hexadecimal digits are read
4122       (letters can be in upper or lower case). Any number of hexadecimal dig-
4123       its may appear between \x{ and }, but the character code is constrained
4124       as follows:
4125
4126         8-bit non-UTF mode    less than 0x100
4127         8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
4128         16-bit non-UTF mode   less than 0x10000
4129         16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
4130
4131       Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
4132       called "surrogate" codepoints).
4133
4134       If  characters  other than hexadecimal digits appear between \x{ and },
4135       or if there is no terminating }, this form of escape is not recognized.
4136       Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
4137       escape, with no following digits, giving a  character  whose  value  is
4138       zero.
4139
4140       If  the  PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
4141       is as just described only when it is followed by two  hexadecimal  dig-
4142       its.   Otherwise,  it  matches  a  literal "x" character. In JavaScript
4143       mode, support for code points greater than 256 is provided by \u, which
4144       must  be  followed  by  four hexadecimal digits; otherwise it matches a
4145       literal "u" character.  Character codes specified by \u  in  JavaScript
4146       mode  are  constrained in the same was as those specified by \x in non-
4147       JavaScript mode.
4148
4149       Characters whose value is less than 256 can be defined by either of the
4150       two  syntaxes for \x (or by \u in JavaScript mode). There is no differ-
4151       ence in the way they are handled. For example, \xdc is exactly the same
4152       as \x{dc} (or \u00dc in JavaScript mode).
4153
4154       After  \0  up  to two further octal digits are read. If there are fewer
4155       than two digits, just  those  that  are  present  are  used.  Thus  the
4156       sequence \0\x\07 specifies two binary zeros followed by a BEL character
4157       (code value 7). Make sure you supply two digits after the initial  zero
4158       if the pattern character that follows is itself an octal digit.
4159
4160       The handling of a backslash followed by a digit other than 0 is compli-
4161       cated.  Outside a character class, PCRE reads it and any following dig-
4162       its  as  a  decimal  number. If the number is less than 10, or if there
4163       have been at least that many previous capturing left parentheses in the
4164       expression,  the  entire  sequence  is  taken  as  a  back reference. A
4165       description of how this works is given later, following the  discussion
4166       of parenthesized subpatterns.
4167
4168       Inside  a  character  class, or if the decimal number is greater than 9
4169       and there have not been that many capturing subpatterns, PCRE  re-reads
4170       up to three octal digits following the backslash, and uses them to gen-
4171       erate a data character. Any subsequent digits stand for themselves. The
4172       value  of  the  character  is constrained in the same way as characters
4173       specified in hexadecimal.  For example:
4174
4175         \040   is another way of writing a space
4176         \40    is the same, provided there are fewer than 40
4177                   previous capturing subpatterns
4178         \7     is always a back reference
4179         \11    might be a back reference, or another way of
4180                   writing a tab
4181         \011   is always a tab
4182         \0113  is a tab followed by the character "3"
4183         \113   might be a back reference, otherwise the
4184                   character with octal code 113
4185         \377   might be a back reference, otherwise
4186                   the value 255 (decimal)
4187         \81    is either a back reference, or a binary zero
4188                   followed by the two characters "8" and "1"
4189
4190       Note that octal values of 100 or greater must not be  introduced  by  a
4191       leading zero, because no more than three octal digits are ever read.
4192
4193       All the sequences that define a single character value can be used both
4194       inside and outside character classes. In addition, inside  a  character
4195       class, \b is interpreted as the backspace character (hex 08).
4196
4197       \N  is not allowed in a character class. \B, \R, and \X are not special
4198       inside a character class. Like  other  unrecognized  escape  sequences,
4199       they  are  treated  as  the  literal  characters  "B",  "R", and "X" by
4200       default, but cause an error if the PCRE_EXTRA option is set. Outside  a
4201       character class, these sequences have different meanings.
4202
4203   Unsupported escape sequences
4204
4205       In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
4206       handler and used  to  modify  the  case  of  following  characters.  By
4207       default,  PCRE does not support these escape sequences. However, if the
4208       PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U"  character,  and
4209       \u can be used to define a character by code point, as described in the
4210       previous section.
4211
4212   Absolute and relative back references
4213
4214       The sequence \g followed by an unsigned or a negative  number,  option-
4215       ally  enclosed  in braces, is an absolute or relative back reference. A
4216       named back reference can be coded as \g{name}. Back references are dis-
4217       cussed later, following the discussion of parenthesized subpatterns.
4218
4219   Absolute and relative subroutine calls
4220
4221       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
4222       name or a number enclosed either in angle brackets or single quotes, is
4223       an  alternative  syntax for referencing a subpattern as a "subroutine".
4224       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
4225       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
4226       reference; the latter is a subroutine call.
4227
4228   Generic character types
4229
4230       Another use of backslash is for specifying generic character types:
4231
4232         \d     any decimal digit
4233         \D     any character that is not a decimal digit
4234         \h     any horizontal white space character
4235         \H     any character that is not a horizontal white space character
4236         \s     any white space character
4237         \S     any character that is not a white space character
4238         \v     any vertical white space character
4239         \V     any character that is not a vertical white space character
4240         \w     any "word" character
4241         \W     any "non-word" character
4242
4243       There is also the single sequence \N, which matches a non-newline char-
4244       acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is
4245       not set. Perl also uses \N to match characters by name; PCRE  does  not
4246       support this.
4247
4248       Each  pair of lower and upper case escape sequences partitions the com-
4249       plete set of characters into two disjoint  sets.  Any  given  character
4250       matches  one, and only one, of each pair. The sequences can appear both
4251       inside and outside character classes. They each match one character  of
4252       the  appropriate  type.  If the current matching point is at the end of
4253       the subject string, all of them fail, because there is no character  to
4254       match.
4255
4256       For  compatibility  with Perl, \s does not match the VT character (code
4257       11).  This makes it different from the the POSIX "space" class. The  \s
4258       characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
4259       "use locale;" is included in a Perl script, \s may match the VT charac-
4260       ter. In PCRE, it never does.
4261
4262       A  "word"  character is an underscore or any character that is a letter
4263       or digit.  By default, the definition of letters  and  digits  is  con-
4264       trolled  by PCRE's low-valued character tables, and may vary if locale-
4265       specific matching is taking place (see "Locale support" in the  pcreapi
4266       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
4267       systems, or "french" in Windows, some character codes greater than  128
4268       are  used  for  accented letters, and these are then matched by \w. The
4269       use of locales with Unicode is discouraged.
4270
4271       By default, in a UTF mode, characters  with  values  greater  than  128
4272       never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
4273       sequences retain their original meanings from before  UTF  support  was
4274       available,  mainly for efficiency reasons. However, if PCRE is compiled
4275       with Unicode property support, and the PCRE_UCP option is set, the  be-
4276       haviour  is  changed  so  that Unicode properties are used to determine
4277       character types, as follows:
4278
4279         \d  any character that \p{Nd} matches (decimal digit)
4280         \s  any character that \p{Z} matches, plus HT, LF, FF, CR
4281         \w  any character that \p{L} or \p{N} matches, plus underscore
4282
4283       The upper case escapes match the inverse sets of characters. Note  that
4284       \d  matches  only decimal digits, whereas \w matches any Unicode digit,
4285       as well as any Unicode letter, and underscore. Note also that  PCRE_UCP
4286       affects  \b,  and  \B  because  they are defined in terms of \w and \W.
4287       Matching these sequences is noticeably slower when PCRE_UCP is set.
4288
4289       The sequences \h, \H, \v, and \V are features that were added  to  Perl
4290       at  release  5.10. In contrast to the other sequences, which match only
4291       ASCII characters by default, these  always  match  certain  high-valued
4292       codepoints,  whether or not PCRE_UCP is set. The horizontal space char-
4293       acters are:
4294
4295         U+0009     Horizontal tab
4296         U+0020     Space
4297         U+00A0     Non-break space
4298         U+1680     Ogham space mark
4299         U+180E     Mongolian vowel separator
4300         U+2000     En quad
4301         U+2001     Em quad
4302         U+2002     En space
4303         U+2003     Em space
4304         U+2004     Three-per-em space
4305         U+2005     Four-per-em space
4306         U+2006     Six-per-em space
4307         U+2007     Figure space
4308         U+2008     Punctuation space
4309         U+2009     Thin space
4310         U+200A     Hair space
4311         U+202F     Narrow no-break space
4312         U+205F     Medium mathematical space
4313         U+3000     Ideographic space
4314
4315       The vertical space characters are:
4316
4317         U+000A     Linefeed
4318         U+000B     Vertical tab
4319         U+000C     Form feed
4320         U+000D     Carriage return
4321         U+0085     Next line
4322         U+2028     Line separator
4323         U+2029     Paragraph separator
4324
4325       In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
4326       256 are relevant.
4327
4328   Newline sequences
4329
4330       Outside  a  character class, by default, the escape sequence \R matches
4331       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
4332       to the following:
4333
4334         (?>\r\n|\n|\x0b|\f|\r|\x85)
4335
4336       This  is  an  example  of an "atomic group", details of which are given
4337       below.  This particular group matches either the two-character sequence
4338       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
4339       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
4340       riage  return,  U+000D),  or NEL (next line, U+0085). The two-character
4341       sequence is treated as a single unit that cannot be split.
4342
4343       In other modes, two additional characters whose codepoints are  greater
4344       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
4345       rator, U+2029).  Unicode character property support is not  needed  for
4346       these characters to be recognized.
4347
4348       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
4349       the complete set  of  Unicode  line  endings)  by  setting  the  option
4350       PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
4351       (BSR is an abbrevation for "backslash R".) This can be made the default
4352       when  PCRE  is  built;  if this is the case, the other behaviour can be
4353       requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
4354       specify  these  settings  by  starting a pattern string with one of the
4355       following sequences:
4356
4357         (*BSR_ANYCRLF)   CR, LF, or CRLF only
4358         (*BSR_UNICODE)   any Unicode newline sequence
4359
4360       These override the default and the options given to the compiling func-
4361       tion,  but  they  can  themselves  be  overridden by options given to a
4362       matching function. Note that these  special  settings,  which  are  not
4363       Perl-compatible,  are  recognized  only at the very start of a pattern,
4364       and that they must be in upper case.  If  more  than  one  of  them  is
4365       present,  the  last  one is used. They can be combined with a change of
4366       newline convention; for example, a pattern can start with:
4367
4368         (*ANY)(*BSR_ANYCRLF)
4369
4370       They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special
4371       sequences.  Inside  a character class, \R is treated as an unrecognized
4372       escape sequence, and so matches the letter "R" by default,  but  causes
4373       an error if PCRE_EXTRA is set.
4374
4375   Unicode character properties
4376
4377       When PCRE is built with Unicode character property support, three addi-
4378       tional escape sequences that match characters with specific  properties
4379       are  available.   When  in 8-bit non-UTF-8 mode, these sequences are of
4380       course limited to testing characters whose  codepoints  are  less  than
4381       256, but they do work in this mode.  The extra escape sequences are:
4382
4383         \p{xx}   a character with the xx property
4384         \P{xx}   a character without the xx property
4385         \X       an extended Unicode sequence
4386
4387       The  property  names represented by xx above are limited to the Unicode
4388       script names, the general category properties, "Any", which matches any
4389       character   (including  newline),  and  some  special  PCRE  properties
4390       (described in the next section).  Other Perl properties such as  "InMu-
4391       sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
4392       does not match any characters, so always causes a match failure.
4393
4394       Sets of Unicode characters are defined as belonging to certain scripts.
4395       A  character from one of these sets can be matched using a script name.
4396       For example:
4397
4398         \p{Greek}
4399         \P{Han}
4400
4401       Those that are not part of an identified script are lumped together  as
4402       "Common". The current list of scripts is:
4403
4404       Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
4405       Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
4406       Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
4407       Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
4408       Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
4409       gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip-
4410       tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
4411       Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
4412       Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
4413       Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
4414       Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
4415       Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari-
4416       tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
4417       Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
4418       Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
4419       Yi.
4420
4421       Each character has exactly one Unicode general category property, spec-
4422       ified  by a two-letter abbreviation. For compatibility with Perl, nega-
4423       tion can be specified by including a  circumflex  between  the  opening
4424       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
4425       \P{Lu}.
4426
4427       If only one letter is specified with \p or \P, it includes all the gen-
4428       eral  category properties that start with that letter. In this case, in
4429       the absence of negation, the curly brackets in the escape sequence  are
4430       optional; these two examples have the same effect:
4431
4432         \p{L}
4433         \pL
4434
4435       The following general category property codes are supported:
4436
4437         C     Other
4438         Cc    Control
4439         Cf    Format
4440         Cn    Unassigned
4441         Co    Private use
4442         Cs    Surrogate
4443
4444         L     Letter
4445         Ll    Lower case letter
4446         Lm    Modifier letter
4447         Lo    Other letter
4448         Lt    Title case letter
4449         Lu    Upper case letter
4450
4451         M     Mark
4452         Mc    Spacing mark
4453         Me    Enclosing mark
4454         Mn    Non-spacing mark
4455
4456         N     Number
4457         Nd    Decimal number
4458         Nl    Letter number
4459         No    Other number
4460
4461         P     Punctuation
4462         Pc    Connector punctuation
4463         Pd    Dash punctuation
4464         Pe    Close punctuation
4465         Pf    Final punctuation
4466         Pi    Initial punctuation
4467         Po    Other punctuation
4468         Ps    Open punctuation
4469
4470         S     Symbol
4471         Sc    Currency symbol
4472         Sk    Modifier symbol
4473         Sm    Mathematical symbol
4474         So    Other symbol
4475
4476         Z     Separator
4477         Zl    Line separator
4478         Zp    Paragraph separator
4479         Zs    Space separator
4480
4481       The  special property L& is also supported: it matches a character that
4482       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
4483       classified as a modifier or "other".
4484
4485       The  Cs  (Surrogate)  property  applies only to characters in the range
4486       U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
4487       so  cannot  be  tested  by  PCRE, unless UTF validity checking has been
4488       turned   off   (see   the   discussion   of   PCRE_NO_UTF8_CHECK    and
4489       PCRE_NO_UTF16_CHECK  in the pcreapi page). Perl does not support the Cs
4490       property.
4491
4492       The long synonyms for  property  names  that  Perl  supports  (such  as
4493       \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
4494       any of these properties with "Is".
4495
4496       No character that is in the Unicode table has the Cn (unassigned) prop-
4497       erty.  Instead, this property is assumed for any code point that is not
4498       in the Unicode table.
4499
4500       Specifying caseless matching does not affect  these  escape  sequences.
4501       For example, \p{Lu} always matches only upper case letters.
4502
4503       The  \X  escape  matches  any number of Unicode characters that form an
4504       extended Unicode sequence. \X is equivalent to
4505
4506         (?>\PM\pM*)
4507
4508       That is, it matches a character without the "mark"  property,  followed
4509       by  zero  or  more  characters with the "mark" property, and treats the
4510       sequence as an atomic group (see below).  Characters  with  the  "mark"
4511       property  are  typically  accents  that affect the preceding character.
4512       None of them have codepoints less than 256, so in 8-bit non-UTF-8  mode
4513       \X matches any one character.
4514
4515       Note that recent versions of Perl have changed \X to match what Unicode
4516       calls an "extended grapheme cluster", which has a more complicated def-
4517       inition.
4518
4519       Matching  characters  by Unicode property is not fast, because PCRE has
4520       to search a structure that contains  data  for  over  fifteen  thousand
4521       characters. That is why the traditional escape sequences such as \d and
4522       \w do not use Unicode properties in PCRE by  default,  though  you  can
4523       make  them do so by setting the PCRE_UCP option or by starting the pat-
4524       tern with (*UCP).
4525
4526   PCRE's additional properties
4527
4528       As well as the standard Unicode properties described  in  the  previous
4529       section,  PCRE supports four more that make it possible to convert tra-
4530       ditional escape sequences such as \w and \s and POSIX character classes
4531       to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
4532       erties internally when PCRE_UCP is set. They are:
4533
4534         Xan   Any alphanumeric character
4535         Xps   Any POSIX space character
4536         Xsp   Any Perl space character
4537         Xwd   Any Perl "word" character
4538
4539       Xan matches characters that have either the L (letter) or the  N  (num-
4540       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
4541       form feed, or carriage return, and any other character that has  the  Z
4542       (separator) property.  Xsp is the same as Xps, except that vertical tab
4543       is excluded. Xwd matches the same characters as Xan, plus underscore.
4544
4545   Resetting the match start
4546
4547       The escape sequence \K causes any previously matched characters not  to
4548       be included in the final matched sequence. For example, the pattern:
4549
4550         foo\Kbar
4551
4552       matches  "foobar",  but reports that it has matched "bar". This feature
4553       is similar to a lookbehind assertion (described  below).   However,  in
4554       this  case, the part of the subject before the real match does not have
4555       to be of fixed length, as lookbehind assertions do. The use of \K  does
4556       not  interfere  with  the setting of captured substrings.  For example,
4557       when the pattern
4558
4559         (foo)\Kbar
4560
4561       matches "foobar", the first substring is still set to "foo".
4562
4563       Perl documents that the use  of  \K  within  assertions  is  "not  well
4564       defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
4565       assertions, but is ignored in negative assertions.
4566
4567   Simple assertions
4568
4569       The final use of backslash is for certain simple assertions. An  asser-
4570       tion  specifies a condition that has to be met at a particular point in
4571       a match, without consuming any characters from the subject string.  The
4572       use  of subpatterns for more complicated assertions is described below.
4573       The backslashed assertions are:
4574
4575         \b     matches at a word boundary
4576         \B     matches when not at a word boundary
4577         \A     matches at the start of the subject
4578         \Z     matches at the end of the subject
4579                 also matches before a newline at the end of the subject
4580         \z     matches only at the end of the subject
4581         \G     matches at the first matching position in the subject
4582
4583       Inside a character class, \b has a different meaning;  it  matches  the
4584       backspace  character.  If  any  other  of these assertions appears in a
4585       character class, by default it matches the corresponding literal  char-
4586       acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
4587       PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-
4588       ated instead.
4589
4590       A  word  boundary is a position in the subject string where the current
4591       character and the previous character do not both match \w or  \W  (i.e.
4592       one  matches  \w  and the other matches \W), or the start or end of the
4593       string if the first or last character matches \w,  respectively.  In  a
4594       UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
4595       PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
4596       PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
4597       quence. However, whatever follows \b normally determines which  it  is.
4598       For example, the fragment \ba matches "a" at the start of a word.
4599
4600       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
4601       and dollar (described in the next section) in that they only ever match
4602       at  the  very start and end of the subject string, whatever options are
4603       set. Thus, they are independent of multiline mode. These  three  asser-
4604       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
4605       affect only the behaviour of the circumflex and dollar  metacharacters.
4606       However,  if the startoffset argument of pcre_exec() is non-zero, indi-
4607       cating that matching is to start at a point other than the beginning of
4608       the  subject,  \A  can never match. The difference between \Z and \z is
4609       that \Z matches before a newline at the end of the string as well as at
4610       the very end, whereas \z matches only at the end.
4611
4612       The  \G assertion is true only when the current matching position is at
4613       the start point of the match, as specified by the startoffset  argument
4614       of  pcre_exec().  It  differs  from \A when the value of startoffset is
4615       non-zero. By calling pcre_exec() multiple times with appropriate  argu-
4616       ments, you can mimic Perl's /g option, and it is in this kind of imple-
4617       mentation where \G can be useful.
4618
4619       Note, however, that PCRE's interpretation of \G, as the  start  of  the
4620       current match, is subtly different from Perl's, which defines it as the
4621       end of the previous match. In Perl, these can  be  different  when  the
4622       previously  matched  string was empty. Because PCRE does just one match
4623       at a time, it cannot reproduce this behaviour.
4624
4625       If all the alternatives of a pattern begin with \G, the  expression  is
4626       anchored to the starting match position, and the "anchored" flag is set
4627       in the compiled regular expression.
4628
4629
4630CIRCUMFLEX AND DOLLAR
4631
4632       Outside a character class, in the default matching mode, the circumflex
4633       character  is  an  assertion  that is true only if the current matching
4634       point is at the start of the subject string. If the  startoffset  argu-
4635       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
4636       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
4637       has an entirely different meaning (see below).
4638
4639       Circumflex  need  not be the first character of the pattern if a number
4640       of alternatives are involved, but it should be the first thing in  each
4641       alternative  in  which  it appears if the pattern is ever to match that
4642       branch. If all possible alternatives start with a circumflex, that  is,
4643       if  the  pattern  is constrained to match only at the start of the sub-
4644       ject, it is said to be an "anchored" pattern.  (There  are  also  other
4645       constructs that can cause a pattern to be anchored.)
4646
4647       A  dollar  character  is  an assertion that is true only if the current
4648       matching point is at the end of  the  subject  string,  or  immediately
4649       before a newline at the end of the string (by default). Dollar need not
4650       be the last character of the pattern if a number  of  alternatives  are
4651       involved,  but  it  should  be  the last item in any branch in which it
4652       appears. Dollar has no special meaning in a character class.
4653
4654       The meaning of dollar can be changed so that it  matches  only  at  the
4655       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
4656       compile time. This does not affect the \Z assertion.
4657
4658       The meanings of the circumflex and dollar characters are changed if the
4659       PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
4660       matches immediately after internal newlines as well as at the start  of
4661       the  subject  string.  It  does not match after a newline that ends the
4662       string. A dollar matches before any newlines in the string, as well  as
4663       at  the very end, when PCRE_MULTILINE is set. When newline is specified
4664       as the two-character sequence CRLF, isolated CR and  LF  characters  do
4665       not indicate newlines.
4666
4667       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
4668       (where \n represents a newline) in multiline mode, but  not  otherwise.
4669       Consequently,  patterns  that  are anchored in single line mode because
4670       all branches start with ^ are not anchored in  multiline  mode,  and  a
4671       match  for  circumflex  is  possible  when  the startoffset argument of
4672       pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
4673       PCRE_MULTILINE is set.
4674
4675       Note  that  the sequences \A, \Z, and \z can be used to match the start
4676       and end of the subject in both modes, and if all branches of a  pattern
4677       start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
4678       set.
4679
4680
4681FULL STOP (PERIOD, DOT) AND \N
4682
4683       Outside a character class, a dot in the pattern matches any one charac-
4684       ter  in  the subject string except (by default) a character that signi-
4685       fies the end of a line.
4686
4687       When a line ending is defined as a single character, dot never  matches
4688       that  character; when the two-character sequence CRLF is used, dot does
4689       not match CR if it is immediately followed  by  LF,  but  otherwise  it
4690       matches  all characters (including isolated CRs and LFs). When any Uni-
4691       code line endings are being recognized, dot does not match CR or LF  or
4692       any of the other line ending characters.
4693
4694       The  behaviour  of  dot  with regard to newlines can be changed. If the
4695       PCRE_DOTALL option is set, a dot matches  any  one  character,  without
4696       exception. If the two-character sequence CRLF is present in the subject
4697       string, it takes two dots to match it.
4698
4699       The handling of dot is entirely independent of the handling of  circum-
4700       flex  and  dollar,  the  only relationship being that they both involve
4701       newlines. Dot has no special meaning in a character class.
4702
4703       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
4704       affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
4705       character except one that signifies the end of a line. Perl  also  uses
4706       \N to match characters by name; PCRE does not support this.
4707
4708
4709MATCHING A SINGLE DATA UNIT
4710
4711       Outside  a character class, the escape sequence \C matches any one data
4712       unit, whether or not a UTF mode is set. In the 8-bit library, one  data
4713       unit  is  one byte; in the 16-bit library it is a 16-bit unit. Unlike a
4714       dot, \C always matches line-ending characters. The feature is  provided
4715       in  Perl  in  order  to match individual bytes in UTF-8 mode, but it is
4716       unclear how it can usefully be used. Because \C  breaks  up  characters
4717       into  individual  data  units,  matching one unit with \C in a UTF mode
4718       means that the rest of the string may start with a malformed UTF  char-
4719       acter.  This  has  undefined  results,  because PCRE assumes that it is
4720       dealing with valid UTF strings (and by default it checks  this  at  the
4721       start     of    processing    unless    the    PCRE_NO_UTF8_CHECK    or
4722       PCRE_NO_UTF16_CHECK option is used).
4723
4724       PCRE does not allow \C to appear in  lookbehind  assertions  (described
4725       below)  in  a UTF mode, because this would make it impossible to calcu-
4726       late the length of the lookbehind.
4727
4728       In general, the \C escape sequence is best avoided. However, one way of
4729       using  it that avoids the problem of malformed UTF characters is to use
4730       a lookahead to check the length of the next character, as in this  pat-
4731       tern,  which  could be used with a UTF-8 string (ignore white space and
4732       line breaks):
4733
4734         (?| (?=[\x00-\x7f])(\C) |
4735             (?=[\x80-\x{7ff}])(\C)(\C) |
4736             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
4737             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
4738
4739       A group that starts with (?| resets the capturing  parentheses  numbers
4740       in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
4741       assertions at the start of each branch check the next  UTF-8  character
4742       for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
4743       character's individual bytes are then captured by the appropriate  num-
4744       ber of groups.
4745
4746
4747SQUARE BRACKETS AND CHARACTER CLASSES
4748
4749       An opening square bracket introduces a character class, terminated by a
4750       closing square bracket. A closing square bracket on its own is not spe-
4751       cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
4752       a lone closing square bracket causes a compile-time error. If a closing
4753       square  bracket  is required as a member of the class, it should be the
4754       first data character in the class  (after  an  initial  circumflex,  if
4755       present) or escaped with a backslash.
4756
4757       A  character  class matches a single character in the subject. In a UTF
4758       mode, the character may be more than one  data  unit  long.  A  matched
4759       character must be in the set of characters defined by the class, unless
4760       the first character in the class definition is a circumflex,  in  which
4761       case the subject character must not be in the set defined by the class.
4762       If a circumflex is actually required as a member of the  class,  ensure
4763       it is not the first character, or escape it with a backslash.
4764
4765       For  example, the character class [aeiou] matches any lower case vowel,
4766       while [^aeiou] matches any character that is not a  lower  case  vowel.
4767       Note that a circumflex is just a convenient notation for specifying the
4768       characters that are in the class by enumerating those that are  not.  A
4769       class  that starts with a circumflex is not an assertion; it still con-
4770       sumes a character from the subject string, and therefore  it  fails  if
4771       the current pointer is at the end of the string.
4772
4773       In  UTF-8  (UTF-16)  mode,  characters  with  values  greater  than 255
4774       (0xffff) can be included in a class as a literal string of data  units,
4775       or by using the \x{ escaping mechanism.
4776
4777       When  caseless  matching  is set, any letters in a class represent both
4778       their upper case and lower case versions, so for  example,  a  caseless
4779       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
4780       match "A", whereas a caseful version would. In a UTF mode, PCRE  always
4781       understands  the  concept  of case for characters whose values are less
4782       than 128, so caseless matching is always possible. For characters  with
4783       higher  values,  the  concept  of case is supported if PCRE is compiled
4784       with Unicode property support, but not otherwise.  If you want  to  use
4785       caseless  matching in a UTF mode for characters 128 and above, you must
4786       ensure that PCRE is compiled with Unicode property support as  well  as
4787       with UTF support.
4788
4789       Characters  that  might  indicate  line breaks are never treated in any
4790       special way  when  matching  character  classes,  whatever  line-ending
4791       sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
4792       PCRE_MULTILINE options is used. A class such as [^a] always matches one
4793       of these characters.
4794
4795       The  minus (hyphen) character can be used to specify a range of charac-
4796       ters in a character  class.  For  example,  [d-m]  matches  any  letter
4797       between  d  and  m,  inclusive.  If  a minus character is required in a
4798       class, it must be escaped with a backslash  or  appear  in  a  position
4799       where  it cannot be interpreted as indicating a range, typically as the
4800       first or last character in the class.
4801
4802       It is not possible to have the literal character "]" as the end charac-
4803       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
4804       two characters ("W" and "-") followed by a literal string "46]", so  it
4805       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
4806       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
4807       preted  as a class containing a range followed by two other characters.
4808       The octal or hexadecimal representation of "]" can also be used to  end
4809       a range.
4810
4811       Ranges  operate in the collating sequence of character values. They can
4812       also  be  used  for  characters  specified  numerically,  for   example
4813       [\000-\037].  Ranges  can include any characters that are valid for the
4814       current mode.
4815
4816       If a range that includes letters is used when caseless matching is set,
4817       it matches the letters in either case. For example, [W-c] is equivalent
4818       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
4819       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
4820       accented E characters in both cases. In UTF modes,  PCRE  supports  the
4821       concept  of  case for characters with values greater than 128 only when
4822       it is compiled with Unicode property support.
4823
4824       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
4825       \w, and \W may appear in a character class, and add the characters that
4826       they match to the class. For example, [\dABCDEF] matches any  hexadeci-
4827       mal  digit.  In  UTF modes, the PCRE_UCP option affects the meanings of
4828       \d, \s, \w and their upper case partners, just as  it  does  when  they
4829       appear  outside a character class, as described in the section entitled
4830       "Generic character types" above. The escape sequence \b has a different
4831       meaning  inside  a character class; it matches the backspace character.
4832       The sequences \B, \N, \R, and \X are not  special  inside  a  character
4833       class.  Like  any other unrecognized escape sequences, they are treated
4834       as the literal characters "B", "N", "R", and "X" by default, but  cause
4835       an error if the PCRE_EXTRA option is set.
4836
4837       A  circumflex  can  conveniently  be used with the upper case character
4838       types to specify a more restricted set of characters than the  matching
4839       lower  case  type.  For example, the class [^\W_] matches any letter or
4840       digit, but not underscore, whereas [\w] includes underscore. A positive
4841       character class should be read as "something OR something OR ..." and a
4842       negative class as "NOT something AND NOT something AND NOT ...".
4843
4844       The only metacharacters that are recognized in  character  classes  are
4845       backslash,  hyphen  (only  where  it can be interpreted as specifying a
4846       range), circumflex (only at the start), opening  square  bracket  (only
4847       when  it can be interpreted as introducing a POSIX class name - see the
4848       next section), and the terminating  closing  square  bracket.  However,
4849       escaping other non-alphanumeric characters does no harm.
4850
4851
4852POSIX CHARACTER CLASSES
4853
4854       Perl supports the POSIX notation for character classes. This uses names
4855       enclosed by [: and :] within the enclosing square brackets.  PCRE  also
4856       supports this notation. For example,
4857
4858         [01[:alpha:]%]
4859
4860       matches "0", "1", any alphabetic character, or "%". The supported class
4861       names are:
4862
4863         alnum    letters and digits
4864         alpha    letters
4865         ascii    character codes 0 - 127
4866         blank    space or tab only
4867         cntrl    control characters
4868         digit    decimal digits (same as \d)
4869         graph    printing characters, excluding space
4870         lower    lower case letters
4871         print    printing characters, including space
4872         punct    printing characters, excluding letters and digits and space
4873         space    white space (not quite the same as \s)
4874         upper    upper case letters
4875         word     "word" characters (same as \w)
4876         xdigit   hexadecimal digits
4877
4878       The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
4879       and  space  (32). Notice that this list includes the VT character (code
4880       11). This makes "space" different to \s, which does not include VT (for
4881       Perl compatibility).
4882
4883       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
4884       from Perl 5.8. Another Perl extension is negation, which  is  indicated
4885       by a ^ character after the colon. For example,
4886
4887         [12[:^digit:]]
4888
4889       matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
4890       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
4891       these are not supported, and an error is given if they are encountered.
4892
4893       By  default,  in  UTF modes, characters with values greater than 128 do
4894       not match any of the POSIX character classes. However, if the  PCRE_UCP
4895       option  is passed to pcre_compile(), some of the classes are changed so
4896       that Unicode character properties are used. This is achieved by replac-
4897       ing the POSIX classes by other sequences, as follows:
4898
4899         [:alnum:]  becomes  \p{Xan}
4900         [:alpha:]  becomes  \p{L}
4901         [:blank:]  becomes  \h
4902         [:digit:]  becomes  \p{Nd}
4903         [:lower:]  becomes  \p{Ll}
4904         [:space:]  becomes  \p{Xps}
4905         [:upper:]  becomes  \p{Lu}
4906         [:word:]   becomes  \p{Xwd}
4907
4908       Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
4909       POSIX classes are unchanged, and match only characters with code points
4910       less than 128.
4911
4912
4913VERTICAL BAR
4914
4915       Vertical  bar characters are used to separate alternative patterns. For
4916       example, the pattern
4917
4918         gilbert|sullivan
4919
4920       matches either "gilbert" or "sullivan". Any number of alternatives  may
4921       appear,  and  an  empty  alternative  is  permitted (matching the empty
4922       string). The matching process tries each alternative in turn, from left
4923       to  right, and the first one that succeeds is used. If the alternatives
4924       are within a subpattern (defined below), "succeeds" means matching  the
4925       rest of the main pattern as well as the alternative in the subpattern.
4926
4927
4928INTERNAL OPTION SETTING
4929
4930       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
4931       PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
4932       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
4933       between "(?" and ")".  The option letters are
4934
4935         i  for PCRE_CASELESS
4936         m  for PCRE_MULTILINE
4937         s  for PCRE_DOTALL
4938         x  for PCRE_EXTENDED
4939
4940       For example, (?im) sets caseless, multiline matching. It is also possi-
4941       ble to unset these options by preceding the letter with a hyphen, and a
4942       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
4943       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
4944       is also permitted. If a  letter  appears  both  before  and  after  the
4945       hyphen, the option is unset.
4946
4947       The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
4948       can be changed in the same way as the Perl-compatible options by  using
4949       the characters J, U and X respectively.
4950
4951       When  one  of  these  option  changes occurs at top level (that is, not
4952       inside subpattern parentheses), the change applies to the remainder  of
4953       the pattern that follows. If the change is placed right at the start of
4954       a pattern, PCRE extracts it into the global options (and it will there-
4955       fore show up in data extracted by the pcre_fullinfo() function).
4956
4957       An  option  change  within a subpattern (see below for a description of
4958       subpatterns) affects only that part of the subpattern that follows  it,
4959       so
4960
4961         (a(?i)b)c
4962
4963       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
4964       used).  By this means, options can be made to have  different  settings
4965       in  different parts of the pattern. Any changes made in one alternative
4966       do carry on into subsequent branches within the  same  subpattern.  For
4967       example,
4968
4969         (a(?i)b|c)
4970
4971       matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
4972       first branch is abandoned before the option setting.  This  is  because
4973       the  effects  of option settings happen at compile time. There would be
4974       some very weird behaviour otherwise.
4975
4976       Note: There are other PCRE-specific options that  can  be  set  by  the
4977       application  when  the  compiling  or matching functions are called. In
4978       some cases the pattern can contain special leading  sequences  such  as
4979       (*CRLF)  to  override  what  the  application  has set or what has been
4980       defaulted.  Details  are  given  in  the  section   entitled   "Newline
4981       sequences"  above.  There  are  also  the (*UTF8), (*UTF16), and (*UCP)
4982       leading sequences that can be used to  set  UTF  and  Unicode  property
4983       modes;  they  are  equivalent to setting the PCRE_UTF8, PCRE_UTF16, and
4984       the PCRE_UCP options, respectively.
4985
4986
4987SUBPATTERNS
4988
4989       Subpatterns are delimited by parentheses (round brackets), which can be
4990       nested.  Turning part of a pattern into a subpattern does two things:
4991
4992       1. It localizes a set of alternatives. For example, the pattern
4993
4994         cat(aract|erpillar|)
4995
4996       matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
4997       it would match "cataract", "erpillar" or an empty string.
4998
4999       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
5000       that,  when  the  whole  pattern  matches,  that portion of the subject
5001       string that matched the subpattern is passed back to the caller via the
5002       ovector  argument  of  the matching function. (This applies only to the
5003       traditional matching functions; the DFA matching functions do not  sup-
5004       port capturing.)
5005
5006       Opening parentheses are counted from left to right (starting from 1) to
5007       obtain numbers for the  capturing  subpatterns.  For  example,  if  the
5008       string "the red king" is matched against the pattern
5009
5010         the ((red|white) (king|queen))
5011
5012       the captured substrings are "red king", "red", and "king", and are num-
5013       bered 1, 2, and 3, respectively.
5014
5015       The fact that plain parentheses fulfil  two  functions  is  not  always
5016       helpful.   There are often times when a grouping subpattern is required
5017       without a capturing requirement. If an opening parenthesis is  followed
5018       by  a question mark and a colon, the subpattern does not do any captur-
5019       ing, and is not counted when computing the  number  of  any  subsequent
5020       capturing  subpatterns. For example, if the string "the white queen" is
5021       matched against the pattern
5022
5023         the ((?:red|white) (king|queen))
5024
5025       the captured substrings are "white queen" and "queen", and are numbered
5026       1 and 2. The maximum number of capturing subpatterns is 65535.
5027
5028       As  a  convenient shorthand, if any option settings are required at the
5029       start of a non-capturing subpattern,  the  option  letters  may  appear
5030       between the "?" and the ":". Thus the two patterns
5031
5032         (?i:saturday|sunday)
5033         (?:(?i)saturday|sunday)
5034
5035       match exactly the same set of strings. Because alternative branches are
5036       tried from left to right, and options are not reset until  the  end  of
5037       the  subpattern is reached, an option setting in one branch does affect
5038       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
5039       "Saturday".
5040
5041
5042DUPLICATE SUBPATTERN NUMBERS
5043
5044       Perl 5.10 introduced a feature whereby each alternative in a subpattern
5045       uses the same numbers for its capturing parentheses. Such a  subpattern
5046       starts  with (?| and is itself a non-capturing subpattern. For example,
5047       consider this pattern:
5048
5049         (?|(Sat)ur|(Sun))day
5050
5051       Because the two alternatives are inside a (?| group, both sets of  cap-
5052       turing  parentheses  are  numbered one. Thus, when the pattern matches,
5053       you can look at captured substring number  one,  whichever  alternative
5054       matched.  This  construct  is useful when you want to capture part, but
5055       not all, of one of a number of alternatives. Inside a (?| group, paren-
5056       theses  are  numbered as usual, but the number is reset at the start of
5057       each branch. The numbers of any capturing parentheses that  follow  the
5058       subpattern  start after the highest number used in any branch. The fol-
5059       lowing example is taken from the Perl documentation. The numbers under-
5060       neath show in which buffer the captured content will be stored.
5061
5062         # before  ---------------branch-reset----------- after
5063         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
5064         # 1            2         2  3        2     3     4
5065
5066       A  back  reference  to a numbered subpattern uses the most recent value
5067       that is set for that number by any subpattern.  The  following  pattern
5068       matches "abcabc" or "defdef":
5069
5070         /(?|(abc)|(def))\1/
5071
5072       In  contrast,  a subroutine call to a numbered subpattern always refers
5073       to the first one in the pattern with the given  number.  The  following
5074       pattern matches "abcabc" or "defabc":
5075
5076         /(?|(abc)|(def))(?1)/
5077
5078       If  a condition test for a subpattern's having matched refers to a non-
5079       unique number, the test is true if any of the subpatterns of that  num-
5080       ber have matched.
5081
5082       An  alternative approach to using this "branch reset" feature is to use
5083       duplicate named subpatterns, as described in the next section.
5084
5085
5086NAMED SUBPATTERNS
5087
5088       Identifying capturing parentheses by number is simple, but  it  can  be
5089       very  hard  to keep track of the numbers in complicated regular expres-
5090       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
5091       change.  To help with this difficulty, PCRE supports the naming of sub-
5092       patterns. This feature was not added to Perl until release 5.10. Python
5093       had  the  feature earlier, and PCRE introduced it at release 4.0, using
5094       the Python syntax. PCRE now supports both the Perl and the Python  syn-
5095       tax.  Perl  allows  identically  numbered subpatterns to have different
5096       names, but PCRE does not.
5097
5098       In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
5099       or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
5100       to capturing parentheses from other parts of the pattern, such as  back
5101       references,  recursion,  and conditions, can be made by name as well as
5102       by number.
5103
5104       Names consist of up to  32  alphanumeric  characters  and  underscores.
5105       Named  capturing  parentheses  are  still  allocated numbers as well as
5106       names, exactly as if the names were not present. The PCRE API  provides
5107       function calls for extracting the name-to-number translation table from
5108       a compiled pattern. There is also a convenience function for extracting
5109       a captured substring by name.
5110
5111       By  default, a name must be unique within a pattern, but it is possible
5112       to relax this constraint by setting the PCRE_DUPNAMES option at compile
5113       time.  (Duplicate  names are also always permitted for subpatterns with
5114       the same number, set up as described in the previous  section.)  Dupli-
5115       cate  names  can  be useful for patterns where only one instance of the
5116       named parentheses can match. Suppose you want to match the  name  of  a
5117       weekday,  either as a 3-letter abbreviation or as the full name, and in
5118       both cases you want to extract the abbreviation. This pattern (ignoring
5119       the line breaks) does the job:
5120
5121         (?<DN>Mon|Fri|Sun)(?:day)?|
5122         (?<DN>Tue)(?:sday)?|
5123         (?<DN>Wed)(?:nesday)?|
5124         (?<DN>Thu)(?:rsday)?|
5125         (?<DN>Sat)(?:urday)?
5126
5127       There  are  five capturing substrings, but only one is ever set after a
5128       match.  (An alternative way of solving this problem is to use a "branch
5129       reset" subpattern, as described in the previous section.)
5130
5131       The  convenience  function  for extracting the data by name returns the
5132       substring for the first (and in this example, the only)  subpattern  of
5133       that  name  that  matched.  This saves searching to find which numbered
5134       subpattern it was.
5135
5136       If you make a back reference to  a  non-unique  named  subpattern  from
5137       elsewhere  in the pattern, the one that corresponds to the first occur-
5138       rence of the name is used. In the absence of duplicate numbers (see the
5139       previous  section) this is the one with the lowest number. If you use a
5140       named reference in a condition test (see the section  about  conditions
5141       below),  either  to check whether a subpattern has matched, or to check
5142       for recursion, all subpatterns with the same name are  tested.  If  the
5143       condition  is  true for any one of them, the overall condition is true.
5144       This is the same behaviour as testing by number. For further details of
5145       the interfaces for handling named subpatterns, see the pcreapi documen-
5146       tation.
5147
5148       Warning: You cannot use different names to distinguish between two sub-
5149       patterns  with  the same number because PCRE uses only the numbers when
5150       matching. For this reason, an error is given at compile time if differ-
5151       ent  names  are given to subpatterns with the same number. However, you
5152       can give the same name to subpatterns with the same number,  even  when
5153       PCRE_DUPNAMES is not set.
5154
5155
5156REPETITION
5157
5158       Repetition  is  specified  by  quantifiers, which can follow any of the
5159       following items:
5160
5161         a literal data character
5162         the dot metacharacter
5163         the \C escape sequence
5164         the \X escape sequence
5165         the \R escape sequence
5166         an escape such as \d or \pL that matches a single character
5167         a character class
5168         a back reference (see next section)
5169         a parenthesized subpattern (including assertions)
5170         a subroutine call to a subpattern (recursive or otherwise)
5171
5172       The general repetition quantifier specifies a minimum and maximum  num-
5173       ber  of  permitted matches, by giving the two numbers in curly brackets
5174       (braces), separated by a comma. The numbers must be  less  than  65536,
5175       and the first must be less than or equal to the second. For example:
5176
5177         z{2,4}
5178
5179       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
5180       special character. If the second number is omitted, but  the  comma  is
5181       present,  there  is  no upper limit; if the second number and the comma
5182       are both omitted, the quantifier specifies an exact number of  required
5183       matches. Thus
5184
5185         [aeiou]{3,}
5186
5187       matches at least 3 successive vowels, but may match many more, while
5188
5189         \d{8}
5190
5191       matches  exactly  8  digits. An opening curly bracket that appears in a
5192       position where a quantifier is not allowed, or one that does not  match
5193       the  syntax of a quantifier, is taken as a literal character. For exam-
5194       ple, {,6} is not a quantifier, but a literal string of four characters.
5195
5196       In UTF modes, quantifiers apply to characters rather than to individual
5197       data  units. Thus, for example, \x{100}{2} matches two characters, each
5198       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
5199       larly,  \X{3}  matches  three Unicode extended sequences, each of which
5200       may be several data units long (and they may be of different lengths).
5201
5202       The quantifier {0} is permitted, causing the expression to behave as if
5203       the previous item and the quantifier were not present. This may be use-
5204       ful for subpatterns that are referenced as subroutines  from  elsewhere
5205       in the pattern (but see also the section entitled "Defining subpatterns
5206       for use by reference only" below). Items other  than  subpatterns  that
5207       have a {0} quantifier are omitted from the compiled pattern.
5208
5209       For  convenience, the three most common quantifiers have single-charac-
5210       ter abbreviations:
5211
5212         *    is equivalent to {0,}
5213         +    is equivalent to {1,}
5214         ?    is equivalent to {0,1}
5215
5216       It is possible to construct infinite loops by  following  a  subpattern
5217       that can match no characters with a quantifier that has no upper limit,
5218       for example:
5219
5220         (a?)*
5221
5222       Earlier versions of Perl and PCRE used to give an error at compile time
5223       for  such  patterns. However, because there are cases where this can be
5224       useful, such patterns are now accepted, but if any  repetition  of  the
5225       subpattern  does in fact match no characters, the loop is forcibly bro-
5226       ken.
5227
5228       By default, the quantifiers are "greedy", that is, they match  as  much
5229       as  possible  (up  to  the  maximum number of permitted times), without
5230       causing the rest of the pattern to fail. The classic example  of  where
5231       this gives problems is in trying to match comments in C programs. These
5232       appear between /* and */ and within the comment,  individual  *  and  /
5233       characters  may  appear. An attempt to match C comments by applying the
5234       pattern
5235
5236         /\*.*\*/
5237
5238       to the string
5239
5240         /* first comment */  not comment  /* second comment */
5241
5242       fails, because it matches the entire string owing to the greediness  of
5243       the .*  item.
5244
5245       However,  if  a quantifier is followed by a question mark, it ceases to
5246       be greedy, and instead matches the minimum number of times possible, so
5247       the pattern
5248
5249         /\*.*?\*/
5250
5251       does  the  right  thing with the C comments. The meaning of the various
5252       quantifiers is not otherwise changed,  just  the  preferred  number  of
5253       matches.   Do  not  confuse this use of question mark with its use as a
5254       quantifier in its own right. Because it has two uses, it can  sometimes
5255       appear doubled, as in
5256
5257         \d??\d
5258
5259       which matches one digit by preference, but can match two if that is the
5260       only way the rest of the pattern matches.
5261
5262       If the PCRE_UNGREEDY option is set (an option that is not available  in
5263       Perl),  the  quantifiers are not greedy by default, but individual ones
5264       can be made greedy by following them with a  question  mark.  In  other
5265       words, it inverts the default behaviour.
5266
5267       When  a  parenthesized  subpattern  is quantified with a minimum repeat
5268       count that is greater than 1 or with a limited maximum, more memory  is
5269       required  for  the  compiled  pattern, in proportion to the size of the
5270       minimum or maximum.
5271
5272       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
5273       alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
5274       the pattern is implicitly anchored, because whatever  follows  will  be
5275       tried  against every character position in the subject string, so there
5276       is no point in retrying the overall match at  any  position  after  the
5277       first.  PCRE  normally treats such a pattern as though it were preceded
5278       by \A.
5279
5280       In cases where it is known that the subject  string  contains  no  new-
5281       lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
5282       mization, or alternatively using ^ to indicate anchoring explicitly.
5283
5284       However, there is one situation where the optimization cannot be  used.
5285       When .*  is inside capturing parentheses that are the subject of a back
5286       reference elsewhere in the pattern, a match at the start may fail where
5287       a later one succeeds. Consider, for example:
5288
5289         (.*)abc\1
5290
5291       If  the subject is "xyz123abc123" the match point is the fourth charac-
5292       ter. For this reason, such a pattern is not implicitly anchored.
5293
5294       When a capturing subpattern is repeated, the value captured is the sub-
5295       string that matched the final iteration. For example, after
5296
5297         (tweedle[dume]{3}\s*)+
5298
5299       has matched "tweedledum tweedledee" the value of the captured substring
5300       is "tweedledee". However, if there are  nested  capturing  subpatterns,
5301       the  corresponding captured values may have been set in previous itera-
5302       tions. For example, after
5303
5304         /(a|(b))+/
5305
5306       matches "aba" the value of the second captured substring is "b".
5307
5308
5309ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
5310
5311       With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
5312       repetition,  failure  of what follows normally causes the repeated item
5313       to be re-evaluated to see if a different number of repeats  allows  the
5314       rest  of  the pattern to match. Sometimes it is useful to prevent this,
5315       either to change the nature of the match, or to cause it  fail  earlier
5316       than  it otherwise might, when the author of the pattern knows there is
5317       no point in carrying on.
5318
5319       Consider, for example, the pattern \d+foo when applied to  the  subject
5320       line
5321
5322         123456bar
5323
5324       After matching all 6 digits and then failing to match "foo", the normal
5325       action of the matcher is to try again with only 5 digits  matching  the
5326       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
5327       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
5328       the  means for specifying that once a subpattern has matched, it is not
5329       to be re-evaluated in this way.
5330
5331       If we use atomic grouping for the previous example, the  matcher  gives
5332       up  immediately  on failing to match "foo" the first time. The notation
5333       is a kind of special parenthesis, starting with (?> as in this example:
5334
5335         (?>\d+)foo
5336
5337       This kind of parenthesis "locks up" the  part of the  pattern  it  con-
5338       tains  once  it  has matched, and a failure further into the pattern is
5339       prevented from backtracking into it. Backtracking past it  to  previous
5340       items, however, works as normal.
5341
5342       An  alternative  description  is that a subpattern of this type matches
5343       the string of characters that an  identical  standalone  pattern  would
5344       match, if anchored at the current point in the subject string.
5345
5346       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
5347       such as the above example can be thought of as a maximizing repeat that
5348       must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
5349       pared to adjust the number of digits they match in order  to  make  the
5350       rest of the pattern match, (?>\d+) can only match an entire sequence of
5351       digits.
5352
5353       Atomic groups in general can of course contain arbitrarily  complicated
5354       subpatterns,  and  can  be  nested. However, when the subpattern for an
5355       atomic group is just a single repeated item, as in the example above, a
5356       simpler  notation,  called  a "possessive quantifier" can be used. This
5357       consists of an additional + character  following  a  quantifier.  Using
5358       this notation, the previous example can be rewritten as
5359
5360         \d++foo
5361
5362       Note that a possessive quantifier can be used with an entire group, for
5363       example:
5364
5365         (abc|xyz){2,3}+
5366
5367       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
5368       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
5369       simpler forms of atomic group. However, there is no difference  in  the
5370       meaning  of  a  possessive  quantifier and the equivalent atomic group,
5371       though there may be a performance  difference;  possessive  quantifiers
5372       should be slightly faster.
5373
5374       The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
5375       tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
5376       edition of his book. Mike McCloskey liked it, so implemented it when he
5377       built Sun's Java package, and PCRE copied it from there. It  ultimately
5378       found its way into Perl at release 5.10.
5379
5380       PCRE has an optimization that automatically "possessifies" certain sim-
5381       ple pattern constructs. For example, the sequence  A+B  is  treated  as
5382       A++B  because  there is no point in backtracking into a sequence of A's
5383       when B must follow.
5384
5385       When a pattern contains an unlimited repeat inside  a  subpattern  that
5386       can  itself  be  repeated  an  unlimited number of times, the use of an
5387       atomic group is the only way to avoid some  failing  matches  taking  a
5388       very long time indeed. The pattern
5389
5390         (\D+|<\d+>)*[!?]
5391
5392       matches  an  unlimited number of substrings that either consist of non-
5393       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
5394       matches, it runs quickly. However, if it is applied to
5395
5396         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
5397
5398       it  takes  a  long  time  before reporting failure. This is because the
5399       string can be divided between the internal \D+ repeat and the  external
5400       *  repeat  in  a  large  number of ways, and all have to be tried. (The
5401       example uses [!?] rather than a single character at  the  end,  because
5402       both  PCRE  and  Perl have an optimization that allows for fast failure
5403       when a single character is used. They remember the last single  charac-
5404       ter  that  is required for a match, and fail early if it is not present
5405       in the string.) If the pattern is changed so that  it  uses  an  atomic
5406       group, like this:
5407
5408         ((?>\D+)|<\d+>)*[!?]
5409
5410       sequences of non-digits cannot be broken, and failure happens quickly.
5411
5412
5413BACK REFERENCES
5414
5415       Outside a character class, a backslash followed by a digit greater than
5416       0 (and possibly further digits) is a back reference to a capturing sub-
5417       pattern  earlier  (that is, to its left) in the pattern, provided there
5418       have been that many previous capturing left parentheses.
5419
5420       However, if the decimal number following the backslash is less than 10,
5421       it  is  always  taken  as a back reference, and causes an error only if
5422       there are not that many capturing left parentheses in the  entire  pat-
5423       tern.  In  other words, the parentheses that are referenced need not be
5424       to the left of the reference for numbers less than 10. A "forward  back
5425       reference"  of  this  type can make sense when a repetition is involved
5426       and the subpattern to the right has participated in an  earlier  itera-
5427       tion.
5428
5429       It  is  not  possible to have a numerical "forward back reference" to a
5430       subpattern whose number is 10 or  more  using  this  syntax  because  a
5431       sequence  such  as  \50 is interpreted as a character defined in octal.
5432       See the subsection entitled "Non-printing characters" above for further
5433       details  of  the  handling of digits following a backslash. There is no
5434       such problem when named parentheses are used. A back reference  to  any
5435       subpattern is possible using named parentheses (see below).
5436
5437       Another  way  of  avoiding  the ambiguity inherent in the use of digits
5438       following a backslash is to use the \g  escape  sequence.  This  escape
5439       must be followed by an unsigned number or a negative number, optionally
5440       enclosed in braces. These examples are all identical:
5441
5442         (ring), \1
5443         (ring), \g1
5444         (ring), \g{1}
5445
5446       An unsigned number specifies an absolute reference without the  ambigu-
5447       ity that is present in the older syntax. It is also useful when literal
5448       digits follow the reference. A negative number is a relative reference.
5449       Consider this example:
5450
5451         (abc(def)ghi)\g{-1}
5452
5453       The sequence \g{-1} is a reference to the most recently started captur-
5454       ing subpattern before \g, that is, is it equivalent to \2 in this exam-
5455       ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
5456       references can be helpful in long patterns, and also in  patterns  that
5457       are  created  by  joining  together  fragments  that contain references
5458       within themselves.
5459
5460       A back reference matches whatever actually matched the  capturing  sub-
5461       pattern  in  the  current subject string, rather than anything matching
5462       the subpattern itself (see "Subpatterns as subroutines" below for a way
5463       of doing that). So the pattern
5464
5465         (sens|respons)e and \1ibility
5466
5467       matches  "sense and sensibility" and "response and responsibility", but
5468       not "sense and responsibility". If caseful matching is in force at  the
5469       time  of the back reference, the case of letters is relevant. For exam-
5470       ple,
5471
5472         ((?i)rah)\s+\1
5473
5474       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
5475       original capturing subpattern is matched caselessly.
5476
5477       There  are  several  different ways of writing back references to named
5478       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
5479       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
5480       unified back reference syntax, in which \g can be used for both numeric
5481       and  named  references,  is  also supported. We could rewrite the above
5482       example in any of the following ways:
5483
5484         (?<p1>(?i)rah)\s+\k<p1>
5485         (?'p1'(?i)rah)\s+\k{p1}
5486         (?P<p1>(?i)rah)\s+(?P=p1)
5487         (?<p1>(?i)rah)\s+\g{p1}
5488
5489       A subpattern that is referenced by  name  may  appear  in  the  pattern
5490       before or after the reference.
5491
5492       There  may be more than one back reference to the same subpattern. If a
5493       subpattern has not actually been used in a particular match,  any  back
5494       references to it always fail by default. For example, the pattern
5495
5496         (a|(bc))\2
5497
5498       always  fails  if  it starts to match "a" rather than "bc". However, if
5499       the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
5500       ence to an unset value matches an empty string.
5501
5502       Because  there may be many capturing parentheses in a pattern, all dig-
5503       its following a backslash are taken as part of a potential back  refer-
5504       ence  number.   If  the  pattern continues with a digit character, some
5505       delimiter must  be  used  to  terminate  the  back  reference.  If  the
5506       PCRE_EXTENDED  option  is  set, this can be white space. Otherwise, the
5507       \g{ syntax or an empty comment (see "Comments" below) can be used.
5508
5509   Recursive back references
5510
5511       A back reference that occurs inside the parentheses to which it  refers
5512       fails  when  the subpattern is first used, so, for example, (a\1) never
5513       matches.  However, such references can be useful inside  repeated  sub-
5514       patterns. For example, the pattern
5515
5516         (a|b\1)+
5517
5518       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
5519       ation of the subpattern,  the  back  reference  matches  the  character
5520       string  corresponding  to  the previous iteration. In order for this to
5521       work, the pattern must be such that the first iteration does  not  need
5522       to  match the back reference. This can be done using alternation, as in
5523       the example above, or by a quantifier with a minimum of zero.
5524
5525       Back references of this type cause the group that they reference to  be
5526       treated  as  an atomic group.  Once the whole group has been matched, a
5527       subsequent matching failure cannot cause backtracking into  the  middle
5528       of the group.
5529
5530
5531ASSERTIONS
5532
5533       An  assertion  is  a  test on the characters following or preceding the
5534       current matching point that does not actually consume  any  characters.
5535       The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
5536       described above.
5537
5538       More complicated assertions are coded as  subpatterns.  There  are  two
5539       kinds:  those  that  look  ahead of the current position in the subject
5540       string, and those that look  behind  it.  An  assertion  subpattern  is
5541       matched  in  the  normal way, except that it does not cause the current
5542       matching position to be changed.
5543
5544       Assertion subpatterns are not capturing subpatterns. If such an  asser-
5545       tion  contains  capturing  subpatterns within it, these are counted for
5546       the purposes of numbering the capturing subpatterns in the  whole  pat-
5547       tern.  However,  substring  capturing  is carried out only for positive
5548       assertions, because it does not make sense for negative assertions.
5549
5550       For compatibility with Perl, assertion  subpatterns  may  be  repeated;
5551       though  it  makes  no sense to assert the same thing several times, the
5552       side effect of capturing parentheses may  occasionally  be  useful.  In
5553       practice, there only three cases:
5554
5555       (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
5556       matching.  However, it may  contain  internal  capturing  parenthesized
5557       groups that are called from elsewhere via the subroutine mechanism.
5558
5559       (2)  If quantifier is {0,n} where n is greater than zero, it is treated
5560       as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
5561       tried with and without the assertion, the order depending on the greed-
5562       iness of the quantifier.
5563
5564       (3) If the minimum repetition is greater than zero, the  quantifier  is
5565       ignored.   The  assertion  is  obeyed just once when encountered during
5566       matching.
5567
5568   Lookahead assertions
5569
5570       Lookahead assertions start with (?= for positive assertions and (?! for
5571       negative assertions. For example,
5572
5573         \w+(?=;)
5574
5575       matches  a word followed by a semicolon, but does not include the semi-
5576       colon in the match, and
5577
5578         foo(?!bar)
5579
5580       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
5581       that the apparently similar pattern
5582
5583         (?!foo)bar
5584
5585       does  not  find  an  occurrence  of "bar" that is preceded by something
5586       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
5587       the assertion (?!foo) is always true when the next three characters are
5588       "bar". A lookbehind assertion is needed to achieve the other effect.
5589
5590       If you want to force a matching failure at some point in a pattern, the
5591       most  convenient  way  to  do  it  is with (?!) because an empty string
5592       always matches, so an assertion that requires there not to be an  empty
5593       string must always fail.  The backtracking control verb (*FAIL) or (*F)
5594       is a synonym for (?!).
5595
5596   Lookbehind assertions
5597
5598       Lookbehind assertions start with (?<= for positive assertions and  (?<!
5599       for negative assertions. For example,
5600
5601         (?<!foo)bar
5602
5603       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
5604       contents of a lookbehind assertion are restricted  such  that  all  the
5605       strings it matches must have a fixed length. However, if there are sev-
5606       eral top-level alternatives, they do not all  have  to  have  the  same
5607       fixed length. Thus
5608
5609         (?<=bullock|donkey)
5610
5611       is permitted, but
5612
5613         (?<!dogs?|cats?)
5614
5615       causes  an  error at compile time. Branches that match different length
5616       strings are permitted only at the top level of a lookbehind  assertion.
5617       This is an extension compared with Perl, which requires all branches to
5618       match the same length of string. An assertion such as
5619
5620         (?<=ab(c|de))
5621
5622       is not permitted, because its single top-level  branch  can  match  two
5623       different lengths, but it is acceptable to PCRE if rewritten to use two
5624       top-level branches:
5625
5626         (?<=abc|abde)
5627
5628       In some cases, the escape sequence \K (see above) can be  used  instead
5629       of a lookbehind assertion to get round the fixed-length restriction.
5630
5631       The  implementation  of lookbehind assertions is, for each alternative,
5632       to temporarily move the current position back by the fixed  length  and
5633       then try to match. If there are insufficient characters before the cur-
5634       rent position, the assertion fails.
5635
5636       In a UTF mode, PCRE does not allow the \C escape (which matches a  sin-
5637       gle  data  unit even in a UTF mode) to appear in lookbehind assertions,
5638       because it makes it impossible to calculate the length of  the  lookbe-
5639       hind.  The \X and \R escapes, which can match different numbers of data
5640       units, are also not permitted.
5641
5642       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
5643       lookbehinds,  as  long as the subpattern matches a fixed-length string.
5644       Recursion, however, is not supported.
5645
5646       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
5647       assertions to specify efficient matching of fixed-length strings at the
5648       end of subject strings. Consider a simple pattern such as
5649
5650         abcd$
5651
5652       when applied to a long string that does  not  match.  Because  matching
5653       proceeds from left to right, PCRE will look for each "a" in the subject
5654       and then see if what follows matches the rest of the  pattern.  If  the
5655       pattern is specified as
5656
5657         ^.*abcd$
5658
5659       the  initial .* matches the entire string at first, but when this fails
5660       (because there is no following "a"), it backtracks to match all but the
5661       last  character,  then all but the last two characters, and so on. Once
5662       again the search for "a" covers the entire string, from right to  left,
5663       so we are no better off. However, if the pattern is written as
5664
5665         ^.*+(?<=abcd)
5666
5667       there  can  be  no backtracking for the .*+ item; it can match only the
5668       entire string. The subsequent lookbehind assertion does a  single  test
5669       on  the last four characters. If it fails, the match fails immediately.
5670       For long strings, this approach makes a significant difference  to  the
5671       processing time.
5672
5673   Using multiple assertions
5674
5675       Several assertions (of any sort) may occur in succession. For example,
5676
5677         (?<=\d{3})(?<!999)foo
5678
5679       matches  "foo" preceded by three digits that are not "999". Notice that
5680       each of the assertions is applied independently at the  same  point  in
5681       the  subject  string.  First  there  is a check that the previous three
5682       characters are all digits, and then there is  a  check  that  the  same
5683       three characters are not "999".  This pattern does not match "foo" pre-
5684       ceded by six characters, the first of which are  digits  and  the  last
5685       three  of  which  are not "999". For example, it doesn't match "123abc-
5686       foo". A pattern to do that is
5687
5688         (?<=\d{3}...)(?<!999)foo
5689
5690       This time the first assertion looks at the  preceding  six  characters,
5691       checking that the first three are digits, and then the second assertion
5692       checks that the preceding three characters are not "999".
5693
5694       Assertions can be nested in any combination. For example,
5695
5696         (?<=(?<!foo)bar)baz
5697
5698       matches an occurrence of "baz" that is preceded by "bar" which in  turn
5699       is not preceded by "foo", while
5700
5701         (?<=\d{3}(?!999)...)foo
5702
5703       is  another pattern that matches "foo" preceded by three digits and any
5704       three characters that are not "999".
5705
5706
5707CONDITIONAL SUBPATTERNS
5708
5709       It is possible to cause the matching process to obey a subpattern  con-
5710       ditionally  or to choose between two alternative subpatterns, depending
5711       on the result of an assertion, or whether a specific capturing  subpat-
5712       tern  has  already  been matched. The two possible forms of conditional
5713       subpattern are:
5714
5715         (?(condition)yes-pattern)
5716         (?(condition)yes-pattern|no-pattern)
5717
5718       If the condition is satisfied, the yes-pattern is used;  otherwise  the
5719       no-pattern  (if  present)  is used. If there are more than two alterna-
5720       tives in the subpattern, a compile-time error occurs. Each of  the  two
5721       alternatives may itself contain nested subpatterns of any form, includ-
5722       ing  conditional  subpatterns;  the  restriction  to  two  alternatives
5723       applies only at the level of the condition. This pattern fragment is an
5724       example where the alternatives are complex:
5725
5726         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
5727
5728
5729       There are four kinds of condition: references  to  subpatterns,  refer-
5730       ences to recursion, a pseudo-condition called DEFINE, and assertions.
5731
5732   Checking for a used subpattern by number
5733
5734       If  the  text between the parentheses consists of a sequence of digits,
5735       the condition is true if a capturing subpattern of that number has pre-
5736       viously  matched.  If  there is more than one capturing subpattern with
5737       the same number (see the earlier  section  about  duplicate  subpattern
5738       numbers),  the condition is true if any of them have matched. An alter-
5739       native notation is to precede the digits with a plus or minus sign.  In
5740       this  case, the subpattern number is relative rather than absolute. The
5741       most recently opened parentheses can be referenced by (?(-1), the  next
5742       most  recent  by (?(-2), and so on. Inside loops it can also make sense
5743       to refer to subsequent groups. The next parentheses to be opened can be
5744       referenced  as (?(+1), and so on. (The value zero in any of these forms
5745       is not used; it provokes a compile-time error.)
5746
5747       Consider the following pattern, which  contains  non-significant  white
5748       space to make it more readable (assume the PCRE_EXTENDED option) and to
5749       divide it into three parts for ease of discussion:
5750
5751         ( \( )?    [^()]+    (?(1) \) )
5752
5753       The first part matches an optional opening  parenthesis,  and  if  that
5754       character is present, sets it as the first captured substring. The sec-
5755       ond part matches one or more characters that are not  parentheses.  The
5756       third  part  is  a conditional subpattern that tests whether or not the
5757       first set of parentheses matched. If they  did,  that  is,  if  subject
5758       started  with an opening parenthesis, the condition is true, and so the
5759       yes-pattern is executed and a closing parenthesis is  required.  Other-
5760       wise,  since no-pattern is not present, the subpattern matches nothing.
5761       In other words, this pattern matches  a  sequence  of  non-parentheses,
5762       optionally enclosed in parentheses.
5763
5764       If  you  were  embedding  this pattern in a larger one, you could use a
5765       relative reference:
5766
5767         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
5768
5769       This makes the fragment independent of the parentheses  in  the  larger
5770       pattern.
5771
5772   Checking for a used subpattern by name
5773
5774       Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
5775       used subpattern by name. For compatibility  with  earlier  versions  of
5776       PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
5777       also recognized. However, there is a possible ambiguity with this  syn-
5778       tax,  because  subpattern  names  may  consist entirely of digits. PCRE
5779       looks first for a named subpattern; if it cannot find one and the  name
5780       consists  entirely  of digits, PCRE looks for a subpattern of that num-
5781       ber, which must be greater than zero. Using subpattern names that  con-
5782       sist entirely of digits is not recommended.
5783
5784       Rewriting the above example to use a named subpattern gives this:
5785
5786         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
5787
5788       If  the  name used in a condition of this kind is a duplicate, the test
5789       is applied to all subpatterns of the same name, and is true if any  one
5790       of them has matched.
5791
5792   Checking for pattern recursion
5793
5794       If the condition is the string (R), and there is no subpattern with the
5795       name R, the condition is true if a recursive call to the whole  pattern
5796       or any subpattern has been made. If digits or a name preceded by amper-
5797       sand follow the letter R, for example:
5798
5799         (?(R3)...) or (?(R&name)...)
5800
5801       the condition is true if the most recent recursion is into a subpattern
5802       whose number or name is given. This condition does not check the entire
5803       recursion stack. If the name used in a condition  of  this  kind  is  a
5804       duplicate, the test is applied to all subpatterns of the same name, and
5805       is true if any one of them is the most recent recursion.
5806
5807       At "top level", all these recursion test  conditions  are  false.   The
5808       syntax for recursive patterns is described below.
5809
5810   Defining subpatterns for use by reference only
5811
5812       If  the  condition  is  the string (DEFINE), and there is no subpattern
5813       with the name DEFINE, the condition is  always  false.  In  this  case,
5814       there  may  be  only  one  alternative  in the subpattern. It is always
5815       skipped if control reaches this point  in  the  pattern;  the  idea  of
5816       DEFINE  is that it can be used to define subroutines that can be refer-
5817       enced from elsewhere. (The use of subroutines is described below.)  For
5818       example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
5819       could be written like this (ignore white space and line breaks):
5820
5821         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5822         \b (?&byte) (\.(?&byte)){3} \b
5823
5824       The first part of the pattern is a DEFINE group inside which a  another
5825       group  named "byte" is defined. This matches an individual component of
5826       an IPv4 address (a number less than 256). When  matching  takes  place,
5827       this  part  of  the pattern is skipped because DEFINE acts like a false
5828       condition. The rest of the pattern uses references to the  named  group
5829       to  match the four dot-separated components of an IPv4 address, insist-
5830       ing on a word boundary at each end.
5831
5832   Assertion conditions
5833
5834       If the condition is not in any of the above  formats,  it  must  be  an
5835       assertion.   This may be a positive or negative lookahead or lookbehind
5836       assertion. Consider  this  pattern,  again  containing  non-significant
5837       white space, and with the two alternatives on the second line:
5838
5839         (?(?=[^a-z]*[a-z])
5840         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
5841
5842       The  condition  is  a  positive  lookahead  assertion  that  matches an
5843       optional sequence of non-letters followed by a letter. In other  words,
5844       it  tests  for the presence of at least one letter in the subject. If a
5845       letter is found, the subject is matched against the first  alternative;
5846       otherwise  it  is  matched  against  the  second.  This pattern matches
5847       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
5848       letters and dd are digits.
5849
5850
5851COMMENTS
5852
5853       There are two ways of including comments in patterns that are processed
5854       by PCRE. In both cases, the start of the comment must not be in a char-
5855       acter class, nor in the middle of any other sequence of related charac-
5856       ters such as (?: or a subpattern name or number.  The  characters  that
5857       make up a comment play no part in the pattern matching.
5858
5859       The  sequence (?# marks the start of a comment that continues up to the
5860       next closing parenthesis. Nested parentheses are not permitted. If  the
5861       PCRE_EXTENDED option is set, an unescaped # character also introduces a
5862       comment, which in this case continues to  immediately  after  the  next
5863       newline  character  or character sequence in the pattern. Which charac-
5864       ters are interpreted as newlines is controlled by the options passed to
5865       a  compiling function or by a special sequence at the start of the pat-
5866       tern, as described in the section entitled "Newline conventions" above.
5867       Note that the end of this type of comment is a literal newline sequence
5868       in the pattern; escape sequences that happen to represent a newline  do
5869       not  count.  For  example,  consider this pattern when PCRE_EXTENDED is
5870       set, and the default newline convention is in force:
5871
5872         abc #comment \n still comment
5873
5874       On encountering the # character, pcre_compile()  skips  along,  looking
5875       for  a newline in the pattern. The sequence \n is still literal at this
5876       stage, so it does not terminate the comment. Only an  actual  character
5877       with the code value 0x0a (the default newline) does so.
5878
5879
5880RECURSIVE PATTERNS
5881
5882       Consider  the problem of matching a string in parentheses, allowing for
5883       unlimited nested parentheses. Without the use of  recursion,  the  best
5884       that  can  be  done  is  to use a pattern that matches up to some fixed
5885       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
5886       depth.
5887
5888       For some time, Perl has provided a facility that allows regular expres-
5889       sions to recurse (amongst other things). It does this by  interpolating
5890       Perl  code in the expression at run time, and the code can refer to the
5891       expression itself. A Perl pattern using code interpolation to solve the
5892       parentheses problem can be created like this:
5893
5894         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
5895
5896       The (?p{...}) item interpolates Perl code at run time, and in this case
5897       refers recursively to the pattern in which it appears.
5898
5899       Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5900       it  supports  special  syntax  for recursion of the entire pattern, and
5901       also for individual subpattern recursion.  After  its  introduction  in
5902       PCRE  and  Python,  this  kind of recursion was subsequently introduced
5903       into Perl at release 5.10.
5904
5905       A special item that consists of (? followed by a  number  greater  than
5906       zero  and  a  closing parenthesis is a recursive subroutine call of the
5907       subpattern of the given number, provided that  it  occurs  inside  that
5908       subpattern.  (If  not,  it is a non-recursive subroutine call, which is
5909       described in the next section.) The special item  (?R)  or  (?0)  is  a
5910       recursive call of the entire regular expression.
5911
5912       This  PCRE  pattern  solves  the nested parentheses problem (assume the
5913       PCRE_EXTENDED option is set so that white space is ignored):
5914
5915         \( ( [^()]++ | (?R) )* \)
5916
5917       First it matches an opening parenthesis. Then it matches any number  of
5918       substrings  which  can  either  be  a sequence of non-parentheses, or a
5919       recursive match of the pattern itself (that is, a  correctly  parenthe-
5920       sized substring).  Finally there is a closing parenthesis. Note the use
5921       of a possessive quantifier to avoid backtracking into sequences of non-
5922       parentheses.
5923
5924       If  this  were  part of a larger pattern, you would not want to recurse
5925       the entire pattern, so instead you could use this:
5926
5927         ( \( ( [^()]++ | (?1) )* \) )
5928
5929       We have put the pattern into parentheses, and caused the  recursion  to
5930       refer to them instead of the whole pattern.
5931
5932       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
5933       tricky. This is made easier by the use of relative references.  Instead
5934       of (?1) in the pattern above you can write (?-2) to refer to the second
5935       most recently opened parentheses  preceding  the  recursion.  In  other
5936       words,  a  negative  number counts capturing parentheses leftwards from
5937       the point at which it is encountered.
5938
5939       It is also possible to refer to  subsequently  opened  parentheses,  by
5940       writing  references  such  as (?+2). However, these cannot be recursive
5941       because the reference is not inside the  parentheses  that  are  refer-
5942       enced.  They are always non-recursive subroutine calls, as described in
5943       the next section.
5944
5945       An alternative approach is to use named parentheses instead.  The  Perl
5946       syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
5947       supported. We could rewrite the above example as follows:
5948
5949         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
5950
5951       If there is more than one subpattern with the same name,  the  earliest
5952       one is used.
5953
5954       This  particular  example pattern that we have been looking at contains
5955       nested unlimited repeats, and so the use of a possessive quantifier for
5956       matching strings of non-parentheses is important when applying the pat-
5957       tern to strings that do not match. For example, when  this  pattern  is
5958       applied to
5959
5960         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5961
5962       it  yields  "no  match" quickly. However, if a possessive quantifier is
5963       not used, the match runs for a very long time indeed because there  are
5964       so  many  different  ways the + and * repeats can carve up the subject,
5965       and all have to be tested before failure can be reported.
5966
5967       At the end of a match, the values of capturing  parentheses  are  those
5968       from  the outermost level. If you want to obtain intermediate values, a
5969       callout function can be used (see below and the pcrecallout  documenta-
5970       tion). If the pattern above is matched against
5971
5972         (ab(cd)ef)
5973
5974       the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
5975       which is the last value taken on at the top level. If a capturing  sub-
5976       pattern  is  not  matched at the top level, its final captured value is
5977       unset, even if it was (temporarily) set at a deeper  level  during  the
5978       matching process.
5979
5980       If  there are more than 15 capturing parentheses in a pattern, PCRE has
5981       to obtain extra memory to store data during a recursion, which it  does
5982       by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5983       can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5984
5985       Do not confuse the (?R) item with the condition (R),  which  tests  for
5986       recursion.   Consider  this pattern, which matches text in angle brack-
5987       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
5988       brackets  (that is, when recursing), whereas any characters are permit-
5989       ted at the outer level.
5990
5991         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
5992
5993       In this pattern, (?(R) is the start of a conditional  subpattern,  with
5994       two  different  alternatives for the recursive and non-recursive cases.
5995       The (?R) item is the actual recursive call.
5996
5997   Differences in recursion processing between PCRE and Perl
5998
5999       Recursion processing in PCRE differs from Perl in two  important  ways.
6000       In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
6001       always treated as an atomic group. That is, once it has matched some of
6002       the subject string, it is never re-entered, even if it contains untried
6003       alternatives and there is a subsequent matching failure.  This  can  be
6004       illustrated  by the following pattern, which purports to match a palin-
6005       dromic string that contains an odd number of characters  (for  example,
6006       "a", "aba", "abcba", "abcdcba"):
6007
6008         ^(.|(.)(?1)\2)$
6009
6010       The idea is that it either matches a single character, or two identical
6011       characters surrounding a sub-palindrome. In Perl, this  pattern  works;
6012       in  PCRE  it  does  not if the pattern is longer than three characters.
6013       Consider the subject string "abcba":
6014
6015       At the top level, the first character is matched, but as it is  not  at
6016       the end of the string, the first alternative fails; the second alterna-
6017       tive is taken and the recursion kicks in. The recursive call to subpat-
6018       tern  1  successfully  matches the next character ("b"). (Note that the
6019       beginning and end of line tests are not part of the recursion).
6020
6021       Back at the top level, the next character ("c") is compared  with  what
6022       subpattern  2 matched, which was "a". This fails. Because the recursion
6023       is treated as an atomic group, there are now  no  backtracking  points,
6024       and  so  the  entire  match fails. (Perl is able, at this point, to re-
6025       enter the recursion and try the second alternative.)  However,  if  the
6026       pattern is written with the alternatives in the other order, things are
6027       different:
6028
6029         ^((.)(?1)\2|.)$
6030
6031       This time, the recursing alternative is tried first, and  continues  to
6032       recurse  until  it runs out of characters, at which point the recursion
6033       fails. But this time we do have  another  alternative  to  try  at  the
6034       higher  level.  That  is  the  big difference: in the previous case the
6035       remaining alternative is at a deeper recursion level, which PCRE cannot
6036       use.
6037
6038       To  change  the pattern so that it matches all palindromic strings, not
6039       just those with an odd number of characters, it is tempting  to  change
6040       the pattern to this:
6041
6042         ^((.)(?1)\2|.?)$
6043
6044       Again,  this  works  in Perl, but not in PCRE, and for the same reason.
6045       When a deeper recursion has matched a single character,  it  cannot  be
6046       entered  again  in  order  to match an empty string. The solution is to
6047       separate the two cases, and write out the odd and even cases as  alter-
6048       natives at the higher level:
6049
6050         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
6051
6052       If  you  want  to match typical palindromic phrases, the pattern has to
6053       ignore all non-word characters, which can be done like this:
6054
6055         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
6056
6057       If run with the PCRE_CASELESS option, this pattern matches phrases such
6058       as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
6059       Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
6060       ing  into  sequences of non-word characters. Without this, PCRE takes a
6061       great deal longer (ten times or more) to  match  typical  phrases,  and
6062       Perl takes so long that you think it has gone into a loop.
6063
6064       WARNING:  The  palindrome-matching patterns above work only if the sub-
6065       ject string does not start with a palindrome that is shorter  than  the
6066       entire  string.  For example, although "abcba" is correctly matched, if
6067       the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
6068       then  fails at top level because the end of the string does not follow.
6069       Once again, it cannot jump back into the recursion to try other  alter-
6070       natives, so the entire match fails.
6071
6072       The  second  way  in which PCRE and Perl differ in their recursion pro-
6073       cessing is in the handling of captured values. In Perl, when a  subpat-
6074       tern  is  called recursively or as a subpattern (see the next section),
6075       it has no access to any values that were captured  outside  the  recur-
6076       sion,  whereas  in  PCRE  these values can be referenced. Consider this
6077       pattern:
6078
6079         ^(.)(\1|a(?2))
6080
6081       In PCRE, this pattern matches "bab". The  first  capturing  parentheses
6082       match  "b",  then in the second group, when the back reference \1 fails
6083       to match "b", the second alternative matches "a" and then recurses.  In
6084       the  recursion,  \1 does now match "b" and so the whole match succeeds.
6085       In Perl, the pattern fails to match because inside the  recursive  call
6086       \1 cannot access the externally set value.
6087
6088
6089SUBPATTERNS AS SUBROUTINES
6090
6091       If  the  syntax for a recursive subpattern call (either by number or by
6092       name) is used outside the parentheses to which it refers,  it  operates
6093       like  a subroutine in a programming language. The called subpattern may
6094       be defined before or after the reference. A numbered reference  can  be
6095       absolute or relative, as in these examples:
6096
6097         (...(absolute)...)...(?2)...
6098         (...(relative)...)...(?-1)...
6099         (...(?+1)...(relative)...
6100
6101       An earlier example pointed out that the pattern
6102
6103         (sens|respons)e and \1ibility
6104
6105       matches  "sense and sensibility" and "response and responsibility", but
6106       not "sense and responsibility". If instead the pattern
6107
6108         (sens|respons)e and (?1)ibility
6109
6110       is used, it does match "sense and responsibility" as well as the  other
6111       two  strings.  Another  example  is  given  in the discussion of DEFINE
6112       above.
6113
6114       All subroutine calls, whether recursive or not, are always  treated  as
6115       atomic  groups. That is, once a subroutine has matched some of the sub-
6116       ject string, it is never re-entered, even if it contains untried alter-
6117       natives  and  there  is  a  subsequent  matching failure. Any capturing
6118       parentheses that are set during the subroutine  call  revert  to  their
6119       previous values afterwards.
6120
6121       Processing  options  such as case-independence are fixed when a subpat-
6122       tern is defined, so if it is used as a subroutine, such options  cannot
6123       be changed for different calls. For example, consider this pattern:
6124
6125         (abc)(?i:(?-1))
6126
6127       It  matches  "abcabc". It does not match "abcABC" because the change of
6128       processing option does not affect the called subpattern.
6129
6130
6131ONIGURUMA SUBROUTINE SYNTAX
6132
6133       For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
6134       name or a number enclosed either in angle brackets or single quotes, is
6135       an alternative syntax for referencing a  subpattern  as  a  subroutine,
6136       possibly  recursively. Here are two of the examples used above, rewrit-
6137       ten using this syntax:
6138
6139         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
6140         (sens|respons)e and \g'1'ibility
6141
6142       PCRE supports an extension to Oniguruma: if a number is preceded  by  a
6143       plus or a minus sign it is taken as a relative reference. For example:
6144
6145         (abc)(?i:\g<-1>)
6146
6147       Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
6148       synonymous. The former is a back reference; the latter is a  subroutine
6149       call.
6150
6151
6152CALLOUTS
6153
6154       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
6155       Perl code to be obeyed in the middle of matching a regular  expression.
6156       This makes it possible, amongst other things, to extract different sub-
6157       strings that match the same pair of parentheses when there is a repeti-
6158       tion.
6159
6160       PCRE provides a similar feature, but of course it cannot obey arbitrary
6161       Perl code. The feature is called "callout". The caller of PCRE provides
6162       an  external function by putting its entry point in the global variable
6163       pcre_callout (8-bit library) or  pcre16_callout  (16-bit  library).  By
6164       default, this variable contains NULL, which disables all calling out.
6165
6166       Within  a  regular  expression,  (?C) indicates the points at which the
6167       external function is to be called. If you want  to  identify  different
6168       callout  points, you can put a number less than 256 after the letter C.
6169       The default value is zero.  For example, this pattern has  two  callout
6170       points:
6171
6172         (?C1)abc(?C2)def
6173
6174       If  the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
6175       outs are automatically installed before each item in the pattern.  They
6176       are all numbered 255.
6177
6178       During  matching, when PCRE reaches a callout point, the external func-
6179       tion is called. It is provided with the  number  of  the  callout,  the
6180       position  in  the pattern, and, optionally, one item of data originally
6181       supplied by the caller of the matching function. The  callout  function
6182       may  cause  matching to proceed, to backtrack, or to fail altogether. A
6183       complete description of the interface to the callout function is  given
6184       in the pcrecallout documentation.
6185
6186
6187BACKTRACKING CONTROL
6188
6189       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
6190       which are described in the Perl documentation as "experimental and sub-
6191       ject  to  change or removal in a future version of Perl". It goes on to
6192       say: "Their usage in production code should be noted to avoid  problems
6193       during upgrades." The same remarks apply to the PCRE features described
6194       in this section.
6195
6196       Since these verbs are specifically related  to  backtracking,  most  of
6197       them  can  be  used only when the pattern is to be matched using one of
6198       the traditional matching functions, which use a backtracking algorithm.
6199       With  the  exception  of (*FAIL), which behaves like a failing negative
6200       assertion, they cause an error if encountered by a DFA  matching  func-
6201       tion.
6202
6203       If  any of these verbs are used in an assertion or in a subpattern that
6204       is called as a subroutine (whether or not recursively), their effect is
6205       confined to that subpattern; it does not extend to the surrounding pat-
6206       tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
6207       that  is  encountered in a successful positive assertion is passed back
6208       when a match succeeds (compare capturing  parentheses  in  assertions).
6209       Note that such subpatterns are processed as anchored at the point where
6210       they are tested. Note also that Perl's  treatment  of  subroutines  and
6211       assertions is different in some cases.
6212
6213       The  new verbs make use of what was previously invalid syntax: an open-
6214       ing parenthesis followed by an asterisk. They are generally of the form
6215       (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-
6216       haviour, depending on whether or not an argument is present. A name  is
6217       any sequence of characters that does not include a closing parenthesis.
6218       The maximum length of name is 255 in the 8-bit library and 65535 in the
6219       16-bit library. If the name is empty, that is, if the closing parenthe-
6220       sis immediately follows the colon, the effect is as if the  colon  were
6221       not there. Any number of these verbs may occur in a pattern.
6222
6223   Optimizations that affect backtracking verbs
6224
6225       PCRE  contains some optimizations that are used to speed up matching by
6226       running some checks at the start of each match attempt. For example, it
6227       may  know  the minimum length of matching subject, or that a particular
6228       character must be present. When one of these  optimizations  suppresses
6229       the  running  of  a match, any included backtracking verbs will not, of
6230       course, be processed. You can suppress the start-of-match optimizations
6231       by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-
6232       pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
6233       There is more discussion of this option in the section entitled "Option
6234       bits for pcre_exec()" in the pcreapi documentation.
6235
6236       Experiments with Perl suggest that it too  has  similar  optimizations,
6237       sometimes leading to anomalous results.
6238
6239   Verbs that act immediately
6240
6241       The  following  verbs act as soon as they are encountered. They may not
6242       be followed by a name.
6243
6244          (*ACCEPT)
6245
6246       This verb causes the match to end successfully, skipping the  remainder
6247       of  the pattern. However, when it is inside a subpattern that is called
6248       as a subroutine, only that subpattern is ended  successfully.  Matching
6249       then  continues  at  the  outer level. If (*ACCEPT) is inside capturing
6250       parentheses, the data so far is captured. For example:
6251
6252         A((?:A|B(*ACCEPT)|C)D)
6253
6254       This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
6255       tured by the outer parentheses.
6256
6257         (*FAIL) or (*F)
6258
6259       This  verb causes a matching failure, forcing backtracking to occur. It
6260       is equivalent to (?!) but easier to read. The Perl documentation  notes
6261       that  it  is  probably  useful only when combined with (?{}) or (??{}).
6262       Those are, of course, Perl features that are not present in  PCRE.  The
6263       nearest  equivalent is the callout feature, as for example in this pat-
6264       tern:
6265
6266         a+(?C)(*FAIL)
6267
6268       A match with the string "aaaa" always fails, but the callout  is  taken
6269       before each backtrack happens (in this example, 10 times).
6270
6271   Recording which path was taken
6272
6273       There  is  one  verb  whose  main  purpose  is to track how a match was
6274       arrived at, though it also has a  secondary  use  in  conjunction  with
6275       advancing the match starting point (see (*SKIP) below).
6276
6277         (*MARK:NAME) or (*:NAME)
6278
6279       A  name  is  always  required  with  this  verb.  There  may be as many
6280       instances of (*MARK) as you like in a pattern, and their names  do  not
6281       have to be unique.
6282
6283       When  a match succeeds, the name of the last-encountered (*MARK) on the
6284       matching path is passed back to the caller as described in the  section
6285       entitled  "Extra  data  for  pcre_exec()" in the pcreapi documentation.
6286       Here is an example of pcretest output, where the /K  modifier  requests
6287       the retrieval and outputting of (*MARK) data:
6288
6289           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
6290         data> XY
6291          0: XY
6292         MK: A
6293         XZ
6294          0: XZ
6295         MK: B
6296
6297       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
6298       ple it indicates which of the two alternatives matched. This is a  more
6299       efficient  way of obtaining this information than putting each alterna-
6300       tive in its own capturing parentheses.
6301
6302       If (*MARK) is encountered in a positive assertion, its name is recorded
6303       and passed back if it is the last-encountered. This does not happen for
6304       negative assertions.
6305
6306       After a partial match or a failed match, the name of the  last  encoun-
6307       tered (*MARK) in the entire match process is returned. For example:
6308
6309           re> /X(*MARK:A)Y|X(*MARK:B)Z/K
6310         data> XP
6311         No match, mark = B
6312
6313       Note  that  in  this  unanchored  example the mark is retained from the
6314       match attempt that started at the letter "X" in the subject. Subsequent
6315       match attempts starting at "P" and then with an empty string do not get
6316       as far as the (*MARK) item, but nevertheless do not reset it.
6317
6318       If you are interested in  (*MARK)  values  after  failed  matches,  you
6319       should  probably  set  the PCRE_NO_START_OPTIMIZE option (see above) to
6320       ensure that the match is always attempted.
6321
6322   Verbs that act after backtracking
6323
6324       The following verbs do nothing when they are encountered. Matching con-
6325       tinues  with what follows, but if there is no subsequent match, causing
6326       a backtrack to the verb, a failure is  forced.  That  is,  backtracking
6327       cannot  pass  to the left of the verb. However, when one of these verbs
6328       appears inside an atomic group, its effect is confined to  that  group,
6329       because  once the group has been matched, there is never any backtrack-
6330       ing into it. In this situation, backtracking can  "jump  back"  to  the
6331       left  of the entire atomic group. (Remember also, as stated above, that
6332       this localization also applies in subroutine calls and assertions.)
6333
6334       These verbs differ in exactly what kind of failure  occurs  when  back-
6335       tracking reaches them.
6336
6337         (*COMMIT)
6338
6339       This  verb, which may not be followed by a name, causes the whole match
6340       to fail outright if the rest of the pattern does not match. Even if the
6341       pattern is unanchored, no further attempts to find a match by advancing
6342       the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
6343       pcre_exec()  is  committed  to  finding a match at the current starting
6344       point, or not at all. For example:
6345
6346         a+(*COMMIT)b
6347
6348       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
6349       of dynamic anchor, or "I've started, so I must finish." The name of the
6350       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
6351       forces a match failure.
6352
6353       Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
6354       anchor, unless PCRE's start-of-match optimizations are turned  off,  as
6355       shown in this pcretest example:
6356
6357           re> /(*COMMIT)abc/
6358         data> xyzabc
6359          0: abc
6360         xyzabc\Y
6361         No match
6362
6363       PCRE  knows  that  any  match  must start with "a", so the optimization
6364       skips along the subject to "a" before running the first match  attempt,
6365       which  succeeds.  When the optimization is disabled by the \Y escape in
6366       the second subject, the match starts at "x" and so the (*COMMIT) causes
6367       it to fail without trying any other starting points.
6368
6369         (*PRUNE) or (*PRUNE:NAME)
6370
6371       This  verb causes the match to fail at the current starting position in
6372       the subject if the rest of the pattern does not match. If  the  pattern
6373       is  unanchored,  the  normal  "bumpalong"  advance to the next starting
6374       character then happens. Backtracking can occur as usual to the left  of
6375       (*PRUNE),  before  it  is  reached,  or  when  matching to the right of
6376       (*PRUNE), but if there is no match to the  right,  backtracking  cannot
6377       cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
6378       native to an atomic group or possessive quantifier, but there are  some
6379       uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
6380       iour of (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE).  In  an
6381       anchored pattern (*PRUNE) has the same effect as (*COMMIT).
6382
6383         (*SKIP)
6384
6385       This  verb, when given without a name, is like (*PRUNE), except that if
6386       the pattern is unanchored, the "bumpalong" advance is not to  the  next
6387       character, but to the position in the subject where (*SKIP) was encoun-
6388       tered. (*SKIP) signifies that whatever text was matched leading  up  to
6389       it cannot be part of a successful match. Consider:
6390
6391         a+(*SKIP)b
6392
6393       If  the  subject  is  "aaaac...",  after  the first match attempt fails
6394       (starting at the first character in the  string),  the  starting  point
6395       skips on to start the next attempt at "c". Note that a possessive quan-
6396       tifer does not have the same effect as this example; although it  would
6397       suppress  backtracking  during  the  first  match  attempt,  the second
6398       attempt would start at the second character instead of skipping  on  to
6399       "c".
6400
6401         (*SKIP:NAME)
6402
6403       When  (*SKIP) has an associated name, its behaviour is modified. If the
6404       following pattern fails to match, the previous path through the pattern
6405       is  searched for the most recent (*MARK) that has the same name. If one
6406       is found, the "bumpalong" advance is to the subject position that  cor-
6407       responds  to  that (*MARK) instead of to where (*SKIP) was encountered.
6408       If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
6409
6410         (*THEN) or (*THEN:NAME)
6411
6412       This verb causes a skip to the next innermost alternative if  the  rest
6413       of  the  pattern does not match. That is, it cancels pending backtrack-
6414       ing, but only within the current alternative. Its name comes  from  the
6415       observation that it can be used for a pattern-based if-then-else block:
6416
6417         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
6418
6419       If  the COND1 pattern matches, FOO is tried (and possibly further items
6420       after the end of the group if FOO succeeds); on  failure,  the  matcher
6421       skips  to  the second alternative and tries COND2, without backtracking
6422       into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as
6423       (*MARK:NAME)(*THEN).   If (*THEN) is not inside an alternation, it acts
6424       like (*PRUNE).
6425
6426       Note that a subpattern that does not contain a | character  is  just  a
6427       part  of the enclosing alternative; it is not a nested alternation with
6428       only one alternative. The effect of (*THEN) extends beyond such a  sub-
6429       pattern  to  the enclosing alternative. Consider this pattern, where A,
6430       B, etc. are complex pattern fragments that do not contain any | charac-
6431       ters at this level:
6432
6433         A (B(*THEN)C) | D
6434
6435       If  A and B are matched, but there is a failure in C, matching does not
6436       backtrack into A; instead it moves to the next alternative, that is, D.
6437       However,  if the subpattern containing (*THEN) is given an alternative,
6438       it behaves differently:
6439
6440         A (B(*THEN)C | (*FAIL)) | D
6441
6442       The effect of (*THEN) is now confined to the inner subpattern. After  a
6443       failure in C, matching moves to (*FAIL), which causes the whole subpat-
6444       tern to fail because there are no more alternatives  to  try.  In  this
6445       case, matching does now backtrack into A.
6446
6447       Note also that a conditional subpattern is not considered as having two
6448       alternatives, because only one is ever used.  In  other  words,  the  |
6449       character in a conditional subpattern has a different meaning. Ignoring
6450       white space, consider:
6451
6452         ^.*? (?(?=a) a | b(*THEN)c )
6453
6454       If the subject is "ba", this pattern does not  match.  Because  .*?  is
6455       ungreedy,  it  initially  matches  zero characters. The condition (?=a)
6456       then fails, the character "b" is matched,  but  "c"  is  not.  At  this
6457       point,  matching does not backtrack to .*? as might perhaps be expected
6458       from the presence of the | character.  The  conditional  subpattern  is
6459       part of the single alternative that comprises the whole pattern, and so
6460       the match fails. (If there was a backtrack into  .*?,  allowing  it  to
6461       match "b", the match would succeed.)
6462
6463       The  verbs just described provide four different "strengths" of control
6464       when subsequent matching fails. (*THEN) is the weakest, carrying on the
6465       match  at  the next alternative. (*PRUNE) comes next, failing the match
6466       at the current starting position, but allowing an advance to  the  next
6467       character  (for an unanchored pattern). (*SKIP) is similar, except that
6468       the advance may be more than one character. (*COMMIT) is the strongest,
6469       causing the entire match to fail.
6470
6471       If more than one such verb is present in a pattern, the "strongest" one
6472       wins.  For example, consider this pattern, where A, B, etc. are complex
6473       pattern fragments:
6474
6475         (A(*COMMIT)B(*THEN)C|D)
6476
6477       Once  A  has  matched,  PCRE is committed to this match, at the current
6478       starting position. If subsequently B matches, but C does not, the  nor-
6479       mal (*THEN) action of trying the next alternative (that is, D) does not
6480       happen because (*COMMIT) overrides.
6481
6482
6483SEE ALSO
6484
6485       pcreapi(3), pcrecallout(3),  pcrematching(3),  pcresyntax(3),  pcre(3),
6486       pcre16(3).
6487
6488
6489AUTHOR
6490
6491       Philip Hazel
6492       University Computing Service
6493       Cambridge CB2 3QH, England.
6494
6495
6496REVISION
6497
6498       Last updated: 17 June 2012
6499       Copyright (c) 1997-2012 University of Cambridge.
6500------------------------------------------------------------------------------
6501
6502
6503PCRESYNTAX(3)                                                    PCRESYNTAX(3)
6504
6505
6506NAME
6507       PCRE - Perl-compatible regular expressions
6508
6509
6510PCRE REGULAR EXPRESSION SYNTAX SUMMARY
6511
6512       The  full syntax and semantics of the regular expressions that are sup-
6513       ported by PCRE are described in  the  pcrepattern  documentation.  This
6514       document contains a quick-reference summary of the syntax.
6515
6516
6517QUOTING
6518
6519         \x         where x is non-alphanumeric is a literal x
6520         \Q...\E    treat enclosed characters as literal
6521
6522
6523CHARACTERS
6524
6525         \a         alarm, that is, the BEL character (hex 07)
6526         \cx        "control-x", where x is any ASCII character
6527         \e         escape (hex 1B)
6528         \f         form feed (hex 0C)
6529         \n         newline (hex 0A)
6530         \r         carriage return (hex 0D)
6531         \t         tab (hex 09)
6532         \ddd       character with octal code ddd, or backreference
6533         \xhh       character with hex code hh
6534         \x{hhh..}  character with hex code hhh..
6535
6536
6537CHARACTER TYPES
6538
6539         .          any character except newline;
6540                      in dotall mode, any character whatsoever
6541         \C         one data unit, even in UTF mode (best avoided)
6542         \d         a decimal digit
6543         \D         a character that is not a decimal digit
6544         \h         a horizontal white space character
6545         \H         a character that is not a horizontal white space character
6546         \N         a character that is not a newline
6547         \p{xx}     a character with the xx property
6548         \P{xx}     a character without the xx property
6549         \R         a newline sequence
6550         \s         a white space character
6551         \S         a character that is not a white space character
6552         \v         a vertical white space character
6553         \V         a character that is not a vertical white space character
6554         \w         a "word" character
6555         \W         a "non-word" character
6556         \X         an extended Unicode sequence
6557
6558       In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
6559       characters, even in a UTF mode. However, this can be changed by setting
6560       the PCRE_UCP option.
6561
6562
6563GENERAL CATEGORY PROPERTIES FOR \p and \P
6564
6565         C          Other
6566         Cc         Control
6567         Cf         Format
6568         Cn         Unassigned
6569         Co         Private use
6570         Cs         Surrogate
6571
6572         L          Letter
6573         Ll         Lower case letter
6574         Lm         Modifier letter
6575         Lo         Other letter
6576         Lt         Title case letter
6577         Lu         Upper case letter
6578         L&         Ll, Lu, or Lt
6579
6580         M          Mark
6581         Mc         Spacing mark
6582         Me         Enclosing mark
6583         Mn         Non-spacing mark
6584
6585         N          Number
6586         Nd         Decimal number
6587         Nl         Letter number
6588         No         Other number
6589
6590         P          Punctuation
6591         Pc         Connector punctuation
6592         Pd         Dash punctuation
6593         Pe         Close punctuation
6594         Pf         Final punctuation
6595         Pi         Initial punctuation
6596         Po         Other punctuation
6597         Ps         Open punctuation
6598
6599         S          Symbol
6600         Sc         Currency symbol
6601         Sk         Modifier symbol
6602         Sm         Mathematical symbol
6603         So         Other symbol
6604
6605         Z          Separator
6606         Zl         Line separator
6607         Zp         Paragraph separator
6608         Zs         Space separator
6609
6610
6611PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
6612
6613         Xan        Alphanumeric: union of properties L and N
6614         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
6615         Xsp        Perl space: property Z or tab, NL, FF, CR
6616         Xwd        Perl word: property Xan or underscore
6617
6618
6619SCRIPT NAMES FOR \p AND \P
6620
6621       Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
6622       Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
6623       Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
6624       Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
6625       Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
6626       gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip-
6627       tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
6628       Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
6629       Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
6630       Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
6631       Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
6632       Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari-
6633       tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
6634       Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
6635       Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
6636       Yi.
6637
6638
6639CHARACTER CLASSES
6640
6641         [...]       positive character class
6642         [^...]      negative character class
6643         [x-y]       range (can be used for hex characters)
6644         [[:xxx:]]   positive POSIX named set
6645         [[:^xxx:]]  negative POSIX named set
6646
6647         alnum       alphanumeric
6648         alpha       alphabetic
6649         ascii       0-127
6650         blank       space or tab
6651         cntrl       control character
6652         digit       decimal digit
6653         graph       printing, excluding space
6654         lower       lower case letter
6655         print       printing, including space
6656         punct       printing, excluding alphanumeric
6657         space       white space
6658         upper       upper case letter
6659         word        same as \w
6660         xdigit      hexadecimal digit
6661
6662       In PCRE, POSIX character set names recognize only ASCII  characters  by
6663       default,  but  some  of them use Unicode properties if PCRE_UCP is set.
6664       You can use \Q...\E inside a character class.
6665
6666
6667QUANTIFIERS
6668
6669         ?           0 or 1, greedy
6670         ?+          0 or 1, possessive
6671         ??          0 or 1, lazy
6672         *           0 or more, greedy
6673         *+          0 or more, possessive
6674         *?          0 or more, lazy
6675         +           1 or more, greedy
6676         ++          1 or more, possessive
6677         +?          1 or more, lazy
6678         {n}         exactly n
6679         {n,m}       at least n, no more than m, greedy
6680         {n,m}+      at least n, no more than m, possessive
6681         {n,m}?      at least n, no more than m, lazy
6682         {n,}        n or more, greedy
6683         {n,}+       n or more, possessive
6684         {n,}?       n or more, lazy
6685
6686
6687ANCHORS AND SIMPLE ASSERTIONS
6688
6689         \b          word boundary
6690         \B          not a word boundary
6691         ^           start of subject
6692                      also after internal newline in multiline mode
6693         \A          start of subject
6694         $           end of subject
6695                      also before newline at end of subject
6696                      also before internal newline in multiline mode
6697         \Z          end of subject
6698                      also before newline at end of subject
6699         \z          end of subject
6700         \G          first matching position in subject
6701
6702
6703MATCH POINT RESET
6704
6705         \K          reset start of match
6706
6707
6708ALTERNATION
6709
6710         expr|expr|expr...
6711
6712
6713CAPTURING
6714
6715         (...)           capturing group
6716         (?<name>...)    named capturing group (Perl)
6717         (?'name'...)    named capturing group (Perl)
6718         (?P<name>...)   named capturing group (Python)
6719         (?:...)         non-capturing group
6720         (?|...)         non-capturing group; reset group numbers for
6721                          capturing groups in each alternative
6722
6723
6724ATOMIC GROUPS
6725
6726         (?>...)         atomic, non-capturing group
6727
6728
6729COMMENT
6730
6731         (?#....)        comment (not nestable)
6732
6733
6734OPTION SETTING
6735
6736         (?i)            caseless
6737         (?J)            allow duplicate names
6738         (?m)            multiline
6739         (?s)            single line (dotall)
6740         (?U)            default ungreedy (lazy)
6741         (?x)            extended (ignore white space)
6742         (?-...)         unset option(s)
6743
6744       The following are recognized only at the start of a  pattern  or  after
6745       one of the newline-setting options with similar syntax:
6746
6747         (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
6748         (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
6749         (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
6750         (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
6751
6752
6753LOOKAHEAD AND LOOKBEHIND ASSERTIONS
6754
6755         (?=...)         positive look ahead
6756         (?!...)         negative look ahead
6757         (?<=...)        positive look behind
6758         (?<!...)        negative look behind
6759
6760       Each top-level branch of a look behind must be of a fixed length.
6761
6762
6763BACKREFERENCES
6764
6765         \n              reference by number (can be ambiguous)
6766         \gn             reference by number
6767         \g{n}           reference by number
6768         \g{-n}          relative reference by number
6769         \k<name>        reference by name (Perl)
6770         \k'name'        reference by name (Perl)
6771         \g{name}        reference by name (Perl)
6772         \k{name}        reference by name (.NET)
6773         (?P=name)       reference by name (Python)
6774
6775
6776SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
6777
6778         (?R)            recurse whole pattern
6779         (?n)            call subpattern by absolute number
6780         (?+n)           call subpattern by relative number
6781         (?-n)           call subpattern by relative number
6782         (?&name)        call subpattern by name (Perl)
6783         (?P>name)       call subpattern by name (Python)
6784         \g<name>        call subpattern by name (Oniguruma)
6785         \g'name'        call subpattern by name (Oniguruma)
6786         \g<n>           call subpattern by absolute number (Oniguruma)
6787         \g'n'           call subpattern by absolute number (Oniguruma)
6788         \g<+n>          call subpattern by relative number (PCRE extension)
6789         \g'+n'          call subpattern by relative number (PCRE extension)
6790         \g<-n>          call subpattern by relative number (PCRE extension)
6791         \g'-n'          call subpattern by relative number (PCRE extension)
6792
6793
6794CONDITIONAL PATTERNS
6795
6796         (?(condition)yes-pattern)
6797         (?(condition)yes-pattern|no-pattern)
6798
6799         (?(n)...        absolute reference condition
6800         (?(+n)...       relative reference condition
6801         (?(-n)...       relative reference condition
6802         (?(<name>)...   named reference condition (Perl)
6803         (?('name')...   named reference condition (Perl)
6804         (?(name)...     named reference condition (PCRE)
6805         (?(R)...        overall recursion condition
6806         (?(Rn)...       specific group recursion condition
6807         (?(R&name)...   specific recursion condition
6808         (?(DEFINE)...   define subpattern for reference
6809         (?(assert)...   assertion condition
6810
6811
6812BACKTRACKING CONTROL
6813
6814       The following act immediately they are reached:
6815
6816         (*ACCEPT)       force successful match
6817         (*FAIL)         force backtrack; synonym (*F)
6818         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
6819
6820       The  following  act only when a subsequent match failure causes a back-
6821       track to reach them. They all force a match failure, but they differ in
6822       what happens afterwards. Those that advance the start-of-match point do
6823       so only if the pattern is not anchored.
6824
6825         (*COMMIT)       overall failure, no advance of starting point
6826         (*PRUNE)        advance to next starting character
6827         (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
6828         (*SKIP)         advance to current matching position
6829         (*SKIP:NAME)    advance to position corresponding to an earlier
6830                         (*MARK:NAME); if not found, the (*SKIP) is ignored
6831         (*THEN)         local failure, backtrack to next alternation
6832         (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
6833
6834
6835NEWLINE CONVENTIONS
6836
6837       These are recognized only at the very start of the pattern or  after  a
6838       (*BSR_...), (*UTF8), (*UTF16) or (*UCP) option.
6839
6840         (*CR)           carriage return only
6841         (*LF)           linefeed only
6842         (*CRLF)         carriage return followed by linefeed
6843         (*ANYCRLF)      all three of the above
6844         (*ANY)          any Unicode newline sequence
6845
6846
6847WHAT \R MATCHES
6848
6849       These  are  recognized only at the very start of the pattern or after a
6850       (*...) option that sets the newline convention or a UTF or UCP mode.
6851
6852         (*BSR_ANYCRLF)  CR, LF, or CRLF
6853         (*BSR_UNICODE)  any Unicode newline sequence
6854
6855
6856CALLOUTS
6857
6858         (?C)      callout
6859         (?Cn)     callout with data n
6860
6861
6862SEE ALSO
6863
6864       pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
6865
6866
6867AUTHOR
6868
6869       Philip Hazel
6870       University Computing Service
6871       Cambridge CB2 3QH, England.
6872
6873
6874REVISION
6875
6876       Last updated: 10 January 2012
6877       Copyright (c) 1997-2012 University of Cambridge.
6878------------------------------------------------------------------------------
6879
6880
6881PCREUNICODE(3)                                                  PCREUNICODE(3)
6882
6883
6884NAME
6885       PCRE - Perl-compatible regular expressions
6886
6887
6888UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT
6889
6890       From Release 8.30, in addition to its previous UTF-8 support, PCRE also
6891       supports UTF-16 by means of a separate  16-bit  library.  This  can  be
6892       built as well as, or instead of, the 8-bit library.
6893
6894
6895UTF-8 SUPPORT
6896
6897       In  order  process  UTF-8  strings, you must build PCRE's 8-bit library
6898       with UTF support, and, in addition, you must call  pcre_compile()  with
6899       the  PCRE_UTF8 option flag, or the pattern must start with the sequence
6900       (*UTF8). When either of these is the case, both  the  pattern  and  any
6901       subject  strings  that  are  matched  against  it  are treated as UTF-8
6902       strings instead of strings of 1-byte characters.
6903
6904
6905UTF-16 SUPPORT
6906
6907       In order process UTF-16 strings, you must build PCRE's  16-bit  library
6908       with UTF support, and, in addition, you must call pcre16_compile() with
6909       the PCRE_UTF16 option flag, or the pattern must start with the sequence
6910       (*UTF16).  When  either  of these is the case, both the pattern and any
6911       subject strings that are matched  against  it  are  treated  as  UTF-16
6912       strings instead of strings of 16-bit characters.
6913
6914
6915UTF SUPPORT OVERHEAD
6916
6917       If  you  compile  PCRE with UTF support, but do not use it at run time,
6918       the library will be a bit bigger, but the additional run time  overhead
6919       is limited to testing the PCRE_UTF8/16 flag occasionally, so should not
6920       be very big.
6921
6922
6923UNICODE PROPERTY SUPPORT
6924
6925       If PCRE is built with Unicode character property support (which implies
6926       UTF  support), the escape sequences \p{..}, \P{..}, and \X can be used.
6927       The available properties that can be tested are limited to the  general
6928       category  properties  such  as  Lu for an upper case letter or Nd for a
6929       decimal number, the Unicode script names such as Arabic or Han, and the
6930       derived  properties Any and L&. A full list is given in the pcrepattern
6931       documentation. Only the short names for properties are  supported.  For
6932       example,  \p{L}  matches a letter. Its Perl synonym, \p{Letter}, is not
6933       supported.  Furthermore, in Perl, many  properties  may  optionally  be
6934       prefixed  by  "Is", for compatibility with Perl 5.6. PCRE does not sup-
6935       port this.
6936
6937   Validity of UTF-8 strings
6938
6939       When you set the PCRE_UTF8 flag, the byte strings  passed  as  patterns
6940       and subjects are (by default) checked for validity on entry to the rel-
6941       evant functions. The entire string is checked before any other process-
6942       ing  takes  place. From release 7.3 of PCRE, the check is according the
6943       rules of RFC 3629, which are themselves derived from the Unicode speci-
6944       fication.  Earlier  releases  of  PCRE  followed the rules of RFC 2279,
6945       which allows the full range of 31-bit values  (0  to  0x7FFFFFFF).  The
6946       current  check allows only values in the range U+0 to U+10FFFF, exclud-
6947       ing U+D800 to U+DFFF.
6948
6949       The excluded code points are the "Surrogate Area" of Unicode. They  are
6950       reserved  for  use  by  UTF-16,  where they are used in pairs to encode
6951       codepoints with values greater than 0xFFFF. The code  points  that  are
6952       encoded by UTF-16 pairs are available independently in the UTF-8 encod-
6953       ing. (In other words, the whole surrogate thing is a fudge  for  UTF-16
6954       which unfortunately messes up UTF-8.)
6955
6956       If an invalid UTF-8 string is passed to PCRE, an error return is given.
6957       At compile time, the only additional information is the offset  to  the
6958       first byte of the failing character. The run-time functions pcre_exec()
6959       and pcre_dfa_exec() also pass back this information, as well as a  more
6960       detailed  reason  code if the caller has provided memory in which to do
6961       this.
6962
6963       In some situations, you may already know that your strings  are  valid,
6964       and  therefore  want  to  skip these checks in order to improve perfor-
6965       mance, for example in the case of a long subject string that  is  being
6966       scanned   repeatedly   with   different   patterns.   If  you  set  the
6967       PCRE_NO_UTF8_CHECK flag at compile time or at run  time,  PCRE  assumes
6968       that  the  pattern  or subject it is given (respectively) contains only
6969       valid UTF-8 codes. In this case, it does not diagnose an invalid  UTF-8
6970       string.
6971
6972       If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
6973       what happens depends on why the string is invalid. If the  string  con-
6974       forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
6975       string of characters in the range 0 to  0x7FFFFFFF  by  pcre_dfa_exec()
6976       and  the interpreted version of pcre_exec(). In other words, apart from
6977       the initial validity test, these functions (when in UTF-8 mode)  handle
6978       strings  according  to the more liberal rules of RFC 2279. However, the
6979       just-in-time (JIT) optimization for pcre_exec() supports only RFC 3629.
6980       If  you are using JIT optimization, or if the string does not even con-
6981       form to RFC 2279, the result is undefined. Your program may crash.
6982
6983       If you want to process strings  of  values  in  the  full  range  0  to
6984       0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
6985       set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
6986       this  situation,  you  will  have to apply your own validity check, and
6987       avoid the use of JIT optimization.
6988
6989   Validity of UTF-16 strings
6990
6991       When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
6992       are passed as patterns and subjects are (by default) checked for valid-
6993       ity on entry to the relevant functions. Values other than those in  the
6994       surrogate range U+D800 to U+DFFF are independent code points. Values in
6995       the surrogate range must be used in pairs in the correct manner.
6996
6997       If an invalid UTF-16 string is passed  to  PCRE,  an  error  return  is
6998       given.  At  compile time, the only additional information is the offset
6999       to the first data unit of the failing character. The run-time functions
7000       pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
7001       well as a more detailed reason code if the caller has  provided  memory
7002       in which to do this.
7003
7004       In  some  situations, you may already know that your strings are valid,
7005       and therefore want to skip these checks in  order  to  improve  perfor-
7006       mance.  If  you  set the PCRE_NO_UTF16_CHECK flag at compile time or at
7007       run time, PCRE assumes that the pattern or subject it is given (respec-
7008       tively) contains only valid UTF-16 sequences. In this case, it does not
7009       diagnose an invalid UTF-16 string.
7010
7011   General comments about UTF modes
7012
7013       1. Codepoints less than 256  can  be  specified  by  either  braced  or
7014       unbraced  hexadecimal  escape  sequences (for example, \x{b3} or \xb3).
7015       Larger values have to use braced sequences.
7016
7017       2. Octal numbers up to \777 are recognized, and  in  UTF-8  mode,  they
7018       match two-byte characters for values greater than \177.
7019
7020       3. Repeat quantifiers apply to complete UTF characters, not to individ-
7021       ual data units, for example: \x{100}{3}.
7022
7023       4. The dot metacharacter matches one UTF character instead of a  single
7024       data unit.
7025
7026       5.  The  escape sequence \C can be used to match a single byte in UTF-8
7027       mode, or a single 16-bit data unit in UTF-16 mode, but its use can lead
7028       to some strange effects because it breaks up multi-unit characters (see
7029       the description of \C in the pcrepattern documentation). The use of  \C
7030       is    not    supported    in    the   alternative   matching   function
7031       pcre[16]_dfa_exec(), nor is it supported in UTF mode by the  JIT  opti-
7032       mization of pcre[16]_exec(). If JIT optimization is requested for a UTF
7033       pattern that contains \C, it will not succeed, and so the matching will
7034       be carried out by the normal interpretive function.
7035
7036       6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
7037       test characters of any code value, but, by default, the characters that
7038       PCRE  recognizes  as digits, spaces, or word characters remain the same
7039       set as in non-UTF mode, all with values less  than  256.  This  remains
7040       true  even  when  PCRE  is  built  to include Unicode property support,
7041       because to do otherwise would slow down PCRE in many common cases. Note
7042       in  particular that this applies to \b and \B, because they are defined
7043       in terms of \w and \W. If you really want to test for a wider sense of,
7044       say,  "digit",  you  can  use  explicit  Unicode property tests such as
7045       \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
7046       character  escapes  work is changed so that Unicode properties are used
7047       to determine which characters match. There are more details in the sec-
7048       tion on generic character types in the pcrepattern documentation.
7049
7050       7.  Similarly,  characters that match the POSIX named character classes
7051       are all low-valued characters, unless the PCRE_UCP option is set.
7052
7053       8. However, the horizontal and vertical white  space  matching  escapes
7054       (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
7055       whether or not PCRE_UCP is set.
7056
7057       9. Case-insensitive matching applies only to  characters  whose  values
7058       are  less than 128, unless PCRE is built with Unicode property support.
7059       Even when Unicode property support is available, PCRE  still  uses  its
7060       own  character  tables when checking the case of low-valued characters,
7061       so as not to degrade performance.  The Unicode property information  is
7062       used only for characters with higher values. Furthermore, PCRE supports
7063       case-insensitive matching only  when  there  is  a  one-to-one  mapping
7064       between  a letter's cases. There are a small number of many-to-one map-
7065       pings in Unicode; these are not supported by PCRE.
7066
7067
7068AUTHOR
7069
7070       Philip Hazel
7071       University Computing Service
7072       Cambridge CB2 3QH, England.
7073
7074
7075REVISION
7076
7077       Last updated: 14 April 2012
7078       Copyright (c) 1997-2012 University of Cambridge.
7079------------------------------------------------------------------------------
7080
7081
7082PCREJIT(3)                                                          PCREJIT(3)
7083
7084
7085NAME
7086       PCRE - Perl-compatible regular expressions
7087
7088
7089PCRE JUST-IN-TIME COMPILER SUPPORT
7090
7091       Just-in-time  compiling  is a heavyweight optimization that can greatly
7092       speed up pattern matching. However, it comes at the cost of extra  pro-
7093       cessing before the match is performed. Therefore, it is of most benefit
7094       when the same pattern is going to be matched many times. This does  not
7095       necessarily  mean  many calls of a matching function; if the pattern is
7096       not anchored, matching attempts may take place many  times  at  various
7097       positions  in  the  subject, even for a single call.  Therefore, if the
7098       subject string is very long, it may still pay to use  JIT  for  one-off
7099       matches.
7100
7101       JIT  support  applies  only to the traditional Perl-compatible matching
7102       function.  It does not apply when the DFA matching  function  is  being
7103       used. The code for this support was written by Zoltan Herczeg.
7104
7105
71068-BIT and 16-BIT SUPPORT
7107
7108       JIT  support is available for both the 8-bit and 16-bit PCRE libraries.
7109       To  keep  this  documentation  simple,  only  the  8-bit  interface  is
7110       described in what follows. If you are using the 16-bit library, substi-
7111       tute  the  16-bit  functions  and  16-bit  structures   (for   example,
7112       pcre16_jit_stack instead of pcre_jit_stack).
7113
7114
7115AVAILABILITY OF JIT SUPPORT
7116
7117       JIT  support  is  an  optional  feature of PCRE. The "configure" option
7118       --enable-jit (or equivalent CMake option) must  be  set  when  PCRE  is
7119       built  if  you want to use JIT. The support is limited to the following
7120       hardware platforms:
7121
7122         ARM v5, v7, and Thumb2
7123         Intel x86 32-bit and 64-bit
7124         MIPS 32-bit
7125         Power PC 32-bit and 64-bit
7126
7127       If --enable-jit is set on an unsupported platform, compilation fails.
7128
7129       A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-
7130       port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT
7131       option. The result is 1 when JIT is available, and  0  otherwise.  How-
7132       ever, a simple program does not need to check this in order to use JIT.
7133       The API is implemented in a way that falls  back  to  the  interpretive
7134       code if JIT is not available.
7135
7136       If  your program may sometimes be linked with versions of PCRE that are
7137       older than 8.20, but you want to use JIT when it is available, you  can
7138       test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
7139       macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
7140
7141
7142SIMPLE USE OF JIT
7143
7144       You have to do two things to make use of the JIT support  in  the  sim-
7145       plest way:
7146
7147         (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for
7148             each compiled pattern, and pass the resulting pcre_extra block to
7149             pcre_exec().
7150
7151         (2) Use pcre_free_study() to free the pcre_extra block when it is
7152             no longer needed, instead of just freeing it yourself. This
7153             ensures that any JIT data is also freed.
7154
7155       For  a  program  that may be linked with pre-8.20 versions of PCRE, you
7156       can insert
7157
7158         #ifndef PCRE_STUDY_JIT_COMPILE
7159         #define PCRE_STUDY_JIT_COMPILE 0
7160         #endif
7161
7162       so that no option is passed to pcre_study(),  and  then  use  something
7163       like this to free the study data:
7164
7165         #ifdef PCRE_CONFIG_JIT
7166             pcre_free_study(study_ptr);
7167         #else
7168             pcre_free(study_ptr);
7169         #endif
7170
7171       PCRE_STUDY_JIT_COMPILE  requests  the JIT compiler to generate code for
7172       complete matches.  If  you  want  to  run  partial  matches  using  the
7173       PCRE_PARTIAL_HARD  or  PCRE_PARTIAL_SOFT  options  of  pcre_exec(), you
7174       should set one or both of the following  options  in  addition  to,  or
7175       instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():
7176
7177         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
7178         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
7179
7180       The  JIT  compiler  generates  different optimized code for each of the
7181       three modes (normal, soft partial, hard partial). When  pcre_exec()  is
7182       called,  the appropriate code is run if it is available. Otherwise, the
7183       pattern is matched using interpretive code.
7184
7185       In some circumstances you may need to call additional functions.  These
7186       are  described  in  the  section  entitled  "Controlling the JIT stack"
7187       below.
7188
7189       If JIT  support  is  not  available,  PCRE_STUDY_JIT_COMPILE  etc.  are
7190       ignored, and no JIT data is created. Otherwise, the compiled pattern is
7191       passed to the JIT compiler, which turns it into machine code that  exe-
7192       cutes  much  faster than the normal interpretive code. When pcre_exec()
7193       is passed a pcre_extra block containing a pointer to JIT  code  of  the
7194       appropriate  mode  (normal  or  hard/soft  partial), it obeys that code
7195       instead of running the interpreter. The result is  identical,  but  the
7196       compiled JIT code runs much faster.
7197
7198       There  are some pcre_exec() options that are not supported for JIT exe-
7199       cution. There are also some  pattern  items  that  JIT  cannot  handle.
7200       Details  are  given below. In both cases, execution automatically falls
7201       back to the interpretive code. If you want  to  know  whether  JIT  was
7202       actually  used  for  a  particular  match, you should arrange for a JIT
7203       callback function to be set up as described  in  the  section  entitled
7204       "Controlling  the JIT stack" below, even if you do not need to supply a
7205       non-default JIT stack. Such a callback function is called whenever  JIT
7206       code  is about to be obeyed. If the execution options are not right for
7207       JIT execution, the callback function is not obeyed.
7208
7209       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
7210       ated.  You  can find out if JIT execution is available after studying a
7211       pattern by calling pcre_fullinfo() with  the  PCRE_INFO_JIT  option.  A
7212       result  of  1  means that JIT compilation was successful. A result of 0
7213       means that JIT support is not available, or the pattern was not studied
7214       with  PCRE_STUDY_JIT_COMPILE  etc., or the JIT compiler was not able to
7215       handle the pattern.
7216
7217       Once a pattern has been studied, with or without JIT, it can be used as
7218       many times as you like for matching different subject strings.
7219
7220
7221UNSUPPORTED OPTIONS AND PATTERN ITEMS
7222
7223       The  only  pcre_exec() options that are supported for JIT execution are
7224       PCRE_NO_UTF8_CHECK,  PCRE_NO_UTF16_CHECK,   PCRE_NOTBOL,   PCRE_NOTEOL,
7225       PCRE_NOTEMPTY,  PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PAR-
7226       TIAL_SOFT.
7227
7228       The unsupported pattern items are:
7229
7230         \C             match a single byte; not supported in UTF-8 mode
7231         (?Cn)          callouts
7232         (*PRUNE)       )
7233         (*SKIP)        ) backtracking control verbs
7234         (*THEN)        )
7235
7236       Support for some of these may be added in future.
7237
7238
7239RETURN VALUES FROM JIT EXECUTION
7240
7241       When a pattern is matched using JIT execution, the  return  values  are
7242       the  same as those given by the interpretive pcre_exec() code, with the
7243       addition of one new error code: PCRE_ERROR_JIT_STACKLIMIT.  This  means
7244       that  the memory used for the JIT stack was insufficient. See "Control-
7245       ling the JIT stack" below for a discussion of JIT stack usage. For com-
7246       patibility  with  the  interpretive pcre_exec() code, no more than two-
7247       thirds of the ovector argument is used for passing back  captured  sub-
7248       strings.
7249
7250       The  error  code  PCRE_ERROR_MATCHLIMIT  is returned by the JIT code if
7251       searching a very large pattern tree goes on for too long, as it  is  in
7252       the  same circumstance when JIT is not used, but the details of exactly
7253       what is counted are not the same. The  PCRE_ERROR_RECURSIONLIMIT  error
7254       code is never returned by JIT execution.
7255
7256
7257SAVING AND RESTORING COMPILED PATTERNS
7258
7259       The  code  that  is  generated by the JIT compiler is architecture-spe-
7260       cific, and is also position dependent. For those reasons it  cannot  be
7261       saved  (in a file or database) and restored later like the bytecode and
7262       other data of a compiled pattern. Saving and  restoring  compiled  pat-
7263       terns  is not something many people do. More detail about this facility
7264       is given in the pcreprecompile documentation. It should be possible  to
7265       run  pcre_study() on a saved and restored pattern, and thereby recreate
7266       the JIT data, but because JIT compilation uses  significant  resources,
7267       it  is  probably  not worth doing this; you might as well recompile the
7268       original pattern.
7269
7270
7271CONTROLLING THE JIT STACK
7272
7273       When the compiled JIT code runs, it needs a block of memory to use as a
7274       stack.   By  default,  it  uses 32K on the machine stack. However, some
7275       large  or  complicated  patterns  need  more  than  this.   The   error
7276       PCRE_ERROR_JIT_STACKLIMIT  is  given  when  there  is not enough stack.
7277       Three functions are provided for managing blocks of memory for  use  as
7278       JIT  stacks. There is further discussion about the use of JIT stacks in
7279       the section entitled "JIT stack FAQ" below.
7280
7281       The pcre_jit_stack_alloc() function creates a JIT stack. Its  arguments
7282       are  a starting size and a maximum size, and it returns a pointer to an
7283       opaque structure of type pcre_jit_stack, or NULL if there is an  error.
7284       The  pcre_jit_stack_free() function can be used to free a stack that is
7285       no longer needed. (For the technically minded:  the  address  space  is
7286       allocated by mmap or VirtualAlloc.)
7287
7288       JIT  uses far less memory for recursion than the interpretive code, and
7289       a maximum stack size of 512K to 1M should be more than enough  for  any
7290       pattern.
7291
7292       The  pcre_assign_jit_stack()  function  specifies  which stack JIT code
7293       should use. Its arguments are as follows:
7294
7295         pcre_extra         *extra
7296         pcre_jit_callback  callback
7297         void               *data
7298
7299       The extra argument must be  the  result  of  studying  a  pattern  with
7300       PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the
7301       other two options:
7302
7303         (1) If callback is NULL and data is NULL, an internal 32K block
7304             on the machine stack is used.
7305
7306         (2) If callback is NULL and data is not NULL, data must be
7307             a valid JIT stack, the result of calling pcre_jit_stack_alloc().
7308
7309         (3) If callback is not NULL, it must point to a function that is
7310             called with data as an argument at the start of matching, in
7311             order to set up a JIT stack. If the return from the callback
7312             function is NULL, the internal 32K stack is used; otherwise the
7313             return value must be a valid JIT stack, the result of calling
7314             pcre_jit_stack_alloc().
7315
7316       A callback function is obeyed whenever JIT code is about to be run;  it
7317       is  not  obeyed when pcre_exec() is called with options that are incom-
7318       patible for JIT execution. A callback function can therefore be used to
7319       determine  whether  a  match  operation  was  executed by JIT or by the
7320       interpreter.
7321
7322       You may safely use the same JIT stack for more than one pattern (either
7323       by  assigning directly or by callback), as long as the patterns are all
7324       matched sequentially in the same thread. In a multithread  application,
7325       if  you  do not specify a JIT stack, or if you assign or pass back NULL
7326       from a callback, that is thread-safe, because each thread has  its  own
7327       machine  stack.  However,  if  you  assign  or pass back a non-NULL JIT
7328       stack, this must be a different stack  for  each  thread  so  that  the
7329       application is thread-safe.
7330
7331       Strictly  speaking,  even more is allowed. You can assign the same non-
7332       NULL stack to any number of patterns as long as they are not  used  for
7333       matching  by  multiple  threads  at the same time. For example, you can
7334       assign the same stack to all compiled patterns, and use a global  mutex
7335       in  the callback to wait until the stack is available for use. However,
7336       this is an inefficient solution, and not recommended.
7337
7338       This is a suggestion for how a multithreaded program that needs to  set
7339       up non-default JIT stacks might operate:
7340
7341         During thread initalization
7342           thread_local_var = pcre_jit_stack_alloc(...)
7343
7344         During thread exit
7345           pcre_jit_stack_free(thread_local_var)
7346
7347         Use a one-line callback function
7348           return thread_local_var
7349
7350       All  the  functions  described in this section do nothing if JIT is not
7351       available, and pcre_assign_jit_stack() does nothing  unless  the  extra
7352       argument  is  non-NULL  and  points  to  a pcre_extra block that is the
7353       result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
7354
7355
7356JIT STACK FAQ
7357
7358       (1) Why do we need JIT stacks?
7359
7360       PCRE (and JIT) is a recursive, depth-first engine, so it needs a  stack
7361       where  the local data of the current node is pushed before checking its
7362       child nodes.  Allocating real machine stack on some platforms is diffi-
7363       cult. For example, the stack chain needs to be updated every time if we
7364       extend the stack on PowerPC.  Although it  is  possible,  its  updating
7365       time overhead decreases performance. So we do the recursion in memory.
7366
7367       (2) Why don't we simply allocate blocks of memory with malloc()?
7368
7369       Modern  operating  systems  have  a  nice  feature: they can reserve an
7370       address space instead of allocating memory. We can safely allocate mem-
7371       ory  pages  inside  this address space, so the stack could grow without
7372       moving memory data (this is important because of pointers). Thus we can
7373       allocate  1M  address space, and use only a single memory page (usually
7374       4K) if that is enough. However, we can still grow up to 1M  anytime  if
7375       needed.
7376
7377       (3) Who "owns" a JIT stack?
7378
7379       The owner of the stack is the user program, not the JIT studied pattern
7380       or anything else. The user program must ensure that if a stack is  used
7381       by  pcre_exec(), (that is, it is assigned to the pattern currently run-
7382       ning), that stack must not be used by any other threads (to avoid over-
7383       writing the same memory area). The best practice for multithreaded pro-
7384       grams is to allocate a stack for each thread,  and  return  this  stack
7385       through the JIT callback function.
7386
7387       (4) When should a JIT stack be freed?
7388
7389       You can free a JIT stack at any time, as long as it will not be used by
7390       pcre_exec() again. When you assign the  stack  to  a  pattern,  only  a
7391       pointer  is set. There is no reference counting or any other magic. You
7392       can free the patterns and stacks in any order,  anytime.  Just  do  not
7393       call  pcre_exec() with a pattern pointing to an already freed stack, as
7394       that will cause SEGFAULT. (Also, do not free a stack currently used  by
7395       pcre_exec()  in  another  thread). You can also replace the stack for a
7396       pattern at any time. You  can  even  free  the  previous  stack  before
7397       assigning a replacement.
7398
7399       (5)  Should  I  allocate/free  a  stack every time before/after calling
7400       pcre_exec()?
7401
7402       No, because this is too costly in  terms  of  resources.  However,  you
7403       could  implement  some clever idea which release the stack if it is not
7404       used in let's say two minutes. The JIT callback can help to achive this
7405       without keeping a list of the currently JIT studied patterns.
7406
7407       (6)  OK, the stack is for long term memory allocation. But what happens
7408       if a pattern causes stack overflow with a stack of 1M? Is that 1M  kept
7409       until the stack is freed?
7410
7411       Especially  on embedded sytems, it might be a good idea to release mem-
7412       ory sometimes without freeing the stack. There is no API  for  this  at
7413       the  moment.  Probably a function call which returns with the currently
7414       allocated memory for any stack and another which allows releasing  mem-
7415       ory (shrinking the stack) would be a good idea if someone needs this.
7416
7417       (7) This is too much of a headache. Isn't there any better solution for
7418       JIT stack handling?
7419
7420       No, thanks to Windows. If POSIX threads were used everywhere, we  could
7421       throw out this complicated API.
7422
7423
7424EXAMPLE CODE
7425
7426       This  is  a  single-threaded example that specifies a JIT stack without
7427       using a callback.
7428
7429         int rc;
7430         int ovector[30];
7431         pcre *re;
7432         pcre_extra *extra;
7433         pcre_jit_stack *jit_stack;
7434
7435         re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
7436         /* Check for errors */
7437         extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);
7438         jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);
7439         /* Check for error (NULL) */
7440         pcre_assign_jit_stack(extra, NULL, jit_stack);
7441         rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);
7442         /* Check results */
7443         pcre_free(re);
7444         pcre_free_study(extra);
7445         pcre_jit_stack_free(jit_stack);
7446
7447
7448SEE ALSO
7449
7450       pcreapi(3)
7451
7452
7453AUTHOR
7454
7455       Philip Hazel (FAQ by Zoltan Herczeg)
7456       University Computing Service
7457       Cambridge CB2 3QH, England.
7458
7459
7460REVISION
7461
7462       Last updated: 04 May 2012
7463       Copyright (c) 1997-2012 University of Cambridge.
7464------------------------------------------------------------------------------
7465
7466
7467PCREPARTIAL(3)                                                  PCREPARTIAL(3)
7468
7469
7470NAME
7471       PCRE - Perl-compatible regular expressions
7472
7473
7474PARTIAL MATCHING IN PCRE
7475
7476       In normal use of PCRE, if the subject string that is passed to a match-
7477       ing function matches as far as it goes, but is too short to  match  the
7478       entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
7479       where it might be helpful to distinguish this case from other cases  in
7480       which there is no match.
7481
7482       Consider, for example, an application where a human is required to type
7483       in data for a field with specific formatting requirements.  An  example
7484       might be a date in the form ddmmmyy, defined by this pattern:
7485
7486         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
7487
7488       If the application sees the user's keystrokes one by one, and can check
7489       that what has been typed so far is potentially valid,  it  is  able  to
7490       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
7491       reflecting the character that has been typed, for example. This immedi-
7492       ate  feedback is likely to be a better user interface than a check that
7493       is delayed until the entire string has been entered.  Partial  matching
7494       can  also be useful when the subject string is very long and is not all
7495       available at once.
7496
7497       PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
7498       PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
7499       matching functions. For backwards compatibility, PCRE_PARTIAL is a syn-
7500       onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
7501       options is whether or not a partial match is preferred to  an  alterna-
7502       tive complete match, though the details differ between the two types of
7503       matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
7504       precedence.
7505
7506       If  you  want to use partial matching with just-in-time optimized code,
7507       you must call pcre_study() or pcre16_study() with one or both of  these
7508       options:
7509
7510         PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
7511         PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
7512
7513       PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
7514       partial matches on the same pattern. If the appropriate JIT study  mode
7515       has not been set for a match, the interpretive matching code is used.
7516
7517       Setting a partial matching option disables two of PCRE's standard opti-
7518       mizations. PCRE remembers the last literal data unit in a pattern,  and
7519       abandons  matching  immediately  if  it  is  not present in the subject
7520       string. This optimization cannot be used  for  a  subject  string  that
7521       might  match only partially. If the pattern was studied, PCRE knows the
7522       minimum length of a matching string, and does not  bother  to  run  the
7523       matching  function  on  shorter strings. This optimization is also dis-
7524       abled for partial matching.
7525
7526
7527PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()
7528
7529       A partial match occurs during a call to  pcre_exec()  or  pcre16_exec()
7530       when  the end of the subject string is reached successfully, but match-
7531       ing cannot continue because more characters  are  needed.  However,  at
7532       least one character in the subject must have been inspected. This char-
7533       acter need not form part of the final matched string; lookbehind asser-
7534       tions  and the \K escape sequence provide ways of inspecting characters
7535       before the start of a matched substring. The requirement for inspecting
7536       at  least  one  character  exists because an empty string can always be
7537       matched; without such a restriction there would  always  be  a  partial
7538       match of an empty string at the end of the subject.
7539
7540       If  there  are  at least two slots in the offsets vector when a partial
7541       match is returned, the first slot is set to the offset of the  earliest
7542       character that was inspected. For convenience, the second offset points
7543       to the end of the subject so that a substring can easily be identified.
7544
7545       For the majority of patterns, the first offset identifies the start  of
7546       the  partially matched string. However, for patterns that contain look-
7547       behind assertions, or \K, or begin with \b or  \B,  earlier  characters
7548       have been inspected while carrying out the match. For example:
7549
7550         /(?<=abc)123/
7551
7552       This pattern matches "123", but only if it is preceded by "abc". If the
7553       subject string is "xyzabc12", the offsets after a partial match are for
7554       the  substring  "abc12",  because  all  these  characters are needed if
7555       another match is tried with extra characters added to the subject.
7556
7557       What happens when a partial match is identified depends on which of the
7558       two partial matching options are set.
7559
7560   PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre16_exec()
7561
7562       If  PCRE_PARTIAL_SOFT  is set when pcre_exec() or pcre16_exec() identi-
7563       fies a partial match, the partial match  is  remembered,  but  matching
7564       continues  as  normal, and other alternatives in the pattern are tried.
7565       If no complete match  can  be  found,  PCRE_ERROR_PARTIAL  is  returned
7566       instead of PCRE_ERROR_NOMATCH.
7567
7568       This  option  is "soft" because it prefers a complete match over a par-
7569       tial match.  All the various matching items in a pattern behave  as  if
7570       the  subject string is potentially complete. For example, \z, \Z, and $
7571       match at the end of the subject, as normal, and for \b and \B  the  end
7572       of the subject is treated as a non-alphanumeric.
7573
7574       If  there  is more than one partial match, the first one that was found
7575       provides the data that is returned. Consider this pattern:
7576
7577         /123\w+X|dogY/
7578
7579       If this is matched against the subject string "abc123dog", both  alter-
7580       natives  fail  to  match,  but the end of the subject is reached during
7581       matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set  to  3
7582       and  9, identifying "123dog" as the first partial match that was found.
7583       (In this example, there are two partial matches, because "dog"  on  its
7584       own partially matches the second alternative.)
7585
7586   PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre16_exec()
7587
7588       If   PCRE_PARTIAL_HARD   is   set  for  pcre_exec()  or  pcre16_exec(),
7589       PCRE_ERROR_PARTIAL is returned as soon as a  partial  match  is  found,
7590       without continuing to search for possible complete matches. This option
7591       is "hard" because it prefers an earlier partial match over a later com-
7592       plete  match.  For  this reason, the assumption is made that the end of
7593       the supplied subject string may not be the true end  of  the  available
7594       data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
7595       subject, the result is PCRE_ERROR_PARTIAL, provided that at  least  one
7596       character in the subject has been inspected.
7597
7598       Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
7599       strings are checked for validity. Normally, an invalid sequence  causes
7600       the  error  PCRE_ERROR_BADUTF8  or PCRE_ERROR_BADUTF16. However, in the
7601       special case of a truncated  character  at  the  end  of  the  subject,
7602       PCRE_ERROR_SHORTUTF8   or   PCRE_ERROR_SHORTUTF16   is   returned  when
7603       PCRE_PARTIAL_HARD is set.
7604
7605   Comparing hard and soft partial matching
7606
7607       The difference between the two partial matching options can  be  illus-
7608       trated by a pattern such as:
7609
7610         /dog(sbody)?/
7611
7612       This  matches either "dog" or "dogsbody", greedily (that is, it prefers
7613       the longer string if possible). If it is  matched  against  the  string
7614       "dog"  with  PCRE_PARTIAL_SOFT,  it  yields a complete match for "dog".
7615       However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
7616       On  the  other hand, if the pattern is made ungreedy the result is dif-
7617       ferent:
7618
7619         /dog(sbody)??/
7620
7621       In this case the result is always a  complete  match  because  that  is
7622       found  first,  and  matching  never  continues after finding a complete
7623       match. It might be easier to follow this explanation by thinking of the
7624       two patterns like this:
7625
7626         /dog(sbody)?/    is the same as  /dogsbody|dog/
7627         /dog(sbody)??/   is the same as  /dog|dogsbody/
7628
7629       The  second pattern will never match "dogsbody", because it will always
7630       find the shorter match first.
7631
7632
7633PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()
7634
7635       The DFA functions move along the subject string character by character,
7636       without  backtracking,  searching  for  all possible matches simultane-
7637       ously. If the end of the subject is reached before the end of the  pat-
7638       tern,  there is the possibility of a partial match, again provided that
7639       at least one character has been inspected.
7640
7641       When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned  only  if
7642       there  have  been  no complete matches. Otherwise, the complete matches
7643       are returned.  However, if PCRE_PARTIAL_HARD is set,  a  partial  match
7644       takes  precedence  over any complete matches. The portion of the string
7645       that was inspected when the longest partial match was found is  set  as
7646       the first matching string, provided there are at least two slots in the
7647       offsets vector.
7648
7649       Because the DFA functions always search for all possible  matches,  and
7650       there  is  no  difference between greedy and ungreedy repetition, their
7651       behaviour is different  from  the  standard  functions  when  PCRE_PAR-
7652       TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
7653       ungreedy pattern shown above:
7654
7655         /dog(sbody)??/
7656
7657       Whereas the standard functions stop as soon as they find  the  complete
7658       match  for  "dog",  the  DFA  functions also find the partial match for
7659       "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
7660
7661
7662PARTIAL MATCHING AND WORD BOUNDARIES
7663
7664       If a pattern ends with one of sequences \b or \B, which test  for  word
7665       boundaries,  partial  matching with PCRE_PARTIAL_SOFT can give counter-
7666       intuitive results. Consider this pattern:
7667
7668         /\bcat\b/
7669
7670       This matches "cat", provided there is a word boundary at either end. If
7671       the subject string is "the cat", the comparison of the final "t" with a
7672       following character cannot take place, so a  partial  match  is  found.
7673       However,  normal  matching carries on, and \b matches at the end of the
7674       subject when the last character is a letter, so  a  complete  match  is
7675       found.   The   result,  therefore,  is  not  PCRE_ERROR_PARTIAL.  Using
7676       PCRE_PARTIAL_HARD in this case does yield  PCRE_ERROR_PARTIAL,  because
7677       then the partial match takes precedence.
7678
7679
7680FORMERLY RESTRICTED PATTERNS
7681
7682       For releases of PCRE prior to 8.00, because of the way certain internal
7683       optimizations  were  implemented  in  the  pcre_exec()  function,   the
7684       PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be
7685       used with all patterns. From release 8.00 onwards, the restrictions  no
7686       longer  apply,  and partial matching with can be requested for any pat-
7687       tern.
7688
7689       Items that were formerly restricted were repeated single characters and
7690       repeated  metasequences. If PCRE_PARTIAL was set for a pattern that did
7691       not conform to the restrictions, pcre_exec() returned  the  error  code
7692       PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The
7693       PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled
7694       pattern can be used for partial matching now always returns 1.
7695
7696
7697EXAMPLE OF PARTIAL MATCHING USING PCRETEST
7698
7699       If  the  escape  sequence  \P  is  present in a pcretest data line, the
7700       PCRE_PARTIAL_SOFT option is used for  the  match.  Here  is  a  run  of
7701       pcretest that uses the date example quoted above:
7702
7703           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
7704         data> 25jun04\P
7705          0: 25jun04
7706          1: jun
7707         data> 25dec3\P
7708         Partial match: 23dec3
7709         data> 3ju\P
7710         Partial match: 3ju
7711         data> 3juj\P
7712         No match
7713         data> j\P
7714         No match
7715
7716       The  first  data  string  is  matched completely, so pcretest shows the
7717       matched substrings. The remaining four strings do not  match  the  com-
7718       plete pattern, but the first two are partial matches. Similar output is
7719       obtained if DFA matching is used.
7720
7721       If the escape sequence \P is present more than once in a pcretest  data
7722       line, the PCRE_PARTIAL_HARD option is set for the match.
7723
7724
7725MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()
7726
7727       When  a  partial match has been found using a DFA matching function, it
7728       is possible to continue the match by providing additional subject  data
7729       and  calling  the function again with the same compiled regular expres-
7730       sion, this time setting the PCRE_DFA_RESTART option. You must pass  the
7731       same working space as before, because this is where details of the pre-
7732       vious partial match are stored. Here  is  an  example  using  pcretest,
7733       using  the  \R  escape  sequence to set the PCRE_DFA_RESTART option (\D
7734       specifies the use of the DFA matching function):
7735
7736           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
7737         data> 23ja\P\D
7738         Partial match: 23ja
7739         data> n05\R\D
7740          0: n05
7741
7742       The first call has "23ja" as the subject, and requests  partial  match-
7743       ing;  the  second  call  has  "n05"  as  the  subject for the continued
7744       (restarted) match.  Notice that when the match is  complete,  only  the
7745       last  part  is  shown;  PCRE  does not retain the previously partially-
7746       matched string. It is up to the calling program to do that if it  needs
7747       to.
7748
7749       You  can  set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
7750       PCRE_DFA_RESTART to continue partial matching over  multiple  segments.
7751       This  facility can be used to pass very long subject strings to the DFA
7752       matching functions.
7753
7754
7755MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()
7756
7757       From release 8.00, the standard matching functions can also be used  to
7758       do multi-segment matching. Unlike the DFA functions, it is not possible
7759       to restart the previous match with a new segment of data. Instead,  new
7760       data must be added to the previous subject string, and the entire match
7761       re-run, starting from the point where the partial match occurred.  Ear-
7762       lier data can be discarded.
7763
7764       It  is best to use PCRE_PARTIAL_HARD in this situation, because it does
7765       not treat the end of a segment as the end of the subject when  matching
7766       \z,  \Z,  \b,  \B,  and  $. Consider an unanchored pattern that matches
7767       dates:
7768
7769           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
7770         data> The date is 23ja\P\P
7771         Partial match: 23ja
7772
7773       At this stage, an application could discard the text preceding  "23ja",
7774       add  on  text  from  the  next  segment, and call the matching function
7775       again. Unlike the DFA matching functions, the  entire  matching  string
7776       must  always be available, and the complete matching process occurs for
7777       each call, so more memory and more processing time is needed.
7778
7779       Note: If the pattern contains lookbehind assertions, or \K,  or  starts
7780       with \b or \B, the string that is returned for a partial match includes
7781       characters that precede the partially matched  string  itself,  because
7782       these  must be retained when adding on more characters for a subsequent
7783       matching attempt.  However, in some cases you may need to  retain  even
7784       earlier characters, as discussed in the next section.
7785
7786
7787ISSUES WITH MULTI-SEGMENT MATCHING
7788
7789       Certain types of pattern may give problems with multi-segment matching,
7790       whichever matching function is used.
7791
7792       1. If the pattern contains a test for the beginning of a line, you need
7793       to  pass  the  PCRE_NOTBOL  option when the subject string for any call
7794       does start at the beginning of a line.  There  is  also  a  PCRE_NOTEOL
7795       option, but in practice when doing multi-segment matching you should be
7796       using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
7797
7798       2. Lookbehind assertions that have already been obeyed are catered  for
7799       in the offsets that are returned for a partial match. However a lookbe-
7800       hind assertion later in the pattern could require even earlier  charac-
7801       ters   to  be  inspected.  You  can  handle  this  case  by  using  the
7802       PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
7803       pcre16_fullinfo() functions to obtain the length of the largest lookbe-
7804       hind in the pattern. This length is given in characters, not bytes.  If
7805       you  always  retain  at least that many characters before the partially
7806       matched string, all should be well. (Of course, near the start  of  the
7807       subject,  fewer  characters may be present; in that case all characters
7808       should be retained.)
7809
7810       3. Because a partial match must always contain at least one  character,
7811       what  might  be  considered a partial match of an empty string actually
7812       gives a "no match" result. For example:
7813
7814           re> /c(?<=abc)x/
7815         data> ab\P
7816         No match
7817
7818       If the next segment begins "cx", a match should be found, but this will
7819       only  happen  if characters from the previous segment are retained. For
7820       this reason, a "no match" result  should  be  interpreted  as  "partial
7821       match of an empty string" when the pattern contains lookbehinds.
7822
7823       4.  Matching  a subject string that is split into multiple segments may
7824       not always produce exactly the same result as matching over one  single
7825       long  string,  especially  when  PCRE_PARTIAL_SOFT is used. The section
7826       "Partial Matching and Word Boundaries" above describes  an  issue  that
7827       arises  if  the  pattern ends with \b or \B. Another kind of difference
7828       may occur when there are multiple matching possibilities, because  (for
7829       PCRE_PARTIAL_SOFT)  a partial match result is given only when there are
7830       no completed matches. This means that as soon as the shortest match has
7831       been  found,  continuation to a new subject segment is no longer possi-
7832       ble. Consider again this pcretest example:
7833
7834           re> /dog(sbody)?/
7835         data> dogsb\P
7836          0: dog
7837         data> do\P\D
7838         Partial match: do
7839         data> gsb\R\P\D
7840          0: g
7841         data> dogsbody\D
7842          0: dogsbody
7843          1: dog
7844
7845       The first data line passes the string "dogsb" to  a  standard  matching
7846       function,  setting the PCRE_PARTIAL_SOFT option. Although the string is
7847       a partial match for "dogsbody", the result is  not  PCRE_ERROR_PARTIAL,
7848       because  the  shorter string "dog" is a complete match. Similarly, when
7849       the subject is presented to a DFA matching function  in  several  parts
7850       ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
7851       been found, and it is not possible to continue.  On the other hand,  if
7852       "dogsbody"  is  presented  as  a single string, a DFA matching function
7853       finds both matches.
7854
7855       Because of these problems, it is best  to  use  PCRE_PARTIAL_HARD  when
7856       matching  multi-segment  data.  The  example above then behaves differ-
7857       ently:
7858
7859           re> /dog(sbody)?/
7860         data> dogsb\P\P
7861         Partial match: dogsb
7862         data> do\P\D
7863         Partial match: do
7864         data> gsb\R\P\P\D
7865         Partial match: gsb
7866
7867       5. Patterns that contain alternatives at the top level which do not all
7868       start  with  the  same  pattern  item  may  not  work  as expected when
7869       PCRE_DFA_RESTART is used. For example, consider this pattern:
7870
7871         1234|3789
7872
7873       If the first part of the subject is "ABC123", a partial  match  of  the
7874       first  alternative  is found at offset 3. There is no partial match for
7875       the second alternative, because such a match does not start at the same
7876       point  in  the  subject  string. Attempting to continue with the string
7877       "7890" does not yield a match  because  only  those  alternatives  that
7878       match  at  one  point in the subject are remembered. The problem arises
7879       because the start of the second alternative matches  within  the  first
7880       alternative.  There  is  no  problem with anchored patterns or patterns
7881       such as:
7882
7883         1234|ABCD
7884
7885       where no string can be a partial match for both alternatives.  This  is
7886       not  a  problem  if  a  standard matching function is used, because the
7887       entire match has to be rerun each time:
7888
7889           re> /1234|3789/
7890         data> ABC123\P\P
7891         Partial match: 123
7892         data> 1237890
7893          0: 3789
7894
7895       Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
7896       running  the  entire match can also be used with the DFA matching func-
7897       tions. Another possibility is to work with two buffers.  If  a  partial
7898       match  at  offset  n in the first buffer is followed by "no match" when
7899       PCRE_DFA_RESTART is used on the second buffer, you can then try  a  new
7900       match starting at offset n+1 in the first buffer.
7901
7902
7903AUTHOR
7904
7905       Philip Hazel
7906       University Computing Service
7907       Cambridge CB2 3QH, England.
7908
7909
7910REVISION
7911
7912       Last updated: 24 February 2012
7913       Copyright (c) 1997-2012 University of Cambridge.
7914------------------------------------------------------------------------------
7915
7916
7917PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
7918
7919
7920NAME
7921       PCRE - Perl-compatible regular expressions
7922
7923
7924SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
7925
7926       If  you  are running an application that uses a large number of regular
7927       expression patterns, it may be useful to store them  in  a  precompiled
7928       form  instead  of  having to compile them every time the application is
7929       run.  If you are not  using  any  private  character  tables  (see  the
7930       pcre_maketables()  documentation),  this is relatively straightforward.
7931       If you are using private tables, it is a little bit  more  complicated.
7932       However,  if you are using the just-in-time optimization feature, it is
7933       not possible to save and reload the JIT data.
7934
7935       If you save compiled patterns to a file, you can copy them to a differ-
7936       ent host and run them there. If the two hosts have different endianness
7937       (byte order), you should run the  pcre[16]_pattern_to_host_byte_order()
7938       function on the new host before trying to match the pattern. The match-
7939       ing functions return PCRE_ERROR_BADENDIANNESS if they detect a  pattern
7940       with the wrong endianness.
7941
7942       Compiling  regular  expressions with one version of PCRE for use with a
7943       different version is not guaranteed to work and may cause crashes,  and
7944       saving  and  restoring  a  compiled  pattern loses any JIT optimization
7945       data.
7946
7947
7948SAVING A COMPILED PATTERN
7949
7950       The value returned by pcre[16]_compile() points to a  single  block  of
7951       memory  that  holds  the  compiled pattern and associated data. You can
7952       find the length of this block in bytes by  calling  pcre[16]_fullinfo()
7953       with  an  argument of PCRE_INFO_SIZE. You can then save the data in any
7954       appropriate manner. Here is sample code for the 8-bit library that com-
7955       piles  a  pattern and writes it to a file. It assumes that the variable
7956       fd refers to a file that is open for output:
7957
7958         int erroroffset, rc, size;
7959         char *error;
7960         pcre *re;
7961
7962         re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
7963         if (re == NULL) { ... handle errors ... }
7964         rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
7965         if (rc < 0) { ... handle errors ... }
7966         rc = fwrite(re, 1, size, fd);
7967         if (rc != size) { ... handle errors ... }
7968
7969       In this example, the bytes  that  comprise  the  compiled  pattern  are
7970       copied  exactly.  Note that this is binary data that may contain any of
7971       the 256 possible byte  values.  On  systems  that  make  a  distinction
7972       between binary and non-binary data, be sure that the file is opened for
7973       binary output.
7974
7975       If you want to write more than one pattern to a file, you will have  to
7976       devise  a  way of separating them. For binary data, preceding each pat-
7977       tern with its length is probably  the  most  straightforward  approach.
7978       Another  possibility is to write out the data in hexadecimal instead of
7979       binary, one pattern to a line.
7980
7981       Saving compiled patterns in a file is only one possible way of  storing
7982       them  for later use. They could equally well be saved in a database, or
7983       in the memory of some daemon process that passes them  via  sockets  to
7984       the processes that want them.
7985
7986       If the pattern has been studied, it is also possible to save the normal
7987       study data in a similar way to the compiled pattern itself. However, if
7988       the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre-
7989       ated cannot be saved because it is too dependent on the  current  envi-
7990       ronment.    When    studying    generates    additional    information,
7991       pcre[16]_study() returns a pointer to a pcre[16]_extra data block.  Its
7992       format  is  defined in the section on matching a pattern in the pcreapi
7993       documentation. The study_data field points to the  binary  study  data,
7994       and  this  is what you must save (not the pcre[16]_extra block itself).
7995       The  length  of  the  study   data   can   be   obtained   by   calling
7996       pcre[16]_fullinfo()  with  an argument of PCRE_INFO_STUDYSIZE. Remember
7997       to check that pcre[16]_study() did return a non-NULL value before  try-
7998       ing to save the study data.
7999
8000
8001RE-USING A PRECOMPILED PATTERN
8002
8003       Re-using  a  precompiled pattern is straightforward. Having reloaded it
8004       into main memory, called pcre[16]_pattern_to_host_byte_order() if  nec-
8005       essary,  you pass its pointer to pcre[16]_exec() or pcre[16]_dfa_exec()
8006       in the usual way.
8007
8008       However, if you passed a pointer to custom character  tables  when  the
8009       pattern was compiled (the tableptr argument of pcre[16]_compile()), you
8010       must   now   pass   a   similar   pointer   to    pcre[16]_exec()    or
8011       pcre[16]_dfa_exec(),  because the value saved with the compiled pattern
8012       will obviously be nonsense. A field in a pcre[16]_extra() block is used
8013       to pass this data, as described in the section on matching a pattern in
8014       the pcreapi documentation.
8015
8016       If you did not provide custom character tables  when  the  pattern  was
8017       compiled, the pointer in the compiled pattern is NULL, which causes the
8018       matching functions to use PCRE's internal tables. Thus, you do not need
8019       to take any special action at run time in this case.
8020
8021       If  you  saved study data with the compiled pattern, you need to create
8022       your own pcre[16]_extra data block and  set  the  study_data  field  to
8023       point   to   the   reloaded   study   data.   You  must  also  set  the
8024       PCRE_EXTRA_STUDY_DATA bit in the flags field  to  indicate  that  study
8025       data  is  present.  Then  pass the pcre[16]_extra block to the matching
8026       function in the usual way. If the pattern was studied for  just-in-time
8027       optimization,  that  data  cannot  be  saved,  and  so  is  lost  by  a
8028       save/restore cycle.
8029
8030
8031COMPATIBILITY WITH DIFFERENT PCRE RELEASES
8032
8033       In general, it is safest to  recompile  all  saved  patterns  when  you
8034       update  to  a new PCRE release, though not all updates actually require
8035       this.
8036
8037
8038AUTHOR
8039
8040       Philip Hazel
8041       University Computing Service
8042       Cambridge CB2 3QH, England.
8043
8044
8045REVISION
8046
8047       Last updated: 10 January 2012
8048       Copyright (c) 1997-2012 University of Cambridge.
8049------------------------------------------------------------------------------
8050
8051
8052PCREPERFORM(3)                                                  PCREPERFORM(3)
8053
8054
8055NAME
8056       PCRE - Perl-compatible regular expressions
8057
8058
8059PCRE PERFORMANCE
8060
8061       Two  aspects  of performance are discussed below: memory usage and pro-
8062       cessing time. The way you express your pattern as a regular  expression
8063       can affect both of them.
8064
8065
8066COMPILED PATTERN MEMORY USAGE
8067
8068       Patterns  are compiled by PCRE into a reasonably efficient interpretive
8069       code, so that most simple patterns do not  use  much  memory.  However,
8070       there  is  one case where the memory usage of a compiled pattern can be
8071       unexpectedly large. If a parenthesized subpattern has a quantifier with
8072       a minimum greater than 1 and/or a limited maximum, the whole subpattern
8073       is repeated in the compiled code. For example, the pattern
8074
8075         (abc|def){2,4}
8076
8077       is compiled as if it were
8078
8079         (abc|def)(abc|def)((abc|def)(abc|def)?)?
8080
8081       (Technical aside: It is done this way so that backtrack  points  within
8082       each of the repetitions can be independently maintained.)
8083
8084       For  regular expressions whose quantifiers use only small numbers, this
8085       is not usually a problem. However, if the numbers are large,  and  par-
8086       ticularly  if  such repetitions are nested, the memory usage can become
8087       an embarrassment. For example, the very simple pattern
8088
8089         ((ab){1,1000}c){1,3}
8090
8091       uses 51K bytes when compiled using the 8-bit library. When PCRE is com-
8092       piled  with  its  default  internal pointer size of two bytes, the size
8093       limit on a compiled pattern is 64K data units, and this is reached with
8094       the  above  pattern  if  the outer repetition is increased from 3 to 4.
8095       PCRE can be compiled to use larger internal pointers  and  thus  handle
8096       larger  compiled patterns, but it is better to try to rewrite your pat-
8097       tern to use less memory if you can.
8098
8099       One way of reducing the memory usage for such patterns is to  make  use
8100       of PCRE's "subroutine" facility. Re-writing the above pattern as
8101
8102         ((ab)(?2){0,999}c)(?1){0,2}
8103
8104       reduces the memory requirements to 18K, and indeed it remains under 20K
8105       even with the outer repetition increased to 100. However, this  pattern
8106       is  not  exactly equivalent, because the "subroutine" calls are treated
8107       as atomic groups into which there can be no backtracking if there is  a
8108       subsequent  matching  failure.  Therefore,  PCRE cannot do this kind of
8109       rewriting automatically.  Furthermore, there is a  noticeable  loss  of
8110       speed  when executing the modified pattern. Nevertheless, if the atomic
8111       grouping is not a problem and the loss of  speed  is  acceptable,  this
8112       kind  of  rewriting will allow you to process patterns that PCRE cannot
8113       otherwise handle.
8114
8115
8116STACK USAGE AT RUN TIME
8117
8118       When pcre_exec() or pcre16_exec() is used for matching,  certain  kinds
8119       of  pattern  can cause it to use large amounts of the process stack. In
8120       some environments the default process stack is quite small, and  if  it
8121       runs  out  the result is often SIGSEGV. This issue is probably the most
8122       frequently raised problem with PCRE. Rewriting your pattern  can  often
8123       help. The pcrestack documentation discusses this issue in detail.
8124
8125
8126PROCESSING TIME
8127
8128       Certain  items  in regular expression patterns are processed more effi-
8129       ciently than others. It is more efficient to use a character class like
8130       [aeiou]   than   a   set   of  single-character  alternatives  such  as
8131       (a|e|i|o|u). In general, the simplest construction  that  provides  the
8132       required behaviour is usually the most efficient. Jeffrey Friedl's book
8133       contains a lot of useful general discussion  about  optimizing  regular
8134       expressions  for  efficient  performance.  This document contains a few
8135       observations about PCRE.
8136
8137       Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
8138       slow,  because PCRE has to scan a structure that contains data for over
8139       fifteen thousand characters whenever it needs a  character's  property.
8140       If  you  can  find  an  alternative pattern that does not use character
8141       properties, it will probably be faster.
8142
8143       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
8144       character  classes  such  as  [:alpha:]  do not use Unicode properties,
8145       partly for backwards compatibility, and partly for performance reasons.
8146       However,  you can set PCRE_UCP if you want Unicode character properties
8147       to be used. This can double the matching time for  items  such  as  \d,
8148       when matched with a traditional matching function; the performance loss
8149       is less with a DFA matching function, and in both cases  there  is  not
8150       much difference for \b.
8151
8152       When  a  pattern  begins  with .* not in parentheses, or in parentheses
8153       that are not the subject of a backreference, and the PCRE_DOTALL option
8154       is  set, the pattern is implicitly anchored by PCRE, since it can match
8155       only at the start of a subject string. However, if PCRE_DOTALL  is  not
8156       set,  PCRE  cannot  make this optimization, because the . metacharacter
8157       does not then match a newline, and if the subject string contains  new-
8158       lines,  the  pattern may match from the character immediately following
8159       one of them instead of from the very start. For example, the pattern
8160
8161         .*second
8162
8163       matches the subject "first\nand second" (where \n stands for a  newline
8164       character),  with the match starting at the seventh character. In order
8165       to do this, PCRE has to retry the match starting after every newline in
8166       the subject.
8167
8168       If  you  are using such a pattern with subject strings that do not con-
8169       tain newlines, the best performance is obtained by setting PCRE_DOTALL,
8170       or  starting  the pattern with ^.* or ^.*? to indicate explicit anchor-
8171       ing. That saves PCRE from having to scan along the subject looking  for
8172       a newline to restart at.
8173
8174       Beware  of  patterns  that contain nested indefinite repeats. These can
8175       take a long time to run when applied to a string that does  not  match.
8176       Consider the pattern fragment
8177
8178         ^(a+)*
8179
8180       This  can  match "aaaa" in 16 different ways, and this number increases
8181       very rapidly as the string gets longer. (The * repeat can match  0,  1,
8182       2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
8183       repeats can match different numbers of times.) When  the  remainder  of
8184       the pattern is such that the entire match is going to fail, PCRE has in
8185       principle to try  every  possible  variation,  and  this  can  take  an
8186       extremely long time, even for relatively short strings.
8187
8188       An optimization catches some of the more simple cases such as
8189
8190         (a+)*b
8191
8192       where  a  literal  character  follows. Before embarking on the standard
8193       matching procedure, PCRE checks that there is a "b" later in  the  sub-
8194       ject  string, and if there is not, it fails the match immediately. How-
8195       ever, when there is no following literal this  optimization  cannot  be
8196       used. You can see the difference by comparing the behaviour of
8197
8198         (a+)*\d
8199
8200       with  the  pattern  above.  The former gives a failure almost instantly
8201       when applied to a whole line of  "a"  characters,  whereas  the  latter
8202       takes an appreciable time with strings longer than about 20 characters.
8203
8204       In many cases, the solution to this kind of performance issue is to use
8205       an atomic group or a possessive quantifier.
8206
8207
8208AUTHOR
8209
8210       Philip Hazel
8211       University Computing Service
8212       Cambridge CB2 3QH, England.
8213
8214
8215REVISION
8216
8217       Last updated: 09 January 2012
8218       Copyright (c) 1997-2012 University of Cambridge.
8219------------------------------------------------------------------------------
8220
8221
8222PCREPOSIX(3)                                                      PCREPOSIX(3)
8223
8224
8225NAME
8226       PCRE - Perl-compatible regular expressions.
8227
8228
8229SYNOPSIS OF POSIX API
8230
8231       #include <pcreposix.h>
8232
8233       int regcomp(regex_t *preg, const char *pattern,
8234            int cflags);
8235
8236       int regexec(regex_t *preg, const char *string,
8237            size_t nmatch, regmatch_t pmatch[], int eflags);
8238
8239       size_t regerror(int errcode, const regex_t *preg,
8240            char *errbuf, size_t errbuf_size);
8241
8242       void regfree(regex_t *preg);
8243
8244
8245DESCRIPTION
8246
8247       This  set  of functions provides a POSIX-style API for the PCRE regular
8248       expression 8-bit library. See the pcreapi documentation for a  descrip-
8249       tion  of  PCRE's native API, which contains much additional functional-
8250       ity. There is no POSIX-style wrapper for PCRE's 16-bit library.
8251
8252       The functions described here are just wrapper functions that ultimately
8253       call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
8254       pcreposix.h header file, and on Unix  systems  the  library  itself  is
8255       called  pcreposix.a,  so  can  be accessed by adding -lpcreposix to the
8256       command for linking an application that uses them.  Because  the  POSIX
8257       functions call the native ones, it is also necessary to add -lpcre.
8258
8259       I  have implemented only those POSIX option bits that can be reasonably
8260       mapped to PCRE native options. In addition, the option REG_EXTENDED  is
8261       defined  with  the  value  zero. This has no effect, but since programs
8262       that are written to the POSIX interface often use  it,  this  makes  it
8263       easier  to  slot  in PCRE as a replacement library. Other POSIX options
8264       are not even defined.
8265
8266       There are also some other options that are not defined by POSIX.  These
8267       have been added at the request of users who want to make use of certain
8268       PCRE-specific features via the POSIX calling interface.
8269
8270       When PCRE is called via these functions, it is only  the  API  that  is
8271       POSIX-like  in  style.  The syntax and semantics of the regular expres-
8272       sions themselves are still those of Perl, subject  to  the  setting  of
8273       various  PCRE  options, as described below. "POSIX-like in style" means
8274       that the API approximates to the POSIX  definition;  it  is  not  fully
8275       POSIX-compatible,  and  in  multi-byte  encoding domains it is probably
8276       even less compatible.
8277
8278       The header for these functions is supplied as pcreposix.h to avoid  any
8279       potential  clash  with  other  POSIX  libraries.  It can, of course, be
8280       renamed or aliased as regex.h, which is the "correct" name. It provides
8281       two  structure  types,  regex_t  for  compiled internal forms, and reg-
8282       match_t for returning captured substrings. It also  defines  some  con-
8283       stants  whose  names  start  with  "REG_";  these  are used for setting
8284       options and identifying error codes.
8285
8286
8287COMPILING A PATTERN
8288
8289       The function regcomp() is called to compile a pattern into an  internal
8290       form.  The  pattern  is  a C string terminated by a binary zero, and is
8291       passed in the argument pattern. The preg argument is  a  pointer  to  a
8292       regex_t  structure that is used as a base for storing information about
8293       the compiled regular expression.
8294
8295       The argument cflags is either zero, or contains one or more of the bits
8296       defined by the following macros:
8297
8298         REG_DOTALL
8299
8300       The PCRE_DOTALL option is set when the regular expression is passed for
8301       compilation to the native function. Note that REG_DOTALL is not part of
8302       the POSIX standard.
8303
8304         REG_ICASE
8305
8306       The  PCRE_CASELESS  option is set when the regular expression is passed
8307       for compilation to the native function.
8308
8309         REG_NEWLINE
8310
8311       The PCRE_MULTILINE option is set when the regular expression is  passed
8312       for  compilation  to the native function. Note that this does not mimic
8313       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
8314       tion).
8315
8316         REG_NOSUB
8317
8318       The  PCRE_NO_AUTO_CAPTURE  option is set when the regular expression is
8319       passed for compilation to the native function. In addition, when a pat-
8320       tern  that is compiled with this flag is passed to regexec() for match-
8321       ing, the nmatch and pmatch  arguments  are  ignored,  and  no  captured
8322       strings are returned.
8323
8324         REG_UCP
8325
8326       The  PCRE_UCP  option  is set when the regular expression is passed for
8327       compilation to the native function. This causes  PCRE  to  use  Unicode
8328       properties  when  matchine  \d,  \w,  etc., instead of just recognizing
8329       ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
8330
8331         REG_UNGREEDY
8332
8333       The PCRE_UNGREEDY option is set when the regular expression  is  passed
8334       for  compilation  to the native function. Note that REG_UNGREEDY is not
8335       part of the POSIX standard.
8336
8337         REG_UTF8
8338
8339       The PCRE_UTF8 option is set when the regular expression is  passed  for
8340       compilation  to the native function. This causes the pattern itself and
8341       all data strings used for matching it to be treated as  UTF-8  strings.
8342       Note that REG_UTF8 is not part of the POSIX standard.
8343
8344       In  the  absence  of  these  flags, no options are passed to the native
8345       function.  This means the the  regex  is  compiled  with  PCRE  default
8346       semantics.  In particular, the way it handles newline characters in the
8347       subject string is the Perl way, not the POSIX way.  Note  that  setting
8348       PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
8349       It does not affect the way newlines are matched by . (they are not)  or
8350       by a negative class such as [^a] (they are).
8351
8352       The  yield of regcomp() is zero on success, and non-zero otherwise. The
8353       preg structure is filled in on success, and one member of the structure
8354       is  public: re_nsub contains the number of capturing subpatterns in the
8355       regular expression. Various error codes are defined in the header file.
8356
8357       NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
8358       use the contents of the preg structure. If, for example, you pass it to
8359       regexec(), the result is undefined and your program is likely to crash.
8360
8361
8362MATCHING NEWLINE CHARACTERS
8363
8364       This area is not simple, because POSIX and Perl take different views of
8365       things.   It  is  not possible to get PCRE to obey POSIX semantics, but
8366       then PCRE was never intended to be a POSIX engine. The following  table
8367       lists  the  different  possibilities for matching newline characters in
8368       PCRE:
8369
8370                                 Default   Change with
8371
8372         . matches newline          no     PCRE_DOTALL
8373         newline matches [^a]       yes    not changeable
8374         $ matches \n at end        yes    PCRE_DOLLARENDONLY
8375         $ matches \n in middle     no     PCRE_MULTILINE
8376         ^ matches \n in middle     no     PCRE_MULTILINE
8377
8378       This is the equivalent table for POSIX:
8379
8380                                 Default   Change with
8381
8382         . matches newline          yes    REG_NEWLINE
8383         newline matches [^a]       yes    REG_NEWLINE
8384         $ matches \n at end        no     REG_NEWLINE
8385         $ matches \n in middle     no     REG_NEWLINE
8386         ^ matches \n in middle     no     REG_NEWLINE
8387
8388       PCRE's behaviour is the same as Perl's, except that there is no equiva-
8389       lent  for  PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
8390       no way to stop newline from matching [^a].
8391
8392       The  default  POSIX  newline  handling  can  be  obtained  by   setting
8393       PCRE_DOTALL  and  PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
8394       behave exactly as for the REG_NEWLINE action.
8395
8396
8397MATCHING A PATTERN
8398
8399       The function regexec() is called  to  match  a  compiled  pattern  preg
8400       against  a  given string, which is by default terminated by a zero byte
8401       (but see REG_STARTEND below), subject to the options in  eflags.  These
8402       can be:
8403
8404         REG_NOTBOL
8405
8406       The PCRE_NOTBOL option is set when calling the underlying PCRE matching
8407       function.
8408
8409         REG_NOTEMPTY
8410
8411       The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
8412       ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
8413       However, setting this option can give more POSIX-like behaviour in some
8414       situations.
8415
8416         REG_NOTEOL
8417
8418       The PCRE_NOTEOL option is set when calling the underlying PCRE matching
8419       function.
8420
8421         REG_STARTEND
8422
8423       The string is considered to start at string +  pmatch[0].rm_so  and  to
8424       have  a terminating NUL located at string + pmatch[0].rm_eo (there need
8425       not actually be a NUL at that location), regardless  of  the  value  of
8426       nmatch.  This  is a BSD extension, compatible with but not specified by
8427       IEEE Standard 1003.2 (POSIX.2), and should  be  used  with  caution  in
8428       software intended to be portable to other systems. Note that a non-zero
8429       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
8430       of the string, not how it is matched.
8431
8432       If  the pattern was compiled with the REG_NOSUB flag, no data about any
8433       matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
8434       regexec() are ignored.
8435
8436       If the value of nmatch is zero, or if the value pmatch is NULL, no data
8437       about any matched strings is returned.
8438
8439       Otherwise,the portion of the string that was matched, and also any cap-
8440       tured substrings, are returned via the pmatch argument, which points to
8441       an array of nmatch structures of type regmatch_t, containing  the  mem-
8442       bers  rm_so  and rm_eo. These contain the offset to the first character
8443       of each substring and the offset to the first character after  the  end
8444       of  each substring, respectively. The 0th element of the vector relates
8445       to the entire portion of string that was matched;  subsequent  elements
8446       relate  to  the capturing subpatterns of the regular expression. Unused
8447       entries in the array have both structure members set to -1.
8448
8449       A successful match yields  a  zero  return;  various  error  codes  are
8450       defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
8451       failure code.
8452
8453
8454ERROR MESSAGES
8455
8456       The regerror() function maps a non-zero errorcode from either regcomp()
8457       or  regexec()  to  a  printable message. If preg is not NULL, the error
8458       should have arisen from the use of that structure. A message terminated
8459       by  a  binary  zero  is  placed  in  errbuf. The length of the message,
8460       including the zero, is limited to errbuf_size. The yield of  the  func-
8461       tion is the size of buffer needed to hold the whole message.
8462
8463
8464MEMORY USAGE
8465
8466       Compiling  a regular expression causes memory to be allocated and asso-
8467       ciated with the preg structure. The function regfree() frees  all  such
8468       memory,  after  which  preg may no longer be used as a compiled expres-
8469       sion.
8470
8471
8472AUTHOR
8473
8474       Philip Hazel
8475       University Computing Service
8476       Cambridge CB2 3QH, England.
8477
8478
8479REVISION
8480
8481       Last updated: 09 January 2012
8482       Copyright (c) 1997-2012 University of Cambridge.
8483------------------------------------------------------------------------------
8484
8485
8486PCRECPP(3)                                                          PCRECPP(3)
8487
8488
8489NAME
8490       PCRE - Perl-compatible regular expressions.
8491
8492
8493SYNOPSIS OF C++ WRAPPER
8494
8495       #include <pcrecpp.h>
8496
8497
8498DESCRIPTION
8499
8500       The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
8501       functionality was added by Giuseppe Maxia. This brief man page was con-
8502       structed  from  the  notes  in the pcrecpp.h file, which should be con-
8503       sulted for further details. Note that the C++ wrapper supports only the
8504       original 8-bit PCRE library. There is no 16-bit support at present.
8505
8506
8507MATCHING INTERFACE
8508
8509       The  "FullMatch" operation checks that supplied text matches a supplied
8510       pattern exactly. If pointer arguments are supplied, it  copies  matched
8511       sub-strings that match sub-patterns into them.
8512
8513         Example: successful match
8514            pcrecpp::RE re("h.*o");
8515            re.FullMatch("hello");
8516
8517         Example: unsuccessful match (requires full match):
8518            pcrecpp::RE re("e");
8519            !re.FullMatch("hello");
8520
8521         Example: creating a temporary RE object:
8522            pcrecpp::RE("h.*o").FullMatch("hello");
8523
8524       You  can pass in a "const char*" or a "string" for "text". The examples
8525       below tend to use a const char*. You can, as in the different  examples
8526       above,  store the RE object explicitly in a variable or use a temporary
8527       RE object. The examples below use one mode or  the  other  arbitrarily.
8528       Either could correctly be used for any of these examples.
8529
8530       You must supply extra pointer arguments to extract matched subpieces.
8531
8532         Example: extracts "ruby" into "s" and 1234 into "i"
8533            int i;
8534            string s;
8535            pcrecpp::RE re("(\\w+):(\\d+)");
8536            re.FullMatch("ruby:1234", &s, &i);
8537
8538         Example: does not try to extract any extra sub-patterns
8539            re.FullMatch("ruby:1234", &s);
8540
8541         Example: does not try to extract into NULL
8542            re.FullMatch("ruby:1234", NULL, &i);
8543
8544         Example: integer overflow causes failure
8545            !re.FullMatch("ruby:1234567891234", NULL, &i);
8546
8547         Example: fails because there aren't enough sub-patterns:
8548            !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
8549
8550         Example: fails because string cannot be stored in integer
8551            !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
8552
8553       The  provided  pointer  arguments can be pointers to any scalar numeric
8554       type, or one of:
8555
8556          string        (matched piece is copied to string)
8557          StringPiece   (StringPiece is mutated to point to matched piece)
8558          T             (where "bool T::ParseFrom(const char*, int)" exists)
8559          NULL          (the corresponding matched sub-pattern is not copied)
8560
8561       The function returns true iff all of the following conditions are  sat-
8562       isfied:
8563
8564         a. "text" matches "pattern" exactly;
8565
8566         b. The number of matched sub-patterns is >= number of supplied
8567            pointers;
8568
8569         c. The "i"th argument has a suitable type for holding the
8570            string captured as the "i"th sub-pattern. If you pass in
8571            void * NULL for the "i"th argument, or a non-void * NULL
8572            of the correct type, or pass fewer arguments than the
8573            number of sub-patterns, "i"th captured sub-pattern is
8574            ignored.
8575
8576       CAVEAT:  An  optional  sub-pattern  that  does not exist in the matched
8577       string is assigned the empty  string.  Therefore,  the  following  will
8578       return false (because the empty string is not a valid number):
8579
8580          int number;
8581          pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
8582
8583       The  matching interface supports at most 16 arguments per call.  If you
8584       need   more,   consider    using    the    more    general    interface
8585       pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
8586
8587       NOTE:  Do not use no_arg, which is used internally to mark the end of a
8588       list of optional arguments, as a placeholder for missing arguments,  as
8589       this can lead to segfaults.
8590
8591
8592QUOTING METACHARACTERS
8593
8594       You  can use the "QuoteMeta" operation to insert backslashes before all
8595       potentially meaningful characters in a  string.  The  returned  string,
8596       used as a regular expression, will exactly match the original string.
8597
8598         Example:
8599            string quoted = RE::QuoteMeta(unquoted);
8600
8601       Note  that  it's  legal to escape a character even if it has no special
8602       meaning in a regular expression -- so this function  does  that.  (This
8603       also  makes  it  identical  to  the perl function of the same name; see
8604       "perldoc   -f   quotemeta".)    For   example,    "1.5-2.0?"    becomes
8605       "1\.5\-2\.0\?".
8606
8607
8608PARTIAL MATCHES
8609
8610       You  can  use the "PartialMatch" operation when you want the pattern to
8611       match any substring of the text.
8612
8613         Example: simple search for a string:
8614            pcrecpp::RE("ell").PartialMatch("hello");
8615
8616         Example: find first number in a string:
8617            int number;
8618            pcrecpp::RE re("(\\d+)");
8619            re.PartialMatch("x*100 + 20", &number);
8620            assert(number == 100);
8621
8622
8623UTF-8 AND THE MATCHING INTERFACE
8624
8625       By default, pattern and text are plain text, one  byte  per  character.
8626       The  UTF8  flag,  passed  to  the  constructor, causes both pattern and
8627       string to be treated as UTF-8 text, still a byte stream but potentially
8628       multiple  bytes  per character. In practice, the text is likelier to be
8629       UTF-8 than the pattern, but the match returned may depend on  the  UTF8
8630       flag,  so  always use it when matching UTF8 text. For example, "." will
8631       match one byte normally but with UTF8 set may match up to  three  bytes
8632       of a multi-byte character.
8633
8634         Example:
8635            pcrecpp::RE_Options options;
8636            options.set_utf8();
8637            pcrecpp::RE re(utf8_pattern, options);
8638            re.FullMatch(utf8_string);
8639
8640         Example: using the convenience function UTF8():
8641            pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
8642            re.FullMatch(utf8_string);
8643
8644       NOTE: The UTF8 flag is ignored if pcre was not configured with the
8645             --enable-utf8 flag.
8646
8647
8648PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
8649
8650       PCRE  defines  some  modifiers  to  change  the behavior of the regular
8651       expression  engine.  The  C++  wrapper  defines  an  auxiliary   class,
8652       RE_Options,  as  a  vehicle  to pass such modifiers to a RE class. Cur-
8653       rently, the following modifiers are supported:
8654
8655          modifier              description               Perl corresponding
8656
8657          PCRE_CASELESS         case insensitive match      /i
8658          PCRE_MULTILINE        multiple lines match        /m
8659          PCRE_DOTALL           dot matches newlines        /s
8660          PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
8661          PCRE_EXTRA            strict escape parsing       N/A
8662          PCRE_EXTENDED         ignore white spaces         /x
8663          PCRE_UTF8             handles UTF8 chars          built-in
8664          PCRE_UNGREEDY         reverses * and *?           N/A
8665          PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
8666
8667       (*) Both Perl and PCRE allow non capturing parentheses by means of  the
8668       "?:"  modifier  within the pattern itself. e.g. (?:ab|cd) does not cap-
8669       ture, while (ab|cd) does.
8670
8671       For a full account on how each modifier works, please  check  the  PCRE
8672       API reference page.
8673
8674       For  each  modifier,  there are two member functions whose name is made
8675       out of the modifier in  lowercase,  without  the  "PCRE_"  prefix.  For
8676       instance, PCRE_CASELESS is handled by
8677
8678         bool caseless()
8679
8680       which returns true if the modifier is set, and
8681
8682         RE_Options & set_caseless(bool)
8683
8684       which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
8685       be accessed through  the  set_match_limit()  and  match_limit()  member
8686       functions.  Setting match_limit to a non-zero value will limit the exe-
8687       cution of pcre to keep it from doing bad things like blowing the  stack
8688       or  taking  an  eternity  to  return  a result. A value of 5000 is good
8689       enough to stop stack blowup in a 2MB thread stack. Setting  match_limit
8690       to   zero   disables   match  limiting.  Alternatively,  you  can  call
8691       match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION  to
8692       limit  how  much  PCRE  recurses.  match_limit()  limits  the number of
8693       matches PCRE does; match_limit_recursion() limits the depth of internal
8694       recursion, and therefore the amount of stack that is used.
8695
8696       Normally,  to  pass  one or more modifiers to a RE class, you declare a
8697       RE_Options object, set the appropriate options, and pass this object to
8698       a RE constructor. Example:
8699
8700          RE_Options opt;
8701          opt.set_caseless(true);
8702          if (RE("HELLO", opt).PartialMatch("hello world")) ...
8703
8704       RE_options has two constructors. The default constructor takes no argu-
8705       ments and creates a set of flags that are off by default. The  optional
8706       parameter  option_flags is to facilitate transfer of legacy code from C
8707       programs.  This lets you do
8708
8709          RE(pattern,
8710            RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
8711
8712       However, new code is better off doing
8713
8714          RE(pattern,
8715            RE_Options().set_caseless(true).set_multiline(true))
8716              .PartialMatch(str);
8717
8718       If you are going to pass one of the most used modifiers, there are some
8719       convenience functions that return a RE_Options class with the appropri-
8720       ate modifier already set: CASELESS(),  UTF8(),  MULTILINE(),  DOTALL(),
8721       and EXTENDED().
8722
8723       If  you  need  to set several options at once, and you don't want to go
8724       through the pains of declaring a RE_Options object and setting  several
8725       options,  there  is a parallel method that give you such ability on the
8726       fly. You can concatenate several set_xxxxx()  member  functions,  since
8727       each  of  them returns a reference to its class object. For example, to
8728       pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with  one
8729       statement, you may write:
8730
8731          RE(" ^ xyz \\s+ .* blah$",
8732            RE_Options()
8733              .set_caseless(true)
8734              .set_extended(true)
8735              .set_multiline(true)).PartialMatch(sometext);
8736
8737
8738SCANNING TEXT INCREMENTALLY
8739
8740       The  "Consume"  operation may be useful if you want to repeatedly match
8741       regular expressions at the front of a string and skip over them as they
8742       match.  This requires use of the "StringPiece" type, which represents a
8743       sub-range of a real string. Like RE,  StringPiece  is  defined  in  the
8744       pcrecpp namespace.
8745
8746         Example: read lines of the form "var = value" from a string.
8747            string contents = ...;                 // Fill string somehow
8748            pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
8749
8750            string var;
8751            int value;
8752            pcrecpp::RE re("(\\w+) = (\\d+)\n");
8753            while (re.Consume(&input, &var, &value)) {
8754              ...;
8755            }
8756
8757       Each  successful  call  to  "Consume"  will  set  "var/value", and also
8758       advance "input" so it points past the matched text.
8759
8760       The "FindAndConsume" operation is similar to  "Consume"  but  does  not
8761       anchor  your  match  at  the  beginning of the string. For example, you
8762       could extract all words from a string by repeatedly calling
8763
8764         pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
8765
8766
8767PARSING HEX/OCTAL/C-RADIX NUMBERS
8768
8769       By default, if you pass a pointer to a numeric value, the corresponding
8770       text  is  interpreted  as  a  base-10  number. You can instead wrap the
8771       pointer with a call to one of the operators Hex(), Octal(), or CRadix()
8772       to  interpret  the text in another base. The CRadix operator interprets
8773       C-style "0" (base-8) and  "0x"  (base-16)  prefixes,  but  defaults  to
8774       base-10.
8775
8776         Example:
8777           int a, b, c, d;
8778           pcrecpp::RE re("(.*) (.*) (.*) (.*)");
8779           re.FullMatch("100 40 0100 0x40",
8780                        pcrecpp::Octal(&a), pcrecpp::Hex(&b),
8781                        pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
8782
8783       will leave 64 in a, b, c, and d.
8784
8785
8786REPLACING PARTS OF STRINGS
8787
8788       You  can  replace the first match of "pattern" in "str" with "rewrite".
8789       Within "rewrite", backslash-escaped digits (\1 to \9) can  be  used  to
8790       insert  text  matching  corresponding parenthesized group from the pat-
8791       tern. \0 in "rewrite" refers to the entire matching text. For example:
8792
8793         string s = "yabba dabba doo";
8794         pcrecpp::RE("b+").Replace("d", &s);
8795
8796       will leave "s" containing "yada dabba doo". The result is true  if  the
8797       pattern matches and a replacement occurs, false otherwise.
8798
8799       GlobalReplace  is  like Replace except that it replaces all occurrences
8800       of the pattern in the string with the  rewrite.  Replacements  are  not
8801       subject to re-matching. For example:
8802
8803         string s = "yabba dabba doo";
8804         pcrecpp::RE("b+").GlobalReplace("d", &s);
8805
8806       will  leave  "s"  containing  "yada dada doo". It returns the number of
8807       replacements made.
8808
8809       Extract is like Replace, except that if the pattern matches,  "rewrite"
8810       is  copied into "out" (an additional argument) with substitutions.  The
8811       non-matching portions of "text" are ignored. Returns true iff  a  match
8812       occurred and the extraction happened successfully;  if no match occurs,
8813       the string is left unaffected.
8814
8815
8816AUTHOR
8817
8818       The C++ wrapper was contributed by Google Inc.
8819       Copyright (c) 2007 Google Inc.
8820
8821
8822REVISION
8823
8824       Last updated: 08 January 2012
8825------------------------------------------------------------------------------
8826
8827
8828PCRESAMPLE(3)                                                    PCRESAMPLE(3)
8829
8830
8831NAME
8832       PCRE - Perl-compatible regular expressions
8833
8834
8835PCRE SAMPLE PROGRAM
8836
8837       A simple, complete demonstration program, to get you started with using
8838       PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
8839       listing  of this program is given in the pcredemo documentation. If you
8840       do not have a copy of the PCRE distribution, you can save this  listing
8841       to re-create pcredemo.c.
8842
8843       The  demonstration program, which uses the original PCRE 8-bit library,
8844       compiles the regular expression that is its first argument, and matches
8845       it  against  the subject string in its second argument. No PCRE options
8846       are set, and default character tables are used. If  matching  succeeds,
8847       the  program  outputs the portion of the subject that matched, together
8848       with the contents of any captured substrings.
8849
8850       If the -g option is given on the command line, the program then goes on
8851       to check for further matches of the same regular expression in the same
8852       subject string. The logic is a little bit tricky because of the  possi-
8853       bility  of  matching an empty string. Comments in the code explain what
8854       is going on.
8855
8856       If PCRE is installed in the standard include  and  library  directories
8857       for your operating system, you should be able to compile the demonstra-
8858       tion program using this command:
8859
8860         gcc -o pcredemo pcredemo.c -lpcre
8861
8862       If PCRE is installed elsewhere, you may need to add additional  options
8863       to  the  command line. For example, on a Unix-like system that has PCRE
8864       installed in /usr/local, you  can  compile  the  demonstration  program
8865       using a command like this:
8866
8867         gcc -o pcredemo -I/usr/local/include pcredemo.c \
8868             -L/usr/local/lib -lpcre
8869
8870       In  a  Windows  environment, if you want to statically link the program
8871       against a non-dll pcre.a file, you must uncomment the line that defines
8872       PCRE_STATIC  before  including  pcre.h, because otherwise the pcre_mal-
8873       loc()   and   pcre_free()   exported   functions   will   be   declared
8874       __declspec(dllimport), with unwanted results.
8875
8876       Once  you  have  compiled and linked the demonstration program, you can
8877       run simple tests like this:
8878
8879         ./pcredemo 'cat|dog' 'the cat sat on the mat'
8880         ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
8881
8882       Note that there is a  much  more  comprehensive  test  program,  called
8883       pcretest,  which  supports  many  more  facilities  for testing regular
8884       expressions and both PCRE libraries. The pcredemo program  is  provided
8885       as a simple coding example.
8886
8887       If  you  try to run pcredemo when PCRE is not installed in the standard
8888       library directory, you may get an error like  this  on  some  operating
8889       systems (e.g. Solaris):
8890
8891         ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
8892       directory
8893
8894       This is caused by the way shared library support works  on  those  sys-
8895       tems. You need to add
8896
8897         -R/usr/local/lib
8898
8899       (for example) to the compile command to get round this problem.
8900
8901
8902AUTHOR
8903
8904       Philip Hazel
8905       University Computing Service
8906       Cambridge CB2 3QH, England.
8907
8908
8909REVISION
8910
8911       Last updated: 10 January 2012
8912       Copyright (c) 1997-2012 University of Cambridge.
8913------------------------------------------------------------------------------
8914PCRELIMITS(3)                                                    PCRELIMITS(3)
8915
8916
8917NAME
8918       PCRE - Perl-compatible regular expressions
8919
8920
8921SIZE AND OTHER LIMITATIONS
8922
8923       There  are some size limitations in PCRE but it is hoped that they will
8924       never in practice be relevant.
8925
8926       The maximum length of a compiled  pattern  is  approximately  64K  data
8927       units  (bytes  for  the  8-bit  library,  16-bit  units  for the 16-bit
8928       library) if PCRE is compiled with the default internal linkage size  of
8929       2  bytes.  If  you  want  to process regular expressions that are truly
8930       enormous, you can compile PCRE with an internal linkage size of 3 or  4
8931       (when  building  the  16-bit  library,  3  is rounded up to 4). See the
8932       README file in the source distribution and the pcrebuild  documentation
8933       for  details.  In  these cases the limit is substantially larger.  How-
8934       ever, the speed of execution is slower.
8935
8936       All values in repeating quantifiers must be less than 65536.
8937
8938       There is no limit to the number of parenthesized subpatterns, but there
8939       can be no more than 65535 capturing subpatterns.
8940
8941       There is a limit to the number of forward references to subsequent sub-
8942       patterns of around 200,000.  Repeated  forward  references  with  fixed
8943       upper  limits,  for example, (?2){0,100} when subpattern number 2 is to
8944       the right, are included in the count. There is no limit to  the  number
8945       of backward references.
8946
8947       The maximum length of name for a named subpattern is 32 characters, and
8948       the maximum number of named subpatterns is 10000.
8949
8950       The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
8951       (*THEN)  verb  is  255  for  the 8-bit library and 65535 for the 16-bit
8952       library.
8953
8954       The maximum length of a subject string is the largest  positive  number
8955       that  an integer variable can hold. However, when using the traditional
8956       matching function, PCRE uses recursion to handle subpatterns and indef-
8957       inite  repetition.  This means that the available stack space may limit
8958       the size of a subject string that can be processed by certain patterns.
8959       For a discussion of stack issues, see the pcrestack documentation.
8960
8961
8962AUTHOR
8963
8964       Philip Hazel
8965       University Computing Service
8966       Cambridge CB2 3QH, England.
8967
8968
8969REVISION
8970
8971       Last updated: 04 May 2012
8972       Copyright (c) 1997-2012 University of Cambridge.
8973------------------------------------------------------------------------------
8974
8975
8976PCRESTACK(3)                                                      PCRESTACK(3)
8977
8978
8979NAME
8980       PCRE - Perl-compatible regular expressions
8981
8982
8983PCRE DISCUSSION OF STACK USAGE
8984
8985       When  you  call  pcre[16]_exec(),  it makes use of an internal function
8986       called match(). This calls itself recursively at branch points  in  the
8987       pattern,  in  order  to  remember the state of the match so that it can
8988       back up and try a different alternative if  the  first  one  fails.  As
8989       matching proceeds deeper and deeper into the tree of possibilities, the
8990       recursion depth increases. The match() function is also called in other
8991       circumstances,  for  example,  whenever  a parenthesized sub-pattern is
8992       entered, and in certain cases of repetition.
8993
8994       Not all calls of match() increase the recursion depth; for an item such
8995       as  a* it may be called several times at the same level, after matching
8996       different numbers of a's. Furthermore, in a number of cases  where  the
8997       result  of  the  recursive call would immediately be passed back as the
8998       result of the current call (a "tail recursion"), the function  is  just
8999       restarted instead.
9000
9001       The  above  comments  apply  when  pcre[16]_exec() is run in its normal
9002       interpretive  manner.   If   the   pattern   was   studied   with   the
9003       PCRE_STUDY_JIT_COMPILE  option, and just-in-time compiling was success-
9004       ful, and the options passed to pcre[16]_exec() were  not  incompatible,
9005       the  matching process uses the JIT-compiled code instead of the match()
9006       function. In this case, the memory requirements  are  handled  entirely
9007       differently. See the pcrejit documentation for details.
9008
9009       The pcre[16]_dfa_exec() function operates in an entirely different way,
9010       and uses recursion only when there is a regular expression recursion or
9011       subroutine  call in the pattern. This includes the processing of asser-
9012       tion and "once-only" subpatterns, which  are  handled  like  subroutine
9013       calls.  Normally,  these are never very deep, and the limit on the com-
9014       plexity of pcre[16]_dfa_exec() is controlled by the amount of workspace
9015       it  is  given.   However, it is possible to write patterns with runaway
9016       infinite recursions; such patterns will  cause  pcre[16]_dfa_exec()  to
9017       run out of stack. At present, there is no protection against this.
9018
9019       The  comments that follow do NOT apply to pcre[16]_dfa_exec(); they are
9020       relevant only for pcre[16]_exec() without the JIT optimization.
9021
9022   Reducing pcre[16]_exec()'s stack usage
9023
9024       Each time that match() is actually called recursively, it  uses  memory
9025       from  the  process  stack.  For certain kinds of pattern and data, very
9026       large amounts of stack may be needed, despite the recognition of  "tail
9027       recursion".   You  can often reduce the amount of recursion, and there-
9028       fore the amount of stack used, by modifying the pattern that  is  being
9029       matched. Consider, for example, this pattern:
9030
9031         ([^<]|<(?!inet))+
9032
9033       It  matches  from wherever it starts until it encounters "<inet" or the
9034       end of the data, and is the kind of pattern that  might  be  used  when
9035       processing an XML file. Each iteration of the outer parentheses matches
9036       either one character that is not "<" or a "<" that is not  followed  by
9037       "inet".  However,  each  time  a  parenthesis is processed, a recursion
9038       occurs, so this formulation uses a stack frame for each matched charac-
9039       ter.  For  a long string, a lot of stack is required. Consider now this
9040       rewritten pattern, which matches exactly the same strings:
9041
9042         ([^<]++|<(?!inet))+
9043
9044       This uses very much less stack, because runs of characters that do  not
9045       contain  "<" are "swallowed" in one item inside the parentheses. Recur-
9046       sion happens only when a "<" character that is not followed  by  "inet"
9047       is  encountered  (and  we assume this is relatively rare). A possessive
9048       quantifier is used to stop any backtracking into the  runs  of  non-"<"
9049       characters, but that is not related to stack usage.
9050
9051       This  example shows that one way of avoiding stack problems when match-
9052       ing long subject strings is to write repeated parenthesized subpatterns
9053       to match more than one character whenever possible.
9054
9055   Compiling PCRE to use heap instead of stack for pcre[16]_exec()
9056
9057       In  environments  where  stack memory is constrained, you might want to
9058       compile PCRE to use heap memory instead of stack for remembering  back-
9059       up points when pcre[16]_exec() is running. This makes it run a lot more
9060       slowly, however.  Details of how to do this are given in the  pcrebuild
9061       documentation. When built in this way, instead of using the stack, PCRE
9062       obtains and frees memory by calling the functions that are  pointed  to
9063       by  the  pcre[16]_stack_malloc  and  pcre[16]_stack_free  variables. By
9064       default, these point to malloc() and free(), but you  can  replace  the
9065       pointers to cause PCRE to use your own functions. Since the block sizes
9066       are always the same, and are always freed in reverse order, it  may  be
9067       possible  to  implement  customized memory handlers that are more effi-
9068       cient than the standard functions.
9069
9070   Limiting pcre[16]_exec()'s stack usage
9071
9072       You can set limits on the number of times that match() is called,  both
9073       in  total  and  recursively.  If  a  limit is exceeded, pcre[16]_exec()
9074       returns an error code. Setting suitable limits should prevent  it  from
9075       running  out of stack. The default values of the limits are very large,
9076       and unlikely ever to operate. They can be changed when PCRE  is  built,
9077       and they can also be set when pcre[16]_exec() is called. For details of
9078       these interfaces, see the pcrebuild documentation and  the  section  on
9079       extra data for pcre[16]_exec() in the pcreapi documentation.
9080
9081       As a very rough rule of thumb, you should reckon on about 500 bytes per
9082       recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
9083       should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
9084       hand, can support around 128000 recursions.
9085
9086       In Unix-like environments, the pcretest test program has a command line
9087       option (-S) that can be used to increase the size of its stack. As long
9088       as the stack is large enough, another option (-M) can be used  to  find
9089       the  smallest  limits  that allow a particular pattern to match a given
9090       subject string. This is done by calling pcre[16]_exec() repeatedly with
9091       different limits.
9092
9093   Obtaining an estimate of stack usage
9094
9095       The  actual  amount  of  stack used per recursion can vary quite a lot,
9096       depending on the compiler that was used to build PCRE and the optimiza-
9097       tion or debugging options that were set for it. The rule of thumb value
9098       of 500 bytes mentioned above may be larger  or  smaller  than  what  is
9099       actually needed. A better approximation can be obtained by running this
9100       command:
9101
9102         pcretest -m -C
9103
9104       The -C option causes pcretest to output information about  the  options
9105       with which PCRE was compiled. When -m is also given (before -C), infor-
9106       mation about stack use is given in a line like this:
9107
9108         Match recursion uses stack: approximate frame size = 640 bytes
9109
9110       The value is approximate because some recursions need a bit more (up to
9111       perhaps 16 more bytes).
9112
9113       If  the  above  command  is given when PCRE is compiled to use the heap
9114       instead of the stack for recursion, the value that  is  output  is  the
9115       size of each block that is obtained from the heap.
9116
9117   Changing stack size in Unix-like systems
9118
9119       In  Unix-like environments, there is not often a problem with the stack
9120       unless very long strings are involved,  though  the  default  limit  on
9121       stack  size  varies  from system to system. Values from 8Mb to 64Mb are
9122       common. You can find your default limit by running the command:
9123
9124         ulimit -s
9125
9126       Unfortunately, the effect of running out of  stack  is  often  SIGSEGV,
9127       though  sometimes  a more explicit error message is given. You can nor-
9128       mally increase the limit on stack size by code such as this:
9129
9130         struct rlimit rlim;
9131         getrlimit(RLIMIT_STACK, &rlim);
9132         rlim.rlim_cur = 100*1024*1024;
9133         setrlimit(RLIMIT_STACK, &rlim);
9134
9135       This reads the current limits (soft and hard) using  getrlimit(),  then
9136       attempts  to  increase  the  soft limit to 100Mb using setrlimit(). You
9137       must do this before calling pcre[16]_exec().
9138
9139   Changing stack size in Mac OS X
9140
9141       Using setrlimit(), as described above, should also work on Mac OS X. It
9142       is also possible to set a stack size when linking a program. There is a
9143       discussion  about  stack  sizes  in  Mac  OS  X  at  this   web   site:
9144       http://developer.apple.com/qa/qa2005/qa1419.html.
9145
9146
9147AUTHOR
9148
9149       Philip Hazel
9150       University Computing Service
9151       Cambridge CB2 3QH, England.
9152
9153
9154REVISION
9155
9156       Last updated: 21 January 2012
9157       Copyright (c) 1997-2012 University of Cambridge.
9158------------------------------------------------------------------------------
9159
9160
9161