1README file for PCRE (Perl-compatible regular expression library) 2----------------------------------------------------------------- 3 4The latest release of PCRE is always available from 5 6 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.gz 7 8Please read the NEWS file if you are upgrading from a previous release. 9 10PCRE has its own native API, but a set of "wrapper" functions that are based on 11the POSIX API are also supplied in the library libpcreposix. Note that this 12just provides a POSIX calling interface to PCRE: the regular expressions 13themselves still follow Perl syntax and semantics. The header file 14for the POSIX-style functions is called pcreposix.h. The official POSIX name is 15regex.h, but I didn't want to risk possible problems with existing files of 16that name by distributing it that way. To use it with an existing program that 17uses the POSIX API, it will have to be renamed or pointed at by a link. 18 19If you are using the POSIX interface to PCRE and there is already a POSIX regex 20library installed on your system, you must take care when linking programs to 21ensure that they link with PCRE's libpcreposix library. Otherwise they may pick 22up the "real" POSIX functions of the same name. 23 24 25Documentation for PCRE 26---------------------- 27 28If you install PCRE in the normal way, you will end up with an installed set of 29man pages whose names all start with "pcre". The one that is called "pcre" 30lists all the others. In addition to these man pages, the PCRE documentation is 31supplied in two other forms; however, as there is no standard place to install 32them, they are left in the doc directory of the unpacked source distribution. 33These forms are: 34 35 1. Files called doc/pcre.txt, doc/pcregrep.txt, and doc/pcretest.txt. The 36 first of these is a concatenation of the text forms of all the section 3 37 man pages except those that summarize individual functions. The other two 38 are the text forms of the section 1 man pages for the pcregrep and 39 pcretest commands. Text forms are provided for ease of scanning with text 40 editors or similar tools. 41 42 2. A subdirectory called doc/html contains all the documentation in HTML 43 form, hyperlinked in various ways, and rooted in a file called 44 doc/index.html. 45 46 47Contributions by users of PCRE 48------------------------------ 49 50You can find contributions from PCRE users in the directory 51 52 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib 53 54where there is also a README file giving brief descriptions of what they are. 55Several of them provide support for compiling PCRE on various flavours of 56Windows systems (I myself do not use Windows). Some are complete in themselves; 57others are pointers to URLs containing relevant files. 58 59 60Building PCRE on a Unix-like system 61----------------------------------- 62 63To build PCRE on a Unix-like system, first run the "configure" command from the 64PCRE distribution directory, with your current directory set to the directory 65where you want the files to be created. This command is a standard GNU 66"autoconf" configuration script, for which generic instructions are supplied in 67INSTALL. 68 69Most commonly, people build PCRE within its own distribution directory, and in 70this case, on many systems, just running "./configure" is sufficient, but the 71usual methods of changing standard defaults are available. For example: 72 73CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local 74 75specifies that the C compiler should be run with the flags '-O2 -Wall' instead 76of the default, and that "make install" should install PCRE under /opt/local 77instead of the default /usr/local. 78 79If you want to build in a different directory, just run "configure" with that 80directory as current. For example, suppose you have unpacked the PCRE source 81into /source/pcre/pcre-xxx, but you want to build it in /build/pcre/pcre-xxx: 82 83cd /build/pcre/pcre-xxx 84/source/pcre/pcre-xxx/configure 85 86There are some optional features that can be included or omitted from the PCRE 87library. You can read more about them in the pcrebuild man page. 88 89. If you want to make use of the support for UTF-8 character strings in PCRE, 90 you must add --enable-utf8 to the "configure" command. Without it, the code 91 for handling UTF-8 is not included in the library. (Even when included, it 92 still has to be enabled by an option at run time.) 93 94. If, in addition to support for UTF-8 character strings, you want to include 95 support for the \P, \p, and \X sequences that recognize Unicode character 96 properties, you must add --enable-unicode-properties to the "configure" 97 command. This adds about 90K to the size of the library (in the form of a 98 property table); only the basic two-letter properties such as Lu are 99 supported. 100 101. You can build PCRE to recognized CR or NL as the newline character, instead 102 of whatever your compiler uses for "\n", by adding --newline-is-cr or 103 --newline-is-nl to the "configure" command, respectively. Only do this if you 104 really understand what you are doing. On traditional Unix-like systems, the 105 newline character is NL. 106 107. When called via the POSIX interface, PCRE uses malloc() to get additional 108 storage for processing capturing parentheses if there are more than 10 of 109 them. You can increase this threshold by setting, for example, 110 111 --with-posix-malloc-threshold=20 112 113 on the "configure" command. 114 115. PCRE has a counter which can be set to limit the amount of resources it uses. 116 If the limit is exceeded during a match, the match fails. The default is ten 117 million. You can change the default by setting, for example, 118 119 --with-match-limit=500000 120 121 on the "configure" command. This is just the default; individual calls to 122 pcre_exec() can supply their own value. There is discussion on the pcreapi 123 man page. 124 125. The default maximum compiled pattern size is around 64K. You can increase 126 this by adding --with-link-size=3 to the "configure" command. You can 127 increase it even more by setting --with-link-size=4, but this is unlikely 128 ever to be necessary. If you build PCRE with an increased link size, test 2 129 (and 5 if you are using UTF-8) will fail. Part of the output of these tests 130 is a representation of the compiled pattern, and this changes with the link 131 size. 132 133. You can build PCRE so that its match() function does not call itself 134 recursively. Instead, it uses blocks of data from the heap via special 135 functions pcre_stack_malloc() and pcre_stack_free() to save data that would 136 otherwise be saved on the stack. To build PCRE like this, use 137 138 --disable-stack-for-recursion 139 140 on the "configure" command. PCRE runs more slowly in this mode, but it may be 141 necessary in environments with limited stack sizes. 142 143The "configure" script builds seven files: 144 145. pcre.h is build by copying pcre.in and making substitutions 146. Makefile is built by copying Makefile.in and making substitutions. 147. config.h is built by copying config.in and making substitutions. 148. pcre-config is built by copying pcre-config.in and making substitutions. 149. libpcre.pc is data for the pkg-config command, built from libpcre.pc.in 150. libtool is a script that builds shared and/or static libraries 151. RunTest is a script for running tests 152 153Once "configure" has run, you can run "make". It builds two libraries called 154libpcre and libpcreposix, a test program called pcretest, and the pcregrep 155command. You can use "make install" to copy these, the public header files 156pcre.h and pcreposix.h, and the man pages to appropriate live directories on 157your system, in the normal way. 158 159 160Retrieving configuration information on Unix-like systems 161--------------------------------------------------------- 162 163Running "make install" also installs the command pcre-config, which can be used 164to recall information about the PCRE configuration and installation. For 165example: 166 167 pcre-config --version 168 169prints the version number, and 170 171 pcre-config --libs 172 173outputs information about where the library is installed. This command can be 174included in makefiles for programs that use PCRE, saving the programmer from 175having to remember too many details. 176 177The pkg-config command is another system for saving and retrieving information 178about installed libraries. Instead of separate commands for each library, a 179single command is used. For example: 180 181 pkg-config --cflags pcre 182 183The data is held in *.pc files that are installed in a directory called 184pkgconfig. 185 186 187Shared libraries on Unix-like systems 188------------------------------------- 189 190The default distribution builds PCRE as two shared libraries and two static 191libraries, as long as the operating system supports shared libraries. Shared 192library support relies on the "libtool" script which is built as part of the 193"configure" process. 194 195The libtool script is used to compile and link both shared and static 196libraries. They are placed in a subdirectory called .libs when they are newly 197built. The programs pcretest and pcregrep are built to use these uninstalled 198libraries (by means of wrapper scripts in the case of shared libraries). When 199you use "make install" to install shared libraries, pcregrep and pcretest are 200automatically re-built to use the newly installed shared libraries before being 201installed themselves. However, the versions left in the source directory still 202use the uninstalled libraries. 203 204To build PCRE using static libraries only you must use --disable-shared when 205configuring it. For example: 206 207./configure --prefix=/usr/gnu --disable-shared 208 209Then run "make" in the usual way. Similarly, you can use --disable-static to 210build only shared libraries. 211 212 213Cross-compiling on a Unix-like system 214------------------------------------- 215 216You can specify CC and CFLAGS in the normal way to the "configure" command, in 217order to cross-compile PCRE for some other host. However, during the building 218process, the dftables.c source file is compiled *and run* on the local host, in 219order to generate the default character tables (the chartables.c file). It 220therefore needs to be compiled with the local compiler, not the cross compiler. 221You can do this by specifying CC_FOR_BUILD (and if necessary CFLAGS_FOR_BUILD) 222when calling the "configure" command. If they are not specified, they default 223to the values of CC and CFLAGS. 224 225 226Building on non-Unix systems 227---------------------------- 228 229For a non-Unix system, read the comments in the file NON-UNIX-USE, though if 230the system supports the use of "configure" and "make" you may be able to build 231PCRE in the same way as for Unix systems. 232 233PCRE has been compiled on Windows systems and on Macintoshes, but I don't know 234the details because I don't use those systems. It should be straightforward to 235build PCRE on any system that has a Standard C compiler, because it uses only 236Standard C functions. 237 238 239Testing PCRE 240------------ 241 242To test PCRE on a Unix system, run the RunTest script that is created by the 243configuring process. (This can also be run by "make runtest", "make check", or 244"make test".) For other systems, see the instructions in NON-UNIX-USE. 245 246The script runs the pcretest test program (which is documented in its own man 247page) on each of the testinput files (in the testdata directory) in turn, 248and compares the output with the contents of the corresponding testoutput file. 249A file called testtry is used to hold the main output from pcretest 250(testsavedregex is also used as a working file). To run pcretest on just one of 251the test files, give its number as an argument to RunTest, for example: 252 253 RunTest 2 254 255The first file can also be fed directly into the perltest script to check that 256Perl gives the same results. The only difference you should see is in the first 257few lines, where the Perl version is given instead of the PCRE version. 258 259The second set of tests check pcre_fullinfo(), pcre_info(), pcre_study(), 260pcre_copy_substring(), pcre_get_substring(), pcre_get_substring_list(), error 261detection, and run-time flags that are specific to PCRE, as well as the POSIX 262wrapper API. It also uses the debugging flag to check some of the internals of 263pcre_compile(). 264 265If you build PCRE with a locale setting that is not the standard C locale, the 266character tables may be different (see next paragraph). In some cases, this may 267cause failures in the second set of tests. For example, in a locale where the 268isprint() function yields TRUE for characters in the range 128-255, the use of 269[:isascii:] inside a character class defines a different set of characters, and 270this shows up in this test as a difference in the compiled code, which is being 271listed for checking. Where the comparison test output contains [\x00-\x7f] the 272test will contain [\x00-\xff], and similarly in some other cases. This is not a 273bug in PCRE. 274 275The third set of tests checks pcre_maketables(), the facility for building a 276set of character tables for a specific locale and using them instead of the 277default tables. The tests make use of the "fr_FR" (French) locale. Before 278running the test, the script checks for the presence of this locale by running 279the "locale" command. If that command fails, or if it doesn't include "fr_FR" 280in the list of available locales, the third test cannot be run, and a comment 281is output to say why. If running this test produces instances of the error 282 283 ** Failed to set locale "fr_FR" 284 285in the comparison output, it means that locale is not available on your system, 286despite being listed by "locale". This does not mean that PCRE is broken. 287 288The fourth test checks the UTF-8 support. It is not run automatically unless 289PCRE is built with UTF-8 support. To do this you must set --enable-utf8 when 290running "configure". This file can be also fed directly to the perltest script, 291provided you are running Perl 5.8 or higher. (For Perl 5.6, a small patch, 292commented in the script, can be be used.) 293 294The fifth test checks error handling with UTF-8 encoding, and internal UTF-8 295features of PCRE that are not relevant to Perl. 296 297The sixth and final test checks the support for Unicode character properties. 298It it not run automatically unless PCRE is built with Unicode property support. 299To to this you must set --enable-unicode-properties when running "configure". 300 301 302Character tables 303---------------- 304 305PCRE uses four tables for manipulating and identifying characters whose values 306are less than 256. The final argument of the pcre_compile() function is a 307pointer to a block of memory containing the concatenated tables. A call to 308pcre_maketables() can be used to generate a set of tables in the current 309locale. If the final argument for pcre_compile() is passed as NULL, a set of 310default tables that is built into the binary is used. 311 312The source file called chartables.c contains the default set of tables. This is 313not supplied in the distribution, but is built by the program dftables 314(compiled from dftables.c), which uses the ANSI C character handling functions 315such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table 316sources. This means that the default C locale which is set for your system will 317control the contents of these default tables. You can change the default tables 318by editing chartables.c and then re-building PCRE. If you do this, you should 319probably also edit Makefile to ensure that the file doesn't ever get 320re-generated. 321 322The first two 256-byte tables provide lower casing and case flipping functions, 323respectively. The next table consists of three 32-byte bit maps which identify 324digits, "word" characters, and white space, respectively. These are used when 325building 32-byte bit maps that represent character classes. 326 327The final 256-byte table has bits indicating various character types, as 328follows: 329 330 1 white space character 331 2 letter 332 4 decimal digit 333 8 hexadecimal digit 334 16 alphanumeric or '_' 335 128 regular expression metacharacter or binary zero 336 337You should not alter the set of characters that contain the 128 bit, as that 338will cause PCRE to malfunction. 339 340 341Manifest 342-------- 343 344The distribution should contain the following files: 345 346(A) The actual source files of the PCRE library functions and their 347 headers: 348 349 dftables.c auxiliary program for building chartables.c 350 351 get.c ) 352 maketables.c ) 353 study.c ) source of the functions 354 pcre.c ) in the library 355 pcreposix.c ) 356 printint.c ) 357 358 ucp.c ) 359 ucp.h ) source for the code that is used for 360 ucpinternal.h ) Unicode property handling 361 ucptable.c ) 362 ucptypetable.c ) 363 364 pcre.in "source" for the header for the external API; pcre.h 365 is built from this by "configure" 366 pcreposix.h header for the external POSIX wrapper API 367 internal.h header for internal use 368 config.in template for config.h, which is built by configure 369 370(B) Auxiliary files: 371 372 AUTHORS information about the author of PCRE 373 ChangeLog log of changes to the code 374 INSTALL generic installation instructions 375 LICENCE conditions for the use of PCRE 376 COPYING the same, using GNU's standard name 377 Makefile.in template for Unix Makefile, which is built by configure 378 NEWS important changes in this release 379 NON-UNIX-USE notes on building PCRE on non-Unix systems 380 README this file 381 RunTest.in template for a Unix shell script for running tests 382 config.guess ) files used by libtool, 383 config.sub ) used only when building a shared library 384 configure a configuring shell script (built by autoconf) 385 configure.in the autoconf input used to build configure 386 doc/Tech.Notes notes on the encoding 387 doc/*.3 man page sources for the PCRE functions 388 doc/*.1 man page sources for pcregrep and pcretest 389 doc/html/* HTML documentation 390 doc/pcre.txt plain text version of the man pages 391 doc/pcretest.txt plain text documentation of test program 392 doc/perltest.txt plain text documentation of Perl test program 393 install-sh a shell script for installing files 394 libpcre.pc.in "source" for libpcre.pc for pkg-config 395 ltmain.sh file used to build a libtool script 396 mkinstalldirs script for making install directories 397 pcretest.c comprehensive test program 398 pcredemo.c simple demonstration of coding calls to PCRE 399 perltest Perl test program 400 pcregrep.c source of a grep utility that uses PCRE 401 pcre-config.in source of script which retains PCRE information 402 testdata/testinput1 test data, compatible with Perl 403 testdata/testinput2 test data for error messages and non-Perl things 404 testdata/testinput3 test data for locale-specific tests 405 testdata/testinput4 test data for UTF-8 tests compatible with Perl 406 testdata/testinput5 test data for other UTF-8 tests 407 testdata/testinput6 test data for Unicode property support tests 408 testdata/testoutput1 test results corresponding to testinput1 409 testdata/testoutput2 test results corresponding to testinput2 410 testdata/testoutput3 test results corresponding to testinput3 411 testdata/testoutput4 test results corresponding to testinput4 412 testdata/testoutput5 test results corresponding to testinput5 413 testdata/testoutput6 test results corresponding to testinput6 414 415(C) Auxiliary files for Win32 DLL 416 417 dll.mk 418 libpcre.def 419 libpcreposix.def 420 pcre.def 421 422(D) Auxiliary file for VPASCAL 423 424 makevp.bat 425 426Philip Hazel <ph10@cam.ac.uk> 427September 2004 428