NameDateSize

..Today10

AUTHORSH A D25-Feb-201028

ChangeLog.oldH A D17-Nov-201728.2 KiB

configure.acH A D19-Jan-201815.8 KiB

doc/H19-Jan-20188

include/H19-Jan-20184

lib/H21-Apr-202031

LICENSEH A D25-Feb-20101.4 KiB

m4/H19-Jan-20189

Makefile.amH A D19-Jan-2018460

NEWSH A D25-Feb-20107.3 KiB

po/H19-Jan-20188

python/H19-Jan-20187

README.mdH A D17-Nov-201710 KiB

src/H25-May-20245

tests/H19-Jan-201813

THANKSH A D25-Feb-2010930

TODOH A D25-Feb-20101.6 KiB

tre.pc.inH A D25-Feb-2010247

tre.spec.inH A D25-Feb-20102.8 KiB

utils/H19-Jan-20189

vcbuild/H19-Jan-20185

win32/H19-Jan-201810

README.md

1Introduction
2============
3
4TRE is a lightweight, robust, and efficient POSIX compliant regexp
5matching library with some exciting features such as approximate
6(fuzzy) matching.
7
8The matching algorithm used in TRE uses linear worst-case time in
9the length of the text being searched, and quadratic worst-case
10time in the length of the used regular expression. 
11
12In other words, the time complexity of the algorithm is O(M^2N), where
13M is the length of the regular expression and N is the length of the
14text. The used space is also quadratic on the length of the regex, but
15does not depend on the searched string. This quadratic behaviour
16occurs only on pathological cases which are probably very rare in
17practice.
18
19
20Hacking
21=======
22
23Here's how to work with this code.
24
25Prerequisites
26-------------
27
28You will need the following tools installed on your system:
29
30  - autoconf
31  - automake
32  - gettext
33  - libtool
34  - zip (optional)
35
36
37Building
38--------
39
40First, prepare the tre.  Change to the root of the source directory
41and run
42```
43./utils/autogen.sh
44```
45This will regenerate various things using the prerequisite tools so
46that you end up with a buildable tree.
47
48After this, you can run the configure script and build TRE as usual:
49```
50./configure
51make
52make check
53make install
54```
55
56
57Building a source code package
58------------------------------
59
60In a prepared tree, this command creates a source code tarball:
61```
62./configure && make dist
63```
64
65Alternatively, you can run
66```
67./utils/build-sources.sh
68```
69which builds the source code packages and puts them in the `dist`
70subdirectory.  This script needs a working `zip` command.
71
72
73Features
74========
75
76TRE is not just yet another regexp matcher. TRE has some features
77which are not there in most free POSIX compatible implementations.
78Most of these features are not present in non-free implementations
79either, for that matter.
80
81Approximate matching
82--------------------
83
84Approximate pattern matching allows matches to be approximate, that
85is, allows the matches to be close to the searched pattern under
86some measure of closeness. TRE uses the edit-distance measure (also
87known as the Levenshtein distance) where characters can be
88inserted, deleted, or substituted in the searched text in order to
89get an exact match. 
90
91Each insertion, deletion, or substitution adds the distance, or cost,
92of the match. TRE can report the matches which have a cost lower than
93some given threshold value. TRE can also be used to search for matches
94with the lowest cost.
95
96TRE includes a version of the agrep (approximate grep) command line
97tool for approximate regexp matching in the style of grep. Unlike
98other agrep implementations (like the one by Sun Wu and Udi Manber
99from University of Arizona) TRE agrep allows full regexps of any
100length, any number of errors, and non-uniform costs for insertion,
101deletion and substitution.
102
103Strict standard conformance
104---------------------------
105
106POSIX defines the behaviour of regexp functions precisely. TRE
107attempts to conform to these specifications as strictly as possible.
108TRE always returns the correct matches for subpatterns, for example.
109Very few other implementations do this correctly.  In fact, the only
110other implementations besides TRE that I am aware of (free or not)
111that get it right are Rx by Tom Lord, Regex++ by John Maddock, and the
112AT&T ast regex by Glenn Fowler and Doug McIlroy.
113
114The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
115or Open Group Base Specifications Issue 6, commonly referred to as
116"POSIX".  It can be found online here. The relevant parts are the
117base specifications on regular expressions (and the rationale) and
118the description of the regcomp() API.
119
120For an excellent survey on POSIX regexp matchers, see the testregex
121pages by Glenn Fowler of AT&T Labs Research.
122
123Predictable matching speed
124--------------------------
125
126Because of the matching algorithm used in TRE, the maximum time
127consumed by any regexec() call is always directly proportional to
128the length of the searched string. There is one exception: if back
129references are used, the matching may take time that grows
130exponentially with the length of the string. This is because
131matching back references is an NP complete problem, and almost
132certainly requires exponential time to match in the worst case.
133
134Predictable and modest memory consumption
135-----------------------------------------
136
137A regexec() call never allocates memory from the heap. TRE
138allocates all the memory it needs during a regcomp() call, and some
139temporary working space from the stack frame for the duration of
140the regexec() call. The amount of temporary space needed is
141constant during matching and does not depend on the searched
142string. For regexps of reasonable size TRE needs less than 50K of
143dynamically allocated memory during the regcomp() call, less than
14420K for the compiled pattern buffer, and less than two kilobytes of
145temporary working space from the stack frame during a regexec()
146call. There is no time/memory tradeoff. TRE is also small in code
147size; statically linking with TRE increases the executable size
148less than 30K (gcc-3.2, x86, GNU/Linux).
149
150Wide character and multibyte character set support
151--------------------------------------------------
152
153TRE supports multibyte character sets. This makes it possible to
154use regexps seamlessly with, for example, Japanese locales. TRE
155also provides a wide character API.
156
157Binary pattern and data support
158-------------------------------
159
160TRE provides APIs which allow binary zero characters both in
161regexps and searched strings. The standard API cannot be easily
162used to, for example, search for printable words from binary data
163(although it is possible with some hacking). Searching for patterns
164which contain binary zeroes embedded is not possible at all with
165the standard API.
166
167Completely thread safe
168----------------------
169
170TRE is completely thread safe. All the exported functions are
171re-entrant, and a single compiled regexp object can be used
172simultaneously in multiple contexts; e.g. in main() and a signal
173handler, or in many threads of a multithreaded application.
174
175Portable
176--------
177
178TRE is portable across multiple platforms. Here's a table of
179platforms and compilers that have been successfully used to compile
180and run TRE:
181
182<table>
183   <tr><th>Platform(s)</th>                    <th>Compiler(s)</th></tr>
184   <tr><td>AIX 4.3.2 - 5.3.0</td>              <td>GCC, C for AIX compiler version 5</td></tr>
185   <tr><td>Compaq Tru64 UNIX V5.1A/B</td>      <td>Compaq C V6.4-014 - V6.5-011</td></tr>
186   <tr><td>Cygwin 1.3 - 1.5</td>               <td>GCC</td></tr>
187   <tr><td>Digital UNIX V4.0</td>              <td>DEC C V5.9-005</td></tr>
188   <tr><td>FreeBSD 4 and above</td>            <td>GCC</td></tr>
189   <tr><td>GNU/Linux systems on x86, x86_64, ppc64, s390</td><td>GCC</td></tr>
190   <tr><td>HP-UX 10.20- 11.00</td>             <td>GCC, HP C Compiler</td></tr>
191   <tr><td>IRIX 6.5</td>                       <td>GCC, MIPSpro Compilers 7.3.1.3m</td></tr>
192   <tr><td>Max OS X</td></tr>
193   <tr><td>NetBSD 1.5 and above</td>           <td>GCC, egcs</td></tr>
194   <tr><td>OpenBSD 3.3 and above</td>          <td>GCC</td></tr>
195   <tr><td>Solaris 2.7-10 sparc/x86</td>       <td>GCC, Sun Workshop 6 compilers</td></tr>
196   <tr><td>Windows 98 - XP</td>                <td>Microsoft Visual C++ 6.0</td></tr>
197</table>
198
199
200TRE 0.7.5 should compile without changes on all of the above
201platforms.  Tell me if you are using TRE on a platform that is not
202listed above, and I'll add it to the list. Also let me know if TRE
203does not work on a listed platform.
204
205Depending on the platform, you may need to install libutf8 to get
206wide character and multibyte character set support.
207
208Free
209----
210
211TRE is released under a license which is essentially the same as
212the "2 clause" BSD-style license used in NetBSD.  See the file
213LICENSE for details.
214
215Roadmap
216-------
217
218There are currently two features, both related to collating
219elements, missing from 100% POSIX compliance. These are:
220
221* Support for collating elements (e.g. [[.<X>.]], where <X> is a
222  collating element). It is not possible to support
223  multi-character collating elements portably, since POSIX does
224  not define a way to determine whether a character sequence is a
225  multi-character collating element or not.
226
227* Support for equivalence classes, for example [[=<X>=]], where
228  <X> is a collating element. An equivalence class matches any
229  character which has the same primary collation weight as
230  <X>. Again, POSIX provides no portable mechanism for
231  determining the primary collation weight of a collating
232  element.
233
234Note that other portable regexp implementations don't support
235collating elements either. The single exception is Regex++, which
236comes with its own database for collating elements for different
237locales. Support for collating elements and equivalence classes has
238not been widely requested and is not very high on the TODO list at
239the moment.
240
241These are other features I'm planning to implement real soon now:
242
243* All the missing GNU extensions enabled in GNU regex, such as
244  [[:<:]] and [[:>:]]
245
246* A REG_SHORTEST regexec() flag for returning the shortest match
247  instead of the longest match.
248
249* Perl-compatible syntax
250  * `[:^class:]`  
251  *  Matches anything but the characters in class. Note that
252  *  [^[:class:]] works already, this would be just a
253  *  convenience shorthand.
254  * 
255  * `\A`  
256  * Match only at beginning of string
257  * 
258  * `\Z`  
259  * Match only at end of string, or before newline at the end
260  * 
261  * `\z`  
262  * Match only at end of string
263  * 
264  * `\l`  
265  * Lowercase next char (think vi)
266  * 
267  * `\u`  
268  * Uppercase next char (think vi)
269  * 
270  * `\L`  
271  * Lowercase till \E (think vi)
272  * 
273  * `\U`  
274  * Uppercase till \E (think vi)
275  * 
276  * `(?=pattern)`  
277  * Zero-width positive look-ahead assertions.
278  * 
279  * `(?!pattern)`  
280  * Zero-width negative look-ahead assertions.
281  * 
282  * `(?<=pattern)`  
283  * Zero-width positive look-behind assertions.
284  * 
285  * `(?<!pattern)`
286  * Zero-width negative look-behind assertions.
287
288Documentation especially for the nonstandard features of TRE, such
289as approximate matching, is a work in progress (with "progress"
290loosely defined...)
291