1# -*- mode: rdoc; coding: utf-8; fill-column: 74; -*-
2
3Regular expressions (<i>regexp</i>s) are patterns which describe the
4contents of a string. They're used for testing whether a string contains a
5given pattern, or extracting the portions that match. They are created
6with the <tt>/</tt><i>pat</i><tt>/</tt> and
7<tt>%r{</tt><i>pat</i><tt>}</tt> literals or the <tt>Regexp.new</tt>
8constructor.
9
10A regexp is usually delimited with forward slashes (<tt>/</tt>). For
11example:
12
13    /hay/ =~ 'haystack'   #=> 0
14    /y/.match('haystack') #=> #<MatchData "y">
15
16If a string contains the pattern it is said to <i>match</i>. A literal
17string matches itself.
18
19    # 'haystack' does not contain the pattern 'needle', so doesn't match.
20    /needle/.match('haystack') #=> nil
21    # 'haystack' does contain the pattern 'hay', so it matches
22    /hay/.match('haystack')    #=> #<MatchData "hay">
23
24Specifically, <tt>/st/</tt> requires that the string contains the letter
25_s_ followed by the letter _t_, so it matches _haystack_, also.
26
27== <tt>=~</tt> and Regexp#match
28
29Pattern matching may be achieved by using <tt>=~</tt> operator or Regexp#match
30method.
31
32=== <tt>=~</tt> operator
33
34<tt>=~</tt> is Ruby's basic pattern-matching operator.  When one operand is a
35regular expression and the other is a string then the regular expression is
36used as a pattern to match against the string.  (This operator is equivalently
37defined by Regexp and String so the order of String and Regexp do not matter.
38Other classes may have different implementations of <tt>=~</tt>.)  If a match
39is found, the operator returns index of first match in string, otherwise it
40returns +nil+.
41
42    /hay/ =~ 'haystack'   #=> 0
43    'haystack' =~ /hay/   #=> 0
44    /a/   =~ 'haystack'   #=> 1
45    /u/   =~ 'haystack'   #=> nil
46
47Using <tt>=~</tt> operator with a String and Regexp the <tt>$~</tt> global
48variable is set after a successful match.  <tt>$~</tt> holds a MatchData
49object. Regexp.last_match is equivalent to <tt>$~</tt>.
50
51=== Regexp#match method
52
53#match method return a MatchData object :
54
55    /st/.match('haystack')   #=> #<MatchData "st">
56
57== Metacharacters and Escapes
58
59The following are <i>metacharacters</i> <tt>(</tt>, <tt>)</tt>,
60<tt>[</tt>, <tt>]</tt>, <tt>{</tt>, <tt>}</tt>, <tt>.</tt>, <tt>?</tt>,
61<tt>+</tt>, <tt>*</tt>. They have a specific meaning when appearing in a
62pattern. To match them literally they must be backslash-escaped. To match
63a backslash literally backslash-escape that: <tt>\\\\\\</tt>.
64
65    /1 \+ 2 = 3\?/.match('Does 1 + 2 = 3?') #=> #<MatchData "1 + 2 = 3?">
66
67Patterns behave like double-quoted strings so can contain the same
68backslash escapes.
69
70    /\s\u{6771 4eac 90fd}/.match("Go to 東京都")
71        #=> #<MatchData " 東京都">
72
73Arbitrary Ruby expressions can be embedded into patterns with the
74<tt>#{...}</tt> construct.
75
76    place = "東京都"
77    /#{place}/.match("Go to 東京都")
78        #=> #<MatchData "東京都">
79
80== Character Classes
81
82A <i>character class</i> is delimited with square brackets (<tt>[</tt>,
83<tt>]</tt>) and lists characters that may appear at that point in the
84match. <tt>/[ab]/</tt> means _a_ or _b_, as opposed to <tt>/ab/</tt> which
85means _a_ followed by _b_.
86
87    /W[aeiou]rd/.match("Word") #=> #<MatchData "Word">
88
89Within a character class the hyphen (<tt>-</tt>) is a metacharacter
90denoting an inclusive range of characters. <tt>[abcd]</tt> is equivalent
91to <tt>[a-d]</tt>. A range can be followed by another range, so
92<tt>[abcdwxyz]</tt> is equivalent to <tt>[a-dw-z]</tt>. The order in which
93ranges or individual characters appear inside a character class is
94irrelevant.
95
96    /[0-9a-f]/.match('9f') #=> #<MatchData "9">
97    /[9f]/.match('9f')     #=> #<MatchData "9">
98
99If the first character of a character class is a caret (<tt>^</tt>) the
100class is inverted: it matches any character _except_ those named.
101
102    /[^a-eg-z]/.match('f') #=> #<MatchData "f">
103
104A character class may contain another character class. By itself this
105isn't useful because <tt>[a-z[0-9]]</tt> describes the same set as
106<tt>[a-z0-9]</tt>. However, character classes also support the <tt>&&</tt>
107operator which performs set intersection on its arguments. The two can be
108combined as follows:
109
110    /[a-w&&[^c-g]z]/ # ([a-w] AND ([^c-g] OR z))
111    # This is equivalent to:
112    /[abh-w]/
113
114The following metacharacters also behave like character classes:
115
116* <tt>/./</tt> - Any character except a newline.
117* <tt>/./m</tt> - Any character (the +m+ modifier enables multiline mode)
118* <tt>/\w/</tt> - A word character (<tt>[a-zA-Z0-9_]</tt>)
119* <tt>/\W/</tt> - A non-word character (<tt>[^a-zA-Z0-9_]</tt>)
120* <tt>/\d/</tt> - A digit character (<tt>[0-9]</tt>)
121* <tt>/\D/</tt> - A non-digit character (<tt>[^0-9]</tt>)
122* <tt>/\h/</tt> - A hexdigit character (<tt>[0-9a-fA-F]</tt>)
123* <tt>/\H/</tt> - A non-hexdigit character (<tt>[^0-9a-fA-F]</tt>)
124* <tt>/\s/</tt> - A whitespace character: <tt>/[ \t\r\n\f]/</tt>
125* <tt>/\S/</tt> - A non-whitespace character: <tt>/[^ \t\r\n\f]/</tt>
126
127POSIX <i>bracket expressions</i> are also similar to character classes.
128They provide a portable alternative to the above, with the added benefit
129that they encompass non-ASCII characters. For instance, <tt>/\d/</tt>
130matches only the ASCII decimal digits (0-9); whereas <tt>/[[:digit:]]/</tt>
131matches any character in the Unicode _Nd_ category.
132
133* <tt>/[[:alnum:]]/</tt> - Alphabetic and numeric character
134* <tt>/[[:alpha:]]/</tt> - Alphabetic character
135* <tt>/[[:blank:]]/</tt> - Space or tab
136* <tt>/[[:cntrl:]]/</tt> - Control character
137* <tt>/[[:digit:]]/</tt> - Digit
138* <tt>/[[:graph:]]/</tt> - Non-blank character (excludes spaces, control
139  characters, and similar)
140* <tt>/[[:lower:]]/</tt> - Lowercase alphabetical character
141* <tt>/[[:print:]]/</tt> - Like [:graph:], but includes the space character
142* <tt>/[[:punct:]]/</tt> - Punctuation character
143* <tt>/[[:space:]]/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline,
144  carriage return, etc.)
145* <tt>/[[:upper:]]/</tt> - Uppercase alphabetical
146* <tt>/[[:xdigit:]]/</tt> - Digit allowed in a hexadecimal number (i.e.,
147  0-9a-fA-F)
148
149Ruby also supports the following non-POSIX character classes:
150
151* <tt>/[[:word:]]/</tt> - A character in one of the following Unicode
152  general categories _Letter_, _Mark_, _Number_,
153  <i>Connector_Punctuation</i>
154* <tt>/[[:ascii:]]/</tt> - A character in the ASCII character set
155
156    # U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO"
157    /[[:digit:]]/.match("\u06F2")    #=> #<MatchData "\u{06F2}">
158    /[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He">
159    /[[:xdigit:]][[:xdigit:]]/.match("A6")  #=> #<MatchData "A6">
160
161== Repetition
162
163The constructs described so far match a single character. They can be
164followed by a repetition metacharacter to specify how many times they need
165to occur. Such metacharacters are called <i>quantifiers</i>.
166
167* <tt>*</tt> - Zero or more times
168* <tt>+</tt> - One or more times
169* <tt>?</tt> - Zero or one times (optional)
170* <tt>{</tt><i>n</i><tt>}</tt> - Exactly <i>n</i> times
171* <tt>{</tt><i>n</i><tt>,}</tt> - <i>n</i> or more times
172* <tt>{,</tt><i>m</i><tt>}</tt> - <i>m</i> or less times
173* <tt>{</tt><i>n</i><tt>,</tt><i>m</i><tt>}</tt> - At least <i>n</i> and
174  at most <i>m</i> times
175
176    # At least one uppercase character ('H'), at least one lowercase
177    # character ('e'), two 'l' characters, then one 'o'
178    "Hello".match(/[[:upper:]]+[[:lower:]]+l{2}o/) #=> #<MatchData "Hello">
179
180Repetition is <i>greedy</i> by default: as many occurrences as possible
181are matched while still allowing the overall match to succeed. By
182contrast, <i>lazy</i> matching makes the minimal amount of matches
183necessary for overall success. A greedy metacharacter can be made lazy by
184following it with <tt>?</tt>.
185
186    # Both patterns below match the string. The first uses a greedy
187    # quantifier so '.+' matches '<a><b>'; the second uses a lazy
188    # quantifier so '.+?' matches '<a>'.
189    /<.+>/.match("<a><b>")  #=> #<MatchData "<a><b>">
190    /<.+?>/.match("<a><b>") #=> #<MatchData "<a>">
191
192A quantifier followed by <tt>+</tt> matches <i>possessively</i>: once it
193has matched it does not backtrack. They behave like greedy quantifiers,
194but having matched they refuse to "give up" their match even if this
195jeopardises the overall match.
196
197== Capturing
198
199Parentheses can be used for <i>capturing</i>. The text enclosed by the
200<i>n</i><sup>th</sup> group of parentheses can be subsequently referred to
201with <i>n</i>. Within a pattern use the <i>backreference</i>
202<tt>\n</tt>; outside of the pattern use
203<tt>MatchData[</tt><i>n</i><tt>]</tt>.
204
205    # 'at' is captured by the first group of parentheses, then referred to
206    # later with \1
207    /[csh](..) [csh]\1 in/.match("The cat sat in the hat")
208        #=> #<MatchData "cat sat in" 1:"at">
209    # Regexp#match returns a MatchData object which makes the captured
210    # text available with its #[] method.
211    /[csh](..) [csh]\1 in/.match("The cat sat in the hat")[1] #=> 'at'
212
213Capture groups can be referred to by name when defined with the
214<tt>(?<</tt><i>name</i><tt>>)</tt> or <tt>(?'</tt><i>name</i><tt>')</tt>
215constructs.
216
217    /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")
218        => #<MatchData "$3.67" dollars:"3" cents:"67">
219    /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")[:dollars] #=> "3"
220
221Named groups can be backreferenced with <tt>\k<</tt><i>name</i><tt>></tt>,
222where _name_ is the group name.
223
224    /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy')
225        #=> #<MatchData "ototo" vowel:"o">
226
227*Note*: A regexp can't use named backreferences and numbered
228backreferences simultaneously.
229
230When named capture groups are used with a literal regexp on the left-hand
231side of an expression and the <tt>=~</tt> operator, the captured text is
232also assigned to local variables with corresponding names.
233
234    /\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
235    dollars #=> "3"
236
237== Grouping
238
239Parentheses also <i>group</i> the terms they enclose, allowing them to be
240quantified as one <i>atomic</i> whole.
241
242    # The pattern below matches a vowel followed by 2 word characters:
243    # 'aen'
244    /[aeiou]\w{2}/.match("Caenorhabditis elegans") #=> #<MatchData "aen">
245    # Whereas the following pattern matches a vowel followed by a word
246    # character, twice, i.e. <tt>[aeiou]\w[aeiou]\w</tt>: 'enor'.
247    /([aeiou]\w){2}/.match("Caenorhabditis elegans")
248        #=> #<MatchData "enor" 1:"or">
249
250The <tt>(?:</tt>...<tt>)</tt> construct provides grouping without
251capturing. That is, it combines the terms it contains into an atomic whole
252without creating a backreference. This benefits performance at the slight
253expense of readability.
254
255    # The group of parentheses captures 'n' and the second 'ti'. The
256    # second group is referred to later with the backreference \2
257    /I(n)ves(ti)ga\2ons/.match("Investigations")
258        #=> #<MatchData "Investigations" 1:"n" 2:"ti">
259    # The first group of parentheses is now made non-capturing with '?:',
260    # so it still matches 'n', but doesn't create the backreference. Thus,
261    # the backreference \1 now refers to 'ti'.
262    /I(?:n)ves(ti)ga\1ons/.match("Investigations")
263        #=> #<MatchData "Investigations" 1:"ti">
264
265=== Atomic Grouping
266
267Grouping can be made <i>atomic</i> with
268<tt>(?></tt><i>pat</i><tt>)</tt>. This causes the subexpression <i>pat</i>
269to be matched independently of the rest of the expression such that what
270it matches becomes fixed for the remainder of the match, unless the entire
271subexpression must be abandoned and subsequently revisited. In this
272way <i>pat</i> is treated as a non-divisible whole. Atomic grouping is
273typically used to optimise patterns so as to prevent the regular
274expression engine from backtracking needlessly.
275
276    # The <tt>"</tt> in the pattern below matches the first character of
277    # the string, then <tt>.*</tt> matches <i>Quote"</i>. This causes the
278    # overall match to fail, so the text matched by <tt>.*</tt> is
279    # backtracked by one position, which leaves the final character of the
280    # string available to match <tt>"</tt>
281          /".*"/.match('"Quote"')     #=> #<MatchData "\"Quote\"">
282    # If <tt>.*</tt> is grouped atomically, it refuses to backtrack
283    # <i>Quote"</i>, even though this means that the overall match fails
284    /"(?>.*)"/.match('"Quote"') #=> nil
285
286== Subexpression Calls
287
288The <tt>\g<</tt><i>name</i><tt>></tt> syntax matches the previous
289subexpression named _name_, which can be a group name or number, again.
290This differs from backreferences in that it re-executes the group rather
291than simply trying to re-match the same text.
292
293    # Matches a <i>(</i> character and assigns it to the <tt>paren</tt>
294    # group, tries to call that the <tt>paren</tt> sub-expression again
295    # but fails, then matches a literal <i>)</i>.
296    /\A(?<paren>\(\g<paren>*\))*\z/ =~ '()'
297
298
299    /\A(?<paren>\(\g<paren>*\))*\z/ =~ '(())' #=> 0
300    # ^1
301    #      ^2
302    #           ^3
303    #                 ^4
304    #      ^5
305    #           ^6
306    #                      ^7
307    #                       ^8
308    #                       ^9
309    #                           ^10
310
3111.  Matches at the beginning of the string, i.e. before the first
312    character.
3132.  Enters a named capture group called <tt>paren</tt>
3143.  Matches a literal <i>(</i>, the first character in the string
3154.  Calls the <tt>paren</tt> group again, i.e. recurses back to the
316    second step
3175.  Re-enters the <tt>paren</tt> group
3186.  Matches a literal <i>(</i>, the second character in the
319    string
3207.  Try to call <tt>paren</tt> a third time, but fail because
321    doing so would prevent an overall successful match
3228.  Match a literal <i>)</i>, the third character in the string.
323    Marks the end of the second recursive call
3249.  Match a literal <i>)</i>, the fourth character in the string
32510. Match the end of the string
326
327== Alternation
328
329The vertical bar metacharacter (<tt>|</tt>) combines two expressions into
330a single one that matches either of the expressions. Each expression is an
331<i>alternative</i>.
332
333    /\w(and|or)\w/.match("Feliformia") #=> #<MatchData "form" 1:"or">
334    /\w(and|or)\w/.match("furandi")    #=> #<MatchData "randi" 1:"and">
335    /\w(and|or)\w/.match("dissemblance") #=> nil
336
337== Character Properties
338
339The <tt>\p{}</tt> construct matches characters with the named property,
340much like POSIX bracket classes.
341
342* <tt>/\p{Alnum}/</tt> - Alphabetic and numeric character
343* <tt>/\p{Alpha}/</tt> - Alphabetic character
344* <tt>/\p{Blank}/</tt> - Space or tab
345* <tt>/\p{Cntrl}/</tt> - Control character
346* <tt>/\p{Digit}/</tt> - Digit
347* <tt>/\p{Graph}/</tt> - Non-blank character (excludes spaces, control
348  characters, and similar)
349* <tt>/\p{Lower}/</tt> - Lowercase alphabetical character
350* <tt>/\p{Print}/</tt> - Like <tt>\p{Graph}</tt>, but includes the space character
351* <tt>/\p{Punct}/</tt> - Punctuation character
352* <tt>/\p{Space}/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline,
353  carriage return, etc.)
354* <tt>/\p{Upper}/</tt> - Uppercase alphabetical
355* <tt>/\p{XDigit}/</tt> - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
356* <tt>/\p{Word}/</tt> - A member of one of the following Unicode general
357  category <i>Letter</i>, <i>Mark</i>, <i>Number</i>,
358  <i>Connector\_Punctuation</i>
359* <tt>/\p{ASCII}/</tt> - A character in the ASCII character set
360* <tt>/\p{Any}/</tt> - Any Unicode character (including unassigned
361  characters)
362* <tt>/\p{Assigned}/</tt> - An assigned character
363
364A Unicode character's <i>General Category</i> value can also be matched
365with <tt>\p{</tt><i>Ab</i><tt>}</tt> where <i>Ab</i> is the category's
366abbreviation as described below:
367
368* <tt>/\p{L}/</tt> - 'Letter'
369* <tt>/\p{Ll}/</tt> - 'Letter: Lowercase'
370* <tt>/\p{Lm}/</tt> - 'Letter: Mark'
371* <tt>/\p{Lo}/</tt> - 'Letter: Other'
372* <tt>/\p{Lt}/</tt> - 'Letter: Titlecase'
373* <tt>/\p{Lu}/</tt> - 'Letter: Uppercase
374* <tt>/\p{Lo}/</tt> - 'Letter: Other'
375* <tt>/\p{M}/</tt> - 'Mark'
376* <tt>/\p{Mn}/</tt> - 'Mark: Nonspacing'
377* <tt>/\p{Mc}/</tt> - 'Mark: Spacing Combining'
378* <tt>/\p{Me}/</tt> - 'Mark: Enclosing'
379* <tt>/\p{N}/</tt> - 'Number'
380* <tt>/\p{Nd}/</tt> - 'Number: Decimal Digit'
381* <tt>/\p{Nl}/</tt> - 'Number: Letter'
382* <tt>/\p{No}/</tt> - 'Number: Other'
383* <tt>/\p{P}/</tt> - 'Punctuation'
384* <tt>/\p{Pc}/</tt> - 'Punctuation: Connector'
385* <tt>/\p{Pd}/</tt> - 'Punctuation: Dash'
386* <tt>/\p{Ps}/</tt> - 'Punctuation: Open'
387* <tt>/\p{Pe}/</tt> - 'Punctuation: Close'
388* <tt>/\p{Pi}/</tt> - 'Punctuation: Initial Quote'
389* <tt>/\p{Pf}/</tt> - 'Punctuation: Final Quote'
390* <tt>/\p{Po}/</tt> - 'Punctuation: Other'
391* <tt>/\p{S}/</tt> - 'Symbol'
392* <tt>/\p{Sm}/</tt> - 'Symbol: Math'
393* <tt>/\p{Sc}/</tt> - 'Symbol: Currency'
394* <tt>/\p{Sc}/</tt> - 'Symbol: Currency'
395* <tt>/\p{Sk}/</tt> - 'Symbol: Modifier'
396* <tt>/\p{So}/</tt> - 'Symbol: Other'
397* <tt>/\p{Z}/</tt> - 'Separator'
398* <tt>/\p{Zs}/</tt> - 'Separator: Space'
399* <tt>/\p{Zl}/</tt> - 'Separator: Line'
400* <tt>/\p{Zp}/</tt> - 'Separator: Paragraph'
401* <tt>/\p{C}/</tt> - 'Other'
402* <tt>/\p{Cc}/</tt> - 'Other: Control'
403* <tt>/\p{Cf}/</tt> - 'Other: Format'
404* <tt>/\p{Cn}/</tt> - 'Other: Not Assigned'
405* <tt>/\p{Co}/</tt> - 'Other: Private Use'
406* <tt>/\p{Cs}/</tt> - 'Other: Surrogate'
407
408Lastly, <tt>\p{}</tt> matches a character's Unicode <i>script</i>. The
409following scripts are supported: <i>Arabic</i>, <i>Armenian</i>,
410<i>Balinese</i>, <i>Bengali</i>, <i>Bopomofo</i>, <i>Braille</i>,
411<i>Buginese</i>, <i>Buhid</i>, <i>Canadian_Aboriginal</i>, <i>Carian</i>,
412<i>Cham</i>, <i>Cherokee</i>, <i>Common</i>, <i>Coptic</i>,
413<i>Cuneiform</i>, <i>Cypriot</i>, <i>Cyrillic</i>, <i>Deseret</i>,
414<i>Devanagari</i>, <i>Ethiopic</i>, <i>Georgian</i>, <i>Glagolitic</i>,
415<i>Gothic</i>, <i>Greek</i>, <i>Gujarati</i>, <i>Gurmukhi</i>, <i>Han</i>,
416<i>Hangul</i>, <i>Hanunoo</i>, <i>Hebrew</i>, <i>Hiragana</i>,
417<i>Inherited</i>, <i>Kannada</i>, <i>Katakana</i>, <i>Kayah_Li</i>,
418<i>Kharoshthi</i>, <i>Khmer</i>, <i>Lao</i>, <i>Latin</i>, <i>Lepcha</i>,
419<i>Limbu</i>, <i>Linear_B</i>, <i>Lycian</i>, <i>Lydian</i>,
420<i>Malayalam</i>, <i>Mongolian</i>, <i>Myanmar</i>, <i>New_Tai_Lue</i>,
421<i>Nko</i>, <i>Ogham</i>, <i>Ol_Chiki</i>, <i>Old_Italic</i>,
422<i>Old_Persian</i>, <i>Oriya</i>, <i>Osmanya</i>, <i>Phags_Pa</i>,
423<i>Phoenician</i>, <i>Rejang</i>, <i>Runic</i>, <i>Saurashtra</i>,
424<i>Shavian</i>, <i>Sinhala</i>, <i>Sundanese</i>, <i>Syloti_Nagri</i>,
425<i>Syriac</i>, <i>Tagalog</i>, <i>Tagbanwa</i>, <i>Tai_Le</i>,
426<i>Tamil</i>, <i>Telugu</i>, <i>Thaana</i>, <i>Thai</i>, <i>Tibetan</i>,
427<i>Tifinagh</i>, <i>Ugaritic</i>, <i>Vai</i>, and <i>Yi</i>.
428
429    # Unicode codepoint U+06E9 is named "ARABIC PLACE OF SAJDAH" and
430    # belongs to the Arabic script.
431    /\p{Arabic}/.match("\u06E9") #=> #<MatchData "\u06E9">
432
433All character properties can be inverted by prefixing their name with a
434caret (<tt>^</tt>).
435
436    # Letter 'A' is not in the Unicode Ll (Letter; Lowercase) category, so
437    # this match succeeds
438    /\p{^Ll}/.match("A") #=> #<MatchData "A">
439
440== Anchors
441
442Anchors are metacharacter that match the zero-width positions between
443characters, <i>anchoring</i> the match to a specific position.
444
445* <tt>^</tt> - Matches beginning of line
446* <tt>$</tt> - Matches end of line
447* <tt>\A</tt> - Matches beginning of string.
448* <tt>\Z</tt> - Matches end of string. If string ends with a newline,
449  it matches just before newline
450* <tt>\z</tt> - Matches end of string
451* <tt>\G</tt> - Matches point where last match finished
452* <tt>\b</tt> - Matches word boundaries when outside brackets;
453  backspace (0x08) when inside brackets
454* <tt>\B</tt> - Matches non-word boundaries
455* <tt>(?=</tt><i>pat</i><tt>)</tt> - <i>Positive lookahead</i> assertion:
456  ensures that the following characters match <i>pat</i>, but doesn't
457  include those characters in the matched text
458* <tt>(?!</tt><i>pat</i><tt>)</tt> - <i>Negative lookahead</i> assertion:
459  ensures that the following characters do not match <i>pat</i>, but
460  doesn't include those characters in the matched text
461* <tt>(?<=</tt><i>pat</i><tt>)</tt> - <i>Positive lookbehind</i>
462  assertion: ensures that the preceding characters match <i>pat</i>, but
463  doesn't include those characters in the matched text
464* <tt>(?<!</tt><i>pat</i><tt>)</tt> - <i>Negative lookbehind</i>
465  assertion: ensures that the preceding characters do not match
466  <i>pat</i>, but doesn't include those characters in the matched text
467
468    # If a pattern isn't anchored it can begin at any point in the string
469    /real/.match("surrealist") #=> #<MatchData "real">
470    # Anchoring the pattern to the beginning of the string forces the
471    # match to start there. 'real' doesn't occur at the beginning of the
472    # string, so now the match fails
473    /\Areal/.match("surrealist") #=> nil
474    # The match below fails because although 'Demand' contains 'and', the
475    pattern does not occur at a word boundary.
476    /\band/.match("Demand")
477    # Whereas in the following example 'and' has been anchored to a
478    # non-word boundary so instead of matching the first 'and' it matches
479    # from the fourth letter of 'demand' instead
480    /\Band.+/.match("Supply and demand curve") #=> #<MatchData "and curve">
481    # The pattern below uses positive lookahead and positive lookbehind to
482    # match text appearing in <b></b> tags without including the tags in the
483    # match
484    /(?<=<b>)\w+(?=<\/b>)/.match("Fortune favours the <b>bold</b>")
485        #=> #<MatchData "bold">
486
487== Options
488
489The end delimiter for a regexp can be followed by one or more single-letter
490options which control how the pattern can match.
491
492* <tt>/pat/i</tt> - Ignore case
493* <tt>/pat/m</tt> - Treat a newline as a character matched by <tt>.</tt>
494* <tt>/pat/x</tt> - Ignore whitespace and comments in the pattern
495* <tt>/pat/o</tt> - Perform <tt>#{}</tt> interpolation only once
496
497<tt>i</tt>, <tt>m</tt>, and <tt>x</tt> can also be applied on the
498subexpression level with the
499<tt>(?</tt><i>on</i><tt>-</tt><i>off</i><tt>)</tt> construct, which
500enables options <i>on</i>, and disables options <i>off</i> for the
501expression enclosed by the parentheses.
502
503    /a(?i:b)c/.match('aBc') #=> #<MatchData "aBc">
504    /a(?i:b)c/.match('abc') #=> #<MatchData "abc">
505
506Options may also be used with <tt>Regexp.new</tt>:
507
508    Regexp.new("abc", Regexp::IGNORECASE)                     #=> /abc/i
509    Regexp.new("abc", Regexp::MULTILINE)                      #=> /abc/m
510    Regexp.new("abc # Comment", Regexp::EXTENDED)             #=> /abc # Comment/x
511    Regexp.new("abc", Regexp::IGNORECASE | Regexp::MULTILINE) #=> /abc/mi
512
513== Free-Spacing Mode and Comments
514
515As mentioned above, the <tt>x</tt> option enables <i>free-spacing</i>
516mode. Literal white space inside the pattern is ignored, and the
517octothorpe (<tt>#</tt>) character introduces a comment until the end of
518the line. This allows the components of the pattern to be organised in a
519potentially more readable fashion.
520
521    # A contrived pattern to match a number with optional decimal places
522    float_pat = /\A
523        [[:digit:]]+ # 1 or more digits before the decimal point
524        (\.          # Decimal point
525            [[:digit:]]+ # 1 or more digits after the decimal point
526        )? # The decimal point and following digits are optional
527    \Z/x
528    float_pat.match('3.14') #=> #<MatchData "3.14" 1:".14">
529
530*Note*: To match whitespace in an <tt>x</tt> pattern use an escape such as
531<tt>\s</tt> or <tt>\p{Space}</tt>.
532
533Comments can be included in a non-<tt>x</tt> pattern with the
534<tt>(?#</tt><i>comment</i><tt>)</tt> construct, where <i>comment</i> is
535arbitrary text ignored by the regexp engine.
536
537== Encoding
538
539Regular expressions are assumed to use the source encoding. This can be
540overridden with one of the following modifiers.
541
542* <tt>/</tt><i>pat</i><tt>/u</tt> - UTF-8
543* <tt>/</tt><i>pat</i><tt>/e</tt> - EUC-JP
544* <tt>/</tt><i>pat</i><tt>/s</tt> - Windows-31J
545* <tt>/</tt><i>pat</i><tt>/n</tt> - ASCII-8BIT
546
547A regexp can be matched against a string when they either share an
548encoding, or the regexp's encoding is _US-ASCII_ and the string's encoding
549is ASCII-compatible.
550
551If a match between incompatible encodings is attempted an
552<tt>Encoding::CompatibilityError</tt> exception is raised.
553
554The <tt>Regexp#fixed_encoding?</tt> predicate indicates whether the regexp
555has a <i>fixed</i> encoding, that is one incompatible with ASCII. A
556regexp's encoding can be explicitly fixed by supplying
557<tt>Regexp::FIXEDENCODING</tt> as the second argument of
558<tt>Regexp.new</tt>:
559
560    r = Regexp.new("a".force_encoding("iso-8859-1"),Regexp::FIXEDENCODING)
561    r =~"a\u3042"
562       #=> Encoding::CompatibilityError: incompatible encoding regexp match
563            (ISO-8859-1 regexp with UTF-8 string)
564
565== Special global variables
566
567Pattern matching sets some global variables :
568* <tt>$~</tt> is equivalent to Regexp.last_match;
569* <tt>$&</tt> contains the complete matched text;
570* <tt>$`</tt> contains string before match;
571* <tt>$'</tt> contains string after match;
572* <tt>$1</tt>, <tt>$2</tt> and so on contain text matching first, second, etc
573  capture group;
574* <tt>$+</tt> contains last capture group.
575
576Example:
577
578    m = /s(\w{2}).*(c)/.match('haystack') #=> #<MatchData "stac" 1:"ta" 2:"c">
579    $~                                    #=> #<MatchData "stac" 1:"ta" 2:"c">
580    Regexp.latch_match                    #=> #<MatchData "stac" 1:"ta" 2:"c">
581
582    $&      #=> "stac"
583            # same as m[0]
584    $`      #=> "hay"
585            # same as m.pre_match
586    $'      #=> "k"
587            # same as m.post_match
588    $1      #=> "ta"
589            # same as m[1]
590    $2      #=> "c"
591            # same as m[2]
592    $3      #=> nil
593            # no third group in pattern
594    $+      #=> "c"
595            # same as m[-1]
596
597These global variables are thread-local and method-local variables.
598
599== Performance
600
601Certain pathological combinations of constructs can lead to abysmally bad
602performance.
603
604Consider a string of 25 <i>a</i>s, a <i>d</i>, 4 <i>a</i>s, and a
605<i>c</i>.
606
607    s = 'a' * 25 + 'd' + 'a' * 4 + 'c'
608    #=> "aaaaaaaaaaaaaaaaaaaaaaaaadaaaac"
609
610The following patterns match instantly as you would expect:
611
612    /(b|a)/ =~ s #=> 0
613    /(b|a+)/ =~ s #=> 0
614    /(b|a+)*\/ =~ s #=> 0
615
616However, the following pattern takes appreciably longer:
617
618    /(b|a+)*c/ =~ s #=> 26
619
620This happens because an atom in the regexp is quantified by both an
621immediate <tt>+</tt> and an enclosing <tt>*</tt> with nothing to
622differentiate which is in control of any particular character. The
623nondeterminism that results produces super-linear performance. (Consult
624<i>Mastering Regular Expressions</i> (3rd ed.), pp 222, by
625<i>Jeffery Friedl</i>, for an in-depth analysis). This particular case
626can be fixed by use of atomic grouping, which prevents the unnecessary
627backtracking:
628
629    (start = Time.now) && /(b|a+)*c/ =~ s && (Time.now - start)
630       #=> 24.702736882
631    (start = Time.now) && /(?>b|a+)*c/ =~ s && (Time.now - start)
632       #=> 0.000166571
633
634A similar case is typified by the following example, which takes
635approximately 60 seconds to execute for me:
636
637    # Match a string of 29 <i>a</i>s against a pattern of 29 optional
638    # <i>a</i>s followed by 29 mandatory <i>a</i>s.
639    Regexp.new('a?' * 29 + 'a' * 29) =~ 'a' * 29
640
641The 29 optional <i>a</i>s match the string, but this prevents the 29
642mandatory <i>a</i>s that follow from matching. Ruby must then backtrack
643repeatedly so as to satisfy as many of the optional matches as it can
644while still matching the mandatory 29. It is plain to us that none of the
645optional matches can succeed, but this fact unfortunately eludes Ruby.
646
647The best way to improve performance is to significantly reduce the amount of
648backtracking needed.  For this case, instead of individually matching 29
649optional <i>a</i>s, a range of optional <i>a</i>s can be matched all at once
650with <i>a{0,29}</i>:
651
652    Regexp.new('a{0,29}' + 'a' * 29) =~ 'a' * 29
653
654