1# -*- mode: rdoc; coding: utf-8; fill-column: 74; -*- 2 3Regular expressions (<i>regexp</i>s) are patterns which describe the 4contents of a string. They're used for testing whether a string contains a 5given pattern, or extracting the portions that match. They are created 6with the <tt>/</tt><i>pat</i><tt>/</tt> and 7<tt>%r{</tt><i>pat</i><tt>}</tt> literals or the <tt>Regexp.new</tt> 8constructor. 9 10A regexp is usually delimited with forward slashes (<tt>/</tt>). For 11example: 12 13 /hay/ =~ 'haystack' #=> 0 14 /y/.match('haystack') #=> #<MatchData "y"> 15 16If a string contains the pattern it is said to <i>match</i>. A literal 17string matches itself. 18 19 # 'haystack' does not contain the pattern 'needle', so doesn't match. 20 /needle/.match('haystack') #=> nil 21 # 'haystack' does contain the pattern 'hay', so it matches 22 /hay/.match('haystack') #=> #<MatchData "hay"> 23 24Specifically, <tt>/st/</tt> requires that the string contains the letter 25_s_ followed by the letter _t_, so it matches _haystack_, also. 26 27== <tt>=~</tt> and Regexp#match 28 29Pattern matching may be achieved by using <tt>=~</tt> operator or Regexp#match 30method. 31 32=== <tt>=~</tt> operator 33 34<tt>=~</tt> is Ruby's basic pattern-matching operator. When one operand is a 35regular expression and the other is a string then the regular expression is 36used as a pattern to match against the string. (This operator is equivalently 37defined by Regexp and String so the order of String and Regexp do not matter. 38Other classes may have different implementations of <tt>=~</tt>.) If a match 39is found, the operator returns index of first match in string, otherwise it 40returns +nil+. 41 42 /hay/ =~ 'haystack' #=> 0 43 'haystack' =~ /hay/ #=> 0 44 /a/ =~ 'haystack' #=> 1 45 /u/ =~ 'haystack' #=> nil 46 47Using <tt>=~</tt> operator with a String and Regexp the <tt>$~</tt> global 48variable is set after a successful match. <tt>$~</tt> holds a MatchData 49object. Regexp.last_match is equivalent to <tt>$~</tt>. 50 51=== Regexp#match method 52 53#match method return a MatchData object : 54 55 /st/.match('haystack') #=> #<MatchData "st"> 56 57== Metacharacters and Escapes 58 59The following are <i>metacharacters</i> <tt>(</tt>, <tt>)</tt>, 60<tt>[</tt>, <tt>]</tt>, <tt>{</tt>, <tt>}</tt>, <tt>.</tt>, <tt>?</tt>, 61<tt>+</tt>, <tt>*</tt>. They have a specific meaning when appearing in a 62pattern. To match them literally they must be backslash-escaped. To match 63a backslash literally backslash-escape that: <tt>\\\\\\</tt>. 64 65 /1 \+ 2 = 3\?/.match('Does 1 + 2 = 3?') #=> #<MatchData "1 + 2 = 3?"> 66 67Patterns behave like double-quoted strings so can contain the same 68backslash escapes. 69 70 /\s\u{6771 4eac 90fd}/.match("Go to 東京都") 71 #=> #<MatchData " 東京都"> 72 73Arbitrary Ruby expressions can be embedded into patterns with the 74<tt>#{...}</tt> construct. 75 76 place = "東京都" 77 /#{place}/.match("Go to 東京都") 78 #=> #<MatchData "東京都"> 79 80== Character Classes 81 82A <i>character class</i> is delimited with square brackets (<tt>[</tt>, 83<tt>]</tt>) and lists characters that may appear at that point in the 84match. <tt>/[ab]/</tt> means _a_ or _b_, as opposed to <tt>/ab/</tt> which 85means _a_ followed by _b_. 86 87 /W[aeiou]rd/.match("Word") #=> #<MatchData "Word"> 88 89Within a character class the hyphen (<tt>-</tt>) is a metacharacter 90denoting an inclusive range of characters. <tt>[abcd]</tt> is equivalent 91to <tt>[a-d]</tt>. A range can be followed by another range, so 92<tt>[abcdwxyz]</tt> is equivalent to <tt>[a-dw-z]</tt>. The order in which 93ranges or individual characters appear inside a character class is 94irrelevant. 95 96 /[0-9a-f]/.match('9f') #=> #<MatchData "9"> 97 /[9f]/.match('9f') #=> #<MatchData "9"> 98 99If the first character of a character class is a caret (<tt>^</tt>) the 100class is inverted: it matches any character _except_ those named. 101 102 /[^a-eg-z]/.match('f') #=> #<MatchData "f"> 103 104A character class may contain another character class. By itself this 105isn't useful because <tt>[a-z[0-9]]</tt> describes the same set as 106<tt>[a-z0-9]</tt>. However, character classes also support the <tt>&&</tt> 107operator which performs set intersection on its arguments. The two can be 108combined as follows: 109 110 /[a-w&&[^c-g]z]/ # ([a-w] AND ([^c-g] OR z)) 111 # This is equivalent to: 112 /[abh-w]/ 113 114The following metacharacters also behave like character classes: 115 116* <tt>/./</tt> - Any character except a newline. 117* <tt>/./m</tt> - Any character (the +m+ modifier enables multiline mode) 118* <tt>/\w/</tt> - A word character (<tt>[a-zA-Z0-9_]</tt>) 119* <tt>/\W/</tt> - A non-word character (<tt>[^a-zA-Z0-9_]</tt>) 120* <tt>/\d/</tt> - A digit character (<tt>[0-9]</tt>) 121* <tt>/\D/</tt> - A non-digit character (<tt>[^0-9]</tt>) 122* <tt>/\h/</tt> - A hexdigit character (<tt>[0-9a-fA-F]</tt>) 123* <tt>/\H/</tt> - A non-hexdigit character (<tt>[^0-9a-fA-F]</tt>) 124* <tt>/\s/</tt> - A whitespace character: <tt>/[ \t\r\n\f]/</tt> 125* <tt>/\S/</tt> - A non-whitespace character: <tt>/[^ \t\r\n\f]/</tt> 126 127POSIX <i>bracket expressions</i> are also similar to character classes. 128They provide a portable alternative to the above, with the added benefit 129that they encompass non-ASCII characters. For instance, <tt>/\d/</tt> 130matches only the ASCII decimal digits (0-9); whereas <tt>/[[:digit:]]/</tt> 131matches any character in the Unicode _Nd_ category. 132 133* <tt>/[[:alnum:]]/</tt> - Alphabetic and numeric character 134* <tt>/[[:alpha:]]/</tt> - Alphabetic character 135* <tt>/[[:blank:]]/</tt> - Space or tab 136* <tt>/[[:cntrl:]]/</tt> - Control character 137* <tt>/[[:digit:]]/</tt> - Digit 138* <tt>/[[:graph:]]/</tt> - Non-blank character (excludes spaces, control 139 characters, and similar) 140* <tt>/[[:lower:]]/</tt> - Lowercase alphabetical character 141* <tt>/[[:print:]]/</tt> - Like [:graph:], but includes the space character 142* <tt>/[[:punct:]]/</tt> - Punctuation character 143* <tt>/[[:space:]]/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline, 144 carriage return, etc.) 145* <tt>/[[:upper:]]/</tt> - Uppercase alphabetical 146* <tt>/[[:xdigit:]]/</tt> - Digit allowed in a hexadecimal number (i.e., 147 0-9a-fA-F) 148 149Ruby also supports the following non-POSIX character classes: 150 151* <tt>/[[:word:]]/</tt> - A character in one of the following Unicode 152 general categories _Letter_, _Mark_, _Number_, 153 <i>Connector_Punctuation</i> 154* <tt>/[[:ascii:]]/</tt> - A character in the ASCII character set 155 156 # U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO" 157 /[[:digit:]]/.match("\u06F2") #=> #<MatchData "\u{06F2}"> 158 /[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He"> 159 /[[:xdigit:]][[:xdigit:]]/.match("A6") #=> #<MatchData "A6"> 160 161== Repetition 162 163The constructs described so far match a single character. They can be 164followed by a repetition metacharacter to specify how many times they need 165to occur. Such metacharacters are called <i>quantifiers</i>. 166 167* <tt>*</tt> - Zero or more times 168* <tt>+</tt> - One or more times 169* <tt>?</tt> - Zero or one times (optional) 170* <tt>{</tt><i>n</i><tt>}</tt> - Exactly <i>n</i> times 171* <tt>{</tt><i>n</i><tt>,}</tt> - <i>n</i> or more times 172* <tt>{,</tt><i>m</i><tt>}</tt> - <i>m</i> or less times 173* <tt>{</tt><i>n</i><tt>,</tt><i>m</i><tt>}</tt> - At least <i>n</i> and 174 at most <i>m</i> times 175 176 # At least one uppercase character ('H'), at least one lowercase 177 # character ('e'), two 'l' characters, then one 'o' 178 "Hello".match(/[[:upper:]]+[[:lower:]]+l{2}o/) #=> #<MatchData "Hello"> 179 180Repetition is <i>greedy</i> by default: as many occurrences as possible 181are matched while still allowing the overall match to succeed. By 182contrast, <i>lazy</i> matching makes the minimal amount of matches 183necessary for overall success. A greedy metacharacter can be made lazy by 184following it with <tt>?</tt>. 185 186 # Both patterns below match the string. The first uses a greedy 187 # quantifier so '.+' matches '<a><b>'; the second uses a lazy 188 # quantifier so '.+?' matches '<a>'. 189 /<.+>/.match("<a><b>") #=> #<MatchData "<a><b>"> 190 /<.+?>/.match("<a><b>") #=> #<MatchData "<a>"> 191 192A quantifier followed by <tt>+</tt> matches <i>possessively</i>: once it 193has matched it does not backtrack. They behave like greedy quantifiers, 194but having matched they refuse to "give up" their match even if this 195jeopardises the overall match. 196 197== Capturing 198 199Parentheses can be used for <i>capturing</i>. The text enclosed by the 200<i>n</i><sup>th</sup> group of parentheses can be subsequently referred to 201with <i>n</i>. Within a pattern use the <i>backreference</i> 202<tt>\n</tt>; outside of the pattern use 203<tt>MatchData[</tt><i>n</i><tt>]</tt>. 204 205 # 'at' is captured by the first group of parentheses, then referred to 206 # later with \1 207 /[csh](..) [csh]\1 in/.match("The cat sat in the hat") 208 #=> #<MatchData "cat sat in" 1:"at"> 209 # Regexp#match returns a MatchData object which makes the captured 210 # text available with its #[] method. 211 /[csh](..) [csh]\1 in/.match("The cat sat in the hat")[1] #=> 'at' 212 213Capture groups can be referred to by name when defined with the 214<tt>(?<</tt><i>name</i><tt>>)</tt> or <tt>(?'</tt><i>name</i><tt>')</tt> 215constructs. 216 217 /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67") 218 => #<MatchData "$3.67" dollars:"3" cents:"67"> 219 /\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")[:dollars] #=> "3" 220 221Named groups can be backreferenced with <tt>\k<</tt><i>name</i><tt>></tt>, 222where _name_ is the group name. 223 224 /(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy') 225 #=> #<MatchData "ototo" vowel:"o"> 226 227*Note*: A regexp can't use named backreferences and numbered 228backreferences simultaneously. 229 230When named capture groups are used with a literal regexp on the left-hand 231side of an expression and the <tt>=~</tt> operator, the captured text is 232also assigned to local variables with corresponding names. 233 234 /\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0 235 dollars #=> "3" 236 237== Grouping 238 239Parentheses also <i>group</i> the terms they enclose, allowing them to be 240quantified as one <i>atomic</i> whole. 241 242 # The pattern below matches a vowel followed by 2 word characters: 243 # 'aen' 244 /[aeiou]\w{2}/.match("Caenorhabditis elegans") #=> #<MatchData "aen"> 245 # Whereas the following pattern matches a vowel followed by a word 246 # character, twice, i.e. <tt>[aeiou]\w[aeiou]\w</tt>: 'enor'. 247 /([aeiou]\w){2}/.match("Caenorhabditis elegans") 248 #=> #<MatchData "enor" 1:"or"> 249 250The <tt>(?:</tt>...<tt>)</tt> construct provides grouping without 251capturing. That is, it combines the terms it contains into an atomic whole 252without creating a backreference. This benefits performance at the slight 253expense of readability. 254 255 # The group of parentheses captures 'n' and the second 'ti'. The 256 # second group is referred to later with the backreference \2 257 /I(n)ves(ti)ga\2ons/.match("Investigations") 258 #=> #<MatchData "Investigations" 1:"n" 2:"ti"> 259 # The first group of parentheses is now made non-capturing with '?:', 260 # so it still matches 'n', but doesn't create the backreference. Thus, 261 # the backreference \1 now refers to 'ti'. 262 /I(?:n)ves(ti)ga\1ons/.match("Investigations") 263 #=> #<MatchData "Investigations" 1:"ti"> 264 265=== Atomic Grouping 266 267Grouping can be made <i>atomic</i> with 268<tt>(?></tt><i>pat</i><tt>)</tt>. This causes the subexpression <i>pat</i> 269to be matched independently of the rest of the expression such that what 270it matches becomes fixed for the remainder of the match, unless the entire 271subexpression must be abandoned and subsequently revisited. In this 272way <i>pat</i> is treated as a non-divisible whole. Atomic grouping is 273typically used to optimise patterns so as to prevent the regular 274expression engine from backtracking needlessly. 275 276 # The <tt>"</tt> in the pattern below matches the first character of 277 # the string, then <tt>.*</tt> matches <i>Quote"</i>. This causes the 278 # overall match to fail, so the text matched by <tt>.*</tt> is 279 # backtracked by one position, which leaves the final character of the 280 # string available to match <tt>"</tt> 281 /".*"/.match('"Quote"') #=> #<MatchData "\"Quote\""> 282 # If <tt>.*</tt> is grouped atomically, it refuses to backtrack 283 # <i>Quote"</i>, even though this means that the overall match fails 284 /"(?>.*)"/.match('"Quote"') #=> nil 285 286== Subexpression Calls 287 288The <tt>\g<</tt><i>name</i><tt>></tt> syntax matches the previous 289subexpression named _name_, which can be a group name or number, again. 290This differs from backreferences in that it re-executes the group rather 291than simply trying to re-match the same text. 292 293 # Matches a <i>(</i> character and assigns it to the <tt>paren</tt> 294 # group, tries to call that the <tt>paren</tt> sub-expression again 295 # but fails, then matches a literal <i>)</i>. 296 /\A(?<paren>\(\g<paren>*\))*\z/ =~ '()' 297 298 299 /\A(?<paren>\(\g<paren>*\))*\z/ =~ '(())' #=> 0 300 # ^1 301 # ^2 302 # ^3 303 # ^4 304 # ^5 305 # ^6 306 # ^7 307 # ^8 308 # ^9 309 # ^10 310 3111. Matches at the beginning of the string, i.e. before the first 312 character. 3132. Enters a named capture group called <tt>paren</tt> 3143. Matches a literal <i>(</i>, the first character in the string 3154. Calls the <tt>paren</tt> group again, i.e. recurses back to the 316 second step 3175. Re-enters the <tt>paren</tt> group 3186. Matches a literal <i>(</i>, the second character in the 319 string 3207. Try to call <tt>paren</tt> a third time, but fail because 321 doing so would prevent an overall successful match 3228. Match a literal <i>)</i>, the third character in the string. 323 Marks the end of the second recursive call 3249. Match a literal <i>)</i>, the fourth character in the string 32510. Match the end of the string 326 327== Alternation 328 329The vertical bar metacharacter (<tt>|</tt>) combines two expressions into 330a single one that matches either of the expressions. Each expression is an 331<i>alternative</i>. 332 333 /\w(and|or)\w/.match("Feliformia") #=> #<MatchData "form" 1:"or"> 334 /\w(and|or)\w/.match("furandi") #=> #<MatchData "randi" 1:"and"> 335 /\w(and|or)\w/.match("dissemblance") #=> nil 336 337== Character Properties 338 339The <tt>\p{}</tt> construct matches characters with the named property, 340much like POSIX bracket classes. 341 342* <tt>/\p{Alnum}/</tt> - Alphabetic and numeric character 343* <tt>/\p{Alpha}/</tt> - Alphabetic character 344* <tt>/\p{Blank}/</tt> - Space or tab 345* <tt>/\p{Cntrl}/</tt> - Control character 346* <tt>/\p{Digit}/</tt> - Digit 347* <tt>/\p{Graph}/</tt> - Non-blank character (excludes spaces, control 348 characters, and similar) 349* <tt>/\p{Lower}/</tt> - Lowercase alphabetical character 350* <tt>/\p{Print}/</tt> - Like <tt>\p{Graph}</tt>, but includes the space character 351* <tt>/\p{Punct}/</tt> - Punctuation character 352* <tt>/\p{Space}/</tt> - Whitespace character (<tt>[:blank:]</tt>, newline, 353 carriage return, etc.) 354* <tt>/\p{Upper}/</tt> - Uppercase alphabetical 355* <tt>/\p{XDigit}/</tt> - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) 356* <tt>/\p{Word}/</tt> - A member of one of the following Unicode general 357 category <i>Letter</i>, <i>Mark</i>, <i>Number</i>, 358 <i>Connector\_Punctuation</i> 359* <tt>/\p{ASCII}/</tt> - A character in the ASCII character set 360* <tt>/\p{Any}/</tt> - Any Unicode character (including unassigned 361 characters) 362* <tt>/\p{Assigned}/</tt> - An assigned character 363 364A Unicode character's <i>General Category</i> value can also be matched 365with <tt>\p{</tt><i>Ab</i><tt>}</tt> where <i>Ab</i> is the category's 366abbreviation as described below: 367 368* <tt>/\p{L}/</tt> - 'Letter' 369* <tt>/\p{Ll}/</tt> - 'Letter: Lowercase' 370* <tt>/\p{Lm}/</tt> - 'Letter: Mark' 371* <tt>/\p{Lo}/</tt> - 'Letter: Other' 372* <tt>/\p{Lt}/</tt> - 'Letter: Titlecase' 373* <tt>/\p{Lu}/</tt> - 'Letter: Uppercase 374* <tt>/\p{Lo}/</tt> - 'Letter: Other' 375* <tt>/\p{M}/</tt> - 'Mark' 376* <tt>/\p{Mn}/</tt> - 'Mark: Nonspacing' 377* <tt>/\p{Mc}/</tt> - 'Mark: Spacing Combining' 378* <tt>/\p{Me}/</tt> - 'Mark: Enclosing' 379* <tt>/\p{N}/</tt> - 'Number' 380* <tt>/\p{Nd}/</tt> - 'Number: Decimal Digit' 381* <tt>/\p{Nl}/</tt> - 'Number: Letter' 382* <tt>/\p{No}/</tt> - 'Number: Other' 383* <tt>/\p{P}/</tt> - 'Punctuation' 384* <tt>/\p{Pc}/</tt> - 'Punctuation: Connector' 385* <tt>/\p{Pd}/</tt> - 'Punctuation: Dash' 386* <tt>/\p{Ps}/</tt> - 'Punctuation: Open' 387* <tt>/\p{Pe}/</tt> - 'Punctuation: Close' 388* <tt>/\p{Pi}/</tt> - 'Punctuation: Initial Quote' 389* <tt>/\p{Pf}/</tt> - 'Punctuation: Final Quote' 390* <tt>/\p{Po}/</tt> - 'Punctuation: Other' 391* <tt>/\p{S}/</tt> - 'Symbol' 392* <tt>/\p{Sm}/</tt> - 'Symbol: Math' 393* <tt>/\p{Sc}/</tt> - 'Symbol: Currency' 394* <tt>/\p{Sc}/</tt> - 'Symbol: Currency' 395* <tt>/\p{Sk}/</tt> - 'Symbol: Modifier' 396* <tt>/\p{So}/</tt> - 'Symbol: Other' 397* <tt>/\p{Z}/</tt> - 'Separator' 398* <tt>/\p{Zs}/</tt> - 'Separator: Space' 399* <tt>/\p{Zl}/</tt> - 'Separator: Line' 400* <tt>/\p{Zp}/</tt> - 'Separator: Paragraph' 401* <tt>/\p{C}/</tt> - 'Other' 402* <tt>/\p{Cc}/</tt> - 'Other: Control' 403* <tt>/\p{Cf}/</tt> - 'Other: Format' 404* <tt>/\p{Cn}/</tt> - 'Other: Not Assigned' 405* <tt>/\p{Co}/</tt> - 'Other: Private Use' 406* <tt>/\p{Cs}/</tt> - 'Other: Surrogate' 407 408Lastly, <tt>\p{}</tt> matches a character's Unicode <i>script</i>. The 409following scripts are supported: <i>Arabic</i>, <i>Armenian</i>, 410<i>Balinese</i>, <i>Bengali</i>, <i>Bopomofo</i>, <i>Braille</i>, 411<i>Buginese</i>, <i>Buhid</i>, <i>Canadian_Aboriginal</i>, <i>Carian</i>, 412<i>Cham</i>, <i>Cherokee</i>, <i>Common</i>, <i>Coptic</i>, 413<i>Cuneiform</i>, <i>Cypriot</i>, <i>Cyrillic</i>, <i>Deseret</i>, 414<i>Devanagari</i>, <i>Ethiopic</i>, <i>Georgian</i>, <i>Glagolitic</i>, 415<i>Gothic</i>, <i>Greek</i>, <i>Gujarati</i>, <i>Gurmukhi</i>, <i>Han</i>, 416<i>Hangul</i>, <i>Hanunoo</i>, <i>Hebrew</i>, <i>Hiragana</i>, 417<i>Inherited</i>, <i>Kannada</i>, <i>Katakana</i>, <i>Kayah_Li</i>, 418<i>Kharoshthi</i>, <i>Khmer</i>, <i>Lao</i>, <i>Latin</i>, <i>Lepcha</i>, 419<i>Limbu</i>, <i>Linear_B</i>, <i>Lycian</i>, <i>Lydian</i>, 420<i>Malayalam</i>, <i>Mongolian</i>, <i>Myanmar</i>, <i>New_Tai_Lue</i>, 421<i>Nko</i>, <i>Ogham</i>, <i>Ol_Chiki</i>, <i>Old_Italic</i>, 422<i>Old_Persian</i>, <i>Oriya</i>, <i>Osmanya</i>, <i>Phags_Pa</i>, 423<i>Phoenician</i>, <i>Rejang</i>, <i>Runic</i>, <i>Saurashtra</i>, 424<i>Shavian</i>, <i>Sinhala</i>, <i>Sundanese</i>, <i>Syloti_Nagri</i>, 425<i>Syriac</i>, <i>Tagalog</i>, <i>Tagbanwa</i>, <i>Tai_Le</i>, 426<i>Tamil</i>, <i>Telugu</i>, <i>Thaana</i>, <i>Thai</i>, <i>Tibetan</i>, 427<i>Tifinagh</i>, <i>Ugaritic</i>, <i>Vai</i>, and <i>Yi</i>. 428 429 # Unicode codepoint U+06E9 is named "ARABIC PLACE OF SAJDAH" and 430 # belongs to the Arabic script. 431 /\p{Arabic}/.match("\u06E9") #=> #<MatchData "\u06E9"> 432 433All character properties can be inverted by prefixing their name with a 434caret (<tt>^</tt>). 435 436 # Letter 'A' is not in the Unicode Ll (Letter; Lowercase) category, so 437 # this match succeeds 438 /\p{^Ll}/.match("A") #=> #<MatchData "A"> 439 440== Anchors 441 442Anchors are metacharacter that match the zero-width positions between 443characters, <i>anchoring</i> the match to a specific position. 444 445* <tt>^</tt> - Matches beginning of line 446* <tt>$</tt> - Matches end of line 447* <tt>\A</tt> - Matches beginning of string. 448* <tt>\Z</tt> - Matches end of string. If string ends with a newline, 449 it matches just before newline 450* <tt>\z</tt> - Matches end of string 451* <tt>\G</tt> - Matches point where last match finished 452* <tt>\b</tt> - Matches word boundaries when outside brackets; 453 backspace (0x08) when inside brackets 454* <tt>\B</tt> - Matches non-word boundaries 455* <tt>(?=</tt><i>pat</i><tt>)</tt> - <i>Positive lookahead</i> assertion: 456 ensures that the following characters match <i>pat</i>, but doesn't 457 include those characters in the matched text 458* <tt>(?!</tt><i>pat</i><tt>)</tt> - <i>Negative lookahead</i> assertion: 459 ensures that the following characters do not match <i>pat</i>, but 460 doesn't include those characters in the matched text 461* <tt>(?<=</tt><i>pat</i><tt>)</tt> - <i>Positive lookbehind</i> 462 assertion: ensures that the preceding characters match <i>pat</i>, but 463 doesn't include those characters in the matched text 464* <tt>(?<!</tt><i>pat</i><tt>)</tt> - <i>Negative lookbehind</i> 465 assertion: ensures that the preceding characters do not match 466 <i>pat</i>, but doesn't include those characters in the matched text 467 468 # If a pattern isn't anchored it can begin at any point in the string 469 /real/.match("surrealist") #=> #<MatchData "real"> 470 # Anchoring the pattern to the beginning of the string forces the 471 # match to start there. 'real' doesn't occur at the beginning of the 472 # string, so now the match fails 473 /\Areal/.match("surrealist") #=> nil 474 # The match below fails because although 'Demand' contains 'and', the 475 pattern does not occur at a word boundary. 476 /\band/.match("Demand") 477 # Whereas in the following example 'and' has been anchored to a 478 # non-word boundary so instead of matching the first 'and' it matches 479 # from the fourth letter of 'demand' instead 480 /\Band.+/.match("Supply and demand curve") #=> #<MatchData "and curve"> 481 # The pattern below uses positive lookahead and positive lookbehind to 482 # match text appearing in <b></b> tags without including the tags in the 483 # match 484 /(?<=<b>)\w+(?=<\/b>)/.match("Fortune favours the <b>bold</b>") 485 #=> #<MatchData "bold"> 486 487== Options 488 489The end delimiter for a regexp can be followed by one or more single-letter 490options which control how the pattern can match. 491 492* <tt>/pat/i</tt> - Ignore case 493* <tt>/pat/m</tt> - Treat a newline as a character matched by <tt>.</tt> 494* <tt>/pat/x</tt> - Ignore whitespace and comments in the pattern 495* <tt>/pat/o</tt> - Perform <tt>#{}</tt> interpolation only once 496 497<tt>i</tt>, <tt>m</tt>, and <tt>x</tt> can also be applied on the 498subexpression level with the 499<tt>(?</tt><i>on</i><tt>-</tt><i>off</i><tt>)</tt> construct, which 500enables options <i>on</i>, and disables options <i>off</i> for the 501expression enclosed by the parentheses. 502 503 /a(?i:b)c/.match('aBc') #=> #<MatchData "aBc"> 504 /a(?i:b)c/.match('abc') #=> #<MatchData "abc"> 505 506Options may also be used with <tt>Regexp.new</tt>: 507 508 Regexp.new("abc", Regexp::IGNORECASE) #=> /abc/i 509 Regexp.new("abc", Regexp::MULTILINE) #=> /abc/m 510 Regexp.new("abc # Comment", Regexp::EXTENDED) #=> /abc # Comment/x 511 Regexp.new("abc", Regexp::IGNORECASE | Regexp::MULTILINE) #=> /abc/mi 512 513== Free-Spacing Mode and Comments 514 515As mentioned above, the <tt>x</tt> option enables <i>free-spacing</i> 516mode. Literal white space inside the pattern is ignored, and the 517octothorpe (<tt>#</tt>) character introduces a comment until the end of 518the line. This allows the components of the pattern to be organised in a 519potentially more readable fashion. 520 521 # A contrived pattern to match a number with optional decimal places 522 float_pat = /\A 523 [[:digit:]]+ # 1 or more digits before the decimal point 524 (\. # Decimal point 525 [[:digit:]]+ # 1 or more digits after the decimal point 526 )? # The decimal point and following digits are optional 527 \Z/x 528 float_pat.match('3.14') #=> #<MatchData "3.14" 1:".14"> 529 530*Note*: To match whitespace in an <tt>x</tt> pattern use an escape such as 531<tt>\s</tt> or <tt>\p{Space}</tt>. 532 533Comments can be included in a non-<tt>x</tt> pattern with the 534<tt>(?#</tt><i>comment</i><tt>)</tt> construct, where <i>comment</i> is 535arbitrary text ignored by the regexp engine. 536 537== Encoding 538 539Regular expressions are assumed to use the source encoding. This can be 540overridden with one of the following modifiers. 541 542* <tt>/</tt><i>pat</i><tt>/u</tt> - UTF-8 543* <tt>/</tt><i>pat</i><tt>/e</tt> - EUC-JP 544* <tt>/</tt><i>pat</i><tt>/s</tt> - Windows-31J 545* <tt>/</tt><i>pat</i><tt>/n</tt> - ASCII-8BIT 546 547A regexp can be matched against a string when they either share an 548encoding, or the regexp's encoding is _US-ASCII_ and the string's encoding 549is ASCII-compatible. 550 551If a match between incompatible encodings is attempted an 552<tt>Encoding::CompatibilityError</tt> exception is raised. 553 554The <tt>Regexp#fixed_encoding?</tt> predicate indicates whether the regexp 555has a <i>fixed</i> encoding, that is one incompatible with ASCII. A 556regexp's encoding can be explicitly fixed by supplying 557<tt>Regexp::FIXEDENCODING</tt> as the second argument of 558<tt>Regexp.new</tt>: 559 560 r = Regexp.new("a".force_encoding("iso-8859-1"),Regexp::FIXEDENCODING) 561 r =~"a\u3042" 562 #=> Encoding::CompatibilityError: incompatible encoding regexp match 563 (ISO-8859-1 regexp with UTF-8 string) 564 565== Special global variables 566 567Pattern matching sets some global variables : 568* <tt>$~</tt> is equivalent to Regexp.last_match; 569* <tt>$&</tt> contains the complete matched text; 570* <tt>$`</tt> contains string before match; 571* <tt>$'</tt> contains string after match; 572* <tt>$1</tt>, <tt>$2</tt> and so on contain text matching first, second, etc 573 capture group; 574* <tt>$+</tt> contains last capture group. 575 576Example: 577 578 m = /s(\w{2}).*(c)/.match('haystack') #=> #<MatchData "stac" 1:"ta" 2:"c"> 579 $~ #=> #<MatchData "stac" 1:"ta" 2:"c"> 580 Regexp.latch_match #=> #<MatchData "stac" 1:"ta" 2:"c"> 581 582 $& #=> "stac" 583 # same as m[0] 584 $` #=> "hay" 585 # same as m.pre_match 586 $' #=> "k" 587 # same as m.post_match 588 $1 #=> "ta" 589 # same as m[1] 590 $2 #=> "c" 591 # same as m[2] 592 $3 #=> nil 593 # no third group in pattern 594 $+ #=> "c" 595 # same as m[-1] 596 597These global variables are thread-local and method-local variables. 598 599== Performance 600 601Certain pathological combinations of constructs can lead to abysmally bad 602performance. 603 604Consider a string of 25 <i>a</i>s, a <i>d</i>, 4 <i>a</i>s, and a 605<i>c</i>. 606 607 s = 'a' * 25 + 'd' + 'a' * 4 + 'c' 608 #=> "aaaaaaaaaaaaaaaaaaaaaaaaadaaaac" 609 610The following patterns match instantly as you would expect: 611 612 /(b|a)/ =~ s #=> 0 613 /(b|a+)/ =~ s #=> 0 614 /(b|a+)*\/ =~ s #=> 0 615 616However, the following pattern takes appreciably longer: 617 618 /(b|a+)*c/ =~ s #=> 26 619 620This happens because an atom in the regexp is quantified by both an 621immediate <tt>+</tt> and an enclosing <tt>*</tt> with nothing to 622differentiate which is in control of any particular character. The 623nondeterminism that results produces super-linear performance. (Consult 624<i>Mastering Regular Expressions</i> (3rd ed.), pp 222, by 625<i>Jeffery Friedl</i>, for an in-depth analysis). This particular case 626can be fixed by use of atomic grouping, which prevents the unnecessary 627backtracking: 628 629 (start = Time.now) && /(b|a+)*c/ =~ s && (Time.now - start) 630 #=> 24.702736882 631 (start = Time.now) && /(?>b|a+)*c/ =~ s && (Time.now - start) 632 #=> 0.000166571 633 634A similar case is typified by the following example, which takes 635approximately 60 seconds to execute for me: 636 637 # Match a string of 29 <i>a</i>s against a pattern of 29 optional 638 # <i>a</i>s followed by 29 mandatory <i>a</i>s. 639 Regexp.new('a?' * 29 + 'a' * 29) =~ 'a' * 29 640 641The 29 optional <i>a</i>s match the string, but this prevents the 29 642mandatory <i>a</i>s that follow from matching. Ruby must then backtrack 643repeatedly so as to satisfy as many of the optional matches as it can 644while still matching the mandatory 29. It is plain to us that none of the 645optional matches can succeed, but this fact unfortunately eludes Ruby. 646 647The best way to improve performance is to significantly reduce the amount of 648backtracking needed. For this case, instead of individually matching 29 649optional <i>a</i>s, a range of optional <i>a</i>s can be matched all at once 650with <i>a{0,29}</i>: 651 652 Regexp.new('a{0,29}' + 'a' * 29) =~ 'a' * 29 653 654