Regular Expressions

A regular expression is a pattern or sequence of characters that serves as a template to find (and possibly replace) a desired arrangement in a body of text.

TAGS: Programming, Regular Expressions, TECH

Intro

A regular expression is a pattern or sequence of characters that has regular characters and metacharacters. The pattern serves as a template to find (and possibly replace) a desired arrangement in a body of text.

Here is a rough history of regular expressions:

  • 1950s. Stephen Cole Kleene, American mathemetician, develops "regular sets".
  • 1961. The ed text editor by Ken Thompson (a CTSS version of the QED text editor) had the first stuff that looks like most of today's regular expressions. Ken had read Kleene's stuff. The command for a pattern search in ed was g/re/p. Based upon his work in ed, Ken went on to make grep, the command line utility of UNIX, which is typically used like this: grep -i dog animals.txt. In DOS, the rough equivalent would be findstr /i dog animals.txt. (Enter findstr -? to find out more.)
  • Most of the regular expressions since grep were based on grep (like those in expr, AWK, Emacs, vi, and lex), however regular expressions are actually a subset of pattern matching and variants can be found in languages like SNOBOL, NPL, KRC, Mathematica, Haskell, ML, and LISP. This page deals mostly with the Perl and POSIX based regular expressions or patterns.
  • Henry Spencer developed "regex" a library for regular expressions. The Perl and Tcl languages based their regular expressions on Spencer's regex.Philip Hazel developed Perl Compatible Regular Expressions (PCRE) based on Perl's regular expressions. PCRE is used by the PHP language and the Apache HTTP server. The regular expression libraries of Java, JavaScript, Python, Ruby, VB, .NET, and the W3C's XML Schema are based on Perl. Perl 5.10 in turn has incorporated modifications in the regular expression library from Python, PCRE, .NET, and Java.
  • 1988. The Portable Operating System Interface (POSIX) standards specified by IEEE defines application programming interfaces (APIs) in an attempt to make operating systems (especially UNIX-based ones) more compatible. POSIX has regular expression standards for basic (BRE) and extended (ERE). The difference between Perl-based and POSIX-based regular expressions is a significant fork. POSIX character classes can only be used within bracket expressions. EG: [[:alnum]_] is the same as [\w] in Perl-style regular expressions.

The VBScript RegExp object implements regular expressions in a slightly different way than does the JavaScript/JScript RegExp object. Both of these RegExp objects are modeled after regular expressions in PERL.

Pattern Syntax

Most characters in a regular expression will look for themselves. EG: /geo/ will find george and gorgeous. However some characters have special meaning in regular expression patterns. Here is the basic list of these metacharacters.

\ | () [] {} ^ $ * + ? .

Assertions are sections of patterns that match themselves. Atoms are non-zero width assertions.

Quantifiers say how many of the atom immediately preceding should match in a row. The quantifiers are *, +, ?, and {}. EG: /hi{2}/ matches hii while /(hi){2}/ matches hihi.

Flags are not part of the pattern, but affect the application of the pattern. They can be combined. EG: ig.

  • g. Global match, i.e. find all matches. Without g, the patterns finds just the first match.
  • i. Ignore case, as in upper or lower case.
  • m. Match over multiple lines, "^" and "$" change from matching at only the start or end of the entire string to the start or end of any line within the string.
  • y. Sticky. (Implemented: JavaScript 1.8; JScript not; ECMA-262 not.). "matches only from the index indicated by the lastIndex property of this regular expression in the target string (and does not attempt to match from any later indexes). This allows the match-only-at-start capabilities of the character "^" to effectively be used at any location in a string by changing the value of the lastIndex property." -http://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp

Here is the syntax for regular expressions in JavaScript.

var RegExpLiteral = /pattern/[flag];
var RegExpObject = new regExp("pattern"[, "flag"]);
Character Description
\ Escapes, i.e. marks the next character as special, a literal, a back reference, or an octal.
EG: "n" is "n", but "\n" is a newline. An escape of particular note is "\\"
^ (1) Anchors at start, i.e. matches at the beginning of target string. If RegExp.Multiline is set, then also matches after "\n" or "\r".
EG: "^a" matches the first a in "ana" but not the second.

(2) In sets, this means not the set.
EG: "[^x-z]" matches any character except for "x", "y", or "z".
$ Anchors at end, i.e. matches at the end of target string. If RegExp.Multiline is set, then also matches before "\n" or "\r".
EG: "a$" matches the second a in "ana" but not the first.
. Matches any 1 character except characters related to new lines: [\n\r\u2028\u2029].
EG: "bo." matches "b", "bo", "boo", "booo", "boooo".
* Quantifier: Matches the preceding sub-expression 0 or more times. Same as {0,}.
EG: "bo*" matches "b", "bo", "boo", "booo", "boooo".
+ Quantifier: Matches the preceding sub-expression 1 or more times. Same as {1,}.
EG: "bo+" matches "b", "bo", "boo", "booo", "boooo".
? (1) Quantifier: Matches the preceding sub-expression 0 or 1 times. Same as {0,1}.
EG: "bo?" matches "b", "bo", "boo", "booo", "boooo".

(2) If used immediately after one of the other quantifiers (*, +, ., and {}), then makes the pattern non-greedy.
EG: "X.+X" matches "XHello world.X Xfoo barX", while "X.+?X" matches "XHello world.X"

(3) Used in the look ahead assertions: (?=), (?!), and (?:).
{n} Matches the preceding sub-expression n times.
EG: "bo{2}" matches "b", "bo", "boo", "booo", "boooo".
{n,} Matches the preceding sub-expression n or more times.
EG: "bo{2,}" matches "b", "bo", "boo", "booo", "boooo".
{n,m} Matches the preceding sub-expression n to m times.
EG: "bo{2,3}" matches "b", "bo", "boo", "booo", "boooo".
(pattern)
\(pattern\)
(Latter in POSIX)
(1) Used in a mathematical fashion for grouping, scoping, and setting precedence.
EG: "dais(y|ies)" is the same as "daisy|daisies".

(2) Matches the pattern and captures/remembers/parenthesizes it. Captured matches can be retrieved into the Matches collection (VBScript) or the $0 ... $9 backreference properties (JavaScript) or the $0 .. $99 backreference properties (PERL) or the \n (POSIX).
EG: "<(.*)>.*<\/$1>" matches paired elements like "<p>hi</p>".
EG: "(.)(.).*$2$1" matches strings like "ABcdedBA".
(?:pattern) Look ahead assertion: Matches the pattern. In spite of parentheses, this is zero-width and does not capture.
EG: "|a" is not valid but "(?:)|a" is for empty string or a.
pattern1(?=pattern2) Positive look ahead assertion: Matches the pattern1 if it is followed by pattern2. In spite of parentheses, this is zero-width and does not capture.
EG: "Win (?=95|98)" matches "Windows" of "Windows 98" but not "Windows" of "Windows 2000".
pattern1(?!pattern2) Negative look ahead assertion: Matches the pattern1 if it is not followed by pattern2. In spite of parentheses, this is zero-width and does not capture.
EG: "Win (?!95|98)" matches "Windows" of "Windows 2000" but not "Windows" of "Windows 98".
(?<=pattern1)pattern2 Positive look behind assertion: Matches the pattern2 if it is preceded by pattern1. In spite of parentheses, this is zero-width and does not capture.
EG: "(?<=Satur|Sun)day" matches "day" of "Sunday" but not "day" of "Monday".
(?<!pattern1)pattern2 Negative look behind assertion: Matches the pattern2 if it is not followed by pattern1. In spite of parentheses, this is zero-width and does not capture.
EG: "(?<!Satur|Sun)day" matches "day" of "Monday" but not "day" of "Sunday".
x|y Seperates alternatives. Matches x or y.
EG: "g|food" matches "g" or "food". "(g|f)ood" matches "good" or "food".
[xyz] Positive character set matches any character enclosed.
EG: "[ab]" matches "ab" of "abcd".
[^xyz] Negative character set matches any character not enclosed.
EG: "[^ab]" matches "cd" of "abcd".
[x-z] Positive range of characters.
EG: "[x-z]" matches any "x", "y", or "z".
[^x-z] Negative range of characters.
EG: "[^x-z]" matches any character except for "x", "y", or "z".
\b Matches a word boundary, i.e. position between a character and whitespace.
EG: "er\b" matches "er" in "hover x" but not the "er" in "Ebert".
\B Matches a non-word boundary, i.e. position between a character and a character.
EG: "er\B" matches "er" in "Ebert" but not the "er" in "hover x".
\cx Matches a control character x, where x is A-Z or a-z.
EG: "\cM" matches ctrl+M (carriage return character).
\d
[:digit:]
(latter in POSIX)
Matches a digital character. Same as [0-9].
\D Matches a non-digital character. Same as [^0-9].
\f Matches a form-feed character. Same as [\x0c\cL].
\n Matches a newline character. Same as [\x0a\cJ]. FYI: EOLs by sys: Win \r\n; Unix \n; Mac \r.
\r Matches a carriage return character. Same as [x0d\cM]. FYI: EOLs by sys: Win \r\n; Unix \n; Mac \r.
\s
[:space:]
(latter in POSIX)
Matches a whitespace character. Same as [\t\n\v\f\r ] or [\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000].
\S Matches a non-whitespace character. Same as [^\t\n\v\f\r ] or [^\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000].
\t Matches a tab character. Same as [\x09\cI].
\v Matches a vertical tab character. Same as [x0b\cK].
\w Matches a word character. Same as [A-Za-z0-9_];.
\W Matches a non-word character. Same as [^A-Za-z0-9_].
[:alnum:] Matches alphanumeric characters in POSIX. Same as [A-Za-z0-9].
[:alpha:] Matches alphabet characters in POSIX. Same as [A-Za-z].
[:blank:] Matches space and tab in POSIX. Same as [ \t].
[:cntrl:] Matches control characters in POSIX. Same as [\x00-\x1F\x7F].
[:graph:] Matches graphical or visible characters in POSIX. Same as [\x21-\x7E].
[:lower:] Matches a lowercase character in POSIX. Same as [a-z].
[:print:] Matches graphical or visible characters and space in POSIX. Same as [\x20-\x7E].
[:punct:] Matches punctuation characters and space in POSIX. Same as [!"#$%&'()*+,-./:;<=>?@[\\\]_`{|}~].
[:upper:] Matches an uppercase character in POSISX. Same as [A-Z].
[:xdigit:] Matches any characters used in hexadecimal digits in POSISX. Same as [A-Fa-f0-9].
\n If integer n is preceded by at least n captured (parenthesized) matches, then back references the captured matches.
EG: "one(,)\stwo\1" matches "one, two" in "one, two, three".
EG: "/<(.*)>.*<\/\1>/" matches paired elements like "<p>hi</p>".
EG: "/^(.)(.).*\2\1$/" matches strings like "ABcdedBA".

Else if n is an octal number, then matches an octal character. In VBScript, n must be between 1-3 digits (0-777).
EG: "\132" matches "Z".
\on Matches an ASCII octal character code n. JavaScript only.
EG: "\x5a" matches "Z".
\xn Matches an ASCII hexadecimal character code n, where n has 2 digits.
EG: "\x5a" matches "Z".
\un Matches a Unicode hexadecimal character code n, where n has 4 digits.
EG: "\u00A2" matches "?".
\0 Matches NUL or NULL PROMPT. Same as [\u0000].

Matching Rules

There are six basic rules that regular expressions apply in order.

  1. Starting before the first character, try to match the pattern on everything to the right, then subtracting characters right-to-left. If no match, then repeat starting after the first character, and so on. EG: Match /ar/ in Cart.
    Match /ar/ in "Cart":
    Cart // Does "Cart" match? ... NO
    Cart // Does "Car" match?  ... NO
    Cart // Does "Ca" match?   ... NO
    Cart // Does "C" match?    ... NO
    Cart // Does "" match?     ... NO
    Cart // Does "art" match?  ... NO
    Cart // Does "ar" match?   ... YES
    
  2. The whole pattern is regarded as a set of alternatives separated by vertical bars (|).
  3. Any specific alternative matches if every assertion or quantified atom in the alternatives matches sequentially according to Rules 4 and 5.
  4. If an assertion does not match according to the following table, backtrack to Rule 3 and try a higher-pecking-order item with different choices.
    Assertion Description
    ^ Matches at the beginning of the string.
    $ Matches at the end of the string.
    \b Matches a word boundary (between \w and \W), when not inside [].
    \B Matches a non-word boundary.
  5. A quantified atom matches only if the atom itself matches some number of times allowed its following quantifier. Multiple matches must be adjacent within the string. If no match can be found at a current position for any allowed quantity of the atom, backtrack to Rule 3 and try higher-pecking-order items with different choices.
  6. Each atom matches according to its type. If the atom doesn't match, or doesn't allow a match of the rest of the regular expression, backtrack to Rule 5 and try the next choice for the atom's quantity.

EGs

Numbers

  • To match whole positive numbers.
    ^\d+$
  • To match positive or negative whole number.
    ^-?\d+$
  • To match positive or negative real numbers with a decimal point.
    ^-?\d+\.\d+$
  • To match positive or negative real numbers with or without a decimal point.
    ^-?\d+(\.\d+)?$
  • To match positive or negative real numbers with or without a decimal point and 1 to 3 decimal places.
    ^-?\d+(\.\d{1,3})?$

Tags

Tag as in HTML, XML, etc.

  • Find all tags.
    /<(.|\n)+?>/
    /(<([^>]+)>)/ig
    s/&nbsp;/ /g
  • .NET regexp to replace <img> with well-formed <img />.
    Find: {\<img([^>]*[^/])}{\>}
    Replace: \1 /\2
  • Select an attribute and its value.
    onclick="[^"]*"
  • Find tags.
    <td[^>]*>
  • JavaScript StripTags, not PHP strip_tags

Whitespace

  • Find blank lines.
    ^\r?\n?$
  • Find extra spaces and tabs on the right. Commonly used for a right trim.
    [ \t]+$
  • Find leading indentation in spaces (soft tabs) and replace with hard tabs. May require multiple passes.
    str.replace(/^(\t*) {4}/g, "$1\t")
  • Find bad "whitespace", i.e. spaces, tab(s), or a mix of tabs and spaces, in the middle of text.
    (?<=\S)( {2,}|\t+| +\t+|\t+ +)(?=\S)
  • Put in missing line and tabs. Watch the number of tabs.
    str.replace(/(\S)(</li>)(\t*)/g, "$1\r\n$3$2")
  • Find spaces while ignoring spaces after things like commas. Useful while cleaning CSVs.
    \b \b

Miscellany

  • Full names of family members.
    (George|Julia|Connie|York|Amy) Hernandez.
  • Switch first two words.
    • In JavaScript.
      str = "George Hernandez";
      newstr = str.replace(/(\S+)\s(\S+)/, "$2 $1");
    • In PERL.
      s/(\S+)\s+(\S+)/$2\, $1/
  • Make the first word of many the last.
    str.replace(/(\w+)\s([\w\s]*)/, "$2 $1"), would make 20060728t1503 My File part1of2.txt into My File part1of2_20060728t1503.txt. I've used this variation (([\w\-]+\s\w+)\s([\w\s]*)) to change 2006-07-28 1503 My File part1of2.txt into My File part1of2_2006-07-28 1503.txt.
  • Make a word before the extension the last.
    str.replace(/(.*)\s(\w*)(.\w{3})/, "\2 \1"), would make My File part1of2 200607281503.txt into 200607281503 My File part1of2.txt.
  • Begins with y or Y, as in a Yes answer.
    ^[yY].
  • Exactly "yes", "Yes", or "YES".
    (yes|YES|Yes)
  • Begins with colon-delimited field that has no spaces or tabs.
    ^\[^ \t:\]+:.
  • Find non-ASCII characters:
    [^\x0-\x7F]
  • Fighting SQL injection. Do this on the client and server side: Only accept entries with the characters in the format you need.
  • Find extra commas:
    ,\s*[\}\|\]]
  • U.S. zip code:
    ^\d{5}([\-]\d{4})?$
  • Social Security Number (SSN) format (NNN-NN-NNNN).
    ^\d{3}-\d{2}-\d{4}$
  • U.S. phone numbers. There are many variations for phones. This one is 123-1234 or 123-123-1234 with dashes optional.
    ^(\d{3}[-]?){1,2}(\d{4})$
  • Common characters for most U.S. names and addresses.
    [ \w\,\'\.\#\-]{1,50}
  • Match parameters in a query string.
    var match, mypar, myparDefault = 0;
    mypar = myparDefault;
    match = document.URL.match(/[?&]mypar=(\d+)/);
    if (match && match[1]) {
        mypar = parseInt(match[1], 10) || myparDefault;
    }
    
  • Find a six-letter word that contains the word "dog".
    \b(?=\w{6}\b)\w{0,3}dog\w*
  • Check password strength. Does not do a dictionary check.
    ^.*(?=.{8,30})(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[\W])(?!.*[\s]).*$
    ^.*(?=.{6,30})(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9\W]).*$
  • Email. The official standards RFC 5321 and 5322 are more extensive. See Email address [W].
    /^([\w\-]+)(\.[\w\-]+)*@([\w\-]+\.){1,5}([A-Za-z]){2,4}$/
    /^(?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-\/=?^_`{|}~]+)+@((?!-)[a-zA-Z\d\-]+(?!<-)\.)+[a-zA-Z]{2,}$/

Links



GeorgeHernandez.comSome rights reserved