Wednesday, July 29, 2015

A Simple Introduction To Regular Expressions

Scott Hanselman has a saying, once you solve a problem with a Regular Expression, you now have two problems.  Regular expressions can be intimidating due to their syntax.  However, regular expressions can be understood and debugged, as long as you have the right tools.  They can be powerful if used appropriately and correctly.  You can do it.

The Is Numeric Regular Expression

Our first regular expression will detect if the entire string is numbers.

^\d+$

Let's break down each character of the regular expression:

  • ^ This means there must be a match at the beginning of the string
  • \d This means to detect a number
  • + This means to detect 1 or more of the preceding item (in other words we are looking for one or more numbers)
  • $ This means there must be a match at the end of the string

Try it out in an HTML5 compliant browser

Here is the HTML code for the above.  Notice we are using "tel" for the input type.  This will bring up the numeric keyboard for mobile devices.
<input pattern="^\d+$" required="required" title="Please enter a number" type="tel" />

The 5 Digit Zip Code Regular Expression


Let's take what we know from above and extend our knowledge.  We can easily make a five digit zip code detector.

^\d{5}$

In this example we have introduced a new concept with the {5}.  This means we are looking for exactly 5 characters.

Try it out in an HTML5 compliant browser

Here is the HTML code for the above.  Again, we are using "tel" for the input type which will bring up the numeric keyboard for mobile devices.
<input pattern="^\d{5}$" required="required" title="Please enter 5 digit zip code" type="tel" />


The Zip Code with optional Zip+4 Regular Expression


Now we are starting to look like the familiar overwhelming spaghetti of characters that characterizes regular expressions.  When you see something like what is presented below your first reaction is to throw up your hands and say to yourself, "I can't understand this mess."  The best way to think about it is to review it character by character.  If you still don't understand it, tools like Expresso can be invaluable to explaining it to you.

^\d{5}(-?\d{4})?$

In this regular expression we are now adding the concept of groups.  A group is characterized by open and close parenthesis ( ).  We are also adding the concept of the question mark ? character.  The question mark character indicates there can be zero or one instance.  So within the group there can be an optional dash followed by four digits.  The entire group for the four digits is optional because there is a question mark following.  Expresso explains all this:


So this will match

  • 12345
  • 123456789
  • 12345-6789


Try it out in an HTML5 compliant browser

Here is what the HTML Code looks like:
<input pattern="^\d{5}(-?\d{4})?$" required="required" title="Please enter 5 digit zip code or zip+4" type="tel" />

The Currency Regular Expression

In our currency example we want to allow the user to enter in a $ and commas along with numbers and cents.  The expression contains these new concepts:


  • \$ means to escape the $ end of string operator with a normal $.  Any special characters need to be escaped such as ^$([]{}.\
  • [0-9] this is a range operator meaning digits from 0 through 9.  It also works with letters too.  Example [a-z] or [A-Z] or [a-zA-Z] for both lower and upper case.
  • {1,3} is a character range operator meaning we want from 1 to 3 characters
  • * means that any number of characters



^\$? ?([0-9]{1,3},([0-9]{3},)*[0-9]{3}|[0-9]+)(.[0-9][0-9])?$

This will match

  • $1,240.99
  • $ 1.240.99
  • $10,000
  • 10000.00
Here is what it looks like in Expresso






Try it out in an HTML5 compliant browser
This is what the HTML code looks like:
<input pattern="^\$? ?([0-9]{1,3},([0-9]{3},)*[0-9]{3}|[0-9]+)(.[0-9][0-9])?$" required="required" title="Please enter currency amount" type="text" />

The US Phone Regular Expression


For a phone in the US we want ten numbers.  The area code and prefix can be separated by common characters.  Notice the set operator.  This is denoted by [-. ].  This means we will match a dash, a period, or a space.

^\(?\d{3}[\)-\.]? ?\d{3}[-. ]?\d{4}$

This will match

  • (999) 999-9999
  • (999)999-9999
  • 999-999-9999
  • 999.999.9999
  • 9999999999


Try it out in an HTML5 compliant browser



Here is the HTML Code:
<input pattern="^\(?\d{3}[\)-\.]? ?\d{3}[-. ]?\d{4}$" required="required" title="Please enter valid phone in the format (999) 999-9999 or 999-999-9999" type="tel" />

Valid URL Regular Expression


We are now going to introduce several different things for a regular expression that validates a URI.  I took the valid characters from this lovely stack overflow post.  .NET supports the concept of named groups.  However JavaScript does not.  See the groups below Protocol, Prefix, Domain and QueryString.  It is also possible to use OR statements inside the regular expression by using the pipe | operator.  In this particular case the or operator is used to work with escape HTML characters such as %20 means a space.

.NET Named Group Example
^(?<Protocol>https?://)(?<Prefix>www\d*.)(?<Domain>([!#$&-;-\[\]_a-z~]|%[0-9a-fA-F]{2})+)(?<QueryString>\??([!#$&-;=-\[\]_a-z~]|%[0-9a-fA-F]{2})+)?$

JavaScript Equivalent
^(https?://)(www\d*.)(([!#$&-;-\[\]_a-z~]|%[0-9a-fA-F]{2})+)(\??([!#$&-;=-\[\]_a-z~]|%[0-9a-fA-F]{2})+)?$

This will match

  • http://www.kellermansoftware.com
  • http://kellermansoftware.com
  • https://www.kellermansoftware.com
  • http://www1.kellermansoftware.com
  • kellermansoftware.com
  • www.kellermansoftware.com
  • http://www.kellermansoftware.com?FirstName=Greg




Try it out in an HTML5 compliant browser

Here is the HTML:
<input pattern="^(https?://)(www\d*.)(([!#$&amp;-;-\[\]_a-z~]|%[0-9a-fA-F]{2})+)(\??([!#$&amp;-;=-\[\]_a-z~]|%[0-9a-fA-F]{2})+)?$" required="required" title="Please enter valid url" type="text" />

Date Regular Expression


Even though HTML5 supports the date input type, it is not supported by most browsers as of yet.  You probably should be using either the JQuery date picker or the bootstrap date picker if you are using bootstrap.  Here is a regular expression for dates that I wrote about before..

When should you use regular expressions?

Client side validation is always a good candidate for regular expressions.  This is whether you are doing a web application, smart client, or phone application.  Regular expressions can be slow when running against megabytes of text, even if compiled beforehand.  If you are dealing with large data sets or large sets of regular expressions you should test the performance vs. plain indexof string parsing.  Sometimes I have split up large text using IndexOf then run a regular expression on the remainder.

Resource for Regular Expressions

Reggie - If you are just starting out, this is a nice self contained HTML page that has a reference built in.
Expresso - This is an incredible free editor.  It is written in .NET so it supports authoring of named groups.  It will run from a flash drive if your PC is locked down.
Regexlib - A library of regular expressions.  Test these the quality is intermittent based on the author.
Regular Expression Cheat Sheet by Dave Child
Regular Expression Tutorial by DZone