Opinions vary greatly with regular expressions, ranging among obliviousness, adoration, or disgust.  It’s no secret that I commonly use regular expressions to solve silly string problems.  I maintain that they can be simple to understand, and simple to unit test when used often and responsibly.  Using them in small doses is a way to get over the learning curve.  Here’s a simple problem that I came across, and I was able to quickly solve it without going way out of my way.

Imagine that I’m working in a source base where a Person class contains a string property called Phone which contains a phone number.  That string property can contain an impressive variation of formats for phone numbers.  Now imagine that I was tasked with consuming a third party library, which defines its own Person object.  Instead of having a string property for a phone number, it defines a PhoneNumber class with a property for Area, Exchange, and Phone.  The constructor is parameterless, so it adheres to a tight/formal definition of a phone number.  Crap.

This is where regular expressions shine.  Sure, I could write a ton of code and account for a million and one cases of what a phone number could look like.  Or, I could be generic about it and describe the pattern that phone numbers follow:

\W*\(*\d{3}\)*\W*\.*\W*\d{3}\W*-*\W*\d{4}\W*

This pattern roughly matches sets of 3, 3, and 4 digits with some input forgiveness in between (random spaces, dashes, and parentheses).  With a pattern like this, we are able to reasonably assert that a string is a phone number.  Test it for yourself.

There is a problem with this pattern.  It only confirms that the string matches.  Sure, that’s a good thing, but let us not forget the original problem.  The PhoneNumber class from this third party is string-ignorant.  I need a way to extract the critical parts of this string to an instance of this PhoneNumber class.

How?  Named Capturing Groups, that’s how.  Named capturing groups aren’t too bad.  Let’s look at a contrived example (those are the best).

Capture a string of three digits, named “digits”:

(?<digits>\d{3})

In this pattern, "\d{3}" matches a string like "123" and is grouped by the key "digits".  Groups are created by using "(?<>)".  The group name goes between the angle brackets, "<digits>".  Perhaps it will make more sense to see it in practice.

string pattern = @"(?<digits>\d{3})";

string input = "hello 123 world";

Match match = Regex.Match(input, pattern);

string result = match.Groups["digits"].Value;


The Regex object has a Match() method that takes string input and a pattern.  The Match object returned from this call has a GroupCollection on it, which allows you to index into it with string keys.  The keys are the names given to groups.  In this example, our group was named "digits" and the result is "123".  Neat, right?

Now that we have a pattern that can match phone numbers, it would be nice to allow Regex to do the dirty work for us as it matches.  We can take advantage of named capturing groups here:

\W*\(*(?<area>\d{3})\)*\W*\.*\W*(?<exchange>\d{3})\W*-*\W*(?<phone>\d{4})\W*

Now we’re able to capture "area", "exchange", and "phone" while performing a match.  There’s lots of ways to implement this code.  Since it’s a third party library, I choose to write an extension method.

static class PhoneNumberExtensions

{

    const string pattern

        = @"\W*\(*(?<area>\d{3})\)*\W*\.*\W*(?<exchange>\d{3})\W*-*\W*(?<phone>\d{4})\W*";

    public static void SetFromString(this PhoneNumber phoneNumber, string input)

    {

        Match match = Regex.Match(input, pattern);

 

        phoneNumber.Area = match.Groups["area"].Value;

        phoneNumber.Exchange = match.Groups["exchange"].Value;

        phoneNumber.Phone = match.Groups["phone"].Value;

    }

}

 
The code is virtually the same as the first example.  The usage is also simple.

PhoneNumber myPhoneNumber = new PhoneNumber();

myPhoneNumber.SetFromString("(123) 456 - 7890");


These are the kinds of problems I like to solve with Regular Expressions.  It takes the tediousness out of dealing with strings.

Thursday, March 25, 2010 9:00:33 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0] -
Regular Expressions

I’m the first to admit—I love regular expressions.  It’s kind of a hammer and nail situation.  I see text, I immediately think:

using System.Text.RegularExpressions;

They’re just so useful.  How can you not like them?  Okay, they’re a bit obfuscated and horrid to debug (and that’s even true for the person who writes them).  I’m always thinking of my coworkers, so here’s a few regurgitated thoughts about how to improve readability and maintainability of Regular Expressions.

Since Regular Expressions can be so difficult to understand, it helps to properly document the individual tokens in the pattern.  There are a few reasons behind this:

  1. Check your work up front
  2. Clearly state what the expression intends to match
  3. Clearly state how the expression intends to match

I have this new friend—uh, buddy… his name is RegexBuddy.  Damn, I love this tool.  You can type in an expression, some input text, and see real time results.  But, there’s something invaluable about its presentation.  As you write an expression, it generates this great explanation.  The best part is that it is in plain English.  Looking at this got me thinking a little.

image
RegexBuddy even allows you to export the explanation to various places.  Seems like the clipboard could come in handy.  Wait, what if we could get this kind of information into comments?  Maybe it could be pasted and massaged into comments to look something like this:

//---------------------------------------------------------------------
/// 
///     This Regular Expression can be used to extract
///     a customer name from the salutation of a form letter.
/// 
#region Regex Explanation
// Expression:
// Dear (?<name>[A-Za-z ]*),
//
// Explanation:
// 	Match the characters “Dear ” literally «Dear »
// 	Match the regular expression below and capture its match into backreference with name “name” «(?<name>[A-Za-z ]*)»
//   	    Match a single character present in the list below «[A-Za-z ]*»
//      	Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//      	A character in the range between “A” and “Z” «A-Z»
//      	A character in the range between “a” and “z” «a-z»
//      	The character “ ” « »
// 	Match the character “,” literally «,»
//
// Sample Input:
// "Dear Loyal Reader,
//
// Thanks for reading John Coder!  :-)"
//
// Matches:
// "Loyal Reader"
//
// Created with RegexBuddy
#endregion
public const string GetCustomer = @"Dear (?<name>[A-Za-z ]*),";

Is it overkill?  Maybe.  Although, there are a couple of things to note.  Jamming all of this “stuff” into the

Xml comment tag will inevitably break it.  Plus, you probably don’t want that much information to pop up in a tooltip anyway.  So, I wrapped it in a region to make it collapsible.  The summary is really a summary, and the drawn out explanation is deferred to a less-obtrusive location.

Is this easy enough to understand?  Leave a comment and let me know what you think.

Friday, July 17, 2009 12:34:24 AM (Eastern Daylight Time, UTC-04:00)  #    Comments [0] -
Coding Horror | Commenting | Jeff Atwood | Maintainability | RegexBuddy | Regular Expressions

John Nelson

mugshot I am a passionate C# Developer working in ASP.NET on an e-commerce solution for ticketing software. I work across all of the application layers, including server side functionality, and client side programming with jQuery and MS Ajax. Although my full time job is in WebForms, I spend many of my off hours working with MVC. I am especially interested in productivity and good programming practices.

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

© Copyright 2010
johncoder.com
Statistics
Total Posts: 39
This Year: 15
This Month: 0
This Week: 0
Comments: 4