Opinions vary greatly with regular expressions, ranging among obliviousness, adoration, or disgust. It’s no secret that I commonly use regular expressions to solve silly string problems. I maintain that they can be simple to understand, and simple to unit test when used often and responsibly. Using them in small doses is a way to get over the learning curve. Here’s a simple problem that I came across, and I was able to quickly solve it without going way out of my way.
Imagine that I’m working in a source base where a Person class contains a string property called Phone which contains a phone number. That string property can contain an impressive variation of formats for phone numbers. Now imagine that I was tasked with consuming a third party library, which defines its own Person object. Instead of having a string property for a phone number, it defines a PhoneNumber class with a property for Area, Exchange, and Phone. The constructor is parameterless, so it adheres to a tight/formal definition of a phone number. Crap.
This is where regular expressions shine. Sure, I could write a ton of code and account for a million and one cases of what a phone number could look like. Or, I could be generic about it and describe the pattern that phone numbers follow:
\W*\(*\d{3}\)*\W*\.*\W*\d{3}\W*-*\W*\d{4}\W*
This pattern roughly matches sets of 3, 3, and 4 digits with some input forgiveness in between (random spaces, dashes, and parentheses). With a pattern like this, we are able to reasonably assert that a string is a phone number. Test it for yourself.
There is a problem with this pattern. It only confirms that the string matches. Sure, that’s a good thing, but let us not forget the original problem. The PhoneNumber class from this third party is string-ignorant. I need a way to extract the critical parts of this string to an instance of this PhoneNumber class.
How? Named Capturing Groups, that’s how. Named capturing groups aren’t too bad. Let’s look at a contrived example (those are the best).
Capture a string of three digits, named “digits”:
In this pattern, "\d{3}" matches a string like "123" and is grouped by the key "digits". Groups are created by using "(?<>)". The group name goes between the angle brackets, "<digits>". Perhaps it will make more sense to see it in practice.
string pattern = @"(?<digits>\d{3})";
string input = "hello 123 world";
Match match = Regex.Match(input, pattern);
string result = match.Groups["digits"].Value;
The Regex object has a Match() method that takes string input and a pattern. The Match object returned from this call has a GroupCollection on it, which allows you to index into it with string keys. The keys are the names given to groups. In this example, our group was named "digits" and the result is "123". Neat, right?
Now that we have a pattern that can match phone numbers, it would be nice to allow Regex to do the dirty work for us as it matches. We can take advantage of named capturing groups here:
\W*\(*(?<area>\d{3})\)*\W*\.*\W*(?<exchange>\d{3})\W*-*\W*(?<phone>\d{4})\W*
Now we’re able to capture "area", "exchange", and "phone" while performing a match. There’s lots of ways to implement this code. Since it’s a third party library, I choose to write an extension method.
static class PhoneNumberExtensions
{
const string pattern
= @"\W*\(*(?<area>\d{3})\)*\W*\.*\W*(?<exchange>\d{3})\W*-*\W*(?<phone>\d{4})\W*";
public static void SetFromString(this PhoneNumber phoneNumber, string input)
{
Match match = Regex.Match(input, pattern);
phoneNumber.Area = match.Groups["area"].Value;
phoneNumber.Exchange = match.Groups["exchange"].Value;
phoneNumber.Phone = match.Groups["phone"].Value;
}
}
The code is virtually the same as the first example. The usage is also simple.
PhoneNumber myPhoneNumber = new PhoneNumber();
myPhoneNumber.SetFromString("(123) 456 - 7890");
These are the kinds of problems I like to solve with Regular Expressions. It takes the tediousness out of dealing with strings.