What's a Regular Expression?
I. Introduction
Today Farsight Security announced DNSDB 2.0 Flexible Search for DNSDB API. Flexible Search offers powerful new search capabilities that enhance DNSDB API, and which make it possible to easily do the DNSDB searches you’ve always wished you could make.
Early Adopter Access will be available on August 19th, 2020 and General Availability is scheduled for October 20th, 2020. If you’re interested in applying for Early Adopter access, please contact [email protected].
Flexible Search will be bundled at no charge for paid DNSDB API customers (and customers given access to DNSDB API under a grant from Farsight), but will NOT be included as part of DNSDB Community Edition, the free, entry-level version of our flagship solution.
II. Search Syntax Options
Flexible Search is a “finding aid” that supplements and enhances (but does not replace) Standard DNSDB API.
Flexible Search offers users three search syntax modes in DNSDB Scout, and two otherwise.
Keyword Search: easily search for a brand name or domain name — just type in a word or string of characters to match.
Keyword Search is meant to provide an easy starting point for novice searchers. [Available In DNSDB Scout Only]
Regular Expressions: Regular expressions are the industry-standard way of expressing search patterns. Regular expressions support simple keyword searches, but also gives you the most power when you want or need to begin making more-complex pattern searches.
This article is meant to give you the chance to learn a little about regular expressions in general now, before Flexible Search is actually made available for your use.
Globbing: Globbing is the other pattern matching option that will be available in DNSDB 2.0. We offer globbing as an option for those who may prefer it, but please note that it is simultaneously:
- More syntactically complex when it comes to doing basic keyword searches, and
- Much more limited when it comes to supporting non-trivial pattern matching.
If you are nonetheless interested in it, see our new blogpost, “What is Globbing?. Most users who aren’t familiar with either regular expressions or globbing should focus on learning to work with regular expressions, as described in this article, instead.
The goal of this article is to give you an introduction to regular expressions (“regexes”) for those who find themselves wondering “what the heck ARE these ‘regex’ things that some ‘techies’ keep talking about?”
At its most basic, a regular expression (or “regex”) is just a string that describes a pattern to be matched.
For example, imagine a program scanning lines in one or more files, looking for lines that contain the regular expression pattern of interest. When it finds a line with that pattern, it prints that line out. Simple as that sounds, regexes can be extremely powerful and useful. Regexes are routinely used in the cybersecurity world by:
- Analysts searching logs and other large data files
- Data scientists massaging input data files so they can be ingested into machine learning models
- Developers validating input fields and so on.
In a comparatively short article like this one, we can only “scratch the surface” when it comes to all the features and capabilities of regular expressions, but we hope that even this short introduction will still serve to pique your interest in regular expressions and motivate you to learn more about them. If that happens, there are a number of regular expression books you should check out, including O’Reilly’s:
- Michael Fitzgerald’s Introducing Regular Expressions
- Jeffrey E. F. Friedl’s Mastering Regular Expressions (3rd Edition)
- Goyvaerts and Levithan’s Regular Expressions Cookbook (2nd Edition)
- Tony Stubblebine’s Regular Expression Pocket Reference (2nd Edition)
III. Sample Data Set
To help illustrate how regular expressions work, we’ve created a small sample data file with fifty-two assorted domain names called domains.txt
(see Appendix I). We’re going to use that file as data for our examples.
IV. The Tool We’ll Use To Demonstrate Regular Expression Matching: GNU egrep
While some of you may participate in our Early Adopter Program, most of you won’t have access to DNSDB Flexible Search until October 20th, 2020. Therefore, we’ve couched the following discussion in terms of the commonly available GNU egrep
command. That command will treat regular expressions the same way that DNSDB 2.0 Flexible Search regular expressions will work, allowing eager people to get some time in learning and practicing before Flexible Search goes to General Availability status.
Once you get access to Flexible Search, using regular expressions will be as simple as plugging them into the Find field in DNSDB Scout, or using the --regex
qualifier to the dnsdbflex
command-line client.
The Unix “grep
” command name is a “portmanteaux” word. It was built from parts of the words in the phrase “globally search for a regular expression and print.”
It is a staple command-line utility on Unix systems (and on Unix-like operating systems such as Linux and Mac OS X) and should exist (in some form or another) on virtually every Unix or Unix-like system. egrep
is an enhanced version of grep
. It’s what we’re going to use for the examples shown in this article.
On the latest version of Mac OSX (aka Catalina), the system-provided egrep
appears to (still) be:
$ /usr/bin/egrep --version egrep (BSD grep) 2.5.1-FreeBSD
The 2.5.1 version of egrep
is known to have bugs, bugs which have been fixed in later versions. Unfortunately, because the later versions use a different open source license, Mac OSX has not updated to one of the later version(s) of egrep
where that bug has been corrected. The bug is serious enough that it visibly impacts the results you may get for even relatively simple queries.
Thus, we normally prefer to use, and recommend that you use, the GNU version of egrep
. If GNU egrep
is installed on your system (and used by default), you should see something like:
$ egrep --version grep (GNU grep) 3.3 [etc]
If GNU egrep
is not installed, you may be able to install GNU egrep
via your operating system’s package manager. For example, on a Mac using homebrew
you can install GNU grep
by saying:
$ brew install grep
You can also download and install GNU egrep
from source.
V. Regular Expression “Building Block” Characters
Regular expressions are just strings (sometimes quite cryptic-looking strings, but still, just strings). We’ll normally put regular expressions inside single quote marks. Each regex gets built using a combination of:
“Literal” characters (e.g., uppercase and lowercase letters, numbers, and some symbols)
“Meta” characters – there are many symbols that serve as a shorthand for special things such as “match any one character here.” The meta characters that normally do special things in at least some circumstances are as follows:
Character Name Special Thing This Symbol Means \ backslash "Escapes" the character after this one . dot Match any one character here * star Repeat the previous zero or more times ^ caret Match start of line $ dollar sign Match end of line ? question mark Optional (match zero or one time only) + plus sign Matches one or more time | vertical bar Logical or (match either) { and } curly braces Repetition count {min}, {min, max}, or {,max} ( and ) parentheses Define logical subexpression ("create grouping") [ left square bracket Define character class
If you want to literally match those metacharacters, prefix them with a backslash. [Note: Some versions of egrep
may attempt to guess if a metacharacter “should” be treated as a metacharacter or as the literal character. That is risky, however, so we generally urge you to explicitly indicate if you want a metacharacter to be treated as a literal.]
- “Character classes:” These can be of two types: shorthand character classes, and bracketed character classes.
Shorthand character classes (such as \w
, \d
, \s
) are used in some
regular expression implementations, but will not be available in DNSDB’s
Flexible Search regex implementation.
Bracketed character classes are either predefined character classes that look like the following (this is not an exhaustive list of these):
[[:alpha:]] Any upper or lower case alphabetic character [[:digit:]] Any digit from 0 to 9 [[:alnum:]] Any alphanumeric character [[:xdigit:]] Any hexadecimal digit (e.g., 0-9 plus A-F or a-f)
or classes that the user defines, such as
[aeiouy] Matches any vowel (or pseudo-vowel, in the case of "y") [^aeiouy] Matches any NON-vowel (including other letters, numbers, symbols, etc.)
Note that MOST metacharacters lose their special meanings within square brackets (a notable exception is the caret symbol, as just shown in the [^aeiou
] example).
VI. The Simplest of Regular Expression: Matching A Literal Substring
Let’s use a regex to find lines from our sample data file that contain the literal string “off”. We’ll run the egrep command from a Terminal window on our Mac:
$ egrep 'off' domains.txt coffee.com office.com office365.com
This is a pretty straight-forward command: it takes a regular expression (in this case the literal string off, in single quote marks) and looks for matching lines in the specified file (domains.txt
). Three “hits” are found: coffee.com
, office.com
and office365.com
. Those get printed out when we run that command.
While in this case we just looked for a short three-character string, we could have looked for a single character, many characters, or even multiple words. (Just be sure to enclose the literal string to be matched in single quote marks if the string includes spaces!)
If we wanted to find lines that DON’T contain the string ‘off’, we can use the egrep -v
option to find lines that DON’T match the specified pattern:
$ egrep -v 'off' domains.txt
all lines EXCEPT coffee.com, office.com and office365.com get output here
VII. Case (In)sensitivity
Regular expressions are case sensitive by default (so if we’d looked for ‘OFF’ instead of ‘off’, we wouldn’t have found any matches).
If we want egrep
to do case Insensitive matches, we can add the -i
option to our egrep
command:
$ egrep -i 'OFF' domains.txt coffee.com office.com office365.com
VIII. Figuring Out WHAT’S Matching
Another handy option to egrep
is the --color
option. It highlights the text that matches the regular expression we supplied:
$ egrep --color 'off' domains.txt coffee.com office.com office365.com
We don’t urgently need this option to understand such a simple match, but when regexes get more complex – or we make a mistake constructing our regex – highlighting the text that matched a regex can really come in handy as a debugging tool.
IX. Matching EITHER of Two Literal Substrings
Let’s do another literal substring regex.
What if we want to find lines that have EITHER the literal substring ‘go’ OR the literal substring ‘off’?
GNU egrep
can help use do that with the vertical bar (or “pipe”) meta character.
$ egrep --color 'go|off' domains.txt coffee.com duckduckgo.com eugene-or.gov google.com house.gov office.com office365.com oregonstate.edu senate.gov supremecourt.gov uoregon.edu whitehouse.gov
Note that the vertical bar (“pipe”) characters is a metacharacter – it does NOT need to be physically part of the string text we’re matching.
If helpful or necessary, you can also use parentheses to set off the limits of an alternating match. For example:
$ egrep --color 'e(go|ug)' domains.txt eugene-or.gov oregonstate.edu uoregon.edu
That pattern matches all records that have ego
or eug
in them.
X. The Dot
Up until this part of the article, we’ve been matching literal strings. That’s cool and useful, but the real power of regular expressions comes when we begin to work with wildcards – in this case literally the dot (“.”) character. Dot is a metacharacter that matches any single character.
$ egrep --color 'g.p' domains.txt
blogspot.com
If we have two dots in a row, that matches any two characters:
$ egrep --color 'r..e' domains.txt lclark.edu marines.mil supremecourt.gov
and we could also search for any three characters in a row, any four characters in a row, etc.
Note that if we want to match an ACTUAL dot (and dots are obviously VERY common in domain names), we need to ask to match an escaped (“backslashed”) dot:
$ egrep --color '\.k12\.' domains.txt bethel.k12.or.us cal.k12.or.us springfield.k12.or.us
If we didn’t remember to escape those “real dots,” specifying an unescaped dot might coincidentally match real dots, but they’d also match any OTHER single character in that spot, too.
XI. The Power of “dot star”
If you think dot was cool, wait until you learn about dot star (‘.*’) – it’s VERY cool!
- dot stands for “match any character”
- star stands for “repeat the previous match zero or more times”
If we had a regular expression that was simply ‘.*’ it would match all lines.
Therefore, most matches that contain ‘.*’ also include other specific patterns to match. For example, let’s find lines that have a b
, then zero or more other characters, then a c
:
$ egrep --color 'b.*c' domains.txt bing.com blogspot.com crabcake.com ebay.com facebook.com github.com youtube.com
If we didn’t have the star metacharacter to give us flexibility here, we’d have to write a much “clunkier” regex with all possible patterns of zero or more dots in between the two letters of interest:
egrep '(bc|b.c|b..c|b...c|b....c|b.....c|b......c|b.......c|b........c)' domains.txt same output as the previous example omitted here
Yuck! And just imagine how ugly that expression would get if one of the domain names in the file happened to be a long name with a b
near the start and a c
twenty or thirty characters later! Truly, the “magic of dot star” is a huge convenience when it comes to writing some regular expressions.
XII. A Note About “dot star” Matches: “Greed Is Good”
When GNU egrep
finds matches, sometimes there are different options that might work. For example, if you asked to match '^st.*o'
there are three ways it could match one line from our sample data:
stackoverflow.com OR stackoverflow.com OR stackoverflow.com
All three of those matches start with “st” and end with “o”, right? But which one of those will GNU egrep
return by default?
The answer is that GNU egrep
agrees with the fictional character Gordon Gekko, played by Kirk Douglas in the 1987 movie “Wall Street,” who became (in)famous for saying “Greed is good.”
By default, wildcards in grep
will always try to match as much as possible while still satisfying the requested pattern. So in this example, it will match as shown in the last of the possible result, stackoverflow.com
.
XIII. Matching A Single Character That’s Part of an Enumerated Set of Characters
We’ve seen how dot matches any SINGLE character, and dot star matches any ZERO OR MORE characters. But what if want to match a single character from just an enumerated set of characters? For example, what if we want to match:
- the character
b
- zero or more other characters
- at least one vowel (
a, e, i, o, u
, ory
) - zero or more other characters
- the character
c
It turns out that regular expressions can help us do this as well, using square brackets (as introduced in Section 4, above) to define a character set:
$ egrep --color 'b.*[aeiouy].*c' domains.txt bing.com blogspot.com crabcake.com ebay.com facebook.com youtube.com
If you’re referring to a contiguous range of characters, rather than a short, enumerated list of characters, you can take advantage of the dash character to avoid having to type a long list:
- [a-z] (lowercase letters)
- [a-zA-Z] (uppercase and lowercase letters)
- [a-zA-Z0-9] (uppercase and lowercase letters plus digits)
If you want to put a literal caret (^) in a list of characters, you can, just don’t put it first (if you do, it will be interpreted as meaning “take the complement of the characters that follow).
If you want to include a literal right square bracket (]) in a list of characters, you can, you just must use it as the FIRST character in the list of characters.
If you want to put a literal dash (-) in a list of characters in square brackets, put it LAST.
XIV. Repetition Factors (“Counts”)
You can also use “repetition factors” or “counts” to ask for multiples of patterns. For example, if you wanted to find names from our sample file that had two successive vowels, you could write:
$ egrep --color '[aeiouy]{2}' domains.txt coffee.com ebay.com eou.edu eugene-or.gov facebook.com freedom.com geoduck.com google.com house.gov oit.edu paypal.com reed.edu sou.edu springfield.k12.or.us supremecourt.gov uoregon.edu whitehouse.gov wikipedia.org wou.edu yahoo.com youtube.com
In addition to asking for exactly a specific value, you can also specify a repetition range, such as:
- {2,5} (meaning match between two and five times)
- {3,} (meaning match if present at least 3 times)
- ? (can appear zero or one time only – same as saying {0,1}
- * (can appear zero or more times – same as saying {0,}
- + (can appear one or more times – same as saying {1,}
For example:
$ egrep --color 'ube?\.com$' domains.txt github.com youtube.com
In this case, the “e” was optional, which is why youtube.com AND github.com successfully matched.
XV. Anchors
The patterns that we’ve been matching have all been “floating” patterns. Those patterns can potentially match suitable text seen anywhere in lines they’re scanning. But what if we only want to match a particular pattern at the start of a line, or at the end of a line? Those type of searches are called “anchored searches,” and we can use special metacharacters to limit our results:
^ (the caret symbol) "At the start of the line" $ (the dollar sign) "At the end of the line"
For example, let’s find the domains in the file that are ‘.edu
‘ domains:
$ egrep --color '\.edu$' domains.txt 4j.lane.edu eou.edu lanecc.edu lclark.edu oit.edu oregonstate.edu pdx.edu reed.edu sou.edu uoregon.edu willamette.edu
Important note: In DNSDB 2.0 Flexible Search, domain names in RRnames (and some Rdata) are written with a “formal ending dot.” Literal dots are also escaped with a backslash. That means that the domain name
wou.edu
would be written in regular expression format as:
wou\.edu\.$
If that’s the case for the stuff you’re matching against the anchored search would need to be written:
$ egrep --color '\.edu\.$' domains.txt
rather than just
$ egrep --color '\.edu$' domains.txt
Or as another example, let’s find the domains that begin with an s
:
$ egrep --color '^s' domains.txt senate.gov sou.edu springfield.k12.or.us stackoverflow.com supremecourt.gov
XVI. Specialty Versions of grep
You may also want to know that there are some “specialty” versions of grep, such as:
agrep: “approximate GREP for fast fuzzy string searching.”
cidrgrep: “A grep-like tool used to filter IP addresses against one or more CIDR network patterns.” (see also grepcidr)
ripgrep: an extremely fast grep implementation that also supports searching non-UTF8 files, searching compressed files, and much more.
XVII. Conclusion
You’ve now had a bit of a whirlwind introduction to regular expressions. If you want to learn more, check out the books mentioned in the introduction, or consider trying one of the online interactive regular expression tutorials.
Regular expressions may feel a bit like they’re “brain teasers” or puzzles from The New York Times puzzle page, but if you tackle them with the right attitude, you may find they’re exceptionally powerful and sort of fun, too!
Acknowledgements
The author would like to acknowledge valuable reviews and commentary from colleagues, including (in alphabetical order) Chris Mikkelson, Jeremy Reed, Chuq Von Rospach, Stephen Watt, and Eric Ziegast. Any remaining issues or errors are solely the responsibility of the author.
APPENDIX 1. DOMAINS USED IN THIS ARTICLE’S EXAMPLES
$ cat domains.txt 4j.lane.edu af.mil amazon.com apple.com army.mil bethel.k12.or.us bing.com blogspot.com cal.k12.or.us coffee.com crabcake.com duckduckgo.com ebay.com eou.edu eugene-or.gov facebook.com freedom.com geoduck.com github.com google.com house.gov instagram.com lanecc.edu lclark.edu linkedin.com live.com marines.mil microsoft.com msn.com navy.mil netflix.com office.com office365.com oit.edu oregonstate.edu paypal.com pdx.edu reddit.com reed.edu senate.gov sou.edu springfield.k12.or.us stackoverflow.com supremecourt.gov twitter.com uoregon.edu whitehouse.gov wikipedia.org willamette.edu wou.edu yahoo.com youtube.com
ENDNOTE
More on validating input fields: An example of input validation might be a rule that says “the employee salary field can only contain numbers, commas, a decimal point and/or a dollar sign.” “$83,412.15” would pass that validation definition but “$7K/month” would not. More carefully-defined validation rules might be used to screen out typos/data entry errors such as “8$3,412.15” or “$83,,412.15” or “$83,412.155” Validation rules might also be used to identify likely-out-of-range-values such as “$8341215.
Joe St Sauver Ph.D. is a Distinguished Scientist and Director of Research with Farsight Security®, Inc..