So far we have been reading through files, looking for patterns and extracting various bits of lines that we find interesting. We have been
using string methods like
find and using lists and string slicing to extract portions of the lines.
This task of searching and extracting is so common that Python has a very powerful library called regular expressions that handles many of these tasks quite elegantly. The reason we have not introduced regular expressions earlier in the book is because while they are very powerful, they are a little complicated and their syntax takes some getting used to.
Regular expressions are almost their own little programming language for searching and parsing strings. As a matter of fact, entire books have been written on the topic of regular expressions. In this chapter, we will only cover the basics of regular expressions. For more detail on regular expressions, see:
The regular expression library
re must be imported into your program before you can use it. The simplest use of the regular expression library is the
search() function. The following program demonstrates a trivial use of the search function.
We open the file, loop through each line, and use the regular expression
search() to only print out lines that contain the string "From:". This program does not use the real power of regular expressions, since we could have just as easily used
line.find() to accomplish the same result.
The power of the regular expressions comes when we add special characters to the search string that allow us to more precisely control which lines match the string. Adding these special characters to our regular expression allow us to do sophisticated matching and extraction while writing very little code.
For example, the caret character is used in regular expressions to match "the beginning" of a line. We could change our program to only match lines where "From:" was at the beginning of the line as follows:
Now we will only match lines that start with the string "From:". This is still a very simple example that we could have done equivalently with the
startswith() method from the string library. But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.
Character matching in regular expressions
There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the period or full stop, which matches any character.
In the following example, the regular expression "F..m:" would match any of the strings "From:", "Fxxm:", "F12m:", or "F!@m:" since the period characters in the regular expression match any character.
This is particularly powerful when combined with the ability to indicate that a character can be repeated any number of times using the "*" or "+" characters in your regular expression. These special characters mean that instead of matching a single character in the search string, they match zero-or-more characters (in the case of the asterisk) or one-or-more of the characters (in the case of the plus sign).
We can further narrow down the lines that we match using a repeated wild card character in the following example:
The search string "
^From:.+@" will successfully match lines that start with "From:", followed by one or more characters (".+"), followed by an at-sign. So this will match the following line:
You can think of the ".+" wildcard as expanding to match all the characters between the colon character and the at-sign.
It is good to think of the plus and asterisk characters as "pushy". For example, the following string would match the last at-sign in the string as the ".+" pushes outwards, as shown below:
It is possible to tell an asterisk or plus sign not to be so "greedy" by adding another character. See the detailed documentation for information on turning off the greedy behavior.
Extracting data using regular expressions
If we want to extract data from a string in Python we can use the
findall() method to extract all of the substrings which match a regular expression. Let's use the example of wanting to extract anything that looks like an email address from any line regardless of format. For example, we want to pull the email addresses from each of the following lines:
From firstname.lastname@example.org Sat Jan 5 09:14:16 2008 Return-Path: <email@example.com> for <firstname.lastname@example.org>; Received: (from apache@localhost) Author: email@example.com
We don't want to write code for each of the types of lines, splitting and slicing differently for each line. This following program uses
findall() to find the lines with email addresses in them and extract one or more addresses from each of those lines.
findall() method searches the string in the second argument and returns a list of all of the strings that look like email addresses. We are using a two-character sequence that matches a non-whitespace character (
The output of the program would be:
Translating the regular expression, we are looking for substrings that have at least one non-whitespace character, followed by an at-sign, followed by at least one more non-whitespace character. The "
\\S+" matches as many non-whitespace characters as possible.
The regular expression would match twice (firstname.lastname@example.org and email@example.com), but it would not match the string "@2PM" because there are no non-blank characters before the at-sign. We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an email address as follows:
We read each line and then extract all the substrings that match our regular expression. Since
findall() returns a list, we simply check if the number of elements in our returned list is more than zero to print only lines where we found at least one substring that looks like an email address.
If we run the program on
mbox.txt we get the following output:
['firstname.lastname@example.org'] ['email@example.com'] ['<firstname.lastname@example.org>'] ['<200801032122.m03LMFo4005148@nakamura.uits.iupui.edu>'] ['<email@example.com>;'] ['<firstname.lastname@example.org>;'] ['<email@example.com>;'] ['apache@localhost)'] ['firstname.lastname@example.org;']
Some of our email addresses have incorrect characters like "
<" or ";" at the beginning or end. Let's declare that we are only interested in the portion of the string that starts and ends with a letter or a number.
To do this, we use another feature of regular expressions. Square brackets are used to indicate a set of multiple acceptable characters we are willing to consider matching. In a sense, the "
\\S" is asking to match the set of "non-whitespace characters". Now we will be a little more explicit in terms of the characters we will match.
Here is our new regular expression:
This is getting a little complicated and you can begin to see why regular expressions are their own little language unto themselves. Translating this regular expression, we are looking for substrings that start with a single lowercase letter, uppercase letter, or number "[a-zA-Z0-9]", followed by zero or more non-blank characters ("
\\S*"), followed by an at-sign, followed by zero or more non-blank characters ("
\\S*"), followed by an uppercase or lowercase letter. Note that we switched from "+" to "*" to indicate zero or more non-blank characters since "[a-zA-Z0-9]" is already one non-blank character. Remember that the "*" or "+" applies to the single character immediately to the left of the plus or asterisk.
If we use this expression in our program, our data is much cleaner:
... ['email@example.com'] ['firstname.lastname@example.org'] ['email@example.com'] ['200801032122.m03LMFo4005148@nakamura.uits.iupui.edu'] ['firstname.lastname@example.org'] ['email@example.com'] ['firstname.lastname@example.org'] ['apache@localhost']
Notice that on the "email@example.com" lines, our regular expression eliminated two letters at the end of the string ("
>;"). This is because when we append "[a-zA-Z]" to the end of our regular expression, we are demanding that whatever string the regular expression parser finds must end with a letter. So when it sees the "
>" after "sakaiproject.org
>;" it simply stops at the last "matching" letter it found (i.e., the "g" was the last good match).
Also note that the output of the program is a Python list that has a string as the single element in the list.
Combining searching and extracting
If we want to find numbers on lines that start with the string "X-" such as:
X-DSPAM-Confidence: 0.8475 X-DSPAM-Probability: 0.0000
we don't just want any floating-point numbers from any lines. We only want to extract numbers from lines that have the above syntax.
We can construct the following regular expression to select the lines:
Translating this, we are saying, we want lines that start with "X-", followed by zero or more characters (".*"), followed by a colon (":") and then a space. After the space we are looking for one or more characters that are either a digit (0-9) or a period "[0-9.]+". Note that inside the square brackets, the period matches an actual period (i.e., it is not a wildcard between the square brackets).
This is a very tight expression that will pretty much match only the lines we are interested in as follows:
When we run the program, we see the data nicely filtered to show only the lines we are looking for.
X-DSPAM-Confidence: 0.8475 X-DSPAM-Probability: 0.0000 X-DSPAM-Confidence: 0.6178 X-DSPAM-Probability: 0.0000
But now we have to solve the problem of extracting the numbers. While it would be simple enough to use
split, we can use another feature of regular expressions to both search and parse the line at the same time.
Parentheses are another special character in regular expressions. When you add parentheses to a regular expression, they are ignored when matching the string. But when you are using
findall(), parentheses indicate that while you want the whole expression to match, you only are interested in extracting a portion of the substring that matches the regular expression.
So we make the following change to our program:
Instead of calling
search(), we add parentheses around the part of the regular expression that represents the floating-point number to indicate we only want
findall() to give us back the floating-point number portion of the matching string.
The output from this program is as follows:
['0.8475'] ['0.0000'] ['0.6178'] ['0.0000'] ['0.6961'] ['0.0000'] ..
The numbers are still in a list and need to be converted from strings to floating point, but we have used the power of regular expressions to both search and extract the information we found interesting.
As another example of this technique, if you look at the file there are a number of lines of the form:
If we wanted to extract all of the revision numbers (the integer number at the end of these lines) using the same technique as above, we could write the following program:
Translating our regular expression, we are looking for lines that start with "Details:", followed by any number of characters (".*"), followed by "rev=", and then by one or more digits. We want to find lines that match the entire expression but we only want to extract the integer number at the end of the line, so we surround "[0-9]+" with parentheses.
When we run the program, we get the following output:
['39772'] ['39771'] ['39770'] ['39769'] ...
Remember that the "[0-9]+" is "greedy" and it tries to make as large a string of digits as possible before extracting those digits. This "greedy" behavior is why we get all five digits for each number. The regular expression library expands in both directions until it encounters a non-digit, or the beginning or the end of a line.
Now we can use regular expressions to redo an exercise from earlier in the book where we were interested in the time of day of each mail message. We looked for lines of the form:
From firstname.lastname@example.org Sat Jan 5 09:14:16 2008
and wanted to extract the hour of the day for each line. Previously we did this with two calls to
split. First the line was split into words and then we pulled out the fifth word and split it again on the colon character to pull out the two characters we were interested in.
While this worked, it actually results in pretty brittle code that is assuming the lines are nicely formatted. If you were to add enough error checking (or a big try/except block) to insure that your program never failed when presented with incorrectly formatted lines, the code would balloon to 10-15 lines of code that was pretty hard to read.
We can do this in a far simpler way with the following regular expression:
^From .* [0-9][0-9]:
The translation of this regular expression is that we are looking for lines that start with "From " (note the space), followed by any number of characters (".*"), followed by a space, followed by two digits "[0-9][0-9]", followed by a colon character. This is the definition of the kinds of lines we are looking for.
In order to pull out only the hour using
findall(), we add parentheses around the two digits as follows:
^From .* ([0-9][0-9]):
This results in the following program: