The open source spammer: extracting email addresses from an openoffice.org document
I’m organizing a workshop later this month (see here, scroll to session V), and I have just received the attendees list from the main conference’s organizers. Since I need to spam send the attendees informative email on the specific workshop, I needed their email addresses. Here’s what I did.
The file itself is MS Word doc. Those I save as native openoffice on my system. Now, an openoffice document is really just a bunch of mostly XML documents zipped together. If you do the following:
unzip -l conference-delegates.odt
You get a listing that looks like this:
Archive: conference-delegates.odt Length Date Time Name --------- ---------- ----- ---- 39 2010-09-01 18:16 mimetype 71244 2010-09-01 18:16 content.xml 94 2010-09-01 18:16 layout-cache 15522 2010-09-01 18:16 styles.xml 1241 2010-09-01 18:16 meta.xml 24852 2010-09-01 18:16 Thumbnails/thumbnail.png 0 2010-09-01 18:16 Configurations2/accelerator/current.xml 0 2010-09-01 18:16 Configurations2/progressbar/ 0 2010-09-01 18:16 Configurations2/floater/ 0 2010-09-01 18:16 Configurations2/popupmenu/ 0 2010-09-01 18:16 Configurations2/menubar/ 0 2010-09-01 18:16 Configurations2/toolbar/ 0 2010-09-01 18:16 Configurations2/images/Bitmaps/ 0 2010-09-01 18:16 Configurations2/statusbar/ 8961 2010-09-01 18:16 settings.xml 1988 2010-09-01 18:16 META-INF/manifest.xml --------- ------- 123941 16 files
Wow. Which file contains the delegates’ emails in all that? Actually, content.xml contains the textual content of the openoffice.org document. You can open it with your favorite XML and see how it’s constructed (I like Firefox myself for browsing, and XML Copy Editor for more in-depth diagnosis). But for now, we would like to extract the emails. So we unzip content.xml only:
unzip conference-delegates.odt content.xml
This unzip command will only extract content.xml from the archive that is the .odt file.
When looking at the content.xml file, we see lines like this:
<text:a xlink:type="simple" xlink:href="mailto:noone@usc.edu"> <text:span text:style-name="Internet_20_link"> <text:span text:style-name="T2">noone@usc.edu</text:span> </text:span> </text:a> </text:p>
Which means that “noone’s” (usernames have been changed to protect the innocent) email appears both as text and as hyperlink. It may or may not be that all the delegates’ emails are hyperlinked, so we may expect some duplications we need to get rid of.
To get the email addresses themselves, we use egrep. egrep uses the extended regular expression syntax in searching for emails. What is a good regex for email addresses? There is a good discussion of that at the regex-guru site. I use the rather simple form:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml
Explanation: the -o qualifier prints only the word matching the regex. -i means a case-insensitive match. egrep, the extended version of grep, that can handle regexs with things like {m,n} repeats. However, the result of our little exercise would still have duplicate emails, because of the hyperlinking tags. Here is how to get rid of the duplicates:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml | sort | uniq
sort sorts the output alphabetically, preparing it for uniq to get rid of duplicates.
One last touch-up: we really don’t need to physically extract the content.xml file. “unzip -c” extracts files to stdout. Therefore, we can get the email addresses without cluttering our disk:
unzip -c conference-delegates.odt content.xml | \ egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' | sort |\ uniq > email-these.txt
Voila! email-these.txt now contains the emails of the conference delegates.
One last word: it may have been easier just to save the MS-Word doc file as text using the File -> Save as…” option in openoffice.org. Supposed we saved the file as conference-delegates.txt. We wouldn’t have to muck about with all the XML, and remove the email address duplicates due to hyperlinking. We could have just done:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' \ conference-delegates.txt > email-these.txt
But where’s the fun in that?
Happy spamming!
“strings” is a really cool command for such purpose.
@Martin Jambon
Thanks Martin, I never used strings… got an example one-liner using strings?
strings
extracts printable ASCII substrings from any file. It’s useful for files like .doc documents:.docx documents are just compressed XML, so the following should work:
strings
is also handy for inspecting the bowels of a compiled executable: