The open source spammer: extracting email addresses from an openoffice.org document

By Iddo on September 1st, 2010

I’m organizing a workshop later this month (see here, scroll to session V), and I have just received the attendees list from the main conference’s organizers. Since I need to spam send the attendees informative email on the specific workshop, I needed their email addresses. Here’s what I did.

The file itself is MS Word doc. Those I save as native openoffice on my system. Now, an openoffice document is really just a bunch of mostly XML documents zipped together. If you do the following:

unzip  -l conference-delegates.odt

You get a listing that looks like this:

Archive:  conference-delegates.odt
 Length      Date    Time    Name
---------  ---------- -----   ----
 39     2010-09-01 18:16   mimetype
 71244  2010-09-01 18:16   content.xml
 94     2010-09-01 18:16   layout-cache
 15522  2010-09-01 18:16   styles.xml
 1241   2010-09-01 18:16   meta.xml
 24852  2010-09-01 18:16   Thumbnails/thumbnail.png
 0      2010-09-01 18:16   Configurations2/accelerator/current.xml
 0      2010-09-01 18:16   Configurations2/progressbar/
 0      2010-09-01 18:16   Configurations2/floater/
 0      2010-09-01 18:16   Configurations2/popupmenu/
 0      2010-09-01 18:16   Configurations2/menubar/
 0      2010-09-01 18:16   Configurations2/toolbar/
 0      2010-09-01 18:16   Configurations2/images/Bitmaps/
 0      2010-09-01 18:16   Configurations2/statusbar/
 8961   2010-09-01 18:16   settings.xml
 1988   2010-09-01 18:16   META-INF/manifest.xml
---------                     -------
 123941                     16 files

Wow. Which file contains the delegates’ emails in all that? Actually, content.xml contains the textual content of the openoffice.org document. You can open it with your favorite XML and see how it’s constructed (I like Firefox myself for browsing, and XML Copy Editor for more in-depth diagnosis). But for now, we would like to extract the emails. So we unzip content.xml only:

unzip conference-delegates.odt content.xml

This unzip command will only extract content.xml from the archive that is the .odt file.

When looking at the content.xml file, we see lines like this:

 <text:a xlink:type="simple" xlink:href="mailto:noone@usc.edu">

 <text:span text:style-name="Internet_20_link">
 <text:span text:style-name="T2">noone@usc.edu</text:span>
 </text:span>
 </text:a>
 </text:p>

Which means that “noone’s” (usernames have been changed to protect the innocent) email appears both as text and as hyperlink. It may or may not be that all the delegates’ emails are hyperlinked, so we may expect some duplications we need to get rid of.

To get the email addresses themselves, we use egrep. egrep uses the extended regular expression syntax in searching for emails. What is a good regex for email addresses? There is a good discussion of that at the regex-guru site. I use the rather simple form:

egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml

Explanation: the -o qualifier prints only the word matching the regex. -i means a case-insensitive match. egrep, the extended version of grep, that can handle regexs with things like {m,n} repeats. However, the result of our little exercise would still have duplicate emails, because of the hyperlinking tags. Here is how to get rid of the duplicates:

egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml | sort |  uniq

sort sorts the output alphabetically, preparing it for uniq to get rid of duplicates.

One last touch-up: we really don’t need to physically extract the content.xml file. “unzip -c” extracts files to stdout. Therefore, we can get the email addresses without cluttering our disk:

unzip -c conference-delegates.odt content.xml | \
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' | sort |\
uniq > email-these.txt

Voila! email-these.txt now contains the emails of the conference delegates.

One last word: it may have been easier just to save the MS-Word doc file as text using the File -> Save as…” option in openoffice.org. Supposed we saved the file as conference-delegates.txt. We wouldn’t have to muck about with all the XML, and remove the email address duplicates due to hyperlinking. We could have just done:

egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' \
conference-delegates.txt > email-these.txt

But where’s the fun in that?

Happy spamming!

Share and Enjoy:

Categorized under: Software.
Tagged with: email, hacks, Linux, operoffice.org, programming, regex, regular expressions, shell, xml.

3 Responses to “The open source spammer: extracting email addresses from an openoffice.org document”

Martin Jambon says:

2-September-2010 at 2:06 AM

“strings” is a really cool command for such purpose.
Iddo says:

2-September-2010 at 10:32 AM

@Martin Jambon
Thanks Martin, I never used strings… got an example one-liner using strings?

Martin Jambon says:

3-September-2010 at 1:37 AM

strings extracts printable ASCII substrings from any file. It’s useful for files like .doc documents:

strings foo.doc | egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'

.docx documents are just compressed XML, so the following should work:

unzip -c foo.docx | egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'

strings is also handy for inspecting the bowels of a compiled executable:

$ strings `which cat`
/lib64/ld-linux-x86-64.so.2
__gmon_start__
libc.so.6
stpcpy
ioctl
stdout
memmove
getopt_long
...
Written by %s, %s, %s,
%s, %s, %s, %s,
%s, and %s.
Written by %s, %s, %s,
%s, %s, %s, %s,
%s, %s, and others.
Copyright %s %d Free Software Foundation, Inc.
%s: %s
literal
shell
shell-always
escape
clocale
memory exhausted
?P< @
@(NULL)
                 0

Byte Size Biology

The musings and ravings of a computational biologist about science, computers, music and, you know, stuff

The open source spammer: extracting email addresses from an openoffice.org document

3 Responses to “The open source spammer: extracting email addresses from an openoffice.org document”

Categories

Tags

Recent Posts

Recent Comments

Other stuff I read

Science blogs I like to read

Twitter