The open source spammer: extracting email addresses from an openoffice.org document

I'm organizing a workshop later this month (see here, scroll to session V), and I have just received the attendees list from the main conference's organizers. Since I need to spam send the attendees informative email on the specific workshop, I needed their email addresses. Here's what I did. The file itself is MS Word doc. Those I save as native openoffice on my system. Now, an openoffice document is really just a bunch of mostly XML documents zipped together. If you do the following:
unzip  -l conference-delegates.odt
You get a listing that looks like this:
Archive:  conference-delegates.odt
 Length      Date    Time    Name
---------  ---------- -----   ----
 39     2010-09-01 18:16   mimetype
 71244  2010-09-01 18:16   content.xml
 94     2010-09-01 18:16   layout-cache
 15522  2010-09-01 18:16   styles.xml
 1241   2010-09-01 18:16   meta.xml
 24852  2010-09-01 18:16   Thumbnails/thumbnail.png
 0      2010-09-01 18:16   Configurations2/accelerator/current.xml
 0      2010-09-01 18:16   Configurations2/progressbar/
 0      2010-09-01 18:16   Configurations2/floater/
 0      2010-09-01 18:16   Configurations2/popupmenu/
 0      2010-09-01 18:16   Configurations2/menubar/
 0      2010-09-01 18:16   Configurations2/toolbar/
 0      2010-09-01 18:16   Configurations2/images/Bitmaps/
 0      2010-09-01 18:16   Configurations2/statusbar/
 8961   2010-09-01 18:16   settings.xml
 1988   2010-09-01 18:16   META-INF/manifest.xml
---------                     -------
 123941                     16 files
Wow. Which file contains the delegates'  emails in all that? Actually, content.xml contains the textual content of the openoffice.org document. You can open it with your favorite XML and see how it's constructed (I like Firefox myself for browsing, and XML Copy Editor for more in-depth diagnosis). But for now, we would like to extract the emails. So we unzip content.xml only:
unzip conference-delegates.odt content.xml
This unzip command will only extract content.xml from the archive that is the .odt file. When looking at the content.xml file, we see lines like this:
 <text:a xlink:type="simple" xlink:href="mailto:noone@usc.edu">

 <text:span text:style-name="Internet_20_link">
 <text:span text:style-name="T2">noone@usc.edu</text:span>
 </text:span>
 </text:a>
 </text:p>
Which means that "noone's" (usernames have been changed to protect the innocent) email appears both as text and as hyperlink. It may or may not be that all the delegates' emails are hyperlinked, so we may expect some duplications we need to get rid of. To get the email addresses themselves, we use egrep. egrep uses the extended regular expression syntax in searching for emails. What is a good regex for email addresses? There is a good discussion of that at the regex-guru site. I use the rather simple form:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml
Explanation: the  -o qualifier prints only the word matching the regex.  -i means a  case-insensitive match. egrep, the extended version of grep, that can handle regexs with things like {m,n} repeats. However, the result of our little exercise would still have duplicate emails, because of the hyperlinking tags. Here is how to get rid of the duplicates:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' content.xml | sort |  uniq
sort sorts the output alphabetically, preparing it for uniq to get rid of duplicates. One last touch-up: we really don't need to physically extract the content.xml file. ``unzip -c'' extracts files to stdout. Therefore, we can get the email addresses without cluttering our disk:
unzip -c conference-delegates.odt content.xml | \
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' | sort |\
uniq > email-these.txt
Voila! email-these.txt now contains the emails of the conference delegates. One last word: it may have been easier just to save the MS-Word doc file as text using the File -> Save as..." option in openoffice.org. Supposed we saved the file as conference-delegates.txt. We wouldn't have to muck about with all the XML, and remove the email address duplicates due to hyperlinking. We could have just done:
egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' \
conference-delegates.txt > email-these.txt
But where's the fun in that? Happy spamming!
Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

3 Responses to “The open source spammer: extracting email addresses from an openoffice.org document”

  1. “strings” is a really cool command for such purpose.

  2. Iddo says:

    @Martin Jambon
    Thanks Martin, I never used strings… got an example one-liner using strings?

  3. strings extracts printable ASCII substrings from any file. It’s useful for files like .doc documents:

    strings foo.doc | egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'
    

    .docx documents are just compressed XML, so the following should work:

    unzip -c foo.docx | egrep -o -i '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'
    

    strings is also handy for inspecting the bowels of a compiled executable:

    $ strings `which cat`
    /lib64/ld-linux-x86-64.so.2
    __gmon_start__
    libc.so.6
    stpcpy
    ioctl
    stdout
    memmove
    getopt_long
    ...
    Written by %s, %s, %s,
    %s, %s, %s, %s,
    %s, and %s.
    Written by %s, %s, %s,
    %s, %s, %s, %s,
    %s, %s, and others.
    Copyright %s %d Free Software Foundation, Inc.
    %s: %s
    literal
    shell
    shell-always
    escape
    clocale
    memory exhausted
    ?P< @
    @(NULL)
                     0