tropo://techno/java /lucene

This is a utility that builds a Lucene index of mail in an IMAP message store.

This is just an indexer - there is no search utility included. If you want a fuller app see Zoe.

Also note that IMAP servers already have their own search built in - this code is more of a donation or example code so that there will be code for Lucene to index all data sources. If redundant indexing of IMAP bothers you then turn back! :)

Block Diagram

Requirements

Download

Building

Should be obvious enough - there's only one file.

Command Line Syntax

The code is all in ImapIndex which contains both a driver (main()) and the code that traverses the message store and indexes the mail. If you run it with -help you get this:
Syntax is -host HOST -user USER -pw PASSWORD [-index INDEX] [-folder FOLDER]
-folder and -index are optional, others are needed and can appear in any order
		
So note that there are 3 mandatory flags, -host, -user, and -password. If you just want to index one folder, which is good way to start off, then add -folder. If you don't like the default index name of imap_index (in the current directory) then use -index.

Running

This assumes you have a lib/ subdirectory which has Lucene, JavaMail and the JAF installed.

For every message indexed there is one line output which contains the time it took to retrieve the message, the size, and the subject.

And the end of traversing each folder there is a line summarizing the performance so far, and at the end of execution there is more summary output.

Annoyingly I use an IMAP server in London from my office in Santa Monica, CA so I have high latency and it takes approximately one second to retrieve and index each message.

Here's the invocation and output. Note that the invocation line is long.

prompt> java -classpath imap_index.jar;lib/mail.jar;lib/activation.jar;lib/imap.jar;lib/lucene-1.2.jar com.tropo.lucene.ImapIndex -pw ***** -user ... -host ... -folder "Lists/Lucene Dev"
Index: imap_index
Connecting to ... as ...
Connected

FOLDER: Lists/Lucene Dev messages=38
            
        1/38 dt=1211(ms) bytes=2999 Latent Semantic Analysis/Indexing
        2/38 dt=661(ms) bytes=3155 Re: Latent Semantic Analysis/Indexing
        3/38 dt=701(ms) bytes=2731 [PMX:#] DO NOT REPLY [Bug 17242]  -     IllegalStateException: docs out of order afte
r 10 insert/delete/optimize
        4/38 dt=561(ms) bytes=5792 Re: FSDirectory patch for file renaming

        5/38 dt=541(ms) bytes=5003 Re: literal operator?
        6/38 dt=711(ms) bytes=3260 [PMX:#] DO NOT REPLY [Bug 16816]  -     Enhanced FSDirectory that allow lock disable
via API
        7/38 dt=530(ms) bytes=3449 [PMX:#] DO NOT REPLY [Bug 16438]  -     time not supported in date ranges
        8/38 dt=611(ms) bytes=4777 [PMX:#] DO NOT REPLY [Bug 16437]  -     AND NOT queries not working
        9/38 dt=551(ms) bytes=4403 Filters, range queries and MultiSearcher
        10/38 dt=661(ms) bytes=6461 Re: literal operator?
        11/38 dt=541(ms) bytes=6367 Re: literal operator?
        12/38 dt=1111(ms) bytes=3670 User documentation for scoring
        13/38 dt=551(ms) bytes=7462 Re: literal operator?
        14/38 dt=531(ms) bytes=5000 Re: literal operator?
        15/38 dt=531(ms) bytes=6276 Re: literal operator?
        16/38 dt=771(ms) bytes=2928 Removing a sandbox committer's account
        17/38 dt=731(ms) bytes=11344 Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
        18/38 dt=561(ms) bytes=4586 Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
        19/38 dt=530(ms) bytes=4241 Re: New PhrasePrefixQuery
        20/38 dt=721(ms) bytes=2919 MultiTermQuery question
        21/38 dt=641(ms) bytes=3647 Re: User documentation for scoring
        22/38 dt=521(ms) bytes=3738 Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
        23/38 dt=531(ms) bytes=3874 Re: MultiTermQuery question
        24/38 dt=531(ms) bytes=4765 Re: MultiTermQuery question
        25/38 dt=721(ms) bytes=3285 Question about .f1, .f2, etc. files in index directory
        26/38 dt=711(ms) bytes=3682 Re: MultiTermQuery question
        27/38 dt=520(ms) bytes=3838 Re: Question about .f1, .f2, etc. files in index directory
        28/38 dt=531(ms) bytes=4330 Re: Filters, range queries and MultiSearcher
        29/38 dt=521(ms) bytes=4414 Re: MultiTermQuery question
        30/38 dt=671(ms) bytes=3358 Re: MultiTermQuery question
        31/38 dt=901(ms) bytes=13875 Re: User documentation for scoring
        32/38 dt=521(ms) bytes=4308 Re: MultiTermQuery question
        33/38 dt=641(ms) bytes=3059 Re: MultiTermQuery question
        34/38 dt=701(ms) bytes=3376 RE : Filters, range queries and MultiSearcher
        35/38 dt=540(ms) bytes=5585 Re: literal operator?
        36/38 dt=521(ms) bytes=4213 Re: MultiTermQuery question
        37/38 dt=641(ms) bytes=2577 1.3RC1?
        38/38 dt=701(ms) bytes=3004 Phrase Query
    	
Folder done in 24(s), rate=5(kb/s),  total data = 171(kb), total time = 29(s)

All done, bytes read=171(kb) / 0(MB), time=29(s), rate=5(kb/sec)
Messages added to index: 38


    

About the Index

I put lots of fields in the index to be thorough. The most important fields are probably subject and contents however.

Field Name Contents
content-description One for each MIME attachment.
content-type One for each MIME attachment.
contents The body of the mail message and each attachment.
folder The folder name
from Who sent the mail (from:)
received The time the mail was received
reply-to The reply to address
sent The time the mail was sent
size Size in bytes or not present if not available
subject Subject: line
to The To: line
uid UID of the message
url The URL in a form that Mozilla seems to accept. It did not seem to accept the form defined in RFC 2192.

Future Directions

This code should have an incremental mode where it can sync up with a message store (which is one of the reasons why the UIDs are stored in the index). As it is now it's only suitable for batch operation as it wipes out the index every time it's run.

I think there's a problem in JavaMail whereby you cannot use multiple threads to process a the messages in a folder in parallel due to locking contention in the Suns IMAP impl. I believe you can process multiple folders in parallel however though I haven't tried it yet.