|
This is a utility that builds a Lucene index of mail in an IMAP message store.
This is just an indexer - there is no search utility included.
If you want a fuller app see
Zoe.
Also note that IMAP servers already have their own search built in - this
code is more of a donation or example code so that there will be code for Lucene
to index all data sources. If redundant indexing of IMAP bothers you
then turn back! :)
For every message indexed there is one line output which
contains the time it took to retrieve the message, the size, and the subject.
And the end of traversing each folder there is a line summarizing the
performance so far, and at the end of execution there is more summary output.
Annoyingly I use an IMAP server in London from my office in Santa Monica, CA so I have high latency and it takes approximately one second to retrieve and index each message.
Here's the invocation and output. Note that the invocation line is long.
Requirements
Download
Building
Should be obvious enough - there's only one file.
Command Line Syntax
The code is all in ImapIndex
which contains both a driver (main()
)
and the code that traverses the message store and indexes the mail.
If you run it with -help you get this:
So note that there are 3 mandatory flags,
Syntax is -host HOST -user USER -pw PASSWORD [-index INDEX] [-folder FOLDER]
-folder and -index are optional, others are needed and can appear in any order
-host
, -user
, and -password
.
If you just want to index one folder, which is good way to start off, then add -folder
.
If you don't like the default index name of imap_index
(in the current directory) then use -index.
Running
This assumes you have a lib/
subdirectory which has Lucene, JavaMail and the JAF installed.
prompt> java -classpath imap_index.jar;lib/mail.jar;lib/activation.jar;lib/imap.jar;lib/lucene-1.2.jar com.tropo.lucene.ImapIndex -pw ***** -user ... -host ... -folder "Lists/Lucene Dev"
Index: imap_index
Connecting to ... as ...
Connected
FOLDER: Lists/Lucene Dev messages=38
1/38 dt=1211(ms) bytes=2999 Latent Semantic Analysis/Indexing
2/38 dt=661(ms) bytes=3155 Re: Latent Semantic Analysis/Indexing
3/38 dt=701(ms) bytes=2731 [PMX:#] DO NOT REPLY [Bug 17242] - IllegalStateException: docs out of order afte
r 10 insert/delete/optimize
4/38 dt=561(ms) bytes=5792 Re: FSDirectory patch for file renaming
5/38 dt=541(ms) bytes=5003 Re: literal operator?
6/38 dt=711(ms) bytes=3260 [PMX:#] DO NOT REPLY [Bug 16816] - Enhanced FSDirectory that allow lock disable
via API
7/38 dt=530(ms) bytes=3449 [PMX:#] DO NOT REPLY [Bug 16438] - time not supported in date ranges
8/38 dt=611(ms) bytes=4777 [PMX:#] DO NOT REPLY [Bug 16437] - AND NOT queries not working
9/38 dt=551(ms) bytes=4403 Filters, range queries and MultiSearcher
10/38 dt=661(ms) bytes=6461 Re: literal operator?
11/38 dt=541(ms) bytes=6367 Re: literal operator?
12/38 dt=1111(ms) bytes=3670 User documentation for scoring
13/38 dt=551(ms) bytes=7462 Re: literal operator?
14/38 dt=531(ms) bytes=5000 Re: literal operator?
15/38 dt=531(ms) bytes=6276 Re: literal operator?
16/38 dt=771(ms) bytes=2928 Removing a sandbox committer's account
17/38 dt=731(ms) bytes=11344 Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
18/38 dt=561(ms) bytes=4586 Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
19/38 dt=530(ms) bytes=4241 Re: New PhrasePrefixQuery
20/38 dt=721(ms) bytes=2919 MultiTermQuery question
21/38 dt=641(ms) bytes=3647 Re: User documentation for scoring
22/38 dt=521(ms) bytes=3738 Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
23/38 dt=531(ms) bytes=3874 Re: MultiTermQuery question
24/38 dt=531(ms) bytes=4765 Re: MultiTermQuery question
25/38 dt=721(ms) bytes=3285 Question about .f1, .f2, etc. files in index directory
26/38 dt=711(ms) bytes=3682 Re: MultiTermQuery question
27/38 dt=520(ms) bytes=3838 Re: Question about .f1, .f2, etc. files in index directory
28/38 dt=531(ms) bytes=4330 Re: Filters, range queries and MultiSearcher
29/38 dt=521(ms) bytes=4414 Re: MultiTermQuery question
30/38 dt=671(ms) bytes=3358 Re: MultiTermQuery question
31/38 dt=901(ms) bytes=13875 Re: User documentation for scoring
32/38 dt=521(ms) bytes=4308 Re: MultiTermQuery question
33/38 dt=641(ms) bytes=3059 Re: MultiTermQuery question
34/38 dt=701(ms) bytes=3376 RE : Filters, range queries and MultiSearcher
35/38 dt=540(ms) bytes=5585 Re: literal operator?
36/38 dt=521(ms) bytes=4213 Re: MultiTermQuery question
37/38 dt=641(ms) bytes=2577 1.3RC1?
38/38 dt=701(ms) bytes=3004 Phrase Query
Folder done in 24(s), rate=5(kb/s), total data = 171(kb), total time = 29(s)
All done, bytes read=171(kb) / 0(MB), time=29(s), rate=5(kb/sec)
Messages added to index: 38
About the Index
I put lots of fields in the index to be thorough.
The most important fields are probably subject
and
contents
however.
Field Name | Contents |
content-description |
One for each MIME attachment. |
content-type |
One for each MIME attachment. |
contents |
The body of the mail message and each attachment. |
folder |
The folder name |
from |
Who sent the mail (from:) |
received |
The time the mail was received |
reply-to |
The reply to address |
sent |
The time the mail was sent |
size |
Size in bytes or not present if not available |
subject |
Subject: line |
to |
The To: line |
uid |
UID of the message |
url |
The URL in a form that Mozilla seems to accept. It did not seem to accept the form defined in RFC 2192. |
I think there's a problem in JavaMail whereby you cannot use multiple threads to process a the messages in a folder in parallel due to locking contention in the Suns IMAP impl. I believe you can process multiple folders in parallel however though I haven't tried it yet.