tropo://techno/java /lucene

This is a utility that forms a Lucene index of synonyms. It uses the Prolog Database Package from Wordnet. The prolog source file is parsed and the output of this utility is a 3MB index with 43,372 documents. Each "document" has a 'word' field with the target word (such as 'big') and then a series of 'syn' fields for all synonyms ('grown', 'large', 'adult'... - 23 or so for 'big').

Phrases (with spaces and hyphens) and non-alphabetic words are stripped out.

The intent is to do this once, then in a separate, search tool, to have an option to expand a users query based on the synonyms for each word uses. This is for experimental use - such things may return "too many" documents.

Download

Syns2Index.java

Building

Should be obvious enough - there's only one file.

Data Flow

Input is the prolog synonym file which by default is c:/proj/wordnet/prolog/wn_s.pl. You can pass in the file as an arg on the cmd line. Output is hardcoded to be a lucene index named syn_index in the current directory.

Running

This takes a few minutes to run and produces some status output as it goes.

    C:\proj\tropo_java>java com.tropo.wordnet.Syns2Index
Opening c:/proj/wordnet/prolog/wn_s.pl
    
	2 s(100001742,1,'entity',n,1,11). 0 0 ndecent=0
	4 s(100002219,1,'thing',n,12,0). 1 1 ndecent=1
	8 s(100002579,2,'nonentity',n,3,0). 5 5 ndecent=1
	16 s(100004024,1,'life',n,11,31). 10 7 ndecent=4
	32 s(100011413,4,'brute',n,2,0). 23 13 ndecent=7
	64 s(100021905,1,'event',n,1,62). 50 28 ndecent=12
	128 s(100032210,1,'advent',n,1,2). 99 67 ndecent=24
	256 s(100042358,1,'completion',n,1,40). 198 126 ndecent=47
	512 s(100065408,2,'return',n,4,3). 377 234 ndecent=107
	1024 s(100109417,1,'fine-tooth_comb',n,2,0). 734 463 ndecent=226
	2048 s(100206038,2,'quick_fix',n,1,0). 1406 930 ndecent=459
	4096 s(100397839,4,'interpretative_dancing',n,1,0). 2677 1890 ndecent=940
	8192 s(100805532,2,'gavage',n,1,0). 4911 3724 ndecent=2194
	16384 s(101612956,1,'Haliotidae',n,1,0). 8525 6939 ndecent=6254
	32768 s(103096786,2,'hydroxyzine',n,1,0). 15286 13240 ndecent=14337
	65536 s(106301017,1,'roast',n,1,0). 28778 27352 ndecent=25437
	131072 s(112574257,1,'liquefied_petroleum_gas',n,1,0). 50535 52425 ndecent=56968
	
	row=1 doc= Document<Keyword<word:scum> Unindexed<syn:trash>>
	row=2 doc= Document<Keyword<word:nard> Unindexed<syn:spikenard>>
	row=4 doc= Document<Keyword<word:intromit> Unindexed<syn:admit>>
	row=8 doc= Document<Keyword<word:shitter> Unindexed<syn:voider> Unindexed<syn:defecator>>
	row=16 doc= Document<Keyword<word:winning> Unindexed<syn:victorious> Unindexed<syn:taking> Unindexed<syn:fetching>>
	row=32 doc= Document<Keyword<word:grampus> Unindexed<syn:orca> Unindexed<syn:killer>>
	row=64 doc= Document<Keyword<word:chopper> Unindexed<syn:whirlybird> Unindexed<syn:pearly> Unindexed<syn:helicopter> Uni
	ndexed<syn:eggbeater> Unindexed<syn:cleaver> Unindexed<syn:chop>>
	row=128 doc= Document<Keyword<word:fuchsia> Unindexed<syn:magenta>>
	row=256 doc= Document<Keyword<word:adrianople> Unindexed<syn:edirne> Unindexed<syn:adrianopolis>>
	row=512 doc= Document<Keyword<word:lack> Unindexed<syn:want> Unindexed<syn:miss> Unindexed<syn:deficiency>>
	row=1024 doc= Document<Keyword<word:battler> Unindexed<syn:scrapper> Unindexed<syn:fighter> Unindexed<syn:combatant> Uni
	ndexed<syn:belligerent>>
	row=2048 doc= Document<Keyword<word:disfavour> Unindexed<syn:dislike> Unindexed<syn:disfavor> Unindexed<syn:disapproval>
	 Unindexed<syn:disadvantage>>
	row=4096 doc= Document<Keyword<word:deflect> Unindexed<syn:parry> Unindexed<syn:obviate> Unindexed<syn:distract> Unindex
	ed<syn:deviate> Unindexed<syn:debar> Unindexed<syn:block> Unindexed<syn:bend> Unindexed<syn:avoid> Unindexed<syn:avert>>
	
	row=8192 doc= Document<Keyword<word:collapse> Unindexed<syn:tumble> Unindexed<syn:give> Unindexed<syn:founder> Unindexed
	<syn:flop> Unindexed<syn:crumple> Unindexed<syn:crumble> Unindexed<syn:crash> Unindexed<syn:crack> Unindexed<syn:burst>
	Unindexed<syn:break>>
	row=16384 doc= Document<Keyword<word:bahrein> Unindexed<syn:bahrain>>
	row=32768 doc= Document<Keyword<word:overbearingness> Unindexed<syn:imperiousness> Unindexed<syn:domineeringness>>
    
    

What Now

Any generic lucene query tool can show you synonyms - you do something like search for "word:big" and get output like this. The point of this, however, is go to the next step and modify queries against "real" indexes to see of you can find docs the user intended to search for.
	
	Documents : 43,372
	Index Size: 3MB
	Searching for: word:big
	1 total matching documents after 380(ms)

        name=word sv="big"
        name=syn sv="vauntingly"
        name=syn sv="vainglorious"
        name=syn sv="swelled"
        name=syn sv="prominent"
        name=syn sv="openhanded"
        name=syn sv="momentous"
        name=syn sv="magnanimous"
        name=syn sv="liberal"
        name=syn sv="large"
        name=syn sv="handsome"
        name=syn sv="grownup"
        name=syn sv="grown"
        name=syn sv="giving"
        name=syn sv="freehanded"
        name=syn sv="crowing"
        name=syn sv="braggy"
        name=syn sv="bountiful"
        name=syn sv="bounteous"
        name=syn sv="boastfully"
        name=syn sv="boastful"
        name=syn sv="bighearted"
        name=syn sv="bad"
        name=syn sv="adult"