Lucene 4 is Super Convenient for Developing NLP Tools


Author: Koji Sekiguchi

The other day, I wrote a system that automatically obtains synonym knowledge from dictionary corpus. Dictionary corpus is a collection of entries that consist of “keywords” and their “descriptions”. Put simply, it’s a dictionary. Familiar examples of dictionary corpus are electronic dictionaries and Wikipedia data. You can also say that the combination of “item name” and “item description” in EC sites are dictionary corpus.

See SlideShare for the detail of this system.

Automatically Obtaining Synonym Knowledge from Dictionary Corpus

I originally wrote this system because I wanted to use Wikipedia to automatically create synonyms.txt that is used for SynonymFilter of Lucene/Solr. SynonymFilter of Lucene/Solr can use the output CSV file but the system itself actually uses Lucene 4.0 inside of it.

I always thought that Lucene 4.0 is convenient for developing NLP tools and reaffirmed the impression after substantiating the assumption.

Lucene 4.0 classes that I used for developing this system are as follows:

  • IndexSearcher, TermQuery, TopDocs
    This system calculates similarities of synonym candidates that consist of nouns extracted from keywords and their descriptions. The system determines that the candidate is a synonym of keyword if similarity is bigger than a threshold value and output it to a CSV file.
    But how I calculate the similarity of a keyword and its synonym candidate. This system determines the similarity by calculating the similarity of keyword description Aa and dictionary entry description set {Ab} that are written using synonym candidates.
    Thus, I have to find {Ab} where I used classes such as IndexSearcher, TermQuery, and TopDocsto to search description field using synonym candidate.
  • PriorityQueue
    Next, I have to pick out “feature word” from Aa and {Ab} to calculate similarity of the two. In order to do so, I select N most important words to structure feature vector. Here, I use TF*IDF of the target word as their degree of importance. See the above SlideShare for the detail. Here, I use PriorityQueue to select “N most important words”
  • DocsEnum, TotalHitCountCollector
    I used TF*IDF to calculate weight to extract the above feature word and used DocsEnum.freq() to obtain TF. docFreq (number of articles including synonym candidate), which is a required parameter to obtain IDF, has been calculated by passing TotalHitCountCollector to the search() method of IndexSearcher.
  • Terms, TermsEnum
    I use these classes to search “description” field for synonym candidates.

These are usage examples for Lucene 4.0 on this system. I also believe Lucene will be a great help for NLP tool developers as well. For lexical knowledge obtention task using Bootstrap, for example, I can use a cycle (1: pattern extraction, 2: pattern selection, 3: instance extraction, 4: instance selection) to obtain knowledge from a small number of seed instances. I believe that you can replace pattern extraction and instance extraction with a simple search task if you use Lucene for these tasks.

» Pagetop