Blog

News

07/03/2014
Comparing Document Classification Functions of Lucene and Mahout

Author: Koji Sekiguchi

Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results.

Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.

Meanwhile, let’s use Mahout to do document classification with Naive Bayes and Random Forest as well.

Overview of Lucene Document Classification

Lucene’s classifier for document classification is defined as the Classifier interface.

public interface Classifier<T> {

  /**
   * Assign a class (with score) to the given text String
   * @param text a String containing text to be classified
   * @return a {@link ClassificationResult} holding assigned class of type <code>T</code> and score
   * @throws IOException If there is a low-level I/O error.
   */
  public ClassificationResult<T> assignClass(String text) throws IOException;

  /**
   * Train the classifier using the underlying Lucene index
   * @param atomicReader the reader to use to access the Lucene index
   * @param textFieldName the name of the field used to compare documents
   * @param classFieldName the name of the field containing the class assigned to documents
   * @param analyzer the analyzer used to tokenize / filter the unseen text
   * @param query the query to filter which documents use for training
   * @throws IOException If there is a low-level I/O error.
   */
  public void train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer, Query query)
      throws IOException;
}

You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.

Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().

public class ClassificationResult<T> {

  private final T assignedClass;
  private final double score;

  /**
   * Constructor
   * @param assignedClass the class <code>T</code> assigned by a {@link Classifier}
   * @param score the score for the assignedClass as a <code>double</code>
   */
  public ClassificationResult(T assignedClass, double score) {
    this.assignedClass = assignedClass;
    this.score = score;
  }

  /**
   * retrieve the result class
   * @return a <code>T</code> representing an assigned class
   */
  public T getAssignedClass() {
    return assignedClass;
  }

  /**
   * retrieve the result score
   * @return a <code>double</code> representing a result score
   */
  public double getScore() {
    return score;
  }
}

Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.

Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.

As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.

Now, let’s quickly go through how the 2 implement classes of Classifier interface do document classification and actually call them from a program.

Using Lucene SimpleNaiveBayesClassifier

SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).

We now use SimpleNaiveBayesClassifier, but before that, we need to prepare learning data in an index. Here we use livedoor news corpusas our corpus. Let’s add livedoor news corpus to the index using schema definition Solr as follows.

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
  <fields>
    <field name="url" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="cat" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
    <field name="title" type="text_ja" indexed="true" stored="true" multiValued="false"/>
    <field name="body" type="text_ja" indexed="true" stored="true" multiValued="true"/>
    <field name="date" type="date" indexed="true" stored="true"/>
  </fields>
  <uniqueKey>url</uniqueKey>
  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
        <filter class="solr.JapaneseBaseFormFilterFactory"/>
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" />
        <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
  </types>
</schema>

Note that the cat field is a classification class while body field is the target learning field. First, start Solr with the above schema.xml and add livedoor news corpus. You can stop Solr as soon as you finish adding the corpus.

Next, we need a Java program that uses SimpleNaiveBayesClassifier. To make things easier, we will use the same document we used for learning for classification test as is. The program looks like as follows.

public final class TestLuceneIndexClassifier {

  public static final String INDEX = "solr2/collection1/data/index";
  public static final String[] CATEGORIES = {
    "dokujo-tsushin",
    "it-life-hack",
    "kaden-channel",
    "livedoor-homme",
    "movie-enter",
    "peachy",
    "smax",
    "sports-watch",
    "topic-news"
  };
  private static int[][] counts;
  private static Map<String, Integer> catindex;

  public static void main(String[] args) throws Exception {
    init();

    final long startTime = System.currentTimeMillis();
    SimpleNaiveBayesClassifier classifier = new SimpleNaiveBayesClassifier();
    IndexReader reader = DirectoryReader.open(dir());
    AtomicReader ar = SlowCompositeReaderWrapper.wrap(reader);

    classifier.train(ar, "body", "cat", new JapaneseAnalyzer(Version.LUCENE_46));
    final int maxdoc = reader.maxDoc();
    for(int i = 0; i < maxdoc; i++){
      Document doc = ar.document(i);
      String correctAnswer = doc.get("cat");
      final int cai = idx(correctAnswer);
      ClassificationResult<BytesRef> result = classifier.assignClass(doc.get("body"));
      String classified = result.getAssignedClass().utf8ToString();
      final int cli = idx(classified);
      counts[cai][cli]++;
    }
    final long endTime = System.currentTimeMillis();
    final int elapse = (int)(endTime - startTime) / 1000;

    // print results
    int fc = 0, tc = 0;
    for(int i = 0; i < CATEGORIES.length; i++){
      for(int j = 0; j < CATEGORIES.length; j++){
        System.out.printf(" %3d ", counts[i][j]);
        if(i == j){
          tc += counts[i][j];
        }
        else{
          fc += counts[i][j];
        }
      }
      System.out.println();
    }
    float accrate = (float)tc / (float)(tc + fc);
    float errrate = (float)fc / (float)(tc + fc);
    System.out.printf("\n\n*** accuracy rate = %f, error rate = %f; time = %d (sec); %d docs\n", accrate, errrate, elapse, maxdoc);

    reader.close();
  }

  static Directory dir() throws IOException {
    return FSDirectory.open(new File(INDEX));
  }

  static void init(){
    counts = new int[CATEGORIES.length][CATEGORIES.length];
    catindex = new HashMap<String, Integer>();
    for(int i = 0; i < CATEGORIES.length; i++){
      catindex.put(CATEGORIES[i], i);
    }
  }

  static int idx(String cat){
    return catindex.get(cat);
  }
}

Here we specified JapaneseAnalyzer as Analyzer (On the other hand, there is a slight difference when we create index because we use JapaneseTokenizer and relevant TokenFilter with a Solr function). A character string array CATEGORIES has document category hard-coded. Executing this program displays a confusion matrix like Mahout but the elements in the matrix are in the same order as array elements of document category that are hard-coded.

Executing this program displays the followings.

 760    0    4   23   37   37    2    2    5
  40  656    7   44   25    4   90    1    3
  87   57  392  102   68   24  113    5   16
  40   15    6  391   33    8   16    2    0
  14    2    0    5  845    2    0    1    1
 134    2    2   26  107  549   19    3    0
  43   36   13   17   26   36  693    5    1
   6    0    0   23   35    0    1  829    6
  10    9    9   25   66    6    5   45  595 

*** accuracy rate = 0.775078, error rate = 0.224922; time = 67 (sec); 7367 docs

The classification accuracy rate went up to 77%.

Using Lucene KNearestNeighborClassifier

Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.

The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.

Executing the same program as KNearestNeighborClassifier will display the following when k=1.

 724   14   28   22    6   30    8   18   20
 121  630   41   13    2    9   35    6   13
 165   28  582   10    5   16   26    7   25
 229   15   15  213    6   14    6    2   11
 134   37   15    8  603   12   19    7   35
 266   38   39   24   14  412   22    9   18
 810   16    1    3    2    3   32    1    2
 316   18   14   12    5    7    8  439   81
 362   17   29   10    1    7    7   16  321 

*** accuracy rate = 0.536989, error rate = 0.463011; time = 13 (sec); 7367 docs

Now the accuracy rate is 53%. In addition, if you take k=3, accuracy rate goes down to 48%.

 652    5   78    3    7   40   13   38   34
 127  540   82   15    1   10   58   23   14
 169   34  553    3    7   16   38   15   29
 242   10   32  156   12   13   15   10   21
 136   30   21    9  592   11   19   15   37
 309   34   58    5   23  318   40   28   27
 810    8    3    1    0   10   37    1    0
 312    8   44    7    5    2   13  442   67
 362   11   45    5    6   10   16   34  281 

*** accuracy rate = 0.484729, error rate = 0.515271; time = 9 (sec); 7367 docs

Document Classification by NLP4L and Mahout

If you want to use Lucene’s index as an input data in Mahout, there’s a handy command available. However, the purpose is to do document classification for a class with an instructor, you need to output field information, which specifies a class, in addition to document vector.

The tools that can easily do this are NLP4L MSDDumper and TermsDumper that we developed. NLP4L stands for Natural Language Processing for Lucene and is a natural language processing tool set that sees Lucene’s index as corpus.

Depending on the setting, MSDDumper and TermsDumper select and extract important words from Lucene’s field according to keys like tf*idf and outputs them in a format that is easy for Mahout command to read. Let’s use this function to select 2,000 important words from the body field of index and do the Mahout classification.

Looking only at the result, Mahout Naive Bayes shows accuracy rate of 96%.

=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       7128	   96.7689%
Incorrectly Classified Instances        :        238	    3.2311%
Total Classified Instances              :       7366

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	<--Classified as
823  	1    	1    	6    	12   	19   	2    	4    	2    	 |  870   	a     = dokujo-tsushin
1    	848  	2    	1    	0    	1    	11   	4    	2    	 |  870   	b     = it-life-hack
5    	6    	830  	1    	1    	0    	3    	1    	17   	 |  864   	c     = kaden-channel
2    	6    	6    	486  	3    	1    	6    	0    	0    	 |  510   	d     = livedoor-homme
0    	0    	1    	1    	865  	1    	0    	1    	1    	 |  870   	e     = movie-enter
31   	3    	6    	12   	14   	762  	6    	4    	4    	 |  842   	f     = peachy
0    	0    	2    	0    	0    	1    	867  	0    	0    	 |  870   	g     = smax
0    	0    	0    	1    	0    	0    	0    	897  	2    	 |  900   	h     = sports-watch
2    	4    	1    	1    	0    	0    	0    	12   	750  	 |  770   	i     = topic-news

=======================================================
Statistics
-------------------------------------------------------
Kappa                                        0.955
Accuracy                                   96.7689%
Reliability                                87.0076%
Reliability (standard deviation)             0.307

Also, Mahout Random Forest shows accuracy rate of 97%.

=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       7156	   97.1359%
Incorrectly Classified Instances        :        211	    2.8641%
Total Classified Instances              :       7367

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	<--Classified as
838  	5    	2    	6    	3    	7    	2    	0    	1    	 |  864   	a     = kaden-channel
0    	895  	0    	1    	4    	0    	0    	0    	0    	 |  900   	b     = sports-watch
0    	0    	869  	0    	0    	1    	0    	0    	0    	 |  870   	c     = smax
0    	2    	0    	839  	1    	0    	14   	2    	12   	 |  870   	d     = dokujo-tsushin
1    	17   	0    	0    	748  	0    	2    	0    	2    	 |  770   	e     = topic-news
1    	5    	0    	1    	5    	855  	2    	0    	1    	 |  870   	f     = it-life-hack
0    	1    	0    	23   	0    	0    	793  	1    	24   	 |  842   	g     = peachy
0    	11   	0    	14   	1    	2    	18   	454  	11   	 |  511   	h     = livedoor-homme
0    	1    	0    	2    	0    	0    	2    	0    	865  	 |  870   	i     = movie-enter

=======================================================
Statistics
-------------------------------------------------------
Kappa                                       0.9608
Accuracy                                   97.1359%
Reliability                                87.0627%
Reliability (standard deviation)            0.3076

Summary

In this article, we used the same corpus to do document classification of the both Lucene and Mahout to compare their results. The accuracy rate seems to be higher for Mahout but, as already stated, its learning data classification use not all word but only top 2,000 important words in the body field. On the other hand, Lucene’s classifier, which accuracy rate was only 70%, uses the all words in body field. Lucene will be able to pass the 90% accuracy rate if you have a field to hold only the words reviewed specially for document classification. It may also be a good idea to create another Classifier implement class for train() method that has such function.

I should add that the accuracy rate goes down to around 80% when you do not use test data for learning but test it as real unknown data.

I hope this article will help you all in some way.

20/12/2013
Mahout and Machine Learning Training Course is Here

Author: Koji Sekiguchi

  • This training course is now provided in Tokyo in Japanese only. We, however, are considering ton provide the course overseas as its contents are very intriguing.

    Please contact me if you are interested and like to help me providing the course overseas.


  • Many enterprises have been tackling with the task of analyzing big data to obtain new perception. In the course of this challenge, machine learning is drawing their attention as their essential idea.

    RONDHUIT now would like to announce the introduction of machine learning training course using Apache Mahout.

    Click to go to the “Machine Learning Using Apache Mahout” page.

    Machine Learning Using Apache Mahout is a training course that mainly consists of hands-on sessions. The course systematically organizes basic knowledge’s of machine learning and uses Mahout as needed. Each unit provides you with applicable exercises that you can solve to improve your understanding and get a foothold in applying the knowledge to actual operations.

    Training Course Features

    • Our textbooks use plenty of graphics and charts in order for you to study the material effectively over a finite period of time. In addition to that, the detail notes provided on each page will be very useful for your self-study outside the class.
    • We are sure you will acquire the vision of background theory and practical knowledge by solving easy-to-tackle exercises provided in each unit with the instructor.

    Here are some of exercises:

    Use the following 2 methods to exhibit that among rectangles that have fixed circumference, square has the largest area.

    • A solving method that uses simple variable elimination by simultaneous equations.
    • A solving method that uses Lagrange multiplier.

    This training course is recommended if you:

    • Want to systematically study the hot topic of machine learning and put your experience to good use in your development work.
    • Are an information processing personnel of an enterprise that has big data and need to have minimum knowledge to order a development project that makes use of big data to an integrator.
    • Want to purchase books that cover machine learning in order to study the topic but have trouble reading as there are mathematical expressions.
    • Purchased “Mahout in Action” and are using Mahout but are not comfortable using the technology.

    Contents of Training Course

    This training course is a 2-day course.

    Day 1

    The day 1 of this 2-day course first looks at “What is Machine Learning?”, goes on to study pattern recognition and classification algorithms for supervised learning and finishes with writing a handwritten characters recognition program where the all students participate in creating handwritten character data. You will be amazed to find out how many handwritten characters the Mahout classifier recognizes!

    • Machine Learning and Apache Mahout
      • What is Machine Learning?
      • Model [Exercises]
      • What is Apache Mahout?
      • Installing Mahout [Exercises]
    • Pattern Recognition
      • What is Pattern Recognition?
      • Feature (or Attribute) Vector
      • Various Distance Measures [Exercises]
      • Prototypes and Learning Data
    • Classification
      • Nearest Neighbor Rule
      • k-NN Rule [Exercises]
      • Prototyping by Learning
      • Derivation of Discriminant Function
      • Perceptron Learning Rule [Exercises]
      • Averaged Perceptron
      • Problems of Perceptron Learning Rule
      • Widrow-Hoff Learning Rule [Exercises]
      • Neural Network [Exercises]
      • Support Vector Machine
      • Lagrange Multiplier [Exercises]
      • Decision Tree [Exercises]
      • Learning Decision Tree [Exercises]
      • Naive Bayes Classifier [Exercises]
      • Multivariable Bernoulli Model [Exercises]
      • Extension to Multiclass Classification
    • Programing Handwritten Characters Recognition
      • Structure of Handwritten Characters Recognition Program
      • Creating Learning Data [Exercises]
      • Facts of Handwritten Characters Recognition [Exercises]

    Day 2

    The day 2 of this 2-day course first looks at functions other than classification that Mahout provides – recommendation, clustering – and goes on to study principal component analysis for eliminating dimension of feature vector, machine learning evaluation, and machine learning for natural language processing, all through exercise using Mahout.

    • Recommendation
      • What is Recommendation?
      • Information retrieval and Recommendation
      • Types of Recommendation Architecture
      • User Profiles and Their Collection
      • Forecasting Evaluation Values [Exercises]
      • Pearson Correlation Coefficient
      • Explanation in Recommendation
    • PageRank
      • Importance of Ranking
      • Rating Scale of Information retrieval System – Theory and Practice
      • Vector Space Model [Exercises]
      • Score Accounting of Apache Lucene
      • PageRank [Exercises]
      • HITS
    • Clustering
      • What is Clustering?
      • Clustering Methods
      • K-Means [Exercises]
      • Nearest Neighbor Method [Exercises]
      • Evaluating and Analyzing Clustering Results
      • Similar Image Search – Apache alike
      • Information retrieval and Clustering
    • Principal Component Analysis
      • Relationship between the Number of Learning Patterns and Dimensions [Exercises]
      • What is Principal Component Analysis?
      • Average and Variance [Exercises]
      • Covariance Matrix [Exercises]
      • Eigenvalue and Eigenvector [Exercises]
    • Evaluating Machine Learning
      • Evaluating and Analyzing Results
      • Partitioning Training Data
      • Precision and Recall Ratio [Exercises]
      • False Positive and False Negative [Exercises]
      • Evaluating Features
      • Within-Class Variance, Between-Class Variance
      • Bayes Error Rate
      • Feature Selection
    • Machine Learning in Natural Language Processing
      • What is Natural Language Processing?
      • Natural Language Processing for Lalognosis
      • Corpus
      • bag-of-words
      • N-gram Model [Exercises]
      • Sequential Labeling
      • Hidden Markov Model [Exercises]
      • Viterbi Algorithm [Exercises]
      • Introduction to NLP4L

    Prerequisite

    Skill of editors such as vi and Emacs and knowledge of Linux commands are helpful as exercises require the use of an Ubuntu machine.

    Equipment

    • Please bring a notebook PC that has ssh installed and running. We can provide you with one if you don’t have access to a notebook PC.
    • Please bring a (mechanical) pencil and an eraser as some exercises involve hand calculations.

    We are looking forward to working with you!

    27/05/2013
    Automatically Acquiring Synonym Knowledge from Wikipedia

    Author: Koji Sekiguchi

    Lucene/Solr can use SynonymFilter to easily perform synonym searches. For example, you can create synonyms.txt that has the followings …

    International Monetary Fund, IMF
    

    …then define “text” field type as follows in the schema.xml file of Solr:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      </analyzer>
    </fieldType>
    

    In the field where this text field type is used, you can search documents without hassle no matter whether you specify ”International Monetary Fund” or ”IMF”. These documents can include either of them.

    Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

    That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

    This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

    Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

    System Architecture

    Following is the architecture of this system.

    This system registers contents of dictionary corpus to the Index of Lucene before analyzing the contents of index. Give “keyword” and “description” fields in index and set TermVector in the description field. Using Solr, you should easily be able to prepare this portion without going through the trouble of writing program. From here on, I will focus on how to analyze the index where dictionary corpus is registered.

    To summarize the process of analysis program, it repeats the following process for every keyword.

    1. Identify synonym candidate tB from the description field.
    2. Calculate semantic similarity S(tA, tB) between keyword tA and synonym candidate tB.
    3. If similarity S is greater than appropriate threshold value, write the pair (tA, tB) to a CSV file.

    Followings are the detail of each process.

    Identifying Synonym Candidate tB

    Synonym entries that this system outputs are in the format “original word , abbreviation 1, abbreviation 2, …” where “original word” is the string itself in the “keyword” field while “abbreviation n” are abbreviations of the keyword placed in “description” fields. Japanese abbreviations are similar to English acronym. Not like English, however, there are so many abbreviations in Japanese as it has more letters than English.

    This system compares keyword tA and word tR in the description field from the following standpoints – the system verifies if the acronym-like conditions are met.

    • The first letters of tA and tR match.
    • tA does not completely include tR.
    • Letters that make up tR are all included in tA and their orders of appearance match as well.

    The system discards word tR if any one of the above conditions is not satisfied and moves to the next word. Also, even though all the conditions are met, it is possible that literal information just happened to match. The system, in that case, makes word tR as synonym candidate tB and goes to the next step where semantic similarity is being calculated.

    Calculating Semantic Similarity S

    The system next has to calculate the semantic similarity S(tA, tB) between keyword tA and synonym candidate tB but cannot to do so yet. Instead of calculating S, the system calculates similarity S*(AA,{AB}) between the article AA of keyword and the set of articles {AB} that was calculated by synonym candidate tB. The similarity S* of articles is derived by calculating the cosine similarity between each term vectors xA and xB.

    Note that if {AB}=Φ then S*=1. This portion of program code is as follows.

    List<String> acronyms = new ArrayList<String>();
    		:
    Map<String,Float> originVector = getFeatureVector(null, liveDocs, docId, terms, wordVectorSize, null);
    		:
    Query query = new TermQuery(new Term(fieldNameDesc, candidate));
    TopDocs topDocs = searcher.search(query, userDocsSize);
    // if no hits or 1 hit, then candidate is unordinary word
    double score = 1;
    if(topDocs.totalHits > 1){
      Map<String,Float> articleVector = null;
      for(int j = 0; j < userDocsSize && j < topDocs.totalHits; j++){
        if(topDocs.scoreDocs[j].doc != docId)
          articleVector = getFeatureVector(articleVector, liveDocs, topDocs.scoreDocs[j].doc, null, userDocVectorSize, candidate);
      }
      score = cosine(originVector, articleVector);
    }
    if(score > minScore){
      acronyms.add(candidate);
    }
    

    To calculate the set {AB} of articles written by synonym candidate tB, the system creates TermQuery of synonym candidate for the description field fieldNameDesc and does a search. When the search result is returned as topDocs, the system collects articles except for AA as set {AB}. The system, however, does not collect {AB} itself but converts them into term vectors and accumulates them on articleVector. The system then calculates the cosine of originVector, which is a term vector of keyword article AA, and articleVector, which is a term vector of article set {AB} that is written by synonym candidate tB. The cosine calculation program is as follows.

    public static double cosine(Map<String,Float> v1, Map<String,Float> v2){
      if(v1.size() == 0 || v2.size() == 0) return 0;
      long numerator = 0;
      for(Map.Entry<String,Float> e1 : v1.entrySet()){
        Float v2Value = v2.get(e1.getKey());
        numerator += e1.getValue() * (v2Value == null ? 0 : v2Value);
      }
      if(numerator == 0) return 0;
      double denominator = getSumSquareRoot(v1) * getSumSquareRoot(v2);
      // shouldn't be occurred, but let's avoid zero devide
      if(denominator == 0.0) return 0;
      return numerator / denominator;
    }
    
    public static double getSumSquareRoot(Map<String,Float> v){
      double sum = 0;
      for(Map.Entry<String,Float> e : v.entrySet()){
        sum += e.getValue() * e.getValue();
      }
      return Math.sqrt(sum);
    }
    

    Term vectors xA and xB are following vectors that include weight w(t) of term t (synonym candidate tB is excluded) as an element.

    Weight wX(t)(X=A, B) of term t is calculated as follows using . This is the same as the implementation of TFIDFSimilarity class of Lucene.

    The program of method getFeatureVector that obtains term vector (feature vector) x_docId of article docId is as follows.

    protected Map<String,Float> getFeatureVector(Map<String,Float> results, Bits liveDocs,
          int docId, Terms termsReuse, int size, String stopWord) throws IOException {
      TermsEnum termsEnum = null;
      if(termsReuse == null)
        termsReuse = reader.getTermVector(docId, fieldNameDesc);
      termsEnum = termsReuse.iterator(termsEnum);
      TermVectorEntityQueue queue = new TermVectorEntityQueue(size);
      BytesRef text;
      while((text = termsEnum.next()) != null){
        // candidate feature term
        final String term = text.utf8ToString();
        if(stopWord != null && term.equals(stopWord)) continue;
        DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, liveDocs, fieldNameDesc, text);
        int d = docsEnum.advance(docId);
        if(d != docId){
          throw new RuntimeException("wrong docId!");
        }
        final int tf = docsEnum.freq();
        if(tf < vectorMinTf) continue;
        final int docFreq = docFreq(fieldNameDesc, text);
        final TfDocFreq tfdfPair = new TfDocFreq(tf, docFreq, maxDoc);
        Map.Entry<String,TfDocFreq> entry = new Map.Entry<String,TfDocFreq>() {
          public String getKey() {
            return term;
          }
          public TfDocFreq getValue() {
            return tfdfPair;
          }
          public TfDocFreq setValue(TfDocFreq arg0) {
            return null;
          }
        };
        queue.insertWithOverflow(entry);
      }
    
      if(results == null)
        results = new HashMap<String, Float>(queue.size());
      Map.Entry<String,TfDocFreq> entry = null;
      while((entry = queue.pop()) != null){
        Float existsValue = results.get(entry.getKey());
        float exv = existsValue == null ? 0 : existsValue.floatValue();
        results.put(entry.getKey(), entry.getValue().tfIdf() + exv);
      }
    
      return results;
    }
    

    The type of return value from getFeatureVector is Map and therefore weight for the term of String is expressed in Float. The argument stopWord receives synonym candidate tB, making sure elements of term vector x_docId do not include tB. The argument size specifies the number of term vector elements and the system collects elements as many as the number specified in the size from the largest weight (tf*idf) to smaller ones. The system therefore uses the class TermVectorEntityQueue that is an extension of Lucene PriorityQueue.

    protected static class TermVectorEntityQueue extends PriorityQueue<Map.Entry<String,TfDocFreq>> {
    
      public TermVectorEntityQueue(int maxSize) {
        super(maxSize);
      }
    
      // collect terms with larger weight
      protected boolean lessThan(Entry<String,TfDocFreq> a, Entry<String,TfDocFreq> b){
        return a.getValue().tfIdf() < b.getValue().tfIdf();
      }
    }
    

    The system uses TermVectorEntityQueue to collect Map.Entry but TfDocFreq keeps tf of term t and docFreq in the following class and has a method that calculates and returns them.

    protected static class TfDocFreq {
      public final int tf, docFreq, maxDoc;
      public TfDocFreq(int tf, int docFreq, int maxDoc){
        this.tf = tf;
        this.docFreq = docFreq;
        this.maxDoc = maxDoc;
      }
    
      public float tfIdf(){
        return (float)(Math.sqrt(tf) * (1 + Math.log(maxDoc / (docFreq + 1))));
      }
    }
    

    docFreq is calculated by calling method docFreq inside getFeatureVector. This is done by obtaining the number of hits from the following search.

    protected int docFreq(String field, BytesRef text) throws IOException {
      Query query = new TermQuery(new Term(field, text));
      TotalHitCountCollector c = new TotalHitCountCollector();
      searcher.search(query, c);
      return c.getTotalHits();
    }
    

    From the search above, the system decides that keyword tA and synonym candidate tB are semantically similar if the result is larger than the threshold value (minScore) calculated from S* and outputs a pair (tA,tB) to a CSV file.

    Examples of Acquired Synonym Knowledge

    When this program was executed to obtain synonym knowledge from Japanese Wikipedia, the system found about 11 thousand synonym entries from about 850 thousand items. Followings are Japanese synonyms actually obtained.

    入国管理局, 入管
    文房具, 文具
    社員食堂, 社食
    国際連盟, 国連
    リポビタンD, リポD
    :
    

    Japanese Wikipedia has alphabet entries and the system obtained the following English synonym knowledge as well.

    Universal Serial Bus, USB
    File Transfer Protocol, FTP
    World Wide Web, WWW
    Document Object Model, DOM
    Read Only Memory, ROM
    Cascading Style Sheets, CSS
    Domain Name System, DNS
    Local Area Network, LAN
    :
    

    Applying to Other Languages and Corpora

    This program is pluggably designed so that it can obtain synonym knowledge from general “dictionary corpus”. This article described how to output synonym CSV file from Japanese Wikipedia and this method can be applicable with little modification to Wikipedias of other languages including English, which has an idea of acronym. Of course, “dictionary corpora” other than Wikipedia, including electronic dictionaries and encyclopedias as well as catalogs for manufacturing industry, are all regarded as “dictionary corpora” and are applicable, giving this program wider application.

    26/10/2012
    Lucene 4 is Super Convenient for Developing NLP Tools

    Author: Koji Sekiguchi

    The other day, I wrote a system that automatically obtains synonym knowledge from dictionary corpus. Dictionary corpus is a collection of entries that consist of “keywords” and their “descriptions”. Put simply, it’s a dictionary. Familiar examples of dictionary corpus are electronic dictionaries and Wikipedia data. You can also say that the combination of “item name” and “item description” in EC sites are dictionary corpus.

    See SlideShare for the detail of this system.

    Automatically Obtaining Synonym Knowledge from Dictionary Corpus

    I originally wrote this system because I wanted to use Wikipedia to automatically create synonyms.txt that is used for SynonymFilter of Lucene/Solr. SynonymFilter of Lucene/Solr can use the output CSV file but the system itself actually uses Lucene 4.0 inside of it.

    I always thought that Lucene 4.0 is convenient for developing NLP tools and reaffirmed the impression after substantiating the assumption.

    Lucene 4.0 classes that I used for developing this system are as follows:

    • IndexSearcher, TermQuery, TopDocs
      This system calculates similarities of synonym candidates that consist of nouns extracted from keywords and their descriptions. The system determines that the candidate is a synonym of keyword if similarity is bigger than a threshold value and output it to a CSV file.
      But how I calculate the similarity of a keyword and its synonym candidate. This system determines the similarity by calculating the similarity of keyword description Aa and dictionary entry description set {Ab} that are written using synonym candidates.
      Thus, I have to find {Ab} where I used classes such as IndexSearcher, TermQuery, and TopDocsto to search description field using synonym candidate.
    • PriorityQueue
      Next, I have to pick out “feature word” from Aa and {Ab} to calculate similarity of the two. In order to do so, I select N most important words to structure feature vector. Here, I use TF*IDF of the target word as their degree of importance. See the above SlideShare for the detail. Here, I use PriorityQueue to select “N most important words”
    • DocsEnum, TotalHitCountCollector
      I used TF*IDF to calculate weight to extract the above feature word and used DocsEnum.freq() to obtain TF. docFreq (number of articles including synonym candidate), which is a required parameter to obtain IDF, has been calculated by passing TotalHitCountCollector to the search() method of IndexSearcher.
    • Terms, TermsEnum
      I use these classes to search “description” field for synonym candidates.

    These are usage examples for Lucene 4.0 on this system. I also believe Lucene will be a great help for NLP tool developers as well. For lexical knowledge obtention task using Bootstrap, for example, I can use a cycle (1: pattern extraction, 2: pattern selection, 3: instance extraction, 4: instance selection) to obtain knowledge from a small number of seed instances. I believe that you can replace pattern extraction and instance extraction with a simple search task if you use Lucene for these tasks.

    29/06/2012
    Starting Lab Work

    Author: Koji Sekiguchi

    Several months into my school life at Japan Advanced Institute of Science and Technology (JAIST), I was assigned to a lab studying natural language processing (NLP). Meanwhile, a lecture on the theory of NLP has just begun and we have been studying how to learn corpus using various machine learning algorithm including Naive Bayes, decision tree, support vector machine, and hidden Markov model to do disambiguation of classification problem.

    As I became curious about career path after graduate school for students who studied NLP, I asked my lab professor a question. Most of them “start their career as an SE at a major electronics company”, he said and added that they don’t utilize NLP in their business. As I went on to ask in detail, he said that some of the students join company laboratory specializing in NLP after doctoral course but students with masters degree only become “ordinary SE”. Evidently, even the professor had no idea how hot the technique we are using in his lab in some IT industries.

    As the theme of professor’s work is highly difficult and the time when we will be able to process natural language in high accuracy would be still a long way off, It could be that he doesn’t have any idea that NLP could be utilized in the real-world in the first place. If the professor has that kind of feeling, graduating students will try to get a job in the real-world without the idea that their (fundamental knowledge of) theme is useful at work and end up “getting a job as an SE at major electronics company”. The professor’s phrase “getting a job as an SE at major electronics company” is a typical answer to a question asked by students and indicates that you can get a decent job even after doing researches on the field that we are specialized in.

    I’m not saying that it is unfortunate to get a job at major electronics company to become an ordinary SE. You might get a job and work on a back-end system where I believe there are some area that you can apply your technique. I just thought that it’s a shame for the both academia and industry if you are not able to have a mindset that your fundamental knowledge acquired from your research can make the system you are working on more convenient as you get so used to the environment like this where you work on a back-end system.

    13/04/2012
    Two Starts – JAIST and Twitter

    Author: Koji Sekiguchi

    Starting April, I’m a graduate student at School of Information Science in Japan Advanced Institute of Science and Technology (JAIST) and a new user of Twitter. Please follow me on Twitter, @kojisays, as I tweet topics related to my school/research life.

    The classes are held at Friday nights and on weekends. We had the first morning class on last Sunday, right after the first orientation session on Saturday. Our professor even gave us an essay assignment in the first class!

    Our subject on Saturday is Statistical Signal Processing where we study to master basic mathematical handling of stochastic process as well as signal-processing algorithm and model estimation with statistics in mind. We didn’t get into the stochastic process and only studied basic probability theory. It was a great timing for me as I have been doing research on score accounting using probability model including BM25 and Language Model that are introduced in Lucene 4.0.

    It might be one of my dreams to apply the knowledge I obtain from this class to query log analysis of soleami. We may be able to take the number of searches on a certain keyword in a certain period of time as stochastic process, then we could mathematically predict the number of searches on the keyword in the future. What do you say?

    We will apply variety of ideas including the ones I get from JAIST lectures to soleami to make this service more exciting and valuable for you. Enjoy soleami!

    18/02/2012
    Notes on Uploading Log File

    After three days since the launch of soleami, increasing number of users are registering and uploading files. It would be our greatest pleasure to see more and more users create a perception that “watching query logs is interesting.”

    That said, there is a strict and unfortunate limitation on the format of query log when you use soleami to visualize your Solr query log.

    Please see the list of limitations here. Even though you receive an e-mail that inform you of “completion of analysis”, the visualization screen displays “no log data…” If your log file does not comply to the specified format. For example, a user uploaded the following log file but is unable to visualize because:

    • The name of month is not an appropriate.
      14 f?vr. 2012 06:39:23
      

      We are afraid but soleami recognizes the following date format only.

      Jan 14, 2012 9:21:27 AM
      2012/1/14 09:21:27
      

      Anything other than the above format is currently unrecognizable.

    • One query is formatted in one line. Currently, soleami can analyze nothing but JUL’s 2-line per query format that is the default for Solr.

    Our soleami development team will continue to improve the service to ease these limitations.

    Happy visualization!

    14/02/2012
    [ news release ] Apache Solr Query Log Visualization Service

    Tokyo, Japan – RONDHUIT Co.,Ltd. announced the launch of “soleami” today. Soleami – pronounced “so ray me” – is a service that visualizes query logs collected by Apache Solr, an open source software search engine, and is provided for free.

    soleami
    http://soleami.com/

    Solr is an OSS, Open Source Software, search engine used worldwide for searches inside private intranet sites as well as company sites that are published on internet.

    Solr keeps track of search requests in a log file called query log. This query log is like “a list of visitors’ needs” but has not been exploited because the task you need to perform to summarize the log by search keyword requires a huge amount of work.

    To bring this problem to a close, soleami has been architected and developed so that it only requires users to upload query logs to summarize them and display charts in a Web browser. Any Solr manager now can upload Solr query logs to soleami and visualize needs of visitors for free.

    Soleami supports the following charts (See sample screenshots here).

    • Top 10
      Top 10 summarizes and displays 10 most popular search keywords by month. You can see the seasonal variability of search keywords because it can display keywords as old as 12 months.
    • Trend 1000
      Trend 1000 displays a line chart of 1000 most popular search keywords for last 12 months. It also displays a bar chart of 20 most popular sub keywords that are specified along with search keywords.
    • Zero Hits
      “0 hits” displays a bar chart that indicates the number of queries in last 12 months that ended up with “0 hit” – queries that failed to find any search result. “0 hits” indicates that your site was unable to fill the needs of the visitors. Cutting the number of 0 hits would increase visitor satisfaction for general sites and conversion rate for EC sites.

    The soleami development team is making constant efforts to improve quality and function of the service so that you can visualize your query logs in various angles.


    Origin of soleami

    Soleami is derived from a French “ami du soleil” that means a friend of sun. We named this service soleami hoping you would always have and use it as a friend of Solr, which is derived from a word solar that means sun.


    About Apache Solr

    Apache Solr is a search engine developed by Apache Software Foundation, an open source software developer/administrator. Solr is based on Apache Lucene that is developed by Doug Cutting, the founder of Apache Hadoop.

    http://lucene.apache.org/solr/


    About RONDHUIT Co.,Ltd.

    RONDHUIT provides support services to help enterprises and educational institutes implement Apache Lucene/Solr. In addition to consultation service in implementing Solr, RONDHUIT provides training service and support services around Solr. Koji Sekiguchi, who leads RONDHUIT, is an active Apache Lucene/Solr committer as well.

    http://www.rondhuit.com/en/

    14/02/2012
    soleami – Apache Solr Query Log Visualization Service Launched

    Soleami, an Apache Solr query log analysis service, is generally available today. Soleami summarizes query logs in various angles and analyzes/charts them for you to see in your browser.

    soleami – Visualize the needs of your visitors.
    http://soleami.com/

    Solr keeps track of search requests in a log file called query log. This query log is like “a list of visitors’ needs” but has not been exploited because the task you need to perform to summarize the log by search keyword requires a huge amount of work.

    To bring this problem to a close, soleami has been architected and developed so that it only requires users to upload query logs to summarize them and display charts in a web browser. Any Solr manager now can upload Solr query logs to soleami and visualize needs of visitors for free.

    It’s as easy as 3 steps to visualize query logs.

    1. Save query log file output from Solr.
    2. Zip the query log file and upload it to soleami.
    3. Watch the displayed charts.

    Soleami supports the following charts (See sample scheenshots here)。

    • Top 10
      Top 10 summarizes and displays 10 most popular search keywords by month. You can see the seasonal variability of search keywords because it can display keywords as old as 12 months.
    • Trend 1000
      Trend 1000 displays a line chart of 1000 most popular search keywords for last 12 months. It also displays a bar chart of 20 most popular sub keywords that are specified along with search keywords.
    • Zero Hits
      “0 hits” displays a bar chart that indicates the number of queries in last 12 months that ended up with “0 hits” – queries that failed to find any search result. “0 hits” indicates that your site was unable to fill the needs of the visitors. Cutting the number of 0 hits would increase visitor satisfaction for general sites and conversion rate for EC sites.

    These are the charts we currently provide in soleami but our development team is making constant efforts to improve quality and function of the service so that you can visualize your query logs in various angles. Please e-mail us if you have any request such as “I want to analyze my data in this certain angle. Can you do it?”

    Take advantage of soleami along with Solr to help enhance the convenience of your site.