Comparing Document Classification Functions of Lucene and Mahout

07/03/2014

Author: Koji Sekiguchi

Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results.

Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.

Meanwhile, let’s use Mahout to do document classification with Naive Bayes and Random Forest as well.

Overview of Lucene Document Classification

Lucene’s classifier for document classification is defined as the Classifier interface.

public interface Classifier<T> {

  /**
   * Assign a class (with score) to the given text String
   * @param text a String containing text to be classified
   * @return a {@link ClassificationResult} holding assigned class of type <code>T</code> and score
   * @throws IOException If there is a low-level I/O error.
   */
  public ClassificationResult<T> assignClass(String text) throws IOException;

  /**
   * Train the classifier using the underlying Lucene index
   * @param atomicReader the reader to use to access the Lucene index
   * @param textFieldName the name of the field used to compare documents
   * @param classFieldName the name of the field containing the class assigned to documents
   * @param analyzer the analyzer used to tokenize / filter the unseen text
   * @param query the query to filter which documents use for training
   * @throws IOException If there is a low-level I/O error.
   */
  public void train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer, Query query)
      throws IOException;
}

You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.

Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().

public class ClassificationResult<T> {

  private final T assignedClass;
  private final double score;

  /**
   * Constructor
   * @param assignedClass the class <code>T</code> assigned by a {@link Classifier}
   * @param score the score for the assignedClass as a <code>double</code>
   */
  public ClassificationResult(T assignedClass, double score) {
    this.assignedClass = assignedClass;
    this.score = score;
  }

  /**
   * retrieve the result class
   * @return a <code>T</code> representing an assigned class
   */
  public T getAssignedClass() {
    return assignedClass;
  }

  /**
   * retrieve the result score
   * @return a <code>double</code> representing a result score
   */
  public double getScore() {
    return score;
  }
}

Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.

Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.

As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.

Now, let’s quickly go through how the 2 implement classes of Classifier interface do document classification and actually call them from a program.

Using Lucene SimpleNaiveBayesClassifier

SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).

We now use SimpleNaiveBayesClassifier, but before that, we need to prepare learning data in an index. Here we use livedoor news corpusas our corpus. Let’s add livedoor news corpus to the index using schema definition Solr as follows.

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
  <fields>
    <field name="url" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="cat" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
    <field name="title" type="text_ja" indexed="true" stored="true" multiValued="false"/>
    <field name="body" type="text_ja" indexed="true" stored="true" multiValued="true"/>
    <field name="date" type="date" indexed="true" stored="true"/>
  </fields>
  <uniqueKey>url</uniqueKey>
  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
        <filter class="solr.JapaneseBaseFormFilterFactory"/>
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" />
        <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
  </types>
</schema>

Note that the cat field is a classification class while body field is the target learning field. First, start Solr with the above schema.xml and add livedoor news corpus. You can stop Solr as soon as you finish adding the corpus.

Next, we need a Java program that uses SimpleNaiveBayesClassifier. To make things easier, we will use the same document we used for learning for classification test as is. The program looks like as follows.

public final class TestLuceneIndexClassifier {

  public static final String INDEX = "solr2/collection1/data/index";
  public static final String[] CATEGORIES = {
    "dokujo-tsushin",
    "it-life-hack",
    "kaden-channel",
    "livedoor-homme",
    "movie-enter",
    "peachy",
    "smax",
    "sports-watch",
    "topic-news"
  };
  private static int[][] counts;
  private static Map<String, Integer> catindex;

  public static void main(String[] args) throws Exception {
    init();

    final long startTime = System.currentTimeMillis();
    SimpleNaiveBayesClassifier classifier = new SimpleNaiveBayesClassifier();
    IndexReader reader = DirectoryReader.open(dir());
    AtomicReader ar = SlowCompositeReaderWrapper.wrap(reader);

    classifier.train(ar, "body", "cat", new JapaneseAnalyzer(Version.LUCENE_46));
    final int maxdoc = reader.maxDoc();
    for(int i = 0; i < maxdoc; i++){
      Document doc = ar.document(i);
      String correctAnswer = doc.get("cat");
      final int cai = idx(correctAnswer);
      ClassificationResult<BytesRef> result = classifier.assignClass(doc.get("body"));
      String classified = result.getAssignedClass().utf8ToString();
      final int cli = idx(classified);
      counts[cai][cli]++;
    }
    final long endTime = System.currentTimeMillis();
    final int elapse = (int)(endTime - startTime) / 1000;

    // print results
    int fc = 0, tc = 0;
    for(int i = 0; i < CATEGORIES.length; i++){
      for(int j = 0; j < CATEGORIES.length; j++){
        System.out.printf(" %3d ", counts[i][j]);
        if(i == j){
          tc += counts[i][j];
        }
        else{
          fc += counts[i][j];
        }
      }
      System.out.println();
    }
    float accrate = (float)tc / (float)(tc + fc);
    float errrate = (float)fc / (float)(tc + fc);
    System.out.printf("\n\n*** accuracy rate = %f, error rate = %f; time = %d (sec); %d docs\n", accrate, errrate, elapse, maxdoc);

    reader.close();
  }

  static Directory dir() throws IOException {
    return FSDirectory.open(new File(INDEX));
  }

  static void init(){
    counts = new int[CATEGORIES.length][CATEGORIES.length];
    catindex = new HashMap<String, Integer>();
    for(int i = 0; i < CATEGORIES.length; i++){
      catindex.put(CATEGORIES[i], i);
    }
  }

  static int idx(String cat){
    return catindex.get(cat);
  }
}

Here we specified JapaneseAnalyzer as Analyzer (On the other hand, there is a slight difference when we create index because we use JapaneseTokenizer and relevant TokenFilter with a Solr function). A character string array CATEGORIES has document category hard-coded. Executing this program displays a confusion matrix like Mahout but the elements in the matrix are in the same order as array elements of document category that are hard-coded.

Executing this program displays the followings.

 760    0    4   23   37   37    2    2    5
  40  656    7   44   25    4   90    1    3
  87   57  392  102   68   24  113    5   16
  40   15    6  391   33    8   16    2    0
  14    2    0    5  845    2    0    1    1
 134    2    2   26  107  549   19    3    0
  43   36   13   17   26   36  693    5    1
   6    0    0   23   35    0    1  829    6
  10    9    9   25   66    6    5   45  595 

*** accuracy rate = 0.775078, error rate = 0.224922; time = 67 (sec); 7367 docs

The classification accuracy rate went up to 77%.

Using Lucene KNearestNeighborClassifier

Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.

The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.

Executing the same program as KNearestNeighborClassifier will display the following when k=1.

 724   14   28   22    6   30    8   18   20
 121  630   41   13    2    9   35    6   13
 165   28  582   10    5   16   26    7   25
 229   15   15  213    6   14    6    2   11
 134   37   15    8  603   12   19    7   35
 266   38   39   24   14  412   22    9   18
 810   16    1    3    2    3   32    1    2
 316   18   14   12    5    7    8  439   81
 362   17   29   10    1    7    7   16  321 

*** accuracy rate = 0.536989, error rate = 0.463011; time = 13 (sec); 7367 docs

Now the accuracy rate is 53%. In addition, if you take k=3, accuracy rate goes down to 48%.

 652    5   78    3    7   40   13   38   34
 127  540   82   15    1   10   58   23   14
 169   34  553    3    7   16   38   15   29
 242   10   32  156   12   13   15   10   21
 136   30   21    9  592   11   19   15   37
 309   34   58    5   23  318   40   28   27
 810    8    3    1    0   10   37    1    0
 312    8   44    7    5    2   13  442   67
 362   11   45    5    6   10   16   34  281 

*** accuracy rate = 0.484729, error rate = 0.515271; time = 9 (sec); 7367 docs

Document Classification by NLP4L and Mahout

If you want to use Lucene’s index as an input data in Mahout, there’s a handy command available. However, the purpose is to do document classification for a class with an instructor, you need to output field information, which specifies a class, in addition to document vector.

The tools that can easily do this are NLP4L MSDDumper and TermsDumper that we developed. NLP4L stands for Natural Language Processing for Lucene and is a natural language processing tool set that sees Lucene’s index as corpus.

Depending on the setting, MSDDumper and TermsDumper select and extract important words from Lucene’s field according to keys like tf*idf and outputs them in a format that is easy for Mahout command to read. Let’s use this function to select 2,000 important words from the body field of index and do the Mahout classification.

Looking only at the result, Mahout Naive Bayes shows accuracy rate of 96%.

=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       7128	   96.7689%
Incorrectly Classified Instances        :        238	    3.2311%
Total Classified Instances              :       7366

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	<--Classified as
823  	1    	1    	6    	12   	19   	2    	4    	2    	 |  870   	a     = dokujo-tsushin
1    	848  	2    	1    	0    	1    	11   	4    	2    	 |  870   	b     = it-life-hack
5    	6    	830  	1    	1    	0    	3    	1    	17   	 |  864   	c     = kaden-channel
2    	6    	6    	486  	3    	1    	6    	0    	0    	 |  510   	d     = livedoor-homme
0    	0    	1    	1    	865  	1    	0    	1    	1    	 |  870   	e     = movie-enter
31   	3    	6    	12   	14   	762  	6    	4    	4    	 |  842   	f     = peachy
0    	0    	2    	0    	0    	1    	867  	0    	0    	 |  870   	g     = smax
0    	0    	0    	1    	0    	0    	0    	897  	2    	 |  900   	h     = sports-watch
2    	4    	1    	1    	0    	0    	0    	12   	750  	 |  770   	i     = topic-news

=======================================================
Statistics
-------------------------------------------------------
Kappa                                        0.955
Accuracy                                   96.7689%
Reliability                                87.0076%
Reliability (standard deviation)             0.307

Also, Mahout Random Forest shows accuracy rate of 97%.

=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       7156	   97.1359%
Incorrectly Classified Instances        :        211	    2.8641%
Total Classified Instances              :       7367

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	h    	i    	<--Classified as
838  	5    	2    	6    	3    	7    	2    	0    	1    	 |  864   	a     = kaden-channel
0    	895  	0    	1    	4    	0    	0    	0    	0    	 |  900   	b     = sports-watch
0    	0    	869  	0    	0    	1    	0    	0    	0    	 |  870   	c     = smax
0    	2    	0    	839  	1    	0    	14   	2    	12   	 |  870   	d     = dokujo-tsushin
1    	17   	0    	0    	748  	0    	2    	0    	2    	 |  770   	e     = topic-news
1    	5    	0    	1    	5    	855  	2    	0    	1    	 |  870   	f     = it-life-hack
0    	1    	0    	23   	0    	0    	793  	1    	24   	 |  842   	g     = peachy
0    	11   	0    	14   	1    	2    	18   	454  	11   	 |  511   	h     = livedoor-homme
0    	1    	0    	2    	0    	0    	2    	0    	865  	 |  870   	i     = movie-enter

=======================================================
Statistics
-------------------------------------------------------
Kappa                                       0.9608
Accuracy                                   97.1359%
Reliability                                87.0627%
Reliability (standard deviation)            0.3076

Summary

In this article, we used the same corpus to do document classification of the both Lucene and Mahout to compare their results. The accuracy rate seems to be higher for Mahout but, as already stated, its learning data classification use not all word but only top 2,000 important words in the body field. On the other hand, Lucene’s classifier, which accuracy rate was only 70%, uses the all words in body field. Lucene will be able to pass the 90% accuracy rate if you have a field to hold only the words reviewed specially for document classification. It may also be a good idea to create another Classifier implement class for train() method that has such function.

I should add that the accuracy rate goes down to around 80% when you do not use test data for learning but test it as real unknown data.

I hope this article will help you all in some way.

» Pagetop