Automatically Acquiring Synonym Knowledge from Wikipedia

27/05/2013

Author: Koji Sekiguchi

Lucene/Solr can use SynonymFilter to easily perform synonym searches. For example, you can create synonyms.txt that has the followings …

International Monetary Fund, IMF

…then define “text” field type as follows in the schema.xml file of Solr:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
  </analyzer>
</fieldType>

In the field where this text field type is used, you can search documents without hassle no matter whether you specify ”International Monetary Fund” or ”IMF”. These documents can include either of them.

Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

System Architecture

Following is the architecture of this system.

This system registers contents of dictionary corpus to the Index of Lucene before analyzing the contents of index. Give “keyword” and “description” fields in index and set TermVector in the description field. Using Solr, you should easily be able to prepare this portion without going through the trouble of writing program. From here on, I will focus on how to analyze the index where dictionary corpus is registered.

To summarize the process of analysis program, it repeats the following process for every keyword.

  1. Identify synonym candidate tB from the description field.
  2. Calculate semantic similarity S(tA, tB) between keyword tA and synonym candidate tB.
  3. If similarity S is greater than appropriate threshold value, write the pair (tA, tB) to a CSV file.

Followings are the detail of each process.

Identifying Synonym Candidate tB

Synonym entries that this system outputs are in the format “original word , abbreviation 1, abbreviation 2, …” where “original word” is the string itself in the “keyword” field while “abbreviation n” are abbreviations of the keyword placed in “description” fields. Japanese abbreviations are similar to English acronym. Not like English, however, there are so many abbreviations in Japanese as it has more letters than English.

This system compares keyword tA and word tR in the description field from the following standpoints – the system verifies if the acronym-like conditions are met.

  • The first letters of tA and tR match.
  • tA does not completely include tR.
  • Letters that make up tR are all included in tA and their orders of appearance match as well.

The system discards word tR if any one of the above conditions is not satisfied and moves to the next word. Also, even though all the conditions are met, it is possible that literal information just happened to match. The system, in that case, makes word tR as synonym candidate tB and goes to the next step where semantic similarity is being calculated.

Calculating Semantic Similarity S

The system next has to calculate the semantic similarity S(tA, tB) between keyword tA and synonym candidate tB but cannot to do so yet. Instead of calculating S, the system calculates similarity S*(AA,{AB}) between the article AA of keyword and the set of articles {AB} that was calculated by synonym candidate tB. The similarity S* of articles is derived by calculating the cosine similarity between each term vectors xA and xB.

Note that if {AB}=Φ then S*=1. This portion of program code is as follows.

List<String> acronyms = new ArrayList<String>();
		:
Map<String,Float> originVector = getFeatureVector(null, liveDocs, docId, terms, wordVectorSize, null);
		:
Query query = new TermQuery(new Term(fieldNameDesc, candidate));
TopDocs topDocs = searcher.search(query, userDocsSize);
// if no hits or 1 hit, then candidate is unordinary word
double score = 1;
if(topDocs.totalHits > 1){
  Map<String,Float> articleVector = null;
  for(int j = 0; j < userDocsSize && j < topDocs.totalHits; j++){
    if(topDocs.scoreDocs[j].doc != docId)
      articleVector = getFeatureVector(articleVector, liveDocs, topDocs.scoreDocs[j].doc, null, userDocVectorSize, candidate);
  }
  score = cosine(originVector, articleVector);
}
if(score > minScore){
  acronyms.add(candidate);
}

To calculate the set {AB} of articles written by synonym candidate tB, the system creates TermQuery of synonym candidate for the description field fieldNameDesc and does a search. When the search result is returned as topDocs, the system collects articles except for AA as set {AB}. The system, however, does not collect {AB} itself but converts them into term vectors and accumulates them on articleVector. The system then calculates the cosine of originVector, which is a term vector of keyword article AA, and articleVector, which is a term vector of article set {AB} that is written by synonym candidate tB. The cosine calculation program is as follows.

public static double cosine(Map<String,Float> v1, Map<String,Float> v2){
  if(v1.size() == 0 || v2.size() == 0) return 0;
  long numerator = 0;
  for(Map.Entry<String,Float> e1 : v1.entrySet()){
    Float v2Value = v2.get(e1.getKey());
    numerator += e1.getValue() * (v2Value == null ? 0 : v2Value);
  }
  if(numerator == 0) return 0;
  double denominator = getSumSquareRoot(v1) * getSumSquareRoot(v2);
  // shouldn't be occurred, but let's avoid zero devide
  if(denominator == 0.0) return 0;
  return numerator / denominator;
}

public static double getSumSquareRoot(Map<String,Float> v){
  double sum = 0;
  for(Map.Entry<String,Float> e : v.entrySet()){
    sum += e.getValue() * e.getValue();
  }
  return Math.sqrt(sum);
}

Term vectors xA and xB are following vectors that include weight w(t) of term t (synonym candidate tB is excluded) as an element.

Weight wX(t)(X=A, B) of term t is calculated as follows using . This is the same as the implementation of TFIDFSimilarity class of Lucene.

The program of method getFeatureVector that obtains term vector (feature vector) x_docId of article docId is as follows.

protected Map<String,Float> getFeatureVector(Map<String,Float> results, Bits liveDocs,
      int docId, Terms termsReuse, int size, String stopWord) throws IOException {
  TermsEnum termsEnum = null;
  if(termsReuse == null)
    termsReuse = reader.getTermVector(docId, fieldNameDesc);
  termsEnum = termsReuse.iterator(termsEnum);
  TermVectorEntityQueue queue = new TermVectorEntityQueue(size);
  BytesRef text;
  while((text = termsEnum.next()) != null){
    // candidate feature term
    final String term = text.utf8ToString();
    if(stopWord != null && term.equals(stopWord)) continue;
    DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, liveDocs, fieldNameDesc, text);
    int d = docsEnum.advance(docId);
    if(d != docId){
      throw new RuntimeException("wrong docId!");
    }
    final int tf = docsEnum.freq();
    if(tf < vectorMinTf) continue;
    final int docFreq = docFreq(fieldNameDesc, text);
    final TfDocFreq tfdfPair = new TfDocFreq(tf, docFreq, maxDoc);
    Map.Entry<String,TfDocFreq> entry = new Map.Entry<String,TfDocFreq>() {
      public String getKey() {
        return term;
      }
      public TfDocFreq getValue() {
        return tfdfPair;
      }
      public TfDocFreq setValue(TfDocFreq arg0) {
        return null;
      }
    };
    queue.insertWithOverflow(entry);
  }

  if(results == null)
    results = new HashMap<String, Float>(queue.size());
  Map.Entry<String,TfDocFreq> entry = null;
  while((entry = queue.pop()) != null){
    Float existsValue = results.get(entry.getKey());
    float exv = existsValue == null ? 0 : existsValue.floatValue();
    results.put(entry.getKey(), entry.getValue().tfIdf() + exv);
  }

  return results;
}

The type of return value from getFeatureVector is Map and therefore weight for the term of String is expressed in Float. The argument stopWord receives synonym candidate tB, making sure elements of term vector x_docId do not include tB. The argument size specifies the number of term vector elements and the system collects elements as many as the number specified in the size from the largest weight (tf*idf) to smaller ones. The system therefore uses the class TermVectorEntityQueue that is an extension of Lucene PriorityQueue.

protected static class TermVectorEntityQueue extends PriorityQueue<Map.Entry<String,TfDocFreq>> {

  public TermVectorEntityQueue(int maxSize) {
    super(maxSize);
  }

  // collect terms with larger weight
  protected boolean lessThan(Entry<String,TfDocFreq> a, Entry<String,TfDocFreq> b){
    return a.getValue().tfIdf() < b.getValue().tfIdf();
  }
}

The system uses TermVectorEntityQueue to collect Map.Entry but TfDocFreq keeps tf of term t and docFreq in the following class and has a method that calculates and returns them.

protected static class TfDocFreq {
  public final int tf, docFreq, maxDoc;
  public TfDocFreq(int tf, int docFreq, int maxDoc){
    this.tf = tf;
    this.docFreq = docFreq;
    this.maxDoc = maxDoc;
  }

  public float tfIdf(){
    return (float)(Math.sqrt(tf) * (1 + Math.log(maxDoc / (docFreq + 1))));
  }
}

docFreq is calculated by calling method docFreq inside getFeatureVector. This is done by obtaining the number of hits from the following search.

protected int docFreq(String field, BytesRef text) throws IOException {
  Query query = new TermQuery(new Term(field, text));
  TotalHitCountCollector c = new TotalHitCountCollector();
  searcher.search(query, c);
  return c.getTotalHits();
}

From the search above, the system decides that keyword tA and synonym candidate tB are semantically similar if the result is larger than the threshold value (minScore) calculated from S* and outputs a pair (tA,tB) to a CSV file.

Examples of Acquired Synonym Knowledge

When this program was executed to obtain synonym knowledge from Japanese Wikipedia, the system found about 11 thousand synonym entries from about 850 thousand items. Followings are Japanese synonyms actually obtained.

入国管理局, 入管
文房具, 文具
社員食堂, 社食
国際連盟, 国連
リポビタンD, リポD
:

Japanese Wikipedia has alphabet entries and the system obtained the following English synonym knowledge as well.

Universal Serial Bus, USB
File Transfer Protocol, FTP
World Wide Web, WWW
Document Object Model, DOM
Read Only Memory, ROM
Cascading Style Sheets, CSS
Domain Name System, DNS
Local Area Network, LAN
:

Applying to Other Languages and Corpora

This program is pluggably designed so that it can obtain synonym knowledge from general “dictionary corpus”. This article described how to output synonym CSV file from Japanese Wikipedia and this method can be applicable with little modification to Wikipedias of other languages including English, which has an idea of acronym. Of course, “dictionary corpora” other than Wikipedia, including electronic dictionaries and encyclopedias as well as catalogs for manufacturing industry, are all regarded as “dictionary corpora” and are applicable, giving this program wider application.

» Pagetop