Thursday, 10 February 2011

Approximate string matching algorithm (fuzzy search)

This is a great algorithm to find similarity coefficient of the two words.
Author is Viterbi or Bellman.I didn't find anything about it in the Internet.

For example exists dictionary: mother, uncle, cat, dog. And you typed smth like
"maaather", as the result will be "maaather" -> mother.
This algorithm possible to use in spell checking,sound recognition etc.


Algorithm description will be later

Here is the source code:
/**
 *
 * This class is for finding coefficient of similarity of the two words
 * using Viterbi(Bellman) algorithm
 * @author Sergey
 */
public class Analyzer {
  
    private double init[][];
    private double result[][];
    private int n;
    private int m;
 
    private void calculateASCIIDifference(String str1, String str2) {
     try{
        n = str1.length();
        m = str2.length();
        init = new double[n][m];
        for (int i = n - 1,i2 = 0; i >= 0; i--, i2++) {
            for (int j = 0; j < m; j++) {
                 init[i2][j] = Math.abs((int) str1.charAt(i) - (int) str2.charAt(j));
            }
        }
     } catch(Exception e){
         System.out.println("calculateASCIIDifference exception: "+e.toString());
        }
    }
 
    private double[][] getASCIIDifference() {
        return init;
    }
 
    private void setASCIIDifference(double[][] a,int n, int m) {
        this.n = n;
        this.m = m;
        this.init = a;
    }
 
       /**
    * 
    * Calculates similarity index, minimal value is more similar
    * </br>
    * you should always compare source with template word like this:
    * </br>
    * calculateSimilarity(iphone 5, iphone) </br>
    * calculateSimilarity(nokia, iphone) </br>
    * calculateSimilarity(inone, iphone) </br>
    * 
    * @param str1 - source word
    * @param str2 - template word
    * @return
    */
    public double calculateSimilarity(String str1, String str2) {
     
     calculateASCIIDifference(str1,str2);    
     try{
         result = new double[n][m];
         double res, val1, val2, val3;
  
         //initialize first element
         result[n - 1][0] = init[n - 1][0];
         
         //initialize first column
         for (int i = n - 2; i >= 0; i--) {
             result[i][0] = result[i + 1][0] + init[i][0];
         }  
         //initialize first row(from down)
         for (int j = 1; j < m; j++) {
             result[n - 1][j] = result[n - 1][j - 1] + init[n - 1][j];
         }  
         //initialize others
         for (int i = n - 2; i >= 0; i--) {
             for (int j = 1; j < m; j++) {
                 val1 = result[i][j - 1] + init[i][j];
                 val2 = result[i + 1][j] + init[i][j];
                 val3 = result[i + 1][j - 1] + (init[i][j] * 2);
  
                 //minimum of the 3 val's
                 res = val1 < val2 ? val1 : val2;
                 res = res < val3 ? res : val3;
  
                 result[i][j] = res;
             }
         }
         return getSimilarityValue();
     }
     catch(Exception e){
      System.out.println("calculateSimilarity exception: "+e.toString());
     }
     return -1;
    }
    //getting result of Similarity
    private double getSimilarityValue() {
     //need to devide if m != n
      if(m != n)
         return result[0][m - 1] / (m + n);
      else
      return result[0][m - 1];
    }
 
    //getting result array
    private double[][] getSimilarityArray() {
        return result;
    }
    
    //to test   
    public static void main(String[] args) {  
     double result1,result2;    
     Analyzer an = new Analyzer();
     result1 = an.calculateSimilarity("mather","mother");   
     result2 = an.calculateSimilarity("father","mother"); 
     //result1 will be less than result2 => result1 is more similar
     System.out.println(result1 + " " + result2);
    }
}

Good links:
Related stackoverflow question
Levenshtein distancesoundex and metaphone

5 comments:

  1. This is not accurate enough
    for example:
    "iphone" and "iphone": Similarity is 0
    "iphone" and "iphone5": Similarity is 3.69
    "iphone" and "iphone 5": Similarity is 8.36
    and
    "iphone" and "casemate": Similarity is 3.57 ????

    Why?

    ReplyDelete
    Replies
    1. maybe the program does not compute similarity but their distance....so there is obvious to find that the first case have distance equal 0

      Delete
    2. you should always compare with same word like this:
      iphone -> iphone 5 (gives 8)
      casemate -> iphone5 (gives 160)

      8 < 160

      Delete