This is a great algorithm to find similarity coefficient of the two words.
Author is Viterbi or Bellman.I didn't find anything about it in the Internet.
For example exists dictionary: mother, uncle, cat, dog. And you typed smth like
"maaather", as the result will be "maaather" -> mother.
This algorithm possible to use in spell checking,sound recognition etc.
Algorithm description will be later
Here is the source code:
Author is Viterbi or Bellman.I didn't find anything about it in the Internet.
For example exists dictionary: mother, uncle, cat, dog. And you typed smth like
"maaather", as the result will be "maaather" -> mother.
This algorithm possible to use in spell checking,sound recognition etc.
Algorithm description will be later
Here is the source code:
/**
*
* This class is for finding coefficient of similarity of the two words
* using Viterbi(Bellman) algorithm
* @author Sergey
*/
public class Analyzer {
private double init[][];
private double result[][];
private int n;
private int m;
private void calculateASCIIDifference(String str1, String str2) {
try{
n = str1.length();
m = str2.length();
init = new double[n][m];
for (int i = n - 1,i2 = 0; i >= 0; i--, i2++) {
for (int j = 0; j < m; j++) {
init[i2][j] = Math.abs((int) str1.charAt(i) - (int) str2.charAt(j));
}
}
} catch(Exception e){
System.out.println("calculateASCIIDifference exception: "+e.toString());
}
}
private double[][] getASCIIDifference() {
return init;
}
private void setASCIIDifference(double[][] a,int n, int m) {
this.n = n;
this.m = m;
this.init = a;
}
/**
*
* Calculates similarity index, minimal value is more similar
* </br>
* you should always compare source with template word like this:
* </br>
* calculateSimilarity(iphone 5, iphone) </br>
* calculateSimilarity(nokia, iphone) </br>
* calculateSimilarity(inone, iphone) </br>
*
* @param str1 - source word
* @param str2 - template word
* @return
*/
public double calculateSimilarity(String str1, String str2) {
calculateASCIIDifference(str1,str2);
try{
result = new double[n][m];
double res, val1, val2, val3;
//initialize first element
result[n - 1][0] = init[n - 1][0];
//initialize first column
for (int i = n - 2; i >= 0; i--) {
result[i][0] = result[i + 1][0] + init[i][0];
}
//initialize first row(from down)
for (int j = 1; j < m; j++) {
result[n - 1][j] = result[n - 1][j - 1] + init[n - 1][j];
}
//initialize others
for (int i = n - 2; i >= 0; i--) {
for (int j = 1; j < m; j++) {
val1 = result[i][j - 1] + init[i][j];
val2 = result[i + 1][j] + init[i][j];
val3 = result[i + 1][j - 1] + (init[i][j] * 2);
//minimum of the 3 val's
res = val1 < val2 ? val1 : val2;
res = res < val3 ? res : val3;
result[i][j] = res;
}
}
return getSimilarityValue();
}
catch(Exception e){
System.out.println("calculateSimilarity exception: "+e.toString());
}
return -1;
}
//getting result of Similarity
private double getSimilarityValue() {
//need to devide if m != n
if(m != n)
return result[0][m - 1] / (m + n);
else
return result[0][m - 1];
}
//getting result array
private double[][] getSimilarityArray() {
return result;
}
//to test
public static void main(String[] args) {
double result1,result2;
Analyzer an = new Analyzer();
result1 = an.calculateSimilarity("mather","mother");
result2 = an.calculateSimilarity("father","mother");
//result1 will be less than result2 => result1 is more similar
System.out.println(result1 + " " + result2);
}
}
Good links:
Related stackoverflow question
Levenshtein distance, soundex and metaphone
This is not accurate enough
ReplyDeletefor example:
"iphone" and "iphone": Similarity is 0
"iphone" and "iphone5": Similarity is 3.69
"iphone" and "iphone 5": Similarity is 8.36
and
"iphone" and "casemate": Similarity is 3.57 ????
Why?
maybe the program does not compute similarity but their distance....so there is obvious to find that the first case have distance equal 0
Deleteyou should always compare with same word like this:
Deleteiphone -> iphone 5 (gives 8)
casemate -> iphone5 (gives 160)
8 < 160
Why this is a fuzzy search?
ReplyDeleteToru
Delete