Damerau–Levenshtein distance

Now you'd thing what the hell am I rambling about now, but here's the thing. Reading through my daily dose of programming news in Google Reader I stumbled upon this clever algorithm - the Damerau-Levenshtein distance algorithm. It is used to find the number of steps it would take to transform one string into another using the following operations: insertion, deletion, substitution and trasposition. Or put simply - Damerau-Levenshtein distance is a metric to defining the difference between two strings. The main areas of use are spell-checking (typos etc), DNA difference calculation etc. So I read up on the algorith at the Wiki page and decided to implement it just for the implementation's sake. Having fiddled around with the code for some time it dawned on me - I have this OCD when I perform the exact same operation in my head for two words that have a revelance to whichever situation I happen to be in at a given moment. I also provide a corresponding String extension method. Oh and if anyone hasn't picked it up yet, I mainly write code in C# 3.0 as is the case with this post. So without further ado, I present to you the implementation:

public static class DamerauLevenshtein
{
 public static int DamerauLevenshteinDistanceTo(this string @string, string targetString)
 {
  return DamerauLevenshteinDistance(@string, targetString);
 }

 public static int DamerauLevenshteinDistance(string string1, string string2)
 {
  if (String.IsNullOrEmpty(string1))
  {
   if (!String.IsNullOrEmpty(string2))
    return string2.Length;

   return 0;
  }

  if (String.IsNullOrEmpty(string2))
  {
   if (!String.IsNullOrEmpty(string1))
    return string1.Length;

   return 0;
  }

  int length1 = string1.Length;
  int length2 = string2.Length;

  int[,] d = new int[length1 + 1, length2 + 1];

  int cost, del, ins, sub;

  for (int i = 0; i <= d.GetUpperBound(0); i++)
   d[i, 0] = i;

  for (int i = 0; i <= d.GetUpperBound(1); i++)
   d[0, i] = i;

  for (int i = 1; i <= d.GetUpperBound(0); i++)
  {
   for (int j = 1; j <= d.GetUpperBound(1); j++)
   {
    if (string1[i - 1] == string2[j - 1])
     cost = 0;
    else
     cost = 1;

    del = d[i - 1, j] + 1;
    ins = d[i, j - 1] + 1;
    sub = d[i - 1, j - 1] + cost;

    d[i, j] = Math.Min(del, Math.Min(ins, sub));

    if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
     d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
   }
  }

  return d[d.GetUpperBound(0), d.GetUpperBound(1)];
 }
}

What do you guys think, should the presentation of interesting algorithm finds in my blog become a recurring subject? I think it wouldn't hurt.

Peace out,
Mihkel

4 comments:

KiCHY said...

Thank you for this great source! It saved some hours of work ;)

Anonymous said...

This is a great way to handle errors. I've implemented it along with a URL rewriting method to ensure that if there's a mistake in a URL, the website redirects you to your most likely intended page.

Thanks a lot!

Anonymous said...

As explained in the wikipedia entry http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance, your algorythm does not actually compute the Damerau-Levenshtein distance. (TO and OST have a distance of 3 in your algorythm, but should have 2).

Tyler Sommer said...

Thanks, you saved a bunch of time for me! I've added another layer to your code to allow me to compare a string's similarity to another. Usage is:

string string1 = "This is a string";

string string2 = "Thsi si a stirng";

bool thisIsTrue = string1.IsSimilarTo(string2);


I divide the the total Dam-Lev distance by the average number of words between the two strings, thus giving an overall 'average' distance of the string or phrase.

private static int CountWords(string s)
{
MatchCollection collection = Regex.Matches(s, @"[\S]+");
return collection.Count;
}

public static bool IsSimilarTo(this string string1, string string2)
{
string1 = string1.Replace("-", " ").Replace("_", " ").ToLower();
string2 = string2.Replace("-", " ").Replace("_", " ").ToLower();

if (string1.Equals(string2)) { return true; }

float distance = DamerauLevenshteinDistance(string1, string2) / ((CountWords(string1) + CountWords(string2)) / 2);

return distance <= 1.1 ? true : false;
}