2012-12-10 7 views
11

I have problems finding an implementation of closest match strings for .netFind closest match to input string in a list of strings

I would like to match a list of strings, example:

input string: "Publiczna Szkoła Podstawowa im. Bolesława Chrobrego w Wąsoszu"

List of strings:

Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu

Szkoła Podstawowa Specjalna

Szkoła Podstawowa im.Henryka Sienkiewicza w Wąsoszu

Szkoła Podstawowa im. Romualda Traugutta w Wąsoszu Górnym

This would clearly need to be matched with "Publiczna Szkoła Podstawowa im. B. Chrobrego w Wąsoszu".

What algorithms are there available for .net?

Odpowiedz

10

Edit distance

Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

Levenshtein distance

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Fast, memory efficient Levenshtein algorithm

C# Levenshtein

using System; 

/// <summary> 
/// Contains approximate string matching 
/// </summary> 
static class LevenshteinDistance 
{ 
    /// <summary> 
    /// Compute the distance between two strings. 
    /// </summary> 
    public static int Compute(string s, string t) 
    { 
    int n = s.Length; 
    int m = t.Length; 
    int[,] d = new int[n + 1, m + 1]; 

    // Step 1 
    if (n == 0) 
    { 
     return m; 
    } 

    if (m == 0) 
    { 
     return n; 
    } 

    // Step 2 
    for (int i = 0; i <= n; d[i, 0] = i++) 
    { 
    } 

    for (int j = 0; j <= m; d[0, j] = j++) 
    { 
    } 

    // Step 3 
    for (int i = 1; i <= n; i++) 
    { 
     //Step 4 
     for (int j = 1; j <= m; j++) 
     { 
     // Step 5 
     int cost = (t[j - 1] == s[i - 1]) ? 0 : 1; 

     // Step 6 
     d[i, j] = Math.Min(
      Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), 
      d[i - 1, j - 1] + cost); 
     } 
    } 
    // Step 7 
    return d[n, m]; 
    } 
} 

class Program 
{ 
    static void Main() 
    { 
    Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant")); 
    Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha")); 
    Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax")); 
    } 
} 
15

.NET does not supply anything out of the box - you need to implement a an Edit Distance algorithm yourself. For example, you can use Levenshtein Distance , like this:

// This code is an implementation of the pseudocode from the Wikipedia, 
// showing a naive implementation. 
// You should research an algorithm with better space complexity. 
public static int LevenshteinDistance(string s, string t) { 
    int n = s.Length; 
    int m = t.Length; 
    int[,] d = new int[n + 1, m + 1]; 
    if (n == 0) { 
     return m; 
    } 
    if (m == 0) { 
     return n; 
    } 
    for (int i = 0; i <= n; d[i, 0] = i++) 
     ; 
    for (int j = 0; j <= m; d[0, j] = j++) 
     ; 
    for (int i = 1; i <= n; i++) { 
     for (int j = 1; j <= m; j++) { 
      int cost = (t[j - 1] == s[i - 1]) ? 0 : 1; 
      d[i, j] = Math.Min(
       Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), 
       d[i - 1, j - 1] + cost); 
     } 
    } 
    return d[n, m]; 
} 

Call LevenshteinDistance(targetString, possible[i]) for each i , then pick the string possible[i] for which LevenshteinDistance returns the smallest value.

+0

thanks. works great. – gleapman