lundi 13 février 2017

feedback on proposed address matching algorithm

Vote count: 0

The problem that i have is to find duplicate match for an address like

  • 999, 99th cross, 9th block Jayanagar, Bangalore - 560070
  • 999, 99th cross, 9th block Jayangr, Bengaluru - 560070

The address can also be some minor permutation of words in the address. The approach I am taking is as follows

  1. Convert both address string into vector of words by splitting on punctuations and spaces. AV1 & AV2
  2. Convert each word vector into metaphone resulting in metaphone vector. If word is a number then keep the number as it is in the metaphone vector MV1 & MV2 .(rationale - metaphones will help us eliminate the spelling mistakes and nearly same words)
  3. Find the levenstein distance between each combination of elements of MV1 & MV2. (rationale - finding distance of word with every word in other address helps to eliminate word rearrangements)
  4. Find min levenstein distance for each word of MV1. (rationale - finds best match of given word of one address in another address)
  5. This gives us a vector with mininmum levenstein distance of each word of MV1 for some word in MV2. let this vector be L1
  6. For eg L1 = [0 0 0 171 171 0 0] being the min levenstein distance of words of MV1
  7. find the average of values in L1 . This gives an approximate score of duplicity of address A1 and A2.

As value tends to zero then probability of 2 address being matched becomes higher.

This is the algorithm I've come up with and request feedback if there are some inherent flaws which I might have missed or in some way can be improved. It would also be helpful if anyone can point to some good material/paper.

PS. I've seen following answers on S.O.

asked 19 secs ago

Let's block ads! (Why?)



feedback on proposed address matching algorithm

Aucun commentaire:

Enregistrer un commentaire