3D grphique: feedback on proposed address matching algorithm

lundi 13 février 2017

feedback on proposed address matching algorithm

Vote count: 0

The problem that i have is to find duplicate match for an address like

999, 99th cross, 9th block Jayanagar, Bangalore - 560070
999, 99th cross, 9th block Jayangr, Bengaluru - 560070

The address can also be some minor permutation of words in the address. The approach I am taking is as follows

Convert both address string into vector of words by splitting on punctuations and spaces. AV1 & AV2
Convert each word vector into metaphone resulting in metaphone vector. If word is a number then keep the number as it is in the metaphone vector MV1 & MV2 .(rationale - metaphones will help us eliminate the spelling mistakes and nearly same words)
Find the levenstein distance between each combination of elements of MV1 & MV2. (rationale - finding distance of word with every word in other address helps to eliminate word rearrangements)
Find min levenstein distance for each word of MV1. (rationale - finds best match of given word of one address in another address)
This gives us a vector with mininmum levenstein distance of each word of MV1 for some word in MV2. let this vector be L1
For eg L1 = [0 0 0 171 171 0 0] being the min levenstein distance of words of MV1
find the average of values in L1 . This gives an approximate score of duplicity of address A1 and A2.

As value tends to zero then probability of 2 address being matched becomes higher.

This is the algorithm I've come up with and request feedback if there are some inherent flaws which I might have missed or in some way can be improved. It would also be helpful if anyone can point to some good material/paper.

PS. I've seen following answers on S.O.

asked 19 secs ago

palash kulshreshtha

Let's block ads! (Why?)

feedback on proposed address matching algorithm

3D grphique

lundi 13 février 2017

feedback on proposed address matching algorithm

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire