Spell error mapping


#1

Hi Guys,

I am working on large dataset where there are lots of spelling error while recording the data manually or electronically. I am facing issues like for a unique id there could be different providers but for a same provider, first name last name varies due to spelling error. For example
provider id first name last name
12345 arun rastogi
12345 arrun rastogi
12345 arun raastogi
1234 aruun rastoge
Although the names are same but have been spelled incorrectly. We have millions of data like this. Please suggest how to deal with the spell error and treat them as one.


#2

Hi @pkumchauhan,

I think you have to manually write a spell corrector for this.

What I would do is either,

  • Make a database for probable names, create a spell checker which searches the database and gives the closest name

or, (if the above method is too hard to do manually)

  • Use a string matching algorithm, which will, for each entry in the dataset, replace the erroneous name with the most occurring string in the entire dataset.

I would not rely on machine learning algorithm, as it would easily break if a new name (i.e. a name not present in the historical data) occurs. Although there has been work done in this domain (refer this)


#3

Thanks for replying.
Is that possible using R. Please suggest


#4

Could you specify what you are asking in R?

PS: Here’s a spell checker for R


#5

I guess spell error won’t help this time because i need to identify each unique individual based on their names, addresses, and phone numbers(available in dataset) mapped against unique provider id. How this might be done without manually running through each record, one at a time?