Suggestion needed regarding Data cleaning for correcting addresses



I am working on one problem, and seek advice/guidance to sort out the problem.

I am trying to do data normalization and populate correct zipcode, city, and state. Data contains zipcode, city, state, and address fields information along with lots of wrong information such as type-mistake etc, i.e instead of Delhi city conatians DL, and mumbai city shown state in Gujurat etc. Following approach, I had tried:

  1. Lookup from correct zipcode, city, and state information and do normalization which covers only 30-40% correct normalization

  2. Tokenize the address and apply lots of conditional statement for correct zipcode, city, and state along with lookup information. Address field contains lot of rich infomration which is useful to create lookup and data normalization. This approach covers only 40-50% correct normalization.

Data contains lots of historical information and new data keep on coming. It’s a iterative process to do data normalization.Is there a better way to do data normalization using machine learning technique i.e data learn itself from historical data and do the normalization to populate correct information?



In addition to what you have already mentioned, you can try to use Google Maps API to clean up the addresses. Here is a simple example, which does some thing similar:

It would not correct all the addresses, but can definitely be automated to reduce the errors.