I am working on one problem, and seek advice/guidance to sort out the problem.
I am trying to do data normalization and populate correct zipcode, city, and state. Data contains zipcode, city, state, and address fields information along with lots of wrong information such as type-mistake etc, i.e instead of Delhi city conatians DL, and mumbai city shown state in Gujurat etc. Following approach, I had tried:
Lookup from correct zipcode, city, and state information and do normalization which covers only 30-40% correct normalization
Tokenize the address and apply lots of conditional statement for correct zipcode, city, and state along with lookup information. Address field contains lot of rich infomration which is useful to create lookup and data normalization. This approach covers only 40-50% correct normalization.
Data contains lots of historical information and new data keep on coming. It’s a iterative process to do data normalization.Is there a better way to do data normalization using machine learning technique i.e data learn itself from historical data and do the normalization to populate correct information?