How to cleanse place of work in dataset of 300K records

cleaning

#1

I have table where one column is specific to place of work. As you guys know, in a place of work the string(/place) can be anything (that is alphanumeric character + special chars). In the place of work there are DL(can be any counties driving license), Passport (passport can be US passport,India passport, etc), ID and telephone numbers included also. Some cases these DL,Passport,telephone numbers are concatenated with place of work also. I have written SQL to filter this out, however, this does not give me correct result for all of 300k records. Manually going through each records using Excel takes lots of time. Hence, wanted to know, what is the best way or techniques to separate out only place of work? Note: I have around 300k records.

This is just sample data;however, please imagine a place of work can be anything all over the world.The sample data is attached.Sample%20data


#2

Hi, if you are using python, I would recommend you to use re to parse that column with their respective patterns like PASSPORT, DL and telephone numbers have their own pattern. Give me your sample data. we will analyze


#3

The sample data is attached.Thanks.


#4

Hi @jrout

Your data seems a bit difficult to clean automatically. I would suggest you to work on the data collection process itself so that such an unclean data is not generated .