How to map 1000+ unique occupation/professions to standard occupation names

pandas
python
regular_expression
regex

#1

Hello,
I am working on H!B visa practice project.

It has 1000+ unique occupation under SOC_NAME field/feature…most of them differ by a small change in names e.g. teacher, teacher maths, teacher maths post studies, teacher maths high school etc etc

I need to map them to standard feature names so that their number comes down and become more manageable.

I can use .loc or a command like the following

df.OCCUPATION[df[‘SOC_NAME’].str.contains(‘computer’,‘programmer’)] = ‘computer occupations’

df.OCCUPATION[df[‘SOC_NAME’].str.contains(‘software’,‘web developer’)] = 'computer occupations

but this is a cumbersome method and is a repetative process.

Is there any other way by which the end result of mapping 1000+ field can be achieved…for example by use of Regex

Thanks

Mohit


#2

Hello Mohit, no sure you can simplify as much as you’d like. str.contains accept regex so you could 1st built your different regex combinations such as:

regex1 = 'computer|programmer|software|web developer’ # regex for ‘computer occupations’
regex2 = ‘a|b|c’

df.OCCUPATION[df[‘SOC_NAME’].str.contains(regex1, flags = re.IGNORECASE)] = ‘computer occupations’
df.OCCUPATION[df[‘SOC_NAME’].str.contains(regex2, flags = re.IGNORECASE)] = ‘xyz’

Rodolphe.