How to map 1000+ unique occupation/professions to standard occupation names



I am working on H!B visa practice project.

It has 1000+ unique occupation under SOC_NAME field/feature…most of them differ by a small change in names e.g. teacher, teacher maths, teacher maths post studies, teacher maths high school etc etc

I need to map them to standard feature names so that their number comes down and become more manageable.

I can use .loc or a command like the following

df.OCCUPATION[df[‘SOC_NAME’].str.contains(‘computer’,‘programmer’)] = ‘computer occupations’

df.OCCUPATION[df[‘SOC_NAME’].str.contains(‘software’,‘web developer’)] = 'computer occupations

but this is a cumbersome method and is a repetative process.

Is there any other way by which the end result of mapping 1000+ field can be achieved…for example by use of Regex




Hello Mohit, no sure you can simplify as much as you’d like. str.contains accept regex so you could 1st built your different regex combinations such as:

regex1 = 'computer|programmer|software|web developer’ # regex for ‘computer occupations’
regex2 = ‘a|b|c’

df.OCCUPATION[df[‘SOC_NAME’].str.contains(regex1, flags = re.IGNORECASE)] = ‘computer occupations’
df.OCCUPATION[df[‘SOC_NAME’].str.contains(regex2, flags = re.IGNORECASE)] = ‘xyz’