How can we use Textual data variables for modeling?

textanalytics

#1

I am working in a classification problem and it has various text variables like skill, college, Educational degree and others. These all variables have data as combination of various words.

For Example: Skill has values as:
1st Record: SAS, Python, Analytics, Predictive Modeling, SQL
2nd Record: R, Python, Consultancy, SQL, Project Management, Regression, Classification,
3rd Record: Qlikview, Python, Java Script, Excel, Retail Analytics, BFSI Analytics,

Similarly, Other variables have similar kind of data values. Please suggest me the way to make sense of data and What are the various ways to treat textual data variables in a data set?

Thanks!


#2

One simple way is to use the “bag of words” approach. This counts the number of occurrence of the words present in the data. The numeric values can be then used for classification. Then you could also try out “TF-IDF” as features.