Dealing with large number of features in document term matrix in text mining

machine_learning

#1

I’m working on a dataset for language identification. It has generated a large number of features by doing document term matrix. I divided the words into grams upto 5. Please tell me how to better predict the language of each word. I tried neural networks in R but could not succeded


#2

Hi @shivakrishna,

Could you define your problem more briefly? Because it seems your approach is correct.

Some queries I have,

  • What dataset are you using? How is it structured (i.e. contents of dataset)?
  • Could you compare your results with some benchmarks (other people’s results)?
  • Have you tried algorithms other than Neural networks?

#3

Thank you for your reply jalFaizy.
I’m working on “http://research.microsoft.com/en-us/events/fire13_st_on_transliteratedsearch/fire15st.aspx” 's subtask1.


#4

Did you refer the resources provided in the competitions home page?


#5

Yes, I got maximum of 78% sentence level accuracy with naive bayes using mallet tool. So I want to improve that preferably using R


#6

I have not worked on NLP so I may not be the right person to suggest you. But you could refer the approaches in NLP competitions like this one


#7

Thank you…