Dealing with large number of features in document term matrix in text mining



I’m working on a dataset for language identification. It has generated a large number of features by doing document term matrix. I divided the words into grams upto 5. Please tell me how to better predict the language of each word. I tried neural networks in R but could not succeded


Hi @shivakrishna,

Could you define your problem more briefly? Because it seems your approach is correct.

Some queries I have,

  • What dataset are you using? How is it structured (i.e. contents of dataset)?
  • Could you compare your results with some benchmarks (other people’s results)?
  • Have you tried algorithms other than Neural networks?


Thank you for your reply jalFaizy.
I’m working on “” 's subtask1.


Did you refer the resources provided in the competitions home page?


Yes, I got maximum of 78% sentence level accuracy with naive bayes using mallet tool. So I want to improve that preferably using R


I have not worked on NLP so I may not be the right person to suggest you. But you could refer the approaches in NLP competitions like this one


Thank you…