Welcome to Practice Problem : Twitter Sentiment Analysis



Welcome to Practice Problem : Twitter Sentiment Analysis

This will be the official thread for any discussion related to the practice problem. Feel free to ask questions, share approaches and learn.


where can I get the data?
Under data file -> No link for test_tweets.csv and train.csv.


Hey @jamilur,

You will find the links to download data just below the Evaluation Metric section.

Sanad :slight_smile:


In the training sample, all the tweets are labeled as 1. Why is this so?



Kindly check again, because the training file contains labels having both 0 and 1.


Thanks NSS, This is the classification report (testing 80:20 from train ) on testing set that I am getting will you please guide me how to improve this. I am unable to use NLTK/spacy this is just by scikitlearn.
precision recall f1-score support

      0       0.96      0.99      0.98      7433
      1       0.87      0.47      0.61       558

avg / total 0.95 0.96 0.95 7991


So far I can say that it may be cos of the imbalanced data ,correct me if wrong?. Also share some link to teat the imbalanced data in text analytics.




I would need help here to write the R/python code. Earlier I have done the regression or classification problems with the categorical/numerical attributes.But this time attribute itself with twitter comments. It has noise (punctuation,hashtags and stop words) etc. Do we need to Remove stop words using NLP techniques and do the classification.
Could you please share the code.



Sorry I asked the wrong question


I also have the same question. I have cleaned the data somewhat using NLP.
However, would like to know whether I need to break down the tweet sentences into words for label 1 and 0 and then process.


Hi, I am unable to reach beyond score of 0.7237813885. Can any one share there approach here?


How did you clean the tweets?
I am having some problems. For Example in tweet 4 ,
#model i love u take with u all the time in ur📱!!! 😙😎👄👅💦💦💦

Which encoding did you use ? I am using python 2.7



I have the same problem and for the moment when I clean my Train and Test dataset I delete them.

When I open the train file it looks like this : 😙😎👄ðŸ‘
And I don’t know what it is. Numbers or letters ?


How shall i handle actual Test data set? I have converted the training data set into a sparse matrix by using count vectorizer from sklearn.
#Line of code:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(train).toarray()

Now what shall i do for for test dataset?
Should it be cv.fit_transform(test).toarray OR cv.transform(test)?


cv.transform() will do