Welcome to Practice Problem : Twitter Sentiment Analysis



Welcome to Practice Problem : Twitter Sentiment Analysis

This will be the official thread for any discussion related to the practice problem. Feel free to ask questions, share approaches and learn.


where can I get the data?
Under data file -> No link for test_tweets.csv and train.csv.


Hey @jamilur,

You will find the links to download data just below the Evaluation Metric section.

Sanad :slight_smile:


In the training sample, all the tweets are labeled as 1. Why is this so?



Kindly check again, because the training file contains labels having both 0 and 1.


Thanks NSS, This is the classification report (testing 80:20 from train ) on testing set that I am getting will you please guide me how to improve this. I am unable to use NLTK/spacy this is just by scikitlearn.
precision recall f1-score support

      0       0.96      0.99      0.98      7433
      1       0.87      0.47      0.61       558

avg / total 0.95 0.96 0.95 7991


So far I can say that it may be cos of the imbalanced data ,correct me if wrong?. Also share some link to teat the imbalanced data in text analytics.




I would need help here to write the R/python code. Earlier I have done the regression or classification problems with the categorical/numerical attributes.But this time attribute itself with twitter comments. It has noise (punctuation,hashtags and stop words) etc. Do we need to Remove stop words using NLP techniques and do the classification.
Could you please share the code.



Sorry I asked the wrong question


I also have the same question. I have cleaned the data somewhat using NLP.
However, would like to know whether I need to break down the tweet sentences into words for label 1 and 0 and then process.


Hi, I am unable to reach beyond score of 0.7237813885. Can any one share there approach here?


How did you clean the tweets?
I am having some problems. For Example in tweet 4 ,
#model i love u take with u all the time in ur📱!!! 😙😎👄👅💦💦💦

Which encoding did you use ? I am using python 2.7



I have the same problem and for the moment when I clean my Train and Test dataset I delete them.

When I open the train file it looks like this : 😙😎👄ðŸ‘
And I don’t know what it is. Numbers or letters ?


How shall i handle actual Test data set? I have converted the training data set into a sparse matrix by using count vectorizer from sklearn.
#Line of code:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(train).toarray()

Now what shall i do for for test dataset?
Should it be cv.fit_transform(test).toarray OR cv.transform(test)?


cv.transform() will do


can i see ur code



I was going through the training set data and found that there are few tweets wrongly labelled as 0.

The below tweet is labelled as 0 but I think it should be labelled as 1(racist/sexist).

17844 0 #sorry #potus this #bigotry is fully #american #obama blasts #trump for #antimuslim #language,as #unamerican

Did anyone else also felt the same because tweets like this is making the algo doing the wrong prediction.



You could always remove the bad examples from training if that helps you to get better accuracy.


HI, can anyone name the kind of variables one can create using the tweet text? Say, very simply should we start with basic parameters like 1. identifying positive/negative words, 2. Lexical diversity

If not the above, what is the approach that i should take to create variables and then do classification?



Hi @Rahul_P_R, there are multiple ways to extract features from text. Some of them are as follows:

  1. Bag of Words: In this approach, occurrence or frequency of each word is used as a feature.
  2. TF-IDF: It takes into account not just the occurrence of a word in a single document but in the entire corpus.
  3. Word Embeddings: It involves transforming every unique word in the corpus to an N-dimensional vector representation.

To explore more, do check these articles