XGBoost script for classifying text

r
machine_learning

#1

I am learning to use R so that I can create a machine learning classification script that classifies a dataset of movie reviews according to their sentiment scores, either a 1 or a 0 for positive or negative. I believe that I am missing two pieces to my script. First, I need the proper syntax for the test data partition for XGBoost. Second, I wanted to create a confusion matrix to evaluate performance. Could someone please tell me what I am missing in my code? Thanks.

‘’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’
library(text2vec)
library(xgboost)
library(pdp)
setwd(‘C:/rscripts/movies’)

imdb = read.csv(‘movies.csv’, stringsAsFactors = FALSE)

Create the document term matrix (bag of words) using the movie_review data

frame provided

in the text2vec package (sentiment analysis problem)

#data(“movie_review”)

Tokenize the movie reviews and create a vocabulary of tokens including

document counts
vocab <- create_vocabulary(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer))

Build a document-term matrix using the tokenized review text. This returns

a dgCMatrix object
dtm_train <- create_dtm(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer),
vocab_vectorizer(vocab))

Turn the DTM into an XGB matrix using the sentiment labels that are to be

learned
train_matrix <- xgb.DMatrix(dtm_train, label = imdb$class)

xgboost model building

xgb_params = list(
objective = “binary:logistic”,
eta = 0.01,
max.depth = 5,
eval_metric = “auc”)

xgb_fit <- xgboost(data = train_matrix, params = xgb_params, nrounds = 10)

set.seed(1)
cv <- xgb.cv(data = train_matrix, label = imdb$class, nfold = 5,
nrounds = 60)

library(caret)
library(Matrix)

Create our prediction probabilities

pred <- predict(xgb_fit, dtm_train)

Set our cutoff threshold

pred.resp <- ifelse(pred >= 0.86, 1, 0)

Create the confusion matrix

confusionMatrix(factor pred.resp),imdb$class, positive=“1”)

Thanks for any help I can get.


#2

You can use createDataPartition() function from the caret package to split
dtm_train into 2 datasets, one for model training and the other for validation.

Over here confusionMatrix(pred.resp, imdb$class) should work.

Thanks!


#3

Thank you for the reply. I was able to have some luck in splitting my document term matrix. Now I need to convert my training and test partitions into xgb matrices so that I can train my XGBoost model. Can you help me with the code for that? Here is what I have so far. Everything works except for the last line of code:

====================================================================
library(text2vec)
library(xgboost)
library(pdp)

setwd(‘C:/rscripts/movies’)

imdb = read.csv(‘movies.csv’, stringsAsFactors = FALSE)

Create the document term matrix (bag of words) using the movie_review data frame provided

in the text2vec package (sentiment analysis problem)

#data(“movie_review”)

Tokenize the movie reviews and create a vocabulary of tokens including document counts

vocab <- create_vocabulary(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer))

Build a document-term matrix using the tokenized review text. This returns a dgCMatrix object

dtm_train <- create_dtm(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer),
vocab_vectorizer(vocab))

id_train <- sample(nrow(dtm_train),nrow(dtm_train)*0.80)
reviews.train = dtm_train[id_train,]
reviews.test = dtm_train[-id_train,]


#4

Hi @jdude48

The create_dtm() function returns a matrix. I suggest you to convert dtm_train to a dataframe and then split it into reviews.train and reviews.test.