Need help in preparing training data for Machine Learning



I was given with test text-data. I want to apply Multinomial naive Bayes algorithm. So I need training data for that. Can anyone help me to prepare training data for that?


You can partition the data as train and test and use it for training and validation respectively


Thanks for your reply. If we do partition, will the accuracy be maintained ?


Consider the imported data as dataset dataframe variable

# Splitting the dataset into the Training set and Test set
split = sample.split(dataset$DependentVariable, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Do note that we need to use the set.seed() function passed with the same number as a parameter to return the same sample values(training and test) each time whenever we re-execute the code, In this case, I’m using 123, so the same training and test data will be returned each time. If I pass 124, another random training and test data will be generated…


The purpose of partitioning is to have training and test set so that we can train the model using training set and validating the accuracy using test set (before running the model against unknown data).

If you don’t get the desired accuracy while validating with the test set, you have to prepare the data again by applying feature engineering techniques before training.This process is iterative until we get the desired result on validation using test set.

Finally, you can apply same transformations to the unknown data to get desired accuracy



Can you clarify what your problem statement actually is?

Is it like a sentiment analysis - where you have to predict the sentiment of the sentence - or a sequence generation problem like generating new textual sequence. Because without this information - how would you be able to formulate the data (without even knowing the problem you are trying to solve?)