Txt mining query


#1

Hi all I had one query
I am downloading zip file of text from a source and importedto excel through extract from text feature.
It was text data of 200 mb only two columns are needed for me but Iam not getting that query may be simple one but need clarity and also str_trim is not avaiable in rstidio instead lstrtrim function is present how to use that
Thankyou in advance


#2

@Rudra11 Could you support your question with code/screenshots for better understanding?

Sanad :slight_smile:


#3

cant get your question you want to import the file to Rstudio or in Excel?


#4

UnicodeDecodeError Traceback (most recent call last)
in ()
2 (‘tfidf’, TfidfTransformer()),
3 (‘clf’, MNB())])
----> 4 text_clf = text_clf.fit(X_train,y_train)

C:\ProgramData\Anaconda2\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
266 This estimator
267 “”"
–> 268 Xt, fit_params = self._fit(X, y, **fit_params)
269 if self._final_estimator is not None:
270 self._final_estimator.fit(Xt, y, **fit_params)

C:\ProgramData\Anaconda2\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, “fit_transform”):
–> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
837
838 vocabulary, X = self.count_vocab(raw_documents,
–> 839 self.fixed_vocabulary
)
840
841 if self.binary:

C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
760 for doc in raw_documents:
761 feature_counter = {}
–> 762 for feature in analyze(doc):
763 try:
764 feature_idx = vocabulary[feature]

C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py in (doc)
239
240 return lambda doc: self._word_ngrams(
–> 241 tokenize(preprocess(self.decode(doc))), stop_words)
242
243 else:

C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py in decode(self, doc)
116
117 if isinstance(doc, bytes):
–> 118 doc = doc.decode(self.encoding, self.decode_error)
119
120 if doc is np.nan:

C:\ProgramData\Anaconda2\lib\encodings\utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors=‘strict’):
—> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xe2 in position 32752: unexpected end of data


#5

Convert your data to ascii instead of Unicode.

import unicode    
unicodedata.normalize('NFKD',x).encode('ascii','ignore') ## Where x is your variable with unicode string

#6

hi thanks a lot for your response I am getting the following error after implementing the code you suggested.Please throw light on this
thanx in advance
Typer error:normalize() argument 2 must be unicode not str