TypeError: doc2bow expects an array of unicode tokens on input, not a single string

lda
python
gensim

#1

Hi, I was trying out a guide in topic modelling in python. And i went across this blog posts from analytics vidhya https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/.

I encountered this problem, and not sure how to interpret the error message.

Traceback (most recent call last):
File “Topic.py”, line 15, in
doc_term_matrix = [dictionary.doc2bow(doc) for doc in text]
File “C:\Python27\lib\site-packages\gensim\corpora\dictionary.py”, line 233, in doc2bow
raise TypeError(“doc2bow expects an array of unicode tokens on input, not a single string”)
TypeError: doc2bow expects an array of unicode tokens on input, not a single string


#2

As said clear by the error, doc2bow expects a list. You gave a string. Try with

doc_term_matrix = [dictionary.doc2bow(doc.split()) for doc in text]

This way you split the document string by spaces. You may even use a different tokenizer depending on the data you are working at.