I am working on indexing and searching documents. I have saved all the .docx, pdf, .xls and ppt files in a folder named datasets. I want to extract information from all documents for indexing as well as to clean files using basic nltk task. To do this I explored textract but it does not work. could you help me to find solution.
I just read document from directory using os.listdir function as below
root = “D:\Harshal\search”
path = os.path.join(root, “datasets”)
for path, subdirs, files in os.walk(root):
for name in os.listdir(path):
i = i + 1
f.write( str(i) + “,” + str(name.encode(“utf-8”)) + “\n”)