How to read pdf, excels and word documents using python



I am working on indexing and searching documents. I have saved all the .docx, pdf, .xls and ppt files in a folder named datasets. I want to extract information from all documents for indexing as well as to clean files using basic nltk task. To do this I explored textract but it does not work. could you help me to find solution.

I just read document from directory using os.listdir function as below

root = “D:\Harshal\search”
path = os.path.join(root, “datasets”)

f= open(“filenames1.txt”,“w+”)
for path, subdirs, files in os.walk(root):
for name in os.listdir(path):
i = i + 1
f.write( str(i) + “,” + str(name.encode(“utf-8”)) + “\n”)




To read the data from excel,csv,clipboard,sql etc files we can use pandas library which contains all API’s to perform operation on these files.
import pandas as pd
Example : pd.read_excel(“file_path”,“sheetname”)

you can refer to the official website. :slight_smile:

To read pdf files you need to have PyPDF2 libray.

you can look at the below page how you can use it.