How to read pdf, excels and word documents using python

machine_learning
data_science
python

#1

I am working on indexing and searching documents. I have saved all the .docx, pdf, .xls and ppt files in a folder named datasets. I want to extract information from all documents for indexing as well as to clean files using basic nltk task. To do this I explored textract but it does not work. could you help me to find solution.

I just read document from directory using os.listdir function as below

root = “D:\Harshal\search”
path = os.path.join(root, “datasets”)

f= open(“filenames1.txt”,“w+”)
i=0
for path, subdirs, files in os.walk(root):
for name in os.listdir(path):
i = i + 1
f.write( str(i) + “,” + str(name.encode(“utf-8”)) + “\n”)

f.close()


#2

Hi,

To read the data from excel,csv,clipboard,sql etc files we can use pandas library which contains all API’s to perform operation on these files.
import pandas as pd
Example : pd.read_excel(“file_path”,“sheetname”)

you can refer to the official website. :slight_smile:
https://pandas.pydata.org/pandas-docs/stable/io.html

To read pdf files you need to have PyPDF2 libray.

you can look at the below page how you can use it.


#3

Thanks a lot.