How to read a bunch of .docx files in Python?

python
docx
python-docx

#1

Hello everyone,
I am working on parsing a bunch of .docx files in python. I have saved all the .docx files in a folder named .docx_files. I want to make a list docx_list whose each element is one docx file saved in my .docx_files folder. To do this, I am using python-docx library. I created a list which contains paths of all the .docx files using os.walk() method. The list looks as given below
[‘C:/Users/HP/Desktop/.docx files\AbhijeetPawar[3_7] (1).docx’,
‘C:/Users/HP/Desktop/.docx files\AbhijeetPolkamwar[0_0].docx’,
‘C:/Users/HP/Desktop/.docx files\AbhijeetRikame[6_0].docx’]
I passed this list to Document() method to get the required list. The code looks as below

for path in document_list:
resume_list = Document(path)

Now, when I printed the list by using the following code

for para in resume_list.paragraphs:
print(para.text)

I can see only the document corresponding to the last path in my list. Can anyone help me find out what is going wrong? Also, suggest how to fix it.


#2

import subprocess
from os import listdir
from os.path import isfile, join
from docx import Document

def file_operation():
mypath = ‘C:/Users/HP/Desktop/’ #Path to folder/Directory
#mypath = '/home/catana/Downloads/doc/'
list_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for k in range(len(list_files)):
# list_files[k] is counting path of file e.g. ‘C:/Users/HP/Desktop/.docx files\AbhijeetPawar3_7.docx’,
document = Document(list_files[k])
print document.paragraphs


#3

Thanks for the answer!
It would be more helpful if you would add comments about what each code block is meant to perform.