How to best extract and prepare data from glossy PDF-presentations in order to feed it to GPT4
I aim at combining classical text-analysis with quantitative data-science using gpt-4. I have a professional background where I rely mainly on my ability to analyze textual and subjective information, and thus my knowledge of quantitative data analysis is rather more on the theoretical side. Thus, you may forgive me for my amateur-mistakes and misgiven assumptions:
Let me try to explain my circumstances and what I try to do: I try to combine textual analysis of information with insights gained from skilled -data analysis of years of financial reports of my object of study, by using GPT4 to process the combined input. I am looking at the risks of a given company, and thus want to study textual information with available quantitative data, mainly financial numbers for the last 10 years, published yearly in glossy PDF-Files. If you are curious about how the glossy-PDF looks like, this random report looks pretty much like the one I am taking about: Rapport de gestion des Transports publics fribourgeois | TPF : As you can see, you have many reports. You can choose any one of the reports and you will see, they are all similar. And what i want to do: I extract all the the small written texts in the many reports, and also all the numbers of the reports. And when i have all information logically and cleanly prepared, i want to combine both textual information and quantitative information, so that GPT can understand all of it. I want GPT4 to understand it exactly as a human would be able to understand it by just looking at the different reports. So GPT4 should for instance understand the temporal dimension attached to each piece of information, in as that these are reports which are organized each years, and which are always a bit differently structured, with changing perspektives from year to year - but still always focussed on the same company, etc.
I found that GPT4 is pretty good in processing transcripts of interviews that I conducted myself, and where I prepared the Data carefully using word-DOCX format with good formatting. But on the other side, the AI is actually very often confused when fed with unprepared files like HTML-downloads of websites, or XML-Exports, and especially by PDF files. The frustrating thing is, that the AI does not clearly tell you straight forward where the problems are, or if there are problems at all. In one long session, for instance, when I was first trying out new approaches in the beginning, GPT4 even pretended that it understood my PDF-Inputs, and created beautiful datanalysis-visualization as JPGs, only to tell me later, that they were based on fantasy Data. I was in disbelieve…
I thus understood, that I have to carefully prepare all the data I want to feed GPT4:
Thus I started to create two sets of manually prepared files: First I used Chrome to open PFD files, and then copy/past the text-parts I need into a Word-DOCX file. Sometimes it works well, other times, because of the poor format of the PDF-File, I have to almost copy it by typing the words myself. It takes a lot of time… many many hours. And when it is don, it is actually only the first step:
So that GPT4 does not confuse the content, I have to carefully format it using headlines 1,2 and 3 In order to organize the text in a logical, thematic and temporal sense: So for every yearly report, I use a title which I mark as a main title in MS-Word. So the temporal aspect is clear. Then every main and secondary textual division as in the texts of the yearly PDFs, I use headline levels 2 and 3. This formatting is very important, as if you do not do that, GPT4 confuses parts, mixes stuff up etc. So formatting the text input like this is very important. I am just not sure how important it is it think about what types of words should be used in the titles. Should the year to which it bellows always be written, for instance? Or are 3-level titles to much? What is your experience in relation to preparing textual information? Thank you for feedback on this assumption.
Second: Numbers: I of course also need the numbers from the glossy PDFs. I am interested to use the company financial numbers which they have to publish every year. But, as you can imagine from what I wrote above, when you feed GPT4 the many PDF-Company-Reports from different years, GPT4 is completely unable to make any sense of the input. Just chaos and hallucinations.
So after asking GPT4, how I best proceed, I was advised to use Smallpdf.com in order to convert the PDF-Files to Excel. Apparently, that is the best there is, and not even Adobe itself has a better solution. OK – anyhow - the result were horrible. In the output-Files I often got two rows merged together – always different from file to file. Or sometimes, important parts are scanned so badly, that you have to completely retype them from hand, because there is nothing to even build upon in the first place.
So after many hours of manually cleansing different output-excel-files of differing qualities, I face a further problem: While for the human eye, the numbers as they are organized in the report make sense: If you take the 2018 Version (look at the link above - the 2018 version), there is a section “Bilan” (page 18) then a section “COMPTE DE RESULTAT DE L’EXERCICE 2018” (page 20), etc. So for GPT4, I thought I have to add two more columns: one for the main topics, like “Bilan” or “COMPTE DE RESULTAT DE L’EXERCICE 2018”, and then the second for the subtopics, like “ACTIF” and “PASSIF” and in the case of the rows pertaining to “Bilan”.
Finally: when I finished cleansing and reformatting the numerical date, i would like to merge all of the files into one single ta, in one single excel file. But the people who created the financial files did not use the exact same categories from year to year. Sometimes they use new positions, because the company did something in that year, which called for that position. For instance, the company may have lent some money to a partner, but then received the money back. Then in the following year, there is not the same position anymore. So even when I work very carefully, I have in the end 10 slightly different files, with always slightly differing financial categories and definitions. So I can see already now, that it will take a lot of time to merge all files into one. I will have to go through every line, check of both data sets use it or not, and if not, how to fil some other value.
As a newcomer in data science I am not sure if that even makes sense. Or also, if I am doing it all wrong? Thank you for a feedback. I hope I was clear with my description of what I try to do, and how I go about it. And thank you in advance for your feedback and tips. I can imagine, many people face similar challenges.