How can I extract variable formatted executive compensation tables from long PDFs using fine tuning?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I have many long (100+ page) PDFs that each contain 1-2 specific tables on executive and non-executive director compensation/renumeration. However, the exact page location and table format differ from document to document, making it challenging to extract them using standard techniques.

I tried using a Python PDF-to-text package followed by the OpenAI API to locate the table, but the results were not meaningful. I'm considering using a fine tuning model to better illustrate what I'm looking for. Does anyone have suggestions or alternative approaches for this scenario?

Example PDF: https://cdn-api.markitdigital.com/apiman-gateway/ASX/asx-research/1.0/file/2924-02701314-2A1468465?access_token=83ff96335c2d45a094df02a206a39ff4

The two tables of interest from the example are shown below:

Here is my current code:

PYTHON
# Your Python code here

I have new code and have posted it in reply to this.

Comments 0

•

Answers 0

•

Views 1.3k

0 Answers

Discussion

No comments yet.