support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 8 months ago by CosmicAdventurer416

Why Does My Streamlit App Only Process the First Uploaded PDF When Extracting Financial Data?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I'm building a Streamlit app designed to extract financial data from multiple uploaded annual report PDFs and compile the results into a single Excel file.

The intended workflow is as follows:

Extract the company_code and FY (e.g., from a file named NASDAQ_AAPL_2023 for Apple Inc., FY 2023).
Parse each PDF sequentially: extract text, split it into manageable chunks, generate embeddings, and retrieve financial data using a model.
Append the extracted information from each PDF to a cumulative DataFrame (final_df).
Generate a single Excel file containing the combined results for all processed PDFs.

However, the app only processes the first uploaded PDF and the extracted data from subsequent PDFs is missing in the final Excel output.

Code Workflow

File Name Format:

Each uploaded file is named in the format: NASDAQ_<COMPANY_CODE>_<FY>.pdf. The company_code and FY are extracted directly from the file name using a regular expression.

Key Workflow Steps:

Upload PDFs: Users can upload multiple PDF files via st.file_uploader.
Process PDFs Sequentially:
- Extract the company_code and FY from the file name.
- Parse the PDF to extract financial data (e.g., business_segment, currency, revenue).
- Append the extracted data to a cumulative DataFrame (final_df).
Generate Excel File: Combine all extracted data into a single Excel file with columns:
- company_code, business_segment, currency, revenue, FY.

Problem

The app only processes the first PDF by appending its data to final_df. Data from the remaining PDFs is ignored even though the code loops over all uploaded files.

Below are some critical sections of my code:

Extracting `company_code` and `FY`

PYTHON
def extract_company_code_and_fy(file_name):
    match = re.match(r"NASDAQ_([A-Z]+)_(\d{4})", file_name)
    if match:
        return match.group(1), match.group(2)
    return None, None

Processing Each PDF

PYTHON
def process_pdf_and_extract_data(user_question, text_chunks):
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    docs = vector_store.similarity_search(user_question)
    chain = get_conversational_chain()
    response = chain({"input_documents": docs}, return_only_outputs=True)
    del vector_store  # Clear memory
    gc.collect()
    return response["output_text"]

Main Function

PYTHON
def main():
    if "final_df" not in st.session_state:
        st.session_state["final_df"] = pd.DataFrame(columns=["company_code", "business_segment", "currency", "revenue", "FY"])

    pdf_docs = st.file_uploader("Upload your Annual Reports (PDF Files)", accept_multiple_files=True)
    if st.button("Submit & Process"):
        if pdf_docs:
            for pdf in pdf_docs:
                file_name = os.path.basename(pdf.name)
                company_code, fy = extract_company_code_and_fy(file_name)
                if not company_code or not fy:
                    st.error(f"Invalid file name format: {file_name}. Skipping...")
                    continue

                raw_text = get_pdf_text(pdf)
                text_chunks = get_text_chunks(raw_text)
                extracted_data = process_pdf_and_extract_data("Extract business segments and financial data.", text_chunks)
                if extracted_data:
                    rows = []
                    for line in extracted_data.split("\n"):
                        if "|" in line and not line.strip().startswith(("business_segment", "currency", "revenue", "---")):
                            parsed_row = line.split("|")[1:-1]
                            if len(parsed_row) == 3:
                                rows.append([field.strip() for field in parsed_row])

                    temp_df = pd.DataFrame(rows, columns=["business_segment", "currency", "revenue"])
                    temp_df["company_code"] = company_code
                    temp_df["FY"] = fy
                    temp_df = temp_df[["company_code", "business_segment", "currency", "revenue", "FY"]]

                    st.session_state["final_df"] = pd.concat([st.session_state["final_df"], temp_df], ignore_index=True)
        else:
            st.error("Please upload at least one PDF file.")

The app correctly processes the first PDF and appends its data to final_df.
However, data from subsequent PDFs is not added.

Steps I've Taken:

Ensured Persistence of final_df:
- Initialized st.session_state["final_df"] only once.
Debugged Data Extraction:
- Verified data extraction for the first PDF and added logging to ensure the loop runs for all PDFs.
Managed Memory Efficiently:
- Cleared temporary variables and used gc.collect().

Question

Why does my Streamlit app only process the first PDF and skip the rest? How can I modify the code so that data from all PDFs is correctly appended to final_df and included in the final Excel file?

Comments 0

•

Answers 2

•

2 Answers

Answer by StackOverflow TychoAI Agent • 8 months ago

Below is one common culprit and several things to check:

1. PDF File Pointer Exhaustion

Streamlit’s file uploader returns file-like objects (typically from BytesIO). When you call a function (like get_pdf_text) that reads a file stream, it consumes the file’s content, moving the pointer to the end. In your loop, after processing the first PDF the file pointer of subsequent PDFs might already be at the end. To resolve this, make sure to reset the pointer of each file before you read it:

PYTHON
for pdf in pdf_docs:
    pdf.seek(0)  # Reset file pointer to the beginning
    file_name = os.path.basename(pdf.name)
    ...

Adding pdf.seek(0) at the start of the loop (or right before calling get_pdf_text) ensures you read every file from the beginning.

2. Verify Data Extraction Consistency

Even if the files are read correctly, it’s a good idea to:

Confirm that get_pdf_text and get_text_chunks return non-empty strings/chunks.
Log or display the value of pdf.name and the length of extracted text so that you can verify processing on every iteration.

For example, temporarily add:

PYTHON
raw_text = get_pdf_text(pdf)
st.write(f"Processing {file_name}: {len(raw_text)} characters extracted.")

This can help confirm that each PDF is processed.

3. Session State and Re-Rendering

Streamlit re-runs the app on interactions. Make sure that:

The session state variable (st.session_state["final_df"]) is initialized only once.
The loop iterates over all uploaded files on each run triggered by st.button.

Your code snippet shows the right approach by checking if “final_df” exists in st.session_state. However, double-check that no code outside this snippet inadvertently resets it.

Summary

Your issue most likely comes from the file pointer of the uploaded PDFs not being reset. Reset each file’s pointer with pdf.seek(0) before processing, and confirm that your text extraction functions are running on valid data for each file. With that change, the loop should successfully iterate over all PDFs, and the cumulative DataFrame will contain all the extracted data.

Feel free to ask if you need further assistance with any part of the code!

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by VoidTracker636 • 10 months ago

It seems your Streamlit app is encountering an issue where it's only processing the first PDF file and skipping the rest. This could be due to several reasons:

Incorrect File Path: Ensure that the file path you're providing to the pd.read_pdf() function is accurate and points to the correct directory containing the PDF files.
Loop Termination: If you're using a loop to iterate over multiple PDF files, the loop might be terminated prematurely due to an error or incorrect condition.
Dataframe Appending: The way you're appending data to the final_df might be causing issues. Ensure that the dataframes being appended have compatible structures.

Troubleshooting Steps:

Print File Paths - Add a print statement inside the loop to verify that the correct file paths are being processed.
Check Dataframe Structure - Before appending, ensure that the data frames have the same columns and data types.
Handle Errors - Use a try-except block to catch potential exceptions during PDF reading or appending.

No comments yet.

Discussion

No comments yet.

Why Does My Streamlit App Only Process the First Uploaded PDF When Extracting Financial Data?

Code Workflow

File Name Format:

Key Workflow Steps:

Problem

Extracting `company_code` and `FY`

Processing Each PDF

Main Function

Question

2 Answers

1. PDF File Pointer Exhaustion

2. Verify Data Extraction Consistency

3. Session State and Re-Rendering

Summary

Discussion

Similar Posts

How can I fix the 'st.session_state has no attribute "retriever"' error in my LangChain RAG app with Chroma?

How safe is storing an API key in a .env file for a containerized Python app?

How can I update Streamlit session variables during LangSmith evaluation?

Why Does My Streamlit App Only Process the First Uploaded PDF When Extracting Financial Data?

Code Workflow

File Name Format:

Key Workflow Steps:

Problem

Extracting company_code and FY

Processing Each PDF

Main Function

Question

2 Answers

1. PDF File Pointer Exhaustion

2. Verify Data Extraction Consistency

3. Session State and Re-Rendering

Summary

Discussion

Similar Posts

How can I fix the 'st.session_state has no attribute "retriever"' error in my LangChain RAG app with Chroma?

How safe is storing an API key in a .env file for a containerized Python app?

How can I update Streamlit session variables during LangSmith evaluation?

Extracting `company_code` and `FY`