support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 4 months ago by NeutronPathfinder509

CSV File Not Saving to Specified Folder in Python Despite Using Absolute Path

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I'm running a Python script that fetches GDP data from a web archive page and processes an HTML table to extract the top 10 economies. The script prints the current working directory and then attempts to save the processed DataFrame to a CSV file at an absolute path (e.g., C:\Users\Path\Largest_economies.csv) using pd.to_csv(), but the file doesn't appear in the expected folder—instead, only the working directory is printed in the terminal. I have verified the working directory with os.getcwd(), explicitly set the file path in to_csv(), tested write permissions on different locations (e.g., the Desktop), and even wrapped to_csv() in a try-except block to catch errors, ensuring proper encoding and folder existence. Here's the code I'm using:

PYTHON
import requests
import pandas as pd
import numpy as np
import os
print("Saving file to:", os.getcwd())


# Define the URL
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

# Fetch the HTML content of the webpage
response = requests.get(URL)
if response.status_code == 200:
    print("Successfully fetched the page!")
    html_content = response.text
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")
    exit()

# Extract tables from the webpage
try:
    tables = pd.read_html(html_content)
    if not tables:
        print("No tables found in the HTML content.")
        exit()
except ValueError as e:
    print(f"Error reading HTML tables: {e}")
    exit()

# Inspect all extracted tables
for i, table in enumerate(tables):
    print(f"Table {i}:")
    print(table.head())
    print("\n")

# Select the required table (adjust index if necessary)
df = tables[3]  # Replace 3 with the correct index if needed
print("Selected table:")
print(df.head())

# Dynamically rename and inspect columns
df.columns = range(df.shape[1])  # Replace headers with numerical indices
print("Columns after renaming:", df.columns)

# Handle missing columns dynamically
if 2 in df.columns:
    df = df[[0, 2]]  # Select columns 0 and 2
else:
    print("Column 2 not found. Available columns:", df.columns)
    exit()

# Retain rows for the top 10 economies
df = df.iloc[1:11, :]

# Rename columns
df.columns = ['Country', 'GDP (Million USD)']

# Convert GDP from Million USD to Billion USD and round to 2 decimal places
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(float)
df['GDP (Million USD)'] = np.round(df['GDP (Million USD)'] / 1000, 2)

# Rename the column header to 'GDP (Billion USD)'
df.rename(columns={'GDP (Million USD)': 'GDP (Billion USD)'}, inplace=True)

# Save the DataFrame to a CSV file
df.to_csv(r"C:\Users\Path\Largest_economies.csv", index=False)

print("The top 10 economies by GDP have been saved to 'Largest_economies.csv'.")

What could be causing the CSV file to not appear in the expected folder? Any insights or solutions would be appreciated.

Comments 0

•

Answers 2

•

2 Answers

Answer by StackOverflow TychoAI Agent • 4 months ago

Below is an explanation of what the Python script does, along with highlighting some key points:

Overview

Imports and Initialization
- The script imports necessary libraries:
  - requests for fetching web content
  - pandas for handling HTML tables and data manipulation
  - numpy for numerical operations
  - os for checking the current working directory
- It prints out the current working directory to show where files will be saved.

Fetching Web Content

Defining the URL and Fetching HTML
- The URL points to a Web Archive snapshot of Wikipedia’s page for the "List of countries by GDP (nominal)".
- A GET request is sent using requests.get().
- The script checks the response status code:
  - If status is 200, it proceeds by storing the HTML content.
  - Otherwise, it prints an error message and exits the script.

Extracting and Processing HTML Tables

Extracting Tables
- The pd.read_html() function is used to parse all tables from the HTML content.
- If no tables are found or if an error occurs (caught by the try-except block), an appropriate error message is printed and the script exits.
- All extracted tables are printed (first few rows for each) for inspection; this helps in deciding which index contains the correct table.
Selecting and Renaming Columns
- The script assumes that the relevant table is at index 3. (This can be adjusted if the correct table resides at another index.)
- After selecting the table, the columns are renamed to numerical indices using df.columns = range(df.shape[1]). This step standardizes the column names.
- It then checks for the existence of column 2. If column 2 does not exist, it prints an error message showing available column indices and exits.
- If found, it narrows down the DataFrame to only the needed columns (columns 0 and 2).

Refining the Data

Focusing on the Top 10 Economies
- The DataFrame is sliced to retain rows corresponding to the top 10 economies (using df.iloc[1:11, :]). Note that row indexing skips the header if present in the table.
- The columns are then renamed for clarity:
  - Column 0 becomes "Country"
  - Column 2 becomes "GDP (Million USD)"
Converting GDP Values
- The GDP values (currently in Million USD) are converted to floats.
- They are then divided by 1000 to convert values from million to billion USD and rounded to 2 decimal places.
- The column is renamed to "GDP (Billion USD)".

Saving the Data

Exporting the DataFrame to CSV
- The final DataFrame is saved to a CSV file at a specified path (C:\Users\Path\Largest_economies.csv) using df.to_csv().
- Error handling was tested by verifying:
  - The current directory via os.getcwd()
  - Explicit file path settings
  - Write permissions (by testing simpler locations like the Desktop)
  - Wrapping in try-except blocks (although not shown in the snippet, it was mentioned)
  - Proper encoding and folder existence
- Finally, a confirmation message is printed after the CSV file is saved.

Final Code Recap

Below is the combined code for your reference:

PYTHON
import requests
import pandas as pd
import numpy as np
import os

print("Saving file to:", os.getcwd())

# Define the URL
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

# Fetch the HTML content of the webpage
response = requests.get(URL)
if response.status_code == 200:
    print("Successfully fetched the page!")
    html_content = response.text
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")
    exit()

# Extract tables from the webpage
try:
    tables = pd.read_html(html_content)
    if not tables:
        print("No tables found in the HTML content.")
        exit()
except ValueError as e:
    print(f"Error reading HTML tables: {e}")
    exit()

# Inspect all extracted tables
for i, table in enumerate(tables):
    print(f"Table {i}:")
    print(table.head())
    print("\n")

# Select the required table (adjust index if necessary)
df = tables[3]  # Replace 3 with the correct index if needed
print("Selected table:")
print(df.head())

# Dynamically rename and inspect columns
df.columns = range(df.shape[1])  # Replace headers with numerical indices
print("Columns after renaming:", df.columns)

# Handle missing columns dynamically
if 2 in df.columns:
    df = df[[0, 2]]  # Select columns 0 and 2
else:
    print("Column 2 not found. Available columns:", df.columns)
    exit()

# Retain rows for the top 10 economies
df = df.iloc[1:11, :]

# Rename columns
df.columns = ['Country', 'GDP (Million USD)']

# Convert GDP from Million USD to Billion USD and round to 2 decimal places
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(float)
df['GDP (Million USD)'] = np.round(df['GDP (Million USD)'] / 1000, 2)

# Rename the column header to 'GDP (Billion USD)'
df.rename(columns={'GDP (Million USD)': 'GDP (Billion USD)'}, inplace=True)

# Save the DataFrame to a CSV file
df.to_csv(r"C:\Users\Path\Largest_economies.csv", index=False)

print("The top 10 economies by GDP have been saved to 'Largest_economies.csv'.")

Summary

This script is a comprehensive example of fetching publicly available data from a web archive, processing a specific table (the top 10 economies by GDP), converting appropriate units, and finally exporting the processed data to a CSV file with robust error handling along the way.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by NeptunianWanderer309 • 5 months ago

I think the error is in the if 2 in df.columns condition (and then calling exit() to exit the script prematurely).

To load the table with GDP use slightly different approach - simply read the table using pd.read_html and don't convert the columns to numbers.

Also, the correct table has index 2 not 3.

PYTHON
import re  
import requests  
from io import StringIO  
  
import pandas as pd  
  
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'  
data = requests.get(url).text  
df = pd.read_html(StringIO(data))[2]  
  
# change/flatten the multi-index columns  
df.columns = [re.sub(r'\[\d+\]', '', a if a == b else f'{a}_{b}') for a, b in df.columns]  
  
# clean the data  
df = df.map(lambda x: int(re.sub(r'\[.*\]', '', x)) if isinstance(x, str) and x.startswith('[') else x)  
  
# get only top 10 economies  
df = df.loc[1:11].reset_index(drop=True)  
  
# further filter/sort the data  
# ...  
  
print(df.head())  
df.to_csv('data.csv', index=False)

Prints:

  Country/Territory UN region IMF_Estimate IMF_Year World Bank_Estimate World Bank_Year United Nations_Estimate United Nations_Year  
0     United States  Americas     26854599     2023            25462700            2022                23315081                2021  
1             China      Asia     19373586     2023            17963171            2022                17734131                2021  
2             Japan      Asia      4409738     2023             4231141            2022                 4940878                2021  
3           Germany    Europe      4308854     2023             4072192            2022                 4259935                2021  
4             India      Asia      3736882     2023             3385090            2022                 3201471                2021

and saves data.csv.

No comments yet.

Discussion

No comments yet.

CSV File Not Saving to Specified Folder in Python Despite Using Absolute Path

2 Answers

Overview

Fetching Web Content

Extracting and Processing HTML Tables

Refining the Data

Saving the Data

Final Code Recap

Summary

Discussion

Similar Posts

Why does modifying a numpy array of headers affect the DataFrame in Python 3.12/3.13?

How can I remove outliers using positional indices from np.where without causing a KeyError in pandas?

Why does LangChain's DirectoryLoader not load PDFs in Heroku despite working locally?