support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 5 months ago by PlanetaryObserver128

How can I remove outliers using positional indices from np.where without causing a KeyError in pandas?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

My machine learning dataset contains several variables, and I want to remove outliers from the 'final_test' column using the interquartile range (IQR) method, but only for rows where the 'hours_per_week' value is 0, 1, 2, 3, or 4 (there are no outliers for values 5 and above).

I wrote the function below to remove outliers for a given 'hours_per_week' value:

PYTHON
import pandas as pd 
import numpy as np

# Load the dataset
df = pd.read_csv('.xxx.csv')

# Function to remove outliers
def remove_outlier_using_IQR(df: pd.DataFrame, name_column: str, value: int) -> pd.DataFrame:
    """
    Remove outliers using IQR in the 'name_column'.

    Args:
        df (pd.DataFrame): The DataFrame containing the columns for outlier removal.
        name_column (str): The name of the column containing outliers to be removed.
        Value: Value in name column for outlier removal.

    Returns:
        pd.DataFrame: The DataFrame with outliers in 'name_column' removed.
    """

    # Detect outliers in the 'name_column'
    df2 = df[df[name_column]==value]
    Q1 = df2['final_test'].quantile(0.25)
    Q3 = df2['final_test'].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5*IQR
    upper = Q3 + 1.5*IQR
    df2.info()
    print(df2.shape)
    print(Q3)

    # Create arrays of Boolean values indicating the outlier rows
    upper_array = np.where(df2['final_test'] >= upper)[0]
    lower_array = np.where(df2['final_test'] <= lower)[0]

    print(upper)
    print(upper_array)

    # Removing the outliers
    df2.drop(index=upper_array, inplace=True)
    df2.drop(index=lower_array, inplace=True)

    df3 = df[df[name_column]!=value]
    df_merged = pd.concat([df2,df3], ignore_index=False, sort=False)

    return df_merged

# Use function to remove outliers
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(0))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(1))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(2))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(3))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(4))

When I run this code, I see the following output:

PLAINTEXT
print(df2.shape) => (56,18)
print(Q3) => 48.0
print(upper) => 55.5
print(upper_array) => [15 25 34 53]

and then encounter this error:

PLAINTEXT
KeyError: '[15, 25, 34, 53] not found in axis'

I intended to drop the rows corresponding to the outliers (with indices [15, 25, 34, 53] in df2), but pandas raises a KeyError because np.where returns positional indices rather than the actual index labels in df2. How can I correctly remove these outliers without encountering this error?

Comments 0

•

Answers 3

•

3 Answers

Answer by StackOverflow TychoAI Agent • 5 months ago

The error occurs because df2 has a non-default index. When you use np.where on df2, it returns positional indices (i.e., 0, 1, 2, … for the rows in df2) rather than the actual index labels in df2. When you then pass these positional indices directly to df2.drop(index=…), pandas looks for those values as index labels and fails if they don’t exist.

To fix this, select the actual index labels using df2.index. For example, change:

PYTHON
df2.drop(index=upper_array, inplace=True)

to:

PYTHON
df2.drop(index=df2.index[upper_array], inplace=True)

and do the same for lower_array:

PYTHON
df2.drop(index=df2.index[lower_array], inplace=True)

Below is your modified function:

PYTHON
import pandas as pd 
import numpy as np

# Load the dataset
df = pd.read_csv('.xxx.csv')

# Function to remove outliers using IQR
def remove_outlier_using_IQR(df: pd.DataFrame, name_column: str, value: int) -> pd.DataFrame:
    """
    Remove outliers using IQR in the 'name_column'.
    
    Args:
        df (pd.DataFrame): The DataFrame containing the columns for outlier removal.
        name_column (str): The name of the column containing outliers to be removed.
        Value: Value in name column for outlier removal.
        
    Returns:
        pd.DataFrame: The DataFrame with outliers in 'name_column' removed.
    """
    # Filter the subset for the specified value
    df2 = df[df[name_column] == value]
    
    # Compute quartiles and IQR on 'final_test'
    Q1 = df2['final_test'].quantile(0.25)
    Q3 = df2['final_test'].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    # Debug prints
    df2.info()
    print("Shape:", df2.shape)
    print("Q3:", Q3)
    print("Upper threshold:", upper)
    
    # Identify outlier positions
    upper_array = np.where(df2['final_test'] >= upper)[0]
    lower_array = np.where(df2['final_test'] <= lower)[0]
    
    print("Outlier indices (relative positions):", upper_array)
    
    # Drop outliers using actual index labels
    df2.drop(index=df2.index[upper_array], inplace=True)
    df2.drop(index=df2.index[lower_array], inplace=True)
    
    # Combine with the rest of the data that was not filtered by 'name_column'
    df3 = df[df[name_column] != value]
    df_merged = pd.concat([df2, df3], ignore_index=False, sort=False)
    
    return df_merged

# Use the function to remove outliers for x values 0 through 4
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=0)
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=1)
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=2)
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=3)
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=4)

By indexing into df2.index with your upper_array and lower_array, you ensure that the correct rows are dropped by label rather than by relative position.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by CosmicCosmonaut441 • 5 months ago

You can simplify solution for remove values in boolean indexing with Series.between:

PYTHON
upper_array = np.where(df2['final_test'] >= upper)[0]
lower_array = np.where(df2['final_test'] <= lower)[0]

print(upper)
print(upper_array)

# Removing the outliers
df2.drop(index=upper_array, inplace=True)
df2.drop(index=lower_array, inplace=True)

change to:

PYTHON
df2 = df2[df2['final_test'].between(lower, upper, inclusive='neither')]

No comments yet.

Answer by VoidSurveyor572 • 5 months ago

You need to convert the positional indices to corresponding labels in df2.index. You can do that by indexing df2.index with those arrays:

So do this:

PYTHON
    df2.drop(df2.index[upper_array], inplace=True)
    df2.drop(df2.index[lower_array], inplace=True)

That way, you are telling .drop() actual row labels from df2.index rather than numeric positions.

np.where(...) giving you positional indices (0,1,2, …) relative to df2 rather than actual row labels in df2. When you do

PYTHON
df2.drop(index=upper_array, inplace=True)

pandas interprets upper_array as actual row labels that must exist in df2.index. So KeyError [15, 25, 34,53] not found in axis simply means that df2 does not have row labels [15, 25, 34, 53].

No comments yet.

Discussion

No comments yet.

How can I remove outliers using positional indices from np.where without causing a KeyError in pandas?

3 Answers

Discussion

Similar Posts

How can I integrate Whisper.cpp's WASM worker into my SvelteKit project?

How can I automatically skip unsupported file types like .mp4 in Langchain's GoogleDriveLoader?