Asked 1 month ago by PlanetaryObserver128
How can I remove outliers using positional indices from np.where without causing a KeyError in pandas?
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
Asked 1 month ago by PlanetaryObserver128
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
My machine learning dataset contains several variables, and I want to remove outliers from the 'final_test' column using the interquartile range (IQR) method, but only for rows where the 'hours_per_week' value is 0, 1, 2, 3, or 4 (there are no outliers for values 5 and above).
I wrote the function below to remove outliers for a given 'hours_per_week' value:
PYTHONimport pandas as pd import numpy as np # Load the dataset df = pd.read_csv('.xxx.csv') # Function to remove outliers def remove_outlier_using_IQR(df: pd.DataFrame, name_column: str, value: int) -> pd.DataFrame: """ Remove outliers using IQR in the 'name_column'. Args: df (pd.DataFrame): The DataFrame containing the columns for outlier removal. name_column (str): The name of the column containing outliers to be removed. Value: Value in name column for outlier removal. Returns: pd.DataFrame: The DataFrame with outliers in 'name_column' removed. """ # Detect outliers in the 'name_column' df2 = df[df[name_column]==value] Q1 = df2['final_test'].quantile(0.25) Q3 = df2['final_test'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5*IQR upper = Q3 + 1.5*IQR df2.info() print(df2.shape) print(Q3) # Create arrays of Boolean values indicating the outlier rows upper_array = np.where(df2['final_test'] >= upper)[0] lower_array = np.where(df2['final_test'] <= lower)[0] print(upper) print(upper_array) # Removing the outliers df2.drop(index=upper_array, inplace=True) df2.drop(index=lower_array, inplace=True) df3 = df[df[name_column]!=value] df_merged = pd.concat([df2,df3], ignore_index=False, sort=False) return df_merged # Use function to remove outliers df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(0)) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(1)) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(2)) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(3)) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(4))
When I run this code, I see the following output:
PLAINTEXTprint(df2.shape) => (56,18) print(Q3) => 48.0 print(upper) => 55.5 print(upper_array) => [15 25 34 53]
and then encounter this error:
PLAINTEXTKeyError: '[15, 25, 34, 53] not found in axis'
I intended to drop the rows corresponding to the outliers (with indices [15, 25, 34, 53] in df2), but pandas raises a KeyError because np.where returns positional indices rather than the actual index labels in df2. How can I correctly remove these outliers without encountering this error?
Comments 0
•Answers 3
•Views 58
Answer by TychoAI Agent • 1 month ago
The error occurs because df2 has a non-default index. When you use np.where on df2, it returns positional indices (i.e., 0, 1, 2, … for the rows in df2) rather than the actual index labels in df2. When you then pass these positional indices directly to df2.drop(index=…), pandas looks for those values as index labels and fails if they don’t exist.
To fix this, select the actual index labels using df2.index. For example, change:
PYTHONdf2.drop(index=upper_array, inplace=True)
to:
PYTHONdf2.drop(index=df2.index[upper_array], inplace=True)
and do the same for lower_array:
PYTHONdf2.drop(index=df2.index[lower_array], inplace=True)
Below is your modified function:
PYTHONimport pandas as pd import numpy as np # Load the dataset df = pd.read_csv('.xxx.csv') # Function to remove outliers using IQR def remove_outlier_using_IQR(df: pd.DataFrame, name_column: str, value: int) -> pd.DataFrame: """ Remove outliers using IQR in the 'name_column'. Args: df (pd.DataFrame): The DataFrame containing the columns for outlier removal. name_column (str): The name of the column containing outliers to be removed. Value: Value in name column for outlier removal. Returns: pd.DataFrame: The DataFrame with outliers in 'name_column' removed. """ # Filter the subset for the specified value df2 = df[df[name_column] == value] # Compute quartiles and IQR on 'final_test' Q1 = df2['final_test'].quantile(0.25) Q3 = df2['final_test'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR # Debug prints df2.info() print("Shape:", df2.shape) print("Q3:", Q3) print("Upper threshold:", upper) # Identify outlier positions upper_array = np.where(df2['final_test'] >= upper)[0] lower_array = np.where(df2['final_test'] <= lower)[0] print("Outlier indices (relative positions):", upper_array) # Drop outliers using actual index labels df2.drop(index=df2.index[upper_array], inplace=True) df2.drop(index=df2.index[lower_array], inplace=True) # Combine with the rest of the data that was not filtered by 'name_column' df3 = df[df[name_column] != value] df_merged = pd.concat([df2, df3], ignore_index=False, sort=False) return df_merged # Use the function to remove outliers for x values 0 through 4 df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=0) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=1) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=2) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=3) df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=4)
By indexing into df2.index with your upper_array and lower_array, you ensure that the correct rows are dropped by label rather than by relative position.
Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?
No comments yet.
Answer by CosmicCosmonaut441 • 1 month ago
You can simplify solution for remove values in boolean indexing
with Series.between
:
PYTHONupper_array = np.where(df2['final_test'] >= upper)[0] lower_array = np.where(df2['final_test'] <= lower)[0] print(upper) print(upper_array) # Removing the outliers df2.drop(index=upper_array, inplace=True) df2.drop(index=lower_array, inplace=True)
change to:
PYTHONdf2 = df2[df2['final_test'].between(lower, upper, inclusive='neither')]
No comments yet.
Answer by VoidSurveyor572 • 1 month ago
You need to convert the positional indices to corresponding labels in df2.index
. You can do that by indexing df2.index
with those arrays:
So do this:
PYTHONdf2.drop(df2.index[upper_array], inplace=True) df2.drop(df2.index[lower_array], inplace=True)
That way, you are telling .drop()
actual row labels from df2.index
rather than numeric positions.
np.where(...)
giving you positional indices (0,1,2, …) relative to df2 rather than actual row labels in df2.
When you do
PYTHONdf2.drop(index=upper_array, inplace=True)
pandas interprets upper_array as actual row labels that must exist in df2.index
. So KeyError [15, 25, 34,53] not found in axis simply means that df2 does not have row labels [15, 25, 34, 53].
No comments yet.
No comments yet.