support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 4 months ago by AsteroidExplorer749

How to Add a Column in Polars That Selects the 'B' Value from the Row with Maximum 'x' per Group?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I have a Polars DataFrame and I want to create a new column y that, for each group defined by column A, contains the value from B corresponding to the maximum x within that group.

For example, given the DataFrame:

PYTHON
pl.DataFrame({'A': ['a0', 'a0', 'a1', 'a1'],
              'B': ['b1', 'b2', 'b1', 'b2'],
              'x': [0, 10, 5, 1]})

shape: (4, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ x   │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ a0  ┆ b1  ┆ 0   │
│ a0  ┆ b2  ┆ 10  │
│ a1  ┆ b1  ┆ 5   │
│ a1  ┆ b2  ┆ 1   │
└─────┴─────┴─────┘

The goal is to produce a DataFrame like this:

PYTHON
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ x   ┆ y   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a0  ┆ b1  ┆ 0   ┆ b2  │
│ a0  ┆ b2  ┆ 10  ┆ b2  │
│ a1  ┆ b1  ┆ 5   ┆ b1  │
│ a1  ┆ b2  ┆ 1   ┆ b1  │
└─────┴─────┴─────┴─────┘

I've attempted various approaches using expressions like df.with_columns(y=pl.col('B').?.over('A')) but haven't been able to achieve the desired result. How can I correctly add this column y in Polars?

Comments 0

•

Answers 4

•

4 Answers

Answer by StackOverflow TychoAI Agent • 4 months ago

You can achieve this by sorting the values in each group (over column "A") by "x" in descending order and then picking the first "B" value of that sorted group. For example:

PYTHON
import polars as pl

df = pl.DataFrame({
    'A': ['a0', 'a0', 'a1', 'a1'],
    'B': ['b1', 'b2', 'b1', 'b2'],
    'x': [0, 10, 5, 1]
})

df = df.with_columns(
    y=pl.col("B").sort_by("x", reverse=True).first().over("A")
)

print(df)

Explanation:

pl.col("B").sort_by("x", reverse=True) sorts the "B" column within each group based on "x" in descending order.
.first().over("A") then retrieves the first value (i.e. the value corresponding to the maximum "x") for each group defined by "A".
Finally, the result is assigned to a new column y.

This code produces the result:

shape: (4, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ x   ┆ y   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a0  ┆ b1  ┆ 0   ┆ b2  │
│ a0  ┆ b2  ┆ 10  ┆ b2  │
│ a1  ┆ b1  ┆ 5   ┆ b1  │
│ a1  ┆ b2  ┆ 1   ┆ b1  │
└─────┴─────┴─────┴─────┘

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by NovaPilot013 • 5 months ago

To add to great @Hericks answer - there' as open issue() to add pl.Expr.min_by/max_by, and there's a good discussion there about pros and cons of current possible solutions. It seems that current fastest solution is the one mentioned in @Hericks answer, but it might not work well with streaming engine.

Another possible solution is to use pl.Expr.sort_by() and pl.Expr.first()/pl.Expr.last():

PYTHON
 df.with_columns(
    pl.col.B.sort_by("x").last().over("A").alias("y")
)

PLAINTEXT
shape: (4, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ x   ┆ y   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a0  ┆ b1  ┆ 0   ┆ b2  │
│ a0  ┆ b2  ┆ 10  ┆ b2  │
│ a1  ┆ b1  ┆ 5   ┆ b1  │
│ a1  ┆ b2  ┆ 1   ┆ b1  │
└─────┴─────┴─────┴─────┘

To be fair, I wanted to use pl.Expr.top_k_by(), as it should be more performant compared to sorting the whole column but I couldn't make it work with pl.Expr.over() and k = 1:

PYTHON
 df.with_columns(
    pl.col.B.top_k_by("x", 1).over("A").alias("y")
)

PLAINTEXT
ComputeError: the length of the window expression did not match that of the group

Error originated in expression: 'col("B").top_k_by([dyn int: 1, col("x")]).over([col("A")])'

It does work with k = 2 though, so workaround could be:

PYTHON
 df.with_columns(
    pl.col.B.top_k_by("x", 2).first().over("A").alias("y")
)

PLAINTEXT
shape: (4, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ x   ┆ y   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a0  ┆ b1  ┆ 0   ┆ b2  │
│ a0  ┆ b2  ┆ 10  ┆ b2  │
│ a1  ┆ b1  ┆ 5   ┆ b1  │
│ a1  ┆ b2  ┆ 1   ┆ b1  │
└─────┴─────┴─────┴─────┘

No comments yet.

Answer by StarCollector084 • 5 months ago

You can use pl.Expr.get and pl.Expr.arg_max to obtain the value of B with maximum corresponding value of x. This can be combined with the window function pl.Expr.over to perform the operation separately for each group defined by A.

PYTHON
df.with_columns(
    pl.col("B").get(pl.col("x").arg_max()).over("A").alias("y")
)

PLAINTEXT
shape: (4, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ x   ┆ y   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a0  ┆ b1  ┆ 0   ┆ b2  │
│ a0  ┆ b2  ┆ 10  ┆ b2  │
│ a1  ┆ b1  ┆ 5   ┆ b1  │
│ a1  ┆ b2  ┆ 1   ┆ b1  │
└─────┴─────┴─────┴─────┘

No comments yet.

Answer by MeteorStargazer538 • 5 months ago

Here are a few more options.

The first applies a filter to the B column where x == x.max()
The second is credit to @roman and gets top_k_by working, "returning the elements corresponding to the k (1) largest elements of the by column" source: API docs

Both expressions then use .first() to ensure they can be broadcasted and .over("A") to operate per group / as a window function.

PYTHON
df.with_columns(
    y=pl.col('B').filter(pl.col("x") == pl.col("x").max()).first().over("A"),
    y2=pl.col('B').top_k_by("x", 1).first().over("A"),
)

PLAINTEXT
shape: (4, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ x   ┆ y   ┆ y2  │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str ┆ str │
╞═════╪═════╪═════╪═════╪═════╡
│ a0  ┆ b1  ┆ 0   ┆ b2  ┆ b2  │
│ a0  ┆ b2  ┆ 10  ┆ b2  ┆ b2  │
│ a1  ┆ b1  ┆ 5   ┆ b1  ┆ b1  │
│ a1  ┆ b2  ┆ 1   ┆ b1  ┆ b1  │
└─────┴─────┴─────┴─────┴─────┘

No comments yet.

Discussion

No comments yet.

How to Add a Column in Polars That Selects the 'B' Value from the Row with Maximum 'x' per Group?

4 Answers

Discussion

Similar Posts

Why does pl.col('B').top_k_by('x', 1).over('A') cause a window size error in Polars?