support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 5 months ago by MartianCommander931

From Stack Overflow

How can I perform a rolling aggregation without manually adding a temporary index column?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I have a DataFrame like this:

PYTHON
import polars as pl

df = pl.DataFrame({"x": [1.2, 1.3, 3.4, 3.5]})
df

# shape: (3, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 1.2 │
# │ 1.3 │
# │ 3.4 │
# │ 3.5 │
# └─────┘

I want to compute a rolling aggregation on column x using a window of [-2:1] so that each row results in a list of values, for example:

PLAINTEXT
shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

So far, I achieved this by manually creating an index column, performing the rolling aggregation, and then removing the index:

PYTHON
df.with_row_index("index").with_columns(
  y = pl.col("x").rolling(index_column="index", period="4i", offset="-3i")
).drop("index")

Is there a way to perform this rolling operation in a single with_columns() call without having to add and remove a temporary index column?

Comments 0

•

Answers 3

•

3 Answers

Answer by StackOverflow TychoAI Agent • 5 months ago

Polars’ rolling functions that use an offset‐based window require an index column to determine the “position” in the DataFrame. In the current API there isn’t a built‑in way to avoid creating that temporary index column (via with_row_index or with_row_count) in order to use a rolling window defined relative to row positions.

For example, the following approach

PYTHON
df.with_row_index("index").with_columns(
    y=pl.col("x").rolling(index_column="index", period="4i", offset="-3i")
).drop("index")

is needed because the rolling function requires the index to know what “–2:1” means. Although you can inline the creation of the index column in your query (using, e.g., with_row_index or with_row_count) you still must create (and subsequently drop) that column.

So, to summarize: As of now there isn’t a built‑in one‑step method that avoids the creation of a temporary index column when using offset‑based rolling windows.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by AstroSentinel010 • 5 months ago

Pure expressions approach (apparently slow)

You can use concat_list with shift

PYTHON
(
    df
    .with_columns(
        y=pl.concat_list(
            pl.col('x').shift(x) 
            for x in range(2,-2,-1)
            )
        .list.drop_nulls()
        )
)
shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

There are a couple things to note here.

When the input to shift is positive, that means to go backwards which is the opposite of your notation.
range can count backwards with (start, stop, increment) but stop is non-inclusive so when entering that parameter, it needs an extra -1.
At the end of the concat_list you need to manually drop the nulls that it will have for items at the beginning and end of the series.

As always, you can wrap this into a function, including a translation of your preferred notation to what you actually need in range for it to work.

PYTHON
from typing import Sequence


def my_roll(in_column: str | pl.Expr, window: Sequence):
    if isinstance(in_column, str):
        in_column = pl.col(in_column)
    pl_window = range(-window[0], -window[1] - 1, -1)
    return pl.concat_list(in_column.shift(x) for x in pl_window).list.drop_nulls()

which then allows you to do

PYTHON
df.with_columns(y=my_roll("x", [-2,1]))

If you don't care about static typing you can even monkey patch it to pl.Expr like this pl.Expr.my_roll = my_roll and then do df.with_columns(y=pl.col("x").my_roll([-2,1])) but your pylance/pyright/mypy/etc will complain about it not existing.

Another approach that's kind of cheating if you're an expression purist

You can combine the built in way featuring .with_row_index and .rolling into a .map_batches that just turns your column into a df and spits back the series you care about.

PYTHON
def my_roll(in_column: str | pl.Expr, window):
    if isinstance(in_column, str):
        in_column = pl.col(in_column)
    period = f"{window[1]-window[0]+1}i"
    offset = f"{window[0]-1}i"
    return in_column.map_batches(
        lambda s: (
            s.to_frame()
            .with_row_index()
            .select(
                pl.col(s.name).rolling(
                    index_column="index", 
                    period=period, 
                    offset=offset
                )
            )
            .get_column(s.name)
        )
    )

The way this works is that map_batches will turn your column into a Series and then run a function on it where the function returns another Series. If we make the function turn that Series into a DF, then attach the row_index, do the rolling, and get the resultant Series then that gives you exactly what you want all contained in an expression. It should be just as performant as the verbose way, assuming you don't have any other use of the row_index.

then you do

PYTHON
df.with_columns(y=my_roll("x", [-2,1]))

No comments yet.

Answer by MercurialSatellite936 • 5 months ago

It looks like pl.Expr.rolling() expects string as as an index column, so need a fixed column. You can use pl.DataFrame.select() instead of pl.DataFrame.with_columns() if it makes it better:

PYTHON
df.with_row_index().select(
    df.columns,
    y = pl.col("x").rolling(index_column="index", period = "4i", offset = "-3i")
)

PLAINTEXT
shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

You could also use pl.DataFrame.rolling() which allows expressions as index column, and use pl.int_range() but it doesn't look much better to be honest:

PYTHON
df.with_columns(
    df
    .rolling(index_column=pl.int_range(pl.len()), period = "4i", offset = "-3i")
    .agg(pl.col.x.alias("y"))["y"]
)

PLAINTEXT
shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

No comments yet.

Discussion

No comments yet.

How can I perform a rolling aggregation without manually adding a temporary index column?

3 Answers

Pure expressions approach (apparently slow)

Another approach that's kind of cheating if you're an expression purist

Discussion

Similar Posts

Why Is 'self' Missing from Instance Langchain Tool Methods in Custom CrewAI Agents?

How do I resolve the ModuleNotFoundError: No module named 'project' in my Dockerized Django Daphne app?