support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 8 months ago by NebulousCollector917

How can I efficiently compute the sum of powers for a 3D NumPy array?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I have a 3D NumPy array Q with a shape like (4000, 25, 25) and I need to calculate the sum of its powers from 0 to k-1, i.e.,

∑ₙ₌₀^(k-1) Qⁿ

My current implementation is as follows:

PYTHON
import numpy as np

def sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray:
    Qs = np.sum([
        np.linalg.matrix_power(Q, n) for n in range(k)
    ], axis=0)

    return Qs

This function works as expected, but I'm looking for ways to speed it up. Is it possible to optimize this function or use another method—such as a closed-form solution, iterative multiplication, exponentiation by squaring, or even leveraging tools like Numba—to achieve the same result more efficiently? Please share any suggestions or improvements.

Comments 0

•

Answers 4

•

4 Answers

Answer by StackOverflow TychoAI Agent • 8 months ago

Below is a discussion of a few strategies you might try:

1. Use a Closed‐Form (Geometric Series) Formula

If each matrix Q is such that (I – Q) is nonsingular, you can use the formula

sumₙ₌₀^(k–1) Qⁿ = (I – Qᵏ) (I – Q)⁻¹

to compute the sum in one go. This avoids computing every Qⁿ separately. For your stacked matrices (shape (4000, 25, 25)), you can do a batched inversion as follows:

PYTHON
import numpy as np

def sum_of_powers_closed_form(Q: np.ndarray, k: int) -> np.ndarray:
    # Build a batched identity matrix with the same shape as Q
    I = np.broadcast_to(np.eye(Q.shape[1]), Q.shape)
    
    # Compute Q**k for each matrix in the stack
    Qk = np.linalg.matrix_power(Q, k)
    
    # Compute the inverse (note: this assumes (I-Q) is invertible for every matrix)
    inv_I_minus_Q = np.linalg.inv(I - Q)
    
    return (I - Qk) @ inv_I_minus_Q

Note: This formula has numerical pitfalls if (I – Q) is close to singular, so check that your matrices (or application) allow it.

2. Iterative Multiplication Without Recomputing Powers

If you want to avoid repeatedly calling np.linalg.matrix_power, you can compute the running sum recursively. Start with an identity array for Q⁰ and then multiply by Q iteratively:

PYTHON
import numpy as np

def sum_of_powers_iterative(Q: np.ndarray, k: int) -> np.ndarray:
    # Create a batched identity matrix (same shape as Q)
    I = np.broadcast_to(np.eye(Q.shape[1]), Q.shape).copy()
    S = I.copy()         # S will accumulate the sum
    power = Q.copy()     # current power: Q^1 initially

    # Loop from 1 to k-1 (since Q^0 already added)
    for _ in range(1, k):
        S += power
        power = power @ Q  # update power to the next exponent
    
    return S

This way you multiply once per iteration and update the previous result, potentially lowering overhead if k is large.

3. Comment on the Existing Approach

Your current method

PYTHON
def sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray:
    Qs = np.sum([np.linalg.matrix_power(Q, n) for n in range(k)], axis=0)
    return Qs

builds a Python list of k arrays and then sums them with NumPy. The overhead comes from:

Calling np.linalg.matrix_power k times.
Building an intermediate list.

For small k, this may be fast enough, but for large k or very large stacks it might be less efficient than the two approaches above.

Summary

Closed-form approach: Best when (I – Q) is invertible and you care about reducing the number of matrix multiplications.
Iterative approach: Simple and avoids the repeated overhead of computing each Qⁿ from scratch.
Using built-in functions: vectorized solutions (like the closed-form) might already use optimized BLAS libraries.

Choose the one that best fits your numerical assumptions and k value.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by NebulousResearcher216 • 9 months ago

TL;DR: Yes, the code can be massively speed up using a more efficient code written in Numba taking care of CPU and memory resources and having a lower complexity.

Why the current code is inefficient

The code in the question is very inefficient for multiple reasons.

First of all, the algorithmic complexity of the code is O(k**2) while this can be performed in O(k) with successive matrix multiplications as pointed out by @Homer512.

Moreover, the operation is not cache friendly. Indeed, each matrix multiplication operates on a batch of 4000 times 25x25 matrices. This means 2*8*4000*25*25 ≈ 38 MiB of memory are needed to perform the computation. This does not fit in the LLC cache of most CPUs. As a result, the slow DRAM is used to store data so a significant amount of data need to be fetched and stored back for each iteration (114 MiB/iteration).

Additionally, the operation should be sequential since Numpy parallelize no operation and the BLAS library used to perform the matrix multiplications cannot parallelize the code either because the 25x25 matrix is too small for multiple threads to worth it.

Finally, each matrix multiplication is certainly sub-optimal because of the overhead of Numpy internal iterators and the one of calling BLAS functions for each matrix.

Faster implementation

We can write a Numba code iteratively performing the matrix multiplication on each 25x25 matrix and summing this up. The code can be parallelized trivially so each thread operate on a chunk along the last axis (of size 4000). The operation is thus parallel, cache friendly, without any Numpy/CPython overhead and SIMD-friendly thanks to Numba. Here is the resulting code:

PYTHON
import numba as nb

@nb.njit('(float64[:,:,::1], int32)', parallel=True)
def sum_of_powers_numba(Q: np.ndarray, k: int) -> np.ndarray:
    s0, s1, s2 = Q.shape
    assert s1 == s2
    Qs = np.zeros((s0, s1, s1))
    for i in nb.prange(s0):
        tmp = np.eye(s1, s1)
        Qs[i,:,:] = tmp
        for n in range(k-1):
            tmp = tmp @ Q[i]
            Qs[i,:,:] += tmp
    return Qs

The BLAS library seems to be sub-optimal for 25x25 matrices but not so bad either. Writing a faster implementation in Numba is very challenging. It is easier in native languages. I think it does not worth the effort of rewriting a specialized implementation for small matrices. I expect the BLIS library to be a bit faster (close to optimal) than the default BLAS implementation (OpenBLAS), so it is certainly better just to try it (or possibly the highly-optimized Intel MKL).

Besides, if you want a more accurate version, then you should adapt the code so to perform an exponentiation by squaring as shown by @Homer512 while still using Numba.

Benchmark (updated)

Here are results on my i5-9600KF CPU (6 cores):

BASH
sum_of_powers_naive:      23.6  sec
sum_of_powers_homer512:    7.8  sec
sum_of_powers_numba:       1.5  sec  <-----
user2357112's option2:     0.16 sec
user2357112's option1:     0.13 sec

Interestingly, profiling results show that about ~80% of the time is spent in the BLAS library so the main speed up comes from parallelizing the code with nb.prange in Numba).

In the end, the approach of @user2357112 is the best since it is the fastest and certainly the most numerically stable. Note it can be faster by parallelizing the computation over the first axis.

No comments yet.

Answer by NovaOrbiter546 • 9 months ago

Matrix exponentiation by integer powers is the same as chaining matrix multiplications. Since you sum up every power up to a limit, you can simply do regular matrix multiplications.

PYTHON
def sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray:
    def gen_powers():
        if k < 0:
            raise ValueError("negative power not implemented")
        # zeroth power
        yield np.broadcast_to(np.eye(Q.shape[-1], dtype=Q.dtype), Q.shape)
        Qi = Q
        for _ in range(1, k):
            yield Qi
            Qi = Qi @ Q
    return sum(gen_powers())

Performance considerations:

Using regular sum instead of np.sum saves a ton of temporary memory. Even more effective versions using += are possible (avoiding the temporary allocations that come from repeated array + array)
You may avoid temporary allocations for the matrix multiplication results by ping-ponging the Qi variable between two buffers and using np.matmul(…, out=…)

Accuracy considerations:

You will find differences to your straightforward approach since rounding errors are different. You can cache the intermediate results to implement exponentiation by squaring. That doesn't improve performance but it reduces the rounding errors since the n'th power result only requires a chain of O(log(n)) matrix multiplications instead of a chain of n matrix multiplications
np.sum uses pairwise summing, which also has lower errors but also takes a lot of memory. Since both accuracy improvements need a history of past powers, you can recoup costs using both at the same time if you're smart about the implementation

This more accurate version is implemented here:

PYTHON
def sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray:
    Qis = np.empty((k + 1,) + Q.shape, dtype=Q.dtype)
    Qis[0] = np.eye(Q.shape[-1], dtype=Q.dtype)
    if not k:
        return Qis[0]
    Qis[1] = Q
    for ki in range(2, k):
        # exponentiation by squaring
        # For even powers Q^k = Q^(k//2) @ Q^(k//2)
        # For odd powers Q^k = Q^(k//2) @ Q^(k//2 + 1)
        leftk = ki >> 1
        rightk = leftk + (ki & 1)
        np.matmul(Qis[leftk], Qis[rightk], out=Qis[ki])
    return Qis.sum(axis=0)

There is still a difference in accuracy but I'm not sure my version is actually worse. It might be better since normally matrix powers are implemented via eigen decompositions and those are not free of rounding errors, either. But I have not checked numpy's implementation or done any further theoretical considerations.

No comments yet.

Answer by CosmicWayfarer695 • 9 months ago

We can perform this calculation in O(log k) matrix operations.

Let M(k) represent the k'th power of the input, and S(k) represent the sum of those powers from 0 to k. Let I represent an appropriate identity matrix.

Approach 1

If you expand the product, you'll find that (M(1) - I) * S(k) = M(k+1) - I. That means we can compute M(k+1) using a standard matrix power (which takes O(log k) matrix multiplications), and compute S(k) by using numpy.linalg.solve to solve the equation (M(1) - I) * S(k) = M(k+1) - I:

PYTHON
import numpy.linalg

def option1(Q, k):
    identity = numpy.eye(Q.shape[-1])
    A = Q - identity
    B = numpy.linalg.matrix_power(Q, k+1) - identity
    return numpy.linalg.solve(A, B)

This does depend on M(1) - I being invertible, though. Fortunately, there's another approach that doesn't have that limitation.

Approach 2

The standard exponentation by squaring algorithm computes M(2*k) as M(k)*M(k) and M(2*k+1) as M(2*k)*M(1).

We can alter the algorithm to track both S(k-1) and M(k), by computing S(2*k-1) as S(k-1)*M(k) + S(k-1) and S(2*k) as S(2*k-1) + M(2*k):

PYTHON
import numpy

def option2(Q, k):
    identity = numpy.eye(Q.shape[-1])

    if k == 0:
        res = numpy.empty_like(Q)
        res[:] = identity
        return res

    power = Q
    sum_of_powers = identity

    # Looping over a string might look dumb, but it's actually the most efficient option,
    # as well as the simplest. (It wouldn't be the bottleneck even if it wasn't efficient.)
    for bit in bin(k+1)[3:]:
        sum_of_powers = (sum_of_powers @ power) + sum_of_powers
        power = power @ power
        if bit == "1":
            sum_of_powers += power
            power = power @ Q
    return sum_of_powers

No comments yet.

Discussion

No comments yet.

How can I efficiently compute the sum of powers for a 3D NumPy array?

4 Answers

1. Use a Closed‐Form (Geometric Series) Formula

2. Iterative Multiplication Without Recomputing Powers

3. Comment on the Existing Approach

Summary

Why the current code is inefficient

Faster implementation

Benchmark (updated)

Approach 1

Approach 2

Discussion

Similar Posts

Why Do Multiple Local Python Subprocesses Experience Massive Slowdowns?

Why Does My get_U Function Yield NaNs in a Fourier Optics Simulation Using ifft2?

How can I numerically capture antiresonance dips in a parametrically driven oscillator?