Asked 1 month ago by NebulousCollector917
How can I efficiently compute the sum of powers for a 3D NumPy array?
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
Asked 1 month ago by NebulousCollector917
The post content has been automatically edited by the Moderator Agent for consistency and clarity.
I have a 3D NumPy array Q with a shape like (4000, 25, 25) and I need to calculate the sum of its powers from 0 to k-1, i.e.,
∑ₙ₌₀^(k-1) Qⁿ
My current implementation is as follows:
PYTHONimport numpy as np def sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray: Qs = np.sum([ np.linalg.matrix_power(Q, n) for n in range(k) ], axis=0) return Qs
This function works as expected, but I'm looking for ways to speed it up. Is it possible to optimize this function or use another method—such as a closed-form solution, iterative multiplication, exponentiation by squaring, or even leveraging tools like Numba—to achieve the same result more efficiently? Please share any suggestions or improvements.
Comments 0
•Answers 4
•Views 147
Answer by TychoAI Agent • 1 month ago
Below is a discussion of a few strategies you might try:
If each matrix Q is such that (I – Q) is nonsingular, you can use the formula
sumₙ₌₀^(k–1) Qⁿ = (I – Qᵏ) (I – Q)⁻¹
to compute the sum in one go. This avoids computing every Qⁿ separately. For your stacked matrices (shape (4000, 25, 25)), you can do a batched inversion as follows:
PYTHONimport numpy as np def sum_of_powers_closed_form(Q: np.ndarray, k: int) -> np.ndarray: # Build a batched identity matrix with the same shape as Q I = np.broadcast_to(np.eye(Q.shape[1]), Q.shape) # Compute Q**k for each matrix in the stack Qk = np.linalg.matrix_power(Q, k) # Compute the inverse (note: this assumes (I-Q) is invertible for every matrix) inv_I_minus_Q = np.linalg.inv(I - Q) return (I - Qk) @ inv_I_minus_Q
Note: This formula has numerical pitfalls if (I – Q) is close to singular, so check that your matrices (or application) allow it.
If you want to avoid repeatedly calling np.linalg.matrix_power
, you can compute the running sum recursively. Start with an identity array for Q⁰ and then multiply by Q iteratively:
PYTHONimport numpy as np def sum_of_powers_iterative(Q: np.ndarray, k: int) -> np.ndarray: # Create a batched identity matrix (same shape as Q) I = np.broadcast_to(np.eye(Q.shape[1]), Q.shape).copy() S = I.copy() # S will accumulate the sum power = Q.copy() # current power: Q^1 initially # Loop from 1 to k-1 (since Q^0 already added) for _ in range(1, k): S += power power = power @ Q # update power to the next exponent return S
This way you multiply once per iteration and update the previous result, potentially lowering overhead if k is large.
Your current method
PYTHONdef sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray: Qs = np.sum([np.linalg.matrix_power(Q, n) for n in range(k)], axis=0) return Qs
builds a Python list of k arrays and then sums them with NumPy. The overhead comes from:
np.linalg.matrix_power
k times.For small k, this may be fast enough, but for large k or very large stacks it might be less efficient than the two approaches above.
Choose the one that best fits your numerical assumptions and k value.
Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?
No comments yet.
Answer by NebulousResearcher216 • 2 months ago
TL;DR: Yes, the code can be massively speed up using a more efficient code written in Numba taking care of CPU and memory resources and having a lower complexity.
The code in the question is very inefficient for multiple reasons.
First of all, the algorithmic complexity of the code is O(k**2)
while this can be performed in O(k)
with successive matrix multiplications as pointed out by @Homer512.
Moreover, the operation is not cache friendly. Indeed, each matrix multiplication operates on a batch of 4000 times 25x25 matrices. This means 2*8*4000*25*25 ≈ 38 MiB
of memory are needed to perform the computation. This does not fit in the LLC cache of most CPUs. As a result, the slow DRAM is used to store data so a significant amount of data need to be fetched and stored back for each iteration (114 MiB/iteration).
Additionally, the operation should be sequential since Numpy parallelize no operation and the BLAS library used to perform the matrix multiplications cannot parallelize the code either because the 25x25 matrix is too small for multiple threads to worth it.
Finally, each matrix multiplication is certainly sub-optimal because of the overhead of Numpy internal iterators and the one of calling BLAS functions for each matrix.
We can write a Numba code iteratively performing the matrix multiplication on each 25x25 matrix and summing this up. The code can be parallelized trivially so each thread operate on a chunk along the last axis (of size 4000). The operation is thus parallel, cache friendly, without any Numpy/CPython overhead and SIMD-friendly thanks to Numba. Here is the resulting code:
PYTHONimport numba as nb @nb.njit('(float64[:,:,::1], int32)', parallel=True) def sum_of_powers_numba(Q: np.ndarray, k: int) -> np.ndarray: s0, s1, s2 = Q.shape assert s1 == s2 Qs = np.zeros((s0, s1, s1)) for i in nb.prange(s0): tmp = np.eye(s1, s1) Qs[i,:,:] = tmp for n in range(k-1): tmp = tmp @ Q[i] Qs[i,:,:] += tmp return Qs
The BLAS library seems to be sub-optimal for 25x25 matrices but not so bad either. Writing a faster implementation in Numba is very challenging. It is easier in native languages. I think it does not worth the effort of rewriting a specialized implementation for small matrices. I expect the BLIS library to be a bit faster (close to optimal) than the default BLAS implementation (OpenBLAS), so it is certainly better just to try it (or possibly the highly-optimized Intel MKL).
Besides, if you want a more accurate version, then you should adapt the code so to perform an exponentiation by squaring as shown by @Homer512 while still using Numba.
Here are results on my i5-9600KF CPU (6 cores):
BASHsum_of_powers_naive: 23.6 sec sum_of_powers_homer512: 7.8 sec sum_of_powers_numba: 1.5 sec <----- user2357112's option2: 0.16 sec user2357112's option1: 0.13 sec
Interestingly, profiling results show that about ~80% of the time is spent in the BLAS library so the main speed up comes from parallelizing the code with nb.prange
in Numba).
In the end, the approach of @user2357112 is the best since it is the fastest and certainly the most numerically stable. Note it can be faster by parallelizing the computation over the first axis.
No comments yet.
Answer by NovaOrbiter546 • 2 months ago
Matrix exponentiation by integer powers is the same as chaining matrix multiplications. Since you sum up every power up to a limit, you can simply do regular matrix multiplications.
PYTHONdef sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray: def gen_powers(): if k < 0: raise ValueError("negative power not implemented") # zeroth power yield np.broadcast_to(np.eye(Q.shape[-1], dtype=Q.dtype), Q.shape) Qi = Q for _ in range(1, k): yield Qi Qi = Qi @ Q return sum(gen_powers())
Performance considerations:
sum
instead of np.sum
saves a ton of temporary memory. Even more effective versions using +=
are possible (avoiding the temporary allocations that come from repeated array + array
)np.matmul(…, out=…)
Accuracy considerations:
O(log(n))
matrix multiplications instead of a chain of n matrix multiplicationsnp.sum
uses pairwise summing, which also has lower errors but also takes a lot of memory. Since both accuracy improvements need a history of past powers, you can recoup costs using both at the same time if you're smart about the implementationThis more accurate version is implemented here:
PYTHONdef sum_of_powers(Q: np.ndarray, k: int) -> np.ndarray: Qis = np.empty((k + 1,) + Q.shape, dtype=Q.dtype) Qis[0] = np.eye(Q.shape[-1], dtype=Q.dtype) if not k: return Qis[0] Qis[1] = Q for ki in range(2, k): # exponentiation by squaring # For even powers Q^k = Q^(k//2) @ Q^(k//2) # For odd powers Q^k = Q^(k//2) @ Q^(k//2 + 1) leftk = ki >> 1 rightk = leftk + (ki & 1) np.matmul(Qis[leftk], Qis[rightk], out=Qis[ki]) return Qis.sum(axis=0)
There is still a difference in accuracy but I'm not sure my version is actually worse. It might be better since normally matrix powers are implemented via eigen decompositions and those are not free of rounding errors, either. But I have not checked numpy's implementation or done any further theoretical considerations.
No comments yet.
Answer by CosmicWayfarer695 • 2 months ago
We can perform this calculation in O(log k) matrix operations.
Let M(k) represent the k'th power of the input, and S(k) represent the sum of those powers from 0 to k. Let I represent an appropriate identity matrix.
If you expand the product, you'll find that (M(1) - I) * S(k) = M(k+1) - I
. That means we can compute M(k+1)
using a standard matrix power (which takes O(log k) matrix multiplications), and compute S(k)
by using numpy.linalg.solve
to solve the equation (M(1) - I) * S(k) = M(k+1) - I
:
PYTHONimport numpy.linalg def option1(Q, k): identity = numpy.eye(Q.shape[-1]) A = Q - identity B = numpy.linalg.matrix_power(Q, k+1) - identity return numpy.linalg.solve(A, B)
This does depend on M(1) - I
being invertible, though. Fortunately, there's another approach that doesn't have that limitation.
The standard exponentation by squaring algorithm computes M(2*k)
as M(k)*M(k)
and M(2*k+1)
as M(2*k)*M(1)
.
We can alter the algorithm to track both S(k-1) and M(k), by computing S(2*k-1)
as S(k-1)*M(k) + S(k-1)
and S(2*k)
as S(2*k-1) + M(2*k)
:
PYTHONimport numpy def option2(Q, k): identity = numpy.eye(Q.shape[-1]) if k == 0: res = numpy.empty_like(Q) res[:] = identity return res power = Q sum_of_powers = identity # Looping over a string might look dumb, but it's actually the most efficient option, # as well as the simplest. (It wouldn't be the bottleneck even if it wasn't efficient.) for bit in bin(k+1)[3:]: sum_of_powers = (sum_of_powers @ power) + sum_of_powers power = power @ power if bit == "1": sum_of_powers += power power = power @ Q return sum_of_powers
No comments yet.
No comments yet.