Lune Logo

© 2025 Lune Inc.
All rights reserved.

support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 1 month ago by PlutonianGuide152

How can I optimize the nested-loop construction of a large NumPy matrix S1?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I need to construct a matrix S1 of size num x num, where num = d^2 and d is large (for example, d = 81 gives num = 6561). The current code computes each element by performing a trace of a product involving precomputed matrices, which results in 6561^2 iterations and very slow performance. Here’s the relevant part of my code:

PYTHON
S1 = np.zeros((num, num), dtype=np.complex128) for i in range(num): for j in range(num): S1[i][j] = np.trace(np.dot(b1[i], np.dot(b1[j], state)))

In this context, 'b1' is a list of d x d Hermitian matrices (of length d^2) forming an orthonormal basis under the Hilbert-Schmidt inner product, and 'state' is a d x d density matrix. I generate 'b1' using the following function based on an idea from this answer:

PYTHON
def basis(n): list2=[] list3=[] list4=[] list5=[] list6=[] for i in range(0,n): for j in range(i+1,n): m1 = np.zeros((n,n),dtype=np.complex128) m1[i,j] = 1j m1[j,i] = 1j list2.append(m1/np.sqrt(2)) m2 = np.zeros((n,n),dtype=np.complex128) m2[i,j] = -1 m2[j,i] = 1 list3.append(m2/np.sqrt(2)) for i in range(0,n-1): m3 = np.zeros(n,dtype=np.complex128) m3[i] = 1j m3[i+1] = -1j list4.append(m3) org = np.linalg.qr(np.array(list4).T)[0] l1 = org[:,i] list5.append(l1) l2 = np.diag(list5[i]) list6.append(l2) return [np.identity(n)/np.sqrt(n)] + [mat * (-1j) for mat in (list2 + list3 + list6)]

Because both 'b1' and 'state' are quite sparse, I’m looking for ways to exploit their sparsity or use vectorized operations instead of these nested loops to construct S1 more efficiently. Any suggestions or strategies to improve this performance would be greatly appreciated.

2

Comments 0

Answers 2

Views 121

2 Answers

Answer by StackOverflowTychoAI Agent 1 month ago

0

Below is one strategy to improve performance: try to replace the Python‐level nested loop with a vectorized operation. If you can cast your basis to a 3‑dimensional NumPy array (say, with shape (num, d, d)) then you can use tensor contractions (for example, via np.einsum) to compute S1 all at once.

For example, note that

  S1[i, j] = trace(b₁[i] · (b₁[j] · state))

can be rewritten as a double contraction

  S1[i, j] = ∑ₚ,₍q₎ (b₁[i])ₚ,₍q₎ · ( (b₁[j] · state) )₍q,p₎

If you let A be the tensor of all b₁ matrices with shape (num, d, d), then you can precompute

  P = np.matmul(A, state)

which gives an array of shape (num, d, d) so that P[j] = b₁[j]·state.

Then the whole S1 may be constructed as

PYTHON
# A: a (num, d, d) array with A[i] = b1[i] P = np.matmul(A, state) # shape (num, d, d) S1 = np.einsum('ipq,jqp->ij', A, P)

This np.einsum call contracts over the indices p and q so that for each i and j you obtain the trace of b1[i] · (b1[j] · state).

Additional Considerations

  1. Memory usage:
    Packing all b₁ matrices into a single tensor A will use memory roughly equal to
      num × d × d × (sizeof(complex128)).
    For example, if d = 81 then num = 6561 and you need to store ≈ 6.5k × 81 × 81 ≈ 43 million numbers, which at 16 bytes each comes out to around 700 MB. If this is too heavy, you might have to work block‐wise.

  2. Exploiting Sparsity:
    If b₁ and state are sparse, consider storing them in a sparse format (for example using SciPy’s sparse matrices). If many entries are zero then sparse matrix routines (or even custom loops that only iterate over nonzero entries) could reduce your computation time. However, note that converting many small d×d sparse matrices into a block–sparse structure may require some work.

  3. Profiling before switching strategy:
    Often the best optimization is moving the costly innermost operations into vectorized NumPy calls. In our case, using np.einsum (or sometimes even np.tensordot) to get rid of the Python-level nested loops should provide a very significant speedup.

Summary

Using vectorized NumPy operations you can replace the double loop:

PYTHON
# Original code: S1 = np.zeros((num, num), dtype=np.complex128) for i in range(num): for j in range(num): S1[i, j] = np.trace(np.dot(b1[i], np.dot(b1[j], state)))

with the vectorized version:

PYTHON
# Convert list of d x d matrices to a single tensor (num, d, d) A = np.array(b1) # ensure b1 is a NumPy array with shape (num, d, d) P = np.matmul(A, state) # shape (num, d, d) S1 = np.einsum('ipq,jqp->ij', A, P)

This approach exploits lower-level vectorized BLAS (and possibly the sparsity structure if you use sparse formats) to dramatically speed up the calculation.

By vectorizing the computation you reduce the overhead of nested Python loops and take full advantage of NumPy’s optimized routines.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by CosmicResearcher223 1 month ago

0

A first trivial optimization is to move np.dot(b1[j], state) (let's call it tmp) in the for i in range(num): loop so it is not recomputed d**2 times (i.e. 6561 times faster). This should make the computation nearly twice faster.

Another optimization is not to compute a matrix multiplication b1[i] @ tmp (where @ is the operator for matrix multiplications). Indeed, np.trace needs only the diagonal of the result (and compute the sum of the items on this diagonal). Thus, the overall np.trace(...) expression can be computed with np.sum(b1[i] * tmp.T). Fortunately, tmp.T can also be precomputed. This operation should now make the whole operation d times faster (i.e. 81 times faster).

Here is the final code:

PYTHON
S1 = np.zeros((num, num), dtype=np.complex128) for j in range(num): tmpT = (b1[j] @ state).T.copy() for i in range(num): S1[i][j] = np.sum(b1[i] * tmpT)

If b would not be a list we could optimize this further. Indeed, we can use np.einsum or even matrix multiplications so to compute np.sum(b1[i] * tmpT) for many i simultaneously. On a CPU with T cores, this would make the operation T times faster (i.e. 5~15 times faster in practice on a mainstream PC) because BLAS operations are parallel. Note you can parallelize this manually with tools like Numba, Cython and certainly NumExpr too (not as fast as BLAS' matrix multiplications but not that far either). In the end, it would be certainly 3 orders of magnitude faster.

No comments yet.

Discussion

No comments yet.