support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 8 months ago by NeptunianKeeper520

What advanced optimization techniques can speed up 2D matrix cross-correlation in Python?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I have implemented three different methods in Python to compute the cross-correlation of two 2D matrices and compared their performance. Below is my experimental code along with execution times:

PYTHON
import numpy as np  
from scipy.signal import fftconvolve  

# Randomly generate 2D matrices  
a = np.random.randn(64, 4200)  
b = np.random.randn(64, 4200)  

# Using np.correlate  
def test_1(a, b):  
    return np.array([np.correlate(s_row, t_row, mode="full") for s_row, t_row in zip(a, b)])[:, len(a[0]) - 1:].sum(axis=0)  

# Using scipy.signal.fftconvolve  
def test_2(a, b):  
    return np.array([fftconvolve(s_row, t_row[::-1], mode="full") for s_row, t_row in zip(a, b)])[:, len(a[0]) - 1:].sum(axis=0)  

# Using scipy.signal.fftconvolve but for 2D  
def test_3(a, b):  
    return fftconvolve(a, np.fliplr(b), mode="full")[:, len(a[0]) - 1:].sum(axis=0)  

# Performance testing  
for i in range(100):  
    x = test_1(a, b)  
for i in range(100):  
    x = test_2(a, b)  
for i in range(100):  
    x = test_3(a, b)

Even when reversing b in advance, the execution time remains unchanged.

My computer configuration is as follows:

Operating System: Windows 11

CPU: i7-12700F

Memory: 16GB

I have already explored parallelization using ThreadPoolExecutor from concurrent.futures to accelerate the loops. I am now looking for further algorithm-level optimizations or alternative approaches to reduce the computation time.

What additional strategies or techniques could help accelerate this 2D matrix cross-correlation computation in Python? Thank you.

Comments 0

•

Answers 2

•

2 Answers

Answer by StackOverflow TychoAI Agent • 8 months ago

Below are several strategies you might try:

1. Batch FFT via Vectorization

Rather than iterating over rows in Python, you can perform FFTs on all rows in a vectorized way. For example, using the real FFT (rfft) along axis 1 can allow you to compute all row correlations at once. Once you precompute the FFTs, multiply (with appropriate conjugation) and then perform an inverse FFT. This may look like:

PYTHON
import numpy as np

def batch_row_correlation(a, b):
    # a and b: shape (n_rows, n_cols)
    n_cols = a.shape[1]
    # Length of full convolution result for each row:
    n = 2 * n_cols - 1

    # Compute rFFT along each row (with zero padding to size n)
    A = np.fft.rfft(a, n=n, axis=1)
    # Note: To compute correlation, reverse one signal and take conjugate:
    # Let's reverse b along axis 1 before FFT; alternatively, use conjugate on FFT
    B = np.fft.rfft(b[:, ::-1], n=n, axis=1)
    
    # Multiply FFTs element-wise and invert
    C = np.fft.irfft(A * B, n=n, axis=1)
    
    # Sum over rows and take only the causal part (from index len(a[0])-1 and beyond)
    result = C[:, n_cols - 1:].sum(axis=0)
    return result

By avoiding explicit Python loops, you can benefit from NumPy’s optimized FFT routines and possibly reduce overhead considerably.

2. Precompute FFTs if Multiple Calls

If you call the computation many times (as in your loop), and if one of the matrices (say, b) stays constant or can be pre-transformed, then compute its FFT outside the loop:

PYTHON
# Precompute FFT for b once
n_cols = a.shape[1]
n = 2 * n_cols - 1
B = np.fft.rfft(b[:, ::-1], n=n, axis=1)

def fast_corr(a, B):
    n_cols = a.shape[1]
    n = 2 * n_cols - 1
    A = np.fft.rfft(a, n=n, axis=1)
    C = np.fft.irfft(A * B, n=n, axis=1)
    return C[:, n_cols - 1:].sum(axis=0)

This avoids recomputing the FFT for b every time.

3. Use PyFFTW for Faster FFTs

Replacing NumPy’s FFT with pyFFTW can give you multi-threaded FFTs. With pyFFTW you can create “plans” that are reused, potentially accelerating repeated FFT computations:

PYTHON
import pyfftw
import numpy as np

# Create pyFFTW arrays with proper alignment for speed:
n_cols = a.shape[1]
n = 2 * n_cols - 1

A_in = pyfftw.empty_aligned((a.shape[0], n), dtype='float64')
A_freq = pyfftw.empty_aligned((a.shape[0], n//2+1), dtype='complex128')
fft_object = pyfftw.FFTW(A_in, A_freq, axes=(1,), direction='FFTW_FORWARD', threads=4)

# Similarly, set up inverse FFT plans for the product.

While setting this up requires additional work, for repeated operations it often pays off.

4. Use Numba for JIT-Compiling the Loop

If the vectorized FFT approach is not suitable (or you need more control), you could try using Numba to JIT compile your Python loop:

PYTHON
import numpy as np
import numba

@numba.njit(parallel=True)
def row_correlation(a, b):
    n_rows, n_cols = a.shape
    n = 2*n_cols - 1
    result = np.zeros((n_rows, n))
    
    for i in numba.prange(n_rows):
        # Compute full cross-correlation for each row:
        for lag in range(-n_cols+1, n_cols):
            s = 0.0
            for j in range(n_cols):
                j2 = j - lag
                if 0 <= j2 < n_cols:
                    s += a[i, j] * b[i, j2]
            result[i, lag + n_cols - 1] = s
    return result[:, n_cols - 1:].sum(axis=0)

Even though the pure Python loop was slow, Numba’s JIT may remove much of the overhead. Note that for large arrays, you’ll need to carefully experiment with different strategies.

5. GPU Acceleration (CuPy)

If you have a compatible GPU, you might convert your code to use CuPy, which is designed to mimic NumPy’s API while performing operations on the GPU:

PYTHON
import cupy as cp
import numpy as np

a_gpu = cp.asarray(a)
b_gpu = cp.asarray(b)

def gpu_batch_row_correlation(a_gpu, b_gpu):
    n_cols = a_gpu.shape[1]
    n = 2 * n_cols - 1
    A_gpu = cp.fft.rfft(a_gpu, n=n, axis=1)
    B_gpu = cp.fft.rfft(b_gpu[:, ::-1], n=n, axis=1)
    C_gpu = cp.fft.irfft(A_gpu * B_gpu, n=n, axis=1)
    result_gpu = C_gpu[:, n_cols - 1:].sum(axis=0)
    return cp.asnumpy(result_gpu)

result = gpu_batch_row_correlation(a_gpu, b_gpu)

This leverages GPU acceleration for FFT operations, which can lead to dramatic speedups when data transfer overhead is amortized by heavy computation.

Summary

Vectorize the FFT computations to avoid Python loops.
Precompute FFTs where possible.
Consider PyFFTW for multi-threaded FFT plans.
Use Numba to compile loops if vectorization isn’t feasible.
Explore GPU acceleration with CuPy if you have a supported GPU.

Adapting one or a combination of these approaches should help reduce your computation time.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by OrbitalWanderer177 • 9 months ago

I do not think you can find a much faster implementation than that using a sequential code on CPU. This answer explains why. It also provide a slightly-faster experimental FFTW-based implementation which is up to 40% faster, and alternative possibly-faster solutions.

Profiling and analysis

The FFT library used by Scipy, which is pypocketfft is surprisingly pretty fast (faster than the FFTW library in this case). test_2 spend most of its time in this native library so using a native code or even vectorization would not make the code much faster. Indeed, 60~~65% of the time is spent in the function pocketfft::detail::rfftp::exec, 15~~20% in Python overheads, ~10% in pocketfft::detail::general_r2c/pocketfft::detail::general_c2r, and ~~10% in other operations like copies and the main multiplication. As a result, writing your own specialized native module directly using pypocketfft would make this computation only 15~~20% faster at best. I do not think it is worth the effort.

Low-level profiling results show that the computing function test_2 is clearly compute bound. The same thing is true for the provided FFTW functions in this answer. That being said, the library pypocketfft only use scalar instructions. Alternative libraries like the FFTW can theoretically benefit from SIMD units.

However, in practice, this is far from being easy to write an implementation both using an efficient algorithm and SIMD instructions. The Intel MKL (alternative library) certainly does that. The FFTW only uses SIMD instructions in the complex-to-complex FFTs. This is why the function test_fftw_2 (provided in at the end of this answer) is slower than the other FFTW-based functions (also provided).

Faster implementation with the FFTW

I tried the FFTW (fully vectorized code), but it is significantly slower in this case (despite the name of the library meaning "Fastest Fourier Transform in the West"). This is because 4200*2-1 has 2 big prime factors and the FFTW underlying algorithm (AFAIK Cooley-Tukey) is inefficient for length having such big prime factors (it is much better for power of 2 length). In fact, the computation is about 8 times faster for a length of 8192 instead of 8399 (your use-case). Note I tried two versions: one in-place with only complex-to-complex FFTs and one out-of-place with real-to-complex/complex-to-real FFTs. The later was supposed to be faster but it is slower in practice... Maybe I missed something critical in my implementations.

We can cheat by using bigger array for the FFT, that is a kind of padding. Indeed, using an array size which is divisible by a big power of two can mitigate the aforementioned performance issue (related to prime factors). For example, here we can use an array size of 1024*9=9216 (the smallest number divisible by 1024 bigger than n*2-1). While, we could use a smaller power-of-two base to be as close as possible to n*2-1 (increasing the efficiency by computing useful numbers), this also cause the algorithm the be less efficient (due to a smaller power-of-two factor). In this case, the FFTW implementation is about as fast as the Scipy code with a FFTW_MEASURE plan (this results in a 7% faster code which is a tiny performance boost). With the FFTW_PATIENT, it is actually faster by about 25%! The thing is FFTW_PATIENT plans can be pretty slow to make. The initialization time can only be amortized with at least several thousands of iterations and not just 100. Note that pre-allocating all arrays and plan once before the loop results in a few percent faster execution.

I also found out that the FFTW is sub-optimal and computing chunks of line is ~10% faster than computing the whole array in one row. As a result, the final implementation test_fftw_4 is about 40% faster than test_2 (still without considering the first execution slowed down due to the planning time). This implementation is still a bit faster than test_2 without expensive planning (e.g. FFTW_MEASURE). This last implementation efficiently use a CPU core: it massively use the SIMD units, it is pretty cache-friendly and spend nearly all its time in the native FFTW library. It can also be easily parallelized on few cores (regarding the chunk size).

Here is the different FFTW functions I tested:

PYTHON
import pyfftw
import numpy as np

# Initial FFTW implementation (slower than Scipy/pypocketfft)
def test_fftw_1(a, b):
    m = a.shape[0]
    n = a.shape[1]
    assert b.shape[1] == n

    # Input/output FFT temporary arrays
    a2 = pyfftw.empty_aligned((m, 2*n-1), dtype='complex128')
    b2 = pyfftw.empty_aligned((m, 2*n-1), dtype='complex128')
    r2 = pyfftw.empty_aligned((m, 2*n-1), dtype='complex128')
    a2[:,:n] = a
    a2[:,n:] = 0
    b2[:,:n] = b[:,::-1]
    b2[:,n:] = 0

    # See FFTW_ESTIMATE, FFTW_MEASURE or FFTW_PATIENT for varying 
    # performance at the expense of a longuer plannification time
    axes = (1,)
    flags = ['FFTW_MEASURE']
    a2_fft_plan = pyfftw.FFTW(a2, a2, axes, 'FFTW_FORWARD', flags)
    b2_fft_plan = pyfftw.FFTW(b2, b2, axes, 'FFTW_FORWARD', flags)
    r2_ifft_plan = pyfftw.FFTW(r2, r2, axes, 'FFTW_BACKWARD', flags)

    # Actual execution of the plans
    a2_fft_plan()
    b2_fft_plan()
    np.multiply(a2, b2, out=r2)
    r2_ifft_plan()

    return r2.real[:, n-1:].sum(axis=0)

# Supposed to be faster, but slower in practice 
# (because it seems not to use SIMD instructions in practice)
def test_fftw_2(a, b):
    m = a.shape[0]
    n = a.shape[1]
    assert b.shape[1] == n

    # Input FFT temporary arrays
    a2 = pyfftw.empty_aligned((m, 2*n-1), dtype='float64')
    b2 = pyfftw.empty_aligned((m, 2*n-1), dtype='float64')
    a2[:,:n] = a
    a2[:,n:] = 0
    b2[:,:n] = b[:,::-1]
    b2[:,n:] = 0

    # Temporary and output arrays
    a2_fft = pyfftw.empty_aligned((m, n), dtype='complex128')
    b2_fft = pyfftw.empty_aligned((m, n), dtype='complex128')
    r2_fft = pyfftw.empty_aligned((m, n), dtype='complex128')
    r2 = pyfftw.empty_aligned((m, 2*n-1), dtype='float64')

    # Same thing for planning
    axes = (1,)
    flags = ['FFTW_MEASURE']
    a2_fft_plan = pyfftw.FFTW(a2, a2_fft, axes, 'FFTW_FORWARD', flags)
    b2_fft_plan = pyfftw.FFTW(b2, b2_fft, axes, 'FFTW_FORWARD', flags)
    r2_ifft_plan = pyfftw.FFTW(r2_fft, r2, axes, 'FFTW_BACKWARD', flags)

    a2_fft_plan()
    b2_fft_plan()
    np.multiply(a2_fft, b2_fft, out=r2_fft)
    r2_ifft_plan()

    return r2.real[:, n-1:].sum(axis=0)

# Pretty-fast FFTW-based implementation (faster than Scipy/pypocketfft without considering the planning time)
# However, the first call is pretty slow (e.g. dozens of seconds)
def test_fftw_3(a, b):
    m = a.shape[0]
    n = a.shape[1]
    assert b.shape[1] == n

    # Input/output FFT temporary arrays
    block_size = 1024
    size = (2*n-1 + block_size-1) // block_size * block_size
    a2 = pyfftw.empty_aligned((m, size), dtype='complex128')
    b2 = pyfftw.empty_aligned((m, size), dtype='complex128')
    r2 = pyfftw.empty_aligned((m, size), dtype='complex128')
    a2[:,:n] = a
    a2[:,n:] = 0
    b2[:,:n] = b[:,::-1]
    b2[:,n:] = 0

    axes = (1,)
    flags = ['FFTW_PATIENT']
    a2_fft_plan = pyfftw.FFTW(a2, a2, axes, 'FFTW_FORWARD', flags)
    b2_fft_plan = pyfftw.FFTW(b2, b2, axes, 'FFTW_FORWARD', flags)
    r2_ifft_plan = pyfftw.FFTW(r2, r2, axes, 'FFTW_BACKWARD', flags)

    # Actual execution of the plans
    a2_fft_plan()
    b2_fft_plan()
    np.multiply(a2, b2, out=r2)
    r2_ifft_plan()

    diff = size - (2*n-1)
    return r2.real[:, -diff-n:-diff].sum(axis=0)

# Fastest FFTW-based implementation
def test_fftw_4(a, b):
    return sum([test_fftw_3(a[i*8:(i+1)*8], b[i*8:(i+1)*8]) for i in range(8)])

Alternative solutions

The Intel MKL is certainly faster than pypocketfft, but this certainly requires a native C/C++ code wrapped to be callable from Python (not so simple to do). I think this is the best option in sequential on CPU.

An alternative solution is to use GPUs. For example, CuPy supports FFW computations though this requires Nvidia GPUs (due to CUDA). There is also the quite-new portable pyvkfft module based on VkFFT (supported by AMD). Be aware that double-precision computations are only fast on (expensive) server-side GPUs, not client-side ones (i.e. personal computers).

Note using single-precision should results in faster execution timings on both CPU and GPU (especially on client-side GPUs).

References

You can find a list of interesting C++ library and a benchmark here.

No comments yet.

Discussion

No comments yet.

What advanced optimization techniques can speed up 2D matrix cross-correlation in Python?

2 Answers

1. Batch FFT via Vectorization

2. Precompute FFTs if Multiple Calls

3. Use PyFFTW for Faster FFTs

4. Use Numba for JIT-Compiling the Loop

5. GPU Acceleration (CuPy)

Summary

Profiling and analysis

Faster implementation with the FFTW

Alternative solutions

References

Discussion

Similar Posts

How can I further optimize 2D cross-correlation computation in Python?