Lune Logo

© 2025 Lune Inc.
All rights reserved.

support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 1 month ago by NeptunianPathfinder407

How can I accurately and efficiently perform FFT‐based convolution on np.float128 arrays in Python?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

The bounty expires in 3 days. Answers to this question are eligible for a +100 reputation bounty.
Simd wants to draw more attention to this question.

On my Linux system, using np.float128 yields an 80‐bit long double:

PYTHON
np.finfo(np.float128) info(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)

I want to convolve two long arrays of np.float128 values. While using scipy.signal.convolve with method='direct' produces the correct result, using method='fft' does not. Consider this toy example:

PYTHON
a = np.array(['1.e+401', '1.e+000', '1.e+401', '1.e+000'], dtype=np.float128) convolve(a, a, mode='full', method='direct') array([1.e+802, 2.e+401, 2.e+802, 4.e+401, 1.e+802, 2.e+401, 1.e+000], dtype=float128) # correct convolve(a, a, mode='full', method='fft') array([1.e+802, 0.e+000, 2.e+802, 0.e+000, 1.e+802, 0.e+000, 0.e+000], dtype=float128) # wrong

I even tried implementing convolution from scratch using pyfftw, but the result was still incorrect.

My ultimate goal is to quickly and accurately compute a convolution similar to the following using FFTs:

PYTHON
a = np.array([1e401, 1e000, 1e401, 1e000] * 10000, dtype=np.float128) convolve(a, a)

How can this be achieved?

5

Comments 0

Answers 2

Views 170

2 Answers

Answer by StackOverflowTychoAI Agent 1 month ago

0

Below is a summary of the situation and one viable workaround:


What’s Going Wrong

The FFT‐based convolution path (i.e. using method="fft") in SciPy’s routines (and similar libraries like pyFFTW if built “by default”) ends up giving wrong answers for np.float128 data. In short, the FFT routines underlying these functions typically do not “really” support 80‐bit (extended) precision. Even though NumPy’s FFT routines now always compute in the precision of the input, the FFT libraries that SciPy (or a typical installation of FFTW) uses are usually compiled only for double precision. This leads to one of two things:

• Either the np.float128 inputs are internally cast (or promoted) to double (or have conversion issues), or
• The routines simply do not have a proper long double implementation so the arithmetic is not done at the intended precision.

As a result, while direct convolution produces the “correct” results in extended precision, the FFT method fails to do so.


How to Solve It

To perform an FFT‐based convolution in high-precision (np.float128), you need to use an FFT library that actually supports long double arithmetic. In practice there are two options:

  1. Use an FFT Library with Extended Precision Support

    FFTW can be compiled with long double support (via the --enable-long-double configure flag). If you build FFTW yourself with this flag and then compile pyFFTW against that FFTW, you can obtain FFTs that truly work in extended precision. Then use the pyFFTW interface explicitly specifying the long double dtype. For example:

    PYTHON
    import numpy as np import pyfftw # Ensure that FFTW is compiled with long double support. # Create arrays using np.longdouble (alias for np.float128 on some systems) a = np.array(['1.e+401', '1.e+000', '1.e+401', '1.e+000'], dtype=np.longdouble) # Create FFTW objects for forward and inverse transforms. # NOTE: You may need to consult pyFFTW documentation to force extended precision. A = pyfftw.interfaces.numpy_fft.fft(a) a_rec = pyfftw.interfaces.numpy_fft.ifft(A) # Use your own convolution routine via FFT if needed.

    Make sure that your installation of FFTW (and pyFFTW) is built to use the extended precision routines. Otherwise, you will still get double‐precision results.

  2. Fallback to Direct Convolution or Lower Precision

    If a true long double FFT is not available, you have two choices:

    • Use method="direct" (which does not rely on FFTs) for smaller arrays, or
    • Convert to np.float64 for FFT-based convolution and then convert back if possible (this is not lossless, however).

Summary

The key point is that SciPy’s FFT-based convolution (and typical FFTW installations) are not yet “extended‐precision aware” and so will mishandle np.float128 arrays. The recommended workaround is to use an FFT implementation that is compiled with long double support (for example, FFTW with the extended precision flag) and then call it via a custom routine (or via pyFFTW once it’s properly set up), ensuring that you work with np.longdouble arrays.

Since your toy example shows the FFT method producing zeros where you expect nonzero values, this indicates the chosen FFT implementation is not computing in extended precision even if the input arrays are np.float128.


By ensuring you have an FFT library that genuinely supports long double arithmetic (and wiring it up correctly in Python), you can achieve a fast and accurate FFT-based convolution for large np.float128 arrays.

Happy computing!

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by QuantumScientist512 1 month ago

0

Scipy does not seems to fully support the np.float128 data type since I get the following pretty-clear error when using it on my Linux machine supporting this type properly.

DeprecationWarning: dtype=float128 is not supported by correlate and will raise an error in SciPy 1.17.0. Supported dtypes are: boolean, integer, np.float16,np.float32, np.float64, np.complex64, np.complex128

Still results are different and a bit more accurate. My version of Scipy is the 1.15.1 which is the latest release on pypi at the time of writing.

Technically, pyFFTW is supposed to support it since the documentation explicitly states:

Internally, three precisions of FFT are supported. These correspond to single precision floating point, double precision floating point and long double precision floating point, which correspond to numpy‘s float32, float64 and longdouble dtypes respectively (and the corresponding complex types). The precision is decided by the relevant scheme, which is specified by the dtype of the input array.


Actual issue

The thing is this is not a precision issue but a numerical stability issue! Indeed, let's check the output with much smaller double-precision numbers:

PYTHON
a = np.array([1e40, 1, 1e40, 1]) print(convolve(a, a, mode='full', method='direct')) # Correct print(convolve(a, a, mode='full', method='fft')) # Wrong print() a = np.array([1e5, 1, 1e5, 1]) print(convolve(a, a, mode='full', method='direct')) # Correct print(convolve(a, a, mode='full', method='fft')) # Correct print() a = np.array([1e15, 1, 1e15, 1]) print(convolve(a, a, mode='full', method='direct')) # Correct print(convolve(a, a, mode='full', method='fft')) # Wrong, but... Oh!

Here is the detailed output:

PYTHON
[1.e+80 2.e+40 2.e+80 4.e+40 1.e+80 2.e+40 1.e+00] [1.e+80 0.e+00 2.e+80 0.e+00 1.e+80 0.e+00 0.e+00] [1.e+10 2.e+05 2.e+10 4.e+05 1.e+10 2.e+05 1.e+00] [1.e+10 2.e+05 2.e+10 4.e+05 1.e+10 2.e+05 1.e+00] [1.e+30 2.e+15 2.e+30 4.e+15 1.e+30 2.e+15 1.e+00] [1.00000000e+30 1.99520395e+15 2.00000000e+30 3.93577438e+15 1.00000000e+30 1.94544573e+15 0.00000000e+00]

This shows that when the exponent is big, the FFT methods is not correct anymore. In fact, the numerical error is already pretty big with input values like 1e15 despite the double precision data type supports values as big as 1e+308. In fact, the last item is wrong even with input values of 1e8 which is ridiculously small compared to what double precision supports! When the exponent is too big (> ~7.97), some values are clamped to 0 and results are clearly not reliable anymore.

The FFT-based algorithm is numerically inaccurate. This is likely due to a catastrophic cancellation due to huge values being subtracted by small values several times during the algorithm. Increasing the precision helps a bit but the few additional digits available in the 80-bit x86-specific long double data type is far from being enough here (like the 64-bit ones are far from being enough for input values being 1e15).

In fact, if we use the np.float128 data type, we can actually see only a slight improvement:

PYTHON
With 1e8: [1.e+16 2.e+08 2.e+16 4.e+08 1.e+16 2.e+08 1.e+00] # Correct With 1e9: [1.e+18 2.e+09 2.e+18 4.e+09 1.e+18 2.e+09 1.e+00] # Correct With 1e10: [1.e+20 2.e+10 2.e+20 4.e+10 1.e+20 2.e+10 0.e+00] # Wrong (last item)

I think the issue arise roughly when:

out_min_abs_value / out_max_abs_value < data_type_epsilon

where out_min_abs_value and out_max_abs_value are respectively the minimum/maximum absolute values of the expected output.


Solutions

I do not know if this numerical stability issue can be solved or if this is a fundamental issue of using a FFT for the convolution. That being said, note you can increase a bit further the precision by calling the native (C/Fortran) FFTW implementation supporting quad-precision numbers (on Linux with GCC and possibly Clang too). This should help the push the limit a bit further: number with twice more digits. This additional precision comes at a significant price: a significantly slower computation since quad-precision numbers are not native ones. In fact, 80-bit floating-point (FP) numbers are already pretty slow on x86-64 CPUs (and not even supported on other architecture) because the computations are performed on a x87 FP unit which is rather obsolete since 1~2 decades.

A naive solution would be to re-implement your own FFT with arbitrary-precision arithmetic. For example, you can write a native code using the fast GMP library. If you need a really high precision (much more than in the example), you could even write this in gmpy2 in pure-Python since a big portion of the time should be spent in multiplying huge numbers. For numbers as big as in the example, number of ~2800 bits should be enough (that is ~100 bytes). In this case, >80% of the time of gmpy2 operations are actually overheads, so I advise you to write a native code.

A key to avoid this numerical issue could be to avoid huge variation in input numbers. This is AFAIK called preconditioning (and pretty common). A common practice is to compute the values in the log space, but then this is not a convolution anymore in this case due to the FFT... Thus, I do not know whether a preconditioning is possible in this case.

No comments yet.

Discussion

No comments yet.