support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 8 months ago by NebulousExplorer860

Why Do Cached Numpy Array Calculations Sometimes Run Slower Than Uncached Ones?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I am performing a large number of calculations on numpy arrays where many operations are repeated. To avoid recomputation, I implemented a caching strategy. However, I observed two surprising issues:

In most cases, the version using cache is slower than simply recalculating the values.
Line profiling shows that the absolute time spent on numpy operations increases when caching is used, even though fewer additions are performed.

This behavior is also seen when using scipy sparse matrices. The full application is more complex, but the following minimal example reproduces the behavior:

PYTHON
import numpy as np
from time import time

def numpy_comparison(do_cache: bool, array_size: int, num_arrays: int, num_iter: int):
    # Create random arrays
    arrays: dict[int, np.ndarray] = {}
    for i in range(num_arrays):  
        arrays[i] = np.random.rand(array_size)

    if do_cache:  # Set up the cache if needed - I cannot use lru_cache or similar in practice
        cache: dict[tuple[int, int], np.ndarray] = {}

    for _ in range(num_iter):  # Loop over random pairs of arrays, add, store if relevant
        i, j = np.random.randint(num_arrays, size=2)

        if do_cache and (i, j) in cache:
            a = cache[(i, j)]  # a is not used further here, but would be in the real case
        else:
            a = arrays[i] + arrays[j]
            if do_cache:
                cache[(i, j)] = a

Performance is measured (without multithreading) using:

BASH
%timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
%timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)

The results are as follows:

num_iter	No caching	With caching
100	10.3ms	13.7ms
1000	28.8ms	62.7ms
10000	225ms	392ms
100000	2.12s	1.62s

I have observed similar behavior when varying the array size and number of arrays. In the regime of interest (num_iter ≈ 1000 and about a 5% chance of cache hits) the overhead of caching appears to increase the time per numpy addition. Can anyone explain why caching in this case causes an overall increase in time spent in numpy operations, and what might be happening under the hood?

Comments 0

•

Answers 3

•

Views 1.0k

3 Answers

Answer by StackOverflow TychoAI Agent • 8 months ago

Below is one of the likely explanations:

Memory “Bloating” and Interference with Low‐Level Optimizations

When you store results in a cache, you “pin” the computed arrays so that they never get freed. In the no‐cache case every computed result is discarded immediately so that the memory allocator (or numpy’s own memory pools) can reuse those buffers efficiently. Once you begin caching—even with a low hit rate—the overall memory footprint grows and may affect:

CPU Cache and Memory Locality: A larger working set can push useful data (and even the temporary buffers generated by numpy’s BLAS/SIMD routines) out of the CPU cache. Even though you compute fewer additions, each addition may run slower because of poorer cache locality and increased memory pressure.
Memory Allocator Behavior: Keeping many arrays alive (even if not used in every iteration) interferes with the fast allocation/deallocation patterns optimized for numpy’s temporary buffers. This extra overhead can slow down the low‐level implementations of arithmetic operations.

Branching and Interpreter Overhead

Although your profiling shows the overhead is in the numpy addition, adding a branch (check for cache membership) may have a subtle effect on how the inner loop is optimized. In tight loops that rely on highly tuned BLAS or SIMD routines, even a small change in the call pattern can prevent certain optimizations, making each operation “appear” slower.

Why Does Line Profiling Show More Time in Numpy Ops?

Even though the cache prevents some additions, the additions that are computed may run slower because:

Increased Memory Pressure: As mentioned above, holding onto many large arrays causes fragmentation and cache misses.
Suboptimal Buffer Reuse: When temporary arrays are recycled immediately (no caching), numpy’s call paths stay optimized. With caching, you end up with a mix of pre-computed arrays that alter the memory access patterns.

Summary

In essence, the combination of a growing cache that “pins” memory, together with the extra branch in your loop, disrupts the memory-locality and fast allocation patterns that numpy (and underlying BLAS/SIMD libraries) rely on. This can lead to each individual addition taking slightly longer, even though you are doing fewer of them overall. When the cache hit rate becomes higher (as in your 100000-iteration case) the benefits finally overcome the overhead. Until then, the increased memory and potential microarchitectural penalties outweigh the savings from avoiding a few calculations.

What Can You Try?

Cache Pruning: Instead of letting the cache grow unbounded, consider an eviction strategy so that old/intermediate arrays are freed.
Adjust the Granularity: Sometimes it helps to cache results of more “coarse” calculations rather than the individual numpy additions.
Measure Memory Footprint: Monitor the memory overhead to see how much of the slowdown is attributable to increased memory usage.

This explanation is consistent with the fact that many numpy/scipy operations benefit from working on “small” working sets that can be entirely held in fast cache or recruited by memory allocators, and that disturbing this can slow down even the very optimized routines.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by QuasarObserver700 • 8 months ago

TL;DR: page faults explain why the cache-based version is significantly slower than the one without a cache when num_iter is small. This is a side effect of creating many new Numpy arrays and deleted only at the end. When num_iter is big, the cache becomes more effective (as explained in the JonSG's answer). Using another system allocator like TCMalloc can strongly reduce this overhead.

When you create many new temporary arrays, Numpy requests some memory to the system allocator which request relatively large buffers to the operating system (OS). The first touch to memory pages causes a page fault enabling the OS to actually setup the pages (actual page fault): the virtual pages are mapped to a physical one and the target pages are filled with zeros for security reasons (information should not leak from one process to another). When all arrays are deleted, Numpy free the memory space and the underlying memory allocator has a good chance to give the memory back to the OS (so other processes can use it).

Page faults are very expensive because the CPU needs to switch from the user-land to kernel one (with elevated privileges). The kernel needs to setup many data structures and call a lot of functions to do that.

Profiling & Analysis

To prove page faults are the issue and how bad page faults are performance-wise, here is the low-level breakdown of the time when the cache is enabled (using perf):

TEXT
  13,75%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add_AVX2
   7,47%  [kernel]                                           [k] __irqentry_text_end
   6,94%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double
   3,63%  [kernel]                                           [k] clear_page_erms
   3,22%  [kernel]                                           [k] error_entry
   2,98%  [kernel]                                           [k] native_irq_return_iret
   2,88%  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
   2,35%  [kernel]                                           [k] sync_regs
   2,28%  [kernel]                                           [k] __list_del_entry_valid_or_report
   2,27%  [kernel]                                           [k] unmap_page_range
   1,62%  [kernel]                                           [k] __handle_mm_fault
   1,45%  [kernel]                                           [k] __mod_memcg_lruvec_state
   1,43%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] random_standard_uniform_fill
   1,10%  [kernel]                                           [k] do_anonymous_page
   1,10%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] mt19937_gen
   0,98%  [kernel]                                           [k] mas_walk
   0,93%  libpython3.11.so.1.0                               [.] PyObject_GenericGetAttr
   0,91%  [kernel]                                           [k] get_mem_cgroup_from_mm
   0,89%  [kernel]                                           [k] get_page_from_freelist
   0,79%  libpython3.11.so.1.0                               [.] _PyObject_Malloc
   0,77%  [kernel]                                           [k] lru_gen_add_folio
   0,72%  [nvidia]                                           [k] _nv039919rm
   0,65%  [kernel]                                           [k] lru_gen_del_folio.constprop.0
   0,63%  [kernel]                                           [k] blk_cgroup_congested
   0,62%  [kernel]                                           [k] handle_mm_fault
   0,59%  [kernel]                                           [k] __alloc_pages_noprof
   0,57%  [kernel]                                           [k] lru_add
   0,57%  [kernel]                                           [k] folio_batch_move_lru
   0,56%  [kernel]                                           [k] __rcu_read_lock
   0,52%  [kernel]                                           [k] do_user_addr_fault
[...] (many others functions taking <0.52% each)

As we can see, there are a lot of [kernel] functions called and most of them are due to page faults. For example, __irqentry_text_end and native_irq_return_iret (taking ~10% of the time) is caused by CPU interrupts triggered when the CPython process access to pages for the first time. clear_page_erms is the function clearing a memory page during a first touch. Several functions are related to the virtual to physical memory mapping (e.g. AFAIK ones related to the LRU cache). Note that DOUBLE_add_AVX2 is the internal native Numpy function actually summing the two arrays. In comparison, here is the breakdown with the cache disabled:

TEXT
  20,85%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add_AVX2
  17,39%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double
   5,69%  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
   3,35%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] random_standard_uniform_fill
   2,46%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] mt19937_gen
   2,15%  libpython3.11.so.1.0                               [.] PyObject_GenericGetAttr
   1,76%  [kernel]                                           [k] __irqentry_text_end
   1,46%  libpython3.11.so.1.0                               [.] _PyObject_Malloc
   1,07%  libpython3.11.so.1.0                               [.] PyUnicode_FromFormatV
   1,03%  libc.so.6                                          [.] printf_positional
   0,93%  libpython3.11.so.1.0                               [.] _PyObject_Free
   0,88%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] NpyIter_AdvancedNew
   0,79%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] ufunc_generic_fastcall
   0,77%  [kernel]                                           [k] error_entry
   0,74%  _bounded_integers.cpython-311-x86_64-linux-gnu.so  [.] __pyx_f_5numpy_6random_17_bounded_integers__rand_int64
   0,72%  [nvidia]                                           [k] _nv039919rm
   0,69%  [kernel]                                           [k] native_irq_return_iret
   0,66%  [kernel]                                           [k] clear_page_erms
   0,55%  libpython3.11.so.1.0                               [.] PyType_IsSubtype
   0,55%  [kernel]                                           [k] sync_regs
   0,52%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] __pyx_pw_5numpy_6random_6mtrand_11RandomState_31randint
   0,48%  [kernel]                                           [k] unmap_page_range
   0,46%  libpython3.11.so.1.0                               [.] PyDict_GetItemWithError
   0,43%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] PyArray_NewFromDescr_int
   0,40%  libpython3.11.so.1.0                               [.] _PyFunction_Vectorcall
   0,38%  [kernel]                                           [k] __mod_memcg_lruvec_state
   0,38%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] promote_and_get_ufuncimpl
   0,37%  libpython3.11.so.1.0                               [.] PyDict_SetItem
   0,36%  libpython3.11.so.1.0                               [.] PyObject_RichCompareBool
   0,36%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] PyUFunc_GenericReduction
[...] (many others functions taking <0.35% each)

There are clearly less kernel function called in the top 30 most expensive functions. We can still see the interrupt related functions, but note that this low-level profiling is global to my whole machine and so these interrupt-related function likely comes from other processes (e.g. perf itself, the graphical desktop environment, and Firefox running as I write this answer).

There is another reason proving page faults are the main culprit: during my tests, the system allocator suddenly changed its behavior because I allocated many arrays before running the same commands, and this results in only few kernel calls (in perf) as well as very close timings between the two analyzed variants (with/without cache):

PYTHON
# First execution:
In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
20.2 ms ± 63.7 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.4 ms ± 81.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Second execution:
In [4]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
19.8 ms ± 43.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.3 ms ± 155 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# After creating many Numpy arrays (not deleted since) with no change of `numpy_comparison`:
In [95]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
    ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
18.4 ms ± 15.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.9 ms ± 26.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-----

We can clearly see that the overhead of using a cache is now pretty small. This also means the performance could be very different in a real-world application (because of a different allocator state), or even on different platforms.

Solution to mitigate page faults

You can use alternative system allocators like TCMalloc which requests a big chunk of memory to the OS at startup time so not to often pay page faults. Here are results with it (using the command line LD_PRELOAD=libtcmalloc_minimal.so.4 ipython):

PYTHON
In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
16.9 ms ± 51.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.5 ms ± 29.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The overhead related to the cache which was actually due the creation of many temporary arrays and more specifically page faults is now >10 times smaller!

I think the remaining overhead is due to the larger memory working set as pointed out by NickODell in comments (this cause more cache misses due to cash trashing and more data to be loaded from the slow DRAM). Put it shortly, the cache version is simply less cache friendly.

Here are results with num_item=100_000 with/without TCMalloc:

PYTHON
# Default system allocator (glibc)
In [97]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
    ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
1.31 s ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
859 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# TCMalloc allocator
In [3]: %timeit -n 1 numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit -n 1 numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
1.28 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
774 ms ± 83.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This behavior is expected: the cache starts to be useful with more hits. Note that TCMalloc makes the overall execution faster in most case!

No comments yet.

Answer by NebularAstronaut471 • 8 months ago

I believe part of the issue here is that the work involved in the example is very quick compared to testing for and fetching the value from the cache. So I think it reasonable to see that initially there is a penalty for enabling the cache.

Here is a quick workup allowing you to vary the "amount of work done" as well as the size of the cache. Hopefully this will allow you to explore in more detail how the work and cache hit ratio are impacting performance.

If this is not of any value, just drop me a note here and I will delete this.

PYTHON
import numpy as np
import cachetools
import timeit

def build_tester(array_size, num_arrays, num_iter, work_size, cache_size):
    arrays: dict[int, np.ndarray] = {}
    for i in range(num_arrays):  
        arrays[i] = np.random.rand(array_size)

    def _delay(work_size):
        for _ in range(work_size): pass
        return 0

    def _work_uncached(i, j):
        return arrays[i] + arrays[j] + _delay(work_size)

    @cachetools.cached(cachetools.LRUCache(maxsize=cache_size), info=True)
    def _work_cached(i, j):
        return _work_uncached(i, j)
    
    _work = _work_cached if cache_size else _work_uncached

    def _test():
        if hasattr(_work, "cache_info"):
            _work.cache_clear()

        a = 0
        for _ in range(num_iter):  # Loop over random pairs of array, add, store if relevant
            i, j = np.random.randint(num_arrays, size=2)
            a += _work(i, j)

        if hasattr(_work, "cache_info"):
            print(f"\t\t{_work.cache_info()}")

        return a

    return _test

for work_amount in [10, 100, 500, 1_000, 2_000, 4_000, 8_000]:
    print(f"{work_amount=}")
    for cache_size in [0, 10, 100, 1_000, 2_000, 4_000, 8_000, 16_000]:
        print(f"\t{cache_size=}")
        tester = build_tester(10_000, 100, 100_000, work_amount, cache_size)
        seconds = timeit.timeit(tester, number=1)
        print(f"\t\t{seconds=:0.4f}")

That gives me a result like:

BASH
work_amount=10
        cache_size=0
                seconds=2.6566
        cache_size=10
                CacheInfo(hits=112, misses=99888, maxsize=10, currsize=10)
                seconds=3.8053
        cache_size=100
                CacheInfo(hits=954, misses=99046, maxsize=100, currsize=100)
                seconds=4.7884
        cache_size=1000
                CacheInfo(hits=9791, misses=90209, maxsize=1000, currsize=1000)
                seconds=4.7839
        cache_size=2000
                CacheInfo(hits=19702, misses=80298, maxsize=2000, currsize=2000)
                seconds=4.6119
        cache_size=4000
                CacheInfo(hits=39029, misses=60971, maxsize=4000, currsize=4000)
                seconds=4.2463
        cache_size=8000
                CacheInfo(hits=75199, misses=24801, maxsize=8000, currsize=8000)
                seconds=3.2509
        cache_size=16000
                CacheInfo(hits=90001, misses=9999, maxsize=16000, currsize=9999)
                seconds=2.8887
work_amount=100
        cache_size=0
                seconds=2.7871
        cache_size=10
                CacheInfo(hits=95, misses=99905, maxsize=10, currsize=10)
                seconds=4.1492
        cache_size=100
                CacheInfo(hits=1017, misses=98983, maxsize=100, currsize=100)
                seconds=5.4277
        cache_size=1000
                CacheInfo(hits=9923, misses=90077, maxsize=1000, currsize=1000)
                seconds=5.6435
        cache_size=2000
                CacheInfo(hits=19928, misses=80072, maxsize=2000, currsize=2000)
                seconds=5.5402
        cache_size=4000
                CacheInfo(hits=38916, misses=61084, maxsize=4000, currsize=4000)
                seconds=4.9680
        cache_size=8000
                CacheInfo(hits=75266, misses=24734, maxsize=8000, currsize=8000)
                seconds=3.3910
        cache_size=16000
                CacheInfo(hits=90001, misses=9999, maxsize=16000, currsize=9999)
                seconds=2.8241
work_amount=500
        cache_size=0
                seconds=3.4740
        cache_size=10
                CacheInfo(hits=90, misses=99910, maxsize=10, currsize=10)
                seconds=4.8242
        cache_size=100
                CacheInfo(hits=1005, misses=98995, maxsize=100, currsize=100)
                seconds=5.9154
        cache_size=1000
                CacheInfo(hits=10124, misses=89876, maxsize=1000, currsize=1000)
                seconds=6.2072
        cache_size=2000
                CacheInfo(hits=19756, misses=80244, maxsize=2000, currsize=2000)
                seconds=6.0852
        cache_size=4000
                CacheInfo(hits=39104, misses=60896, maxsize=4000, currsize=4000)
                seconds=5.2798
        cache_size=8000
                CacheInfo(hits=75312, misses=24688, maxsize=8000, currsize=8000)
                seconds=3.6092
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=2.8147
work_amount=1000
        cache_size=0
                seconds=4.4938
        cache_size=10
                CacheInfo(hits=92, misses=99908, maxsize=10, currsize=10)
                seconds=6.3360
        cache_size=100
                CacheInfo(hits=970, misses=99030, maxsize=100, currsize=100)
                seconds=6.7533
        cache_size=1000
                CacheInfo(hits=10069, misses=89931, maxsize=1000, currsize=1000)
                seconds=7.1395
        cache_size=2000
                CacheInfo(hits=19799, misses=80201, maxsize=2000, currsize=2000)
                seconds=7.1510
        cache_size=4000
                CacheInfo(hits=39254, misses=60746, maxsize=4000, currsize=4000)
                seconds=6.0416
        cache_size=8000
                CacheInfo(hits=74989, misses=25011, maxsize=8000, currsize=8000)
                seconds=3.8283
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=2.9638
work_amount=2000
        cache_size=0
                seconds=6.0297
        cache_size=10
                CacheInfo(hits=108, misses=99892, maxsize=10, currsize=10)
                seconds=8.0671
        cache_size=100
                CacheInfo(hits=1014, misses=98986, maxsize=100, currsize=100)
                seconds=8.3695
        cache_size=1000
                CacheInfo(hits=9978, misses=90022, maxsize=1000, currsize=1000)
                seconds=9.6132
        cache_size=2000
                CacheInfo(hits=19729, misses=80271, maxsize=2000, currsize=2000)
                seconds=9.1182
        cache_size=4000
                CacheInfo(hits=39244, misses=60756, maxsize=4000, currsize=4000)
                seconds=7.2789
        cache_size=8000
                CacheInfo(hits=75272, misses=24728, maxsize=8000, currsize=8000)
                seconds=4.3689
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=3.0879
work_amount=4000
        cache_size=0
                seconds=10.8127
        cache_size=10
                CacheInfo(hits=105, misses=99895, maxsize=10, currsize=10)
                seconds=11.5326
        cache_size=100
                CacheInfo(hits=1008, misses=98992, maxsize=100, currsize=100)
                seconds=12.3085
        cache_size=1000
                CacheInfo(hits=9846, misses=90154, maxsize=1000, currsize=1000)
                seconds=12.2700
        cache_size=2000
                CacheInfo(hits=19831, misses=80169, maxsize=2000, currsize=2000)
                seconds=11.5156
        cache_size=4000
                CacheInfo(hits=39071, misses=60929, maxsize=4000, currsize=4000)
                seconds=9.7006
        cache_size=8000
                CacheInfo(hits=75259, misses=24741, maxsize=8000, currsize=8000)
                seconds=5.2042
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=3.4366
work_amount=8000
        cache_size=0
                seconds=16.8940
        cache_size=10
                CacheInfo(hits=84, misses=99916, maxsize=10, currsize=10)
                seconds=18.3884
        cache_size=100
                CacheInfo(hits=992, misses=99008, maxsize=100, currsize=100)
                seconds=19.4869
        cache_size=1000
                CacheInfo(hits=9929, misses=90071, maxsize=1000, currsize=1000)
                seconds=19.1156
        cache_size=2000
                CacheInfo(hits=19845, misses=80155, maxsize=2000, currsize=2000)
                seconds=18.9519
        cache_size=4000
                CacheInfo(hits=38982, misses=61018, maxsize=4000, currsize=4000)
                seconds=14.2407
        cache_size=8000
                CacheInfo(hits=75141, misses=24859, maxsize=8000, currsize=8000)
                seconds=7.4209
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=4.2533

No comments yet.

Discussion

No comments yet.

Why Do Cached Numpy Array Calculations Sometimes Run Slower Than Uncached Ones?

3 Answers

Memory “Bloating” and Interference with Low‐Level Optimizations

Branching and Interpreter Overhead

Why Does Line Profiling Show More Time in Numpy Ops?

Summary

What Can You Try?

Profiling & Analysis

Solution to mitigate page faults

Discussion

Similar Posts

Why does caching numpy array operations sometimes decrease performance?