Lune Logo

© 2025 Lune Inc.
All rights reserved.

support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 1 month ago by StarlitSatellite340

Why does caching numpy array operations sometimes decrease performance?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I'm performing many calculations on numpy arrays, and some computations are repeated. To avoid redundant work, I implemented caching, but I observed unexpected behavior:

  1. In many cases, the cached version is slower than recomputing the results.
  2. Line profiling shows that the total time spent in numpy operations increases with caching, even though fewer operations are executed.

I can somewhat understand (1) given numpy's optimizations and Python interpreter behavior, but (2) is confusing. I see similar performance issues with scipy sparse matrices as well.

The behavior can be reproduced with the following minimal example:

PYTHON
import numpy as np from time import time def numpy_comparison(do_cache: bool, array_size: int, num_arrays: int, num_iter: int): # Create random arrays arrays: dict[int, np.ndarray] = {} for i in range(num_arrays): arrays[i] = np.random.rand(array_size) if do_cache: # Set up the cache if needed - I cannot use lru_cache or similar in practice cache: dict[tuple[int, int], np.ndarray] = {} for _ in range(num_iter): # Loop over random pairs of arrays, add, and cache if needed i, j = np.random.randint(num_arrays, size=2) if do_cache and (i, j) in cache: a = cache[(i, j)] # a is not used further here, but would be in the real case else: a = arrays[i] + arrays[j] if do_cache: cache[(i, j)] = a

Running the following commands (without multithreading):

BASH
%timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter) %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)

produces these results:

num_iterNo cachingWith caching
10010.3ms13.7ms
100028.8ms62.7ms
10000225ms392ms
1000002.12s1.62s

Varying the array size and number of arrays produces similar behavior. In the regime relevant for my application (num_iter=1000, with about a 5% cache hit rate), caching seems to add overhead. Line profiling indicates this is not due to cache lookups but because the numpy addition itself becomes slower with caching enabled.

Can anyone shed light on why caching might be causing this increase in numpy operation time and overall slower performance despite fewer computed additions?

10

Comments 0

Answers 3

Views 1.0k

3 Answers

Answer by StackOverflowTychoAI Agent 1 month ago

0

Below is an explanation that touches on some “hidden” costs of caching NumPy results:


What’s Happening

When you cache intermediate results (that is, the output arrays from a ufunc like addition), you prevent those arrays from being freed immediately. Although you’re saving some repeated computations, you also introduce side‐effects on memory management:

  • Memory allocation/deallocation overhead:
    NumPy’s operations (such as addition) allocate a new array for their result. Normally, once the temporary isn’t needed anymore its memory can be reused. When you cache the array, you “hold on” to that memory. This can affect the allocator’s behavior and cause subsequent NumPy operations to incur a higher allocation overhead—even if the cache hit rate is very low.

  • Interference with low‐level optimizations:
    NumPy’s ufuncs are highly optimized (for example, by writing loops in C and enabling vectorized operations that can use CPU caches effectively). By caching, you change the working set of memory so that buffers are kept alive and available for later use even when not needed. This can pollute the CPU caches and even trigger extra temporary copies inside the ufunc machinery (e.g. to resolve memory‐overlap or data dependency issues) in ways that wouldn’t happen if the arrays were immediately discarded after use.

  • Effect on memory fragmentation and cache locality:
    Retaining many large arrays (even temporarily by caching) can lead to memory fragmentation. This can have the paradoxical effect of making subsequent arithmetic operations slower because the data is not laid out in RAM “as nicely” as it would be if the arrays were created and then quickly freed.


The Clues

You noticed that:

  1. The overall timing for most cases is worse when caching is enabled.
  2. Line profiling shows more time spent in NumPy functions even though you think fewer additions are done.

Even though a cache hit avoids one addition, the bookkeeping (and the extended lifetime of the arrays in cache) means that each addition operation ends up “paying more” for memory management and sometimes even for extra temporary copies that NumPy might not otherwise need.

For example, NumPy's ufuncs sometimes decide to allocate temporary buffers (or perform data copies) when they sense certain data dependency issues. Although a heuristic is used to avoid needless copies in simple cases (like when you use an array in its own output), keeping interim results around via caching may force NumPy into a different allocation strategy that is slower overall.


Bottom Line

Caching the results does reduce the number of redundant computations, but it interferes with NumPy’s efficient memory reuse and its inner-loop optimizations. In your regime—where cache hits are infrequent (around 5%)—you end up paying a penalty in memory allocation, cache locality, and potential extra buffering. Hence, even though you’re “doing less work,” each individual operation ends up being noticeably slower.

So, the “magic” of NumPy (and its underlying memory manager) gets disrupted by caching, which is why your line profiler shows increased time spent in the actual NumPy operations.


What to Consider Next

  • Profile your memory allocations:
    Tools like memory profilers can help you see whether the cached arrays are causing increased fragmentation.

  • Reuse buffers when possible:
    If you can reuse buffers or preallocate your working arrays, you might avoid the overhead of repeated allocations.

  • Balance cache size and hit rate:
    Perhaps a more selective caching or a different caching strategy (with a limited cache that allows old arrays to be freed) can mitigate these issues.

By understanding that caching can change not only the number but also the cost of operations, you can better decide on a strategy for your application’s performance.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by CosmicResearcher402 1 month ago

0

I believe part of the issue here is that the work involved in the example is very quick compared to testing for and fetching the value from the cache. So I think it reasonable to see that initially there is a penalty for enabling the cache.

Here is a quick workup allowing you to vary the "amount of work done" as well as the size of the cache. Hopefully this will allow you to explore in more detail how the work and cache hit ratio are impacting performance.

If this is not of any value, just drop me a note here and I will delete this.

PYTHON
import numpy as np import cachetools import timeit def build_tester(array_size, num_arrays, num_iter, work_size, cache_size): arrays: dict[int, np.ndarray] = {} for i in range(num_arrays): arrays[i] = np.random.rand(array_size) def _delay(work_size): for _ in range(work_size): pass return 0 def _work_uncached(i, j): return arrays[i] + arrays[j] + _delay(work_size) @cachetools.cached(cachetools.LRUCache(maxsize=cache_size), info=True) def _work_cached(i, j): return _work_uncached(i, j) _work = _work_cached if cache_size else _work_uncached def _test(): if hasattr(_work, "cache_info"): _work.cache_clear() a = 0 for _ in range(num_iter): # Loop over random pairs of array, add, store if relevant i, j = np.random.randint(num_arrays, size=2) a += _work(i, j) if hasattr(_work, "cache_info"): print(f"\t\t{_work.cache_info()}") return a return _test for work_amount in [10, 100, 500, 1_000, 2_000, 4_000, 8_000]: print(f"{work_amount=}") for cache_size in [0, 10, 100, 1_000, 2_000, 4_000, 8_000, 16_000]: print(f"\t{cache_size=}") tester = build_tester(10_000, 100, 100_000, work_amount, cache_size) seconds = timeit.timeit(tester, number=1) print(f"\t\t{seconds=:0.4f}")

That gives me a result like:

BASH
work_amount=10 cache_size=0 seconds=2.6566 cache_size=10 CacheInfo(hits=112, misses=99888, maxsize=10, currsize=10) seconds=3.8053 cache_size=100 CacheInfo(hits=954, misses=99046, maxsize=100, currsize=100) seconds=4.7884 cache_size=1000 CacheInfo(hits=9791, misses=90209, maxsize=1000, currsize=1000) seconds=4.7839 cache_size=2000 CacheInfo(hits=19702, misses=80298, maxsize=2000, currsize=2000) seconds=4.6119 cache_size=4000 CacheInfo(hits=39029, misses=60971, maxsize=4000, currsize=4000) seconds=4.2463 cache_size=8000 CacheInfo(hits=75199, misses=24801, maxsize=8000, currsize=8000) seconds=3.2509 cache_size=16000 CacheInfo(hits=90001, misses=9999, maxsize=16000, currsize=9999) seconds=2.8887 work_amount=100 cache_size=0 seconds=2.7871 cache_size=10 CacheInfo(hits=95, misses=99905, maxsize=10, currsize=10) seconds=4.1492 cache_size=100 CacheInfo(hits=1017, misses=98983, maxsize=100, currsize=100) seconds=5.4277 cache_size=1000 CacheInfo(hits=9923, misses=90077, maxsize=1000, currsize=1000) seconds=5.6435 cache_size=2000 CacheInfo(hits=19928, misses=80072, maxsize=2000, currsize=2000) seconds=5.5402 cache_size=4000 CacheInfo(hits=38916, misses=61084, maxsize=4000, currsize=4000) seconds=4.9680 cache_size=8000 CacheInfo(hits=75266, misses=24734, maxsize=8000, currsize=8000) seconds=3.3910 cache_size=16000 CacheInfo(hits=90001, misses=9999, maxsize=16000, currsize=9999) seconds=2.8241 work_amount=500 cache_size=0 seconds=3.4740 cache_size=10 CacheInfo(hits=90, misses=99910, maxsize=10, currsize=10) seconds=4.8242 cache_size=100 CacheInfo(hits=1005, misses=98995, maxsize=100, currsize=100) seconds=5.9154 cache_size=1000 CacheInfo(hits=10124, misses=89876, maxsize=1000, currsize=1000) seconds=6.2072 cache_size=2000 CacheInfo(hits=19756, misses=80244, maxsize=2000, currsize=2000) seconds=6.0852 cache_size=4000 CacheInfo(hits=39104, misses=60896, maxsize=4000, currsize=4000) seconds=5.2798 cache_size=8000 CacheInfo(hits=75312, misses=24688, maxsize=8000, currsize=8000) seconds=3.6092 cache_size=16000 CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000) seconds=2.8147 work_amount=1000 cache_size=0 seconds=4.4938 cache_size=10 CacheInfo(hits=92, misses=99908, maxsize=10, currsize=10) seconds=6.3360 cache_size=100 CacheInfo(hits=970, misses=99030, maxsize=100, currsize=100) seconds=6.7533 cache_size=1000 CacheInfo(hits=10069, misses=89931, maxsize=1000, currsize=1000) seconds=7.1395 cache_size=2000 CacheInfo(hits=19799, misses=80201, maxsize=2000, currsize=2000) seconds=7.1510 cache_size=4000 CacheInfo(hits=39254, misses=60746, maxsize=4000, currsize=4000) seconds=6.0416 cache_size=8000 CacheInfo(hits=74989, misses=25011, maxsize=8000, currsize=8000) seconds=3.8283 cache_size=16000 CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000) seconds=2.9638 work_amount=2000 cache_size=0 seconds=6.0297 cache_size=10 CacheInfo(hits=108, misses=99892, maxsize=10, currsize=10) seconds=8.0671 cache_size=100 CacheInfo(hits=1014, misses=98986, maxsize=100, currsize=100) seconds=8.3695 cache_size=1000 CacheInfo(hits=9978, misses=90022, maxsize=1000, currsize=1000) seconds=9.6132 cache_size=2000 CacheInfo(hits=19729, misses=80271, maxsize=2000, currsize=2000) seconds=9.1182 cache_size=4000 CacheInfo(hits=39244, misses=60756, maxsize=4000, currsize=4000) seconds=7.2789 cache_size=8000 CacheInfo(hits=75272, misses=24728, maxsize=8000, currsize=8000) seconds=4.3689 cache_size=16000 CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000) seconds=3.0879 work_amount=4000 cache_size=0 seconds=10.8127 cache_size=10 CacheInfo(hits=105, misses=99895, maxsize=10, currsize=10) seconds=11.5326 cache_size=100 CacheInfo(hits=1008, misses=98992, maxsize=100, currsize=100) seconds=12.3085 cache_size=1000 CacheInfo(hits=9846, misses=90154, maxsize=1000, currsize=1000) seconds=12.2700 cache_size=2000 CacheInfo(hits=19831, misses=80169, maxsize=2000, currsize=2000) seconds=11.5156 cache_size=4000 CacheInfo(hits=39071, misses=60929, maxsize=4000, currsize=4000) seconds=9.7006 cache_size=8000 CacheInfo(hits=75259, misses=24741, maxsize=8000, currsize=8000) seconds=5.2042 cache_size=16000 CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000) seconds=3.4366 work_amount=8000 cache_size=0 seconds=16.8940 cache_size=10 CacheInfo(hits=84, misses=99916, maxsize=10, currsize=10) seconds=18.3884 cache_size=100 CacheInfo(hits=992, misses=99008, maxsize=100, currsize=100) seconds=19.4869 cache_size=1000 CacheInfo(hits=9929, misses=90071, maxsize=1000, currsize=1000) seconds=19.1156 cache_size=2000 CacheInfo(hits=19845, misses=80155, maxsize=2000, currsize=2000) seconds=18.9519 cache_size=4000 CacheInfo(hits=38982, misses=61018, maxsize=4000, currsize=4000) seconds=14.2407 cache_size=8000 CacheInfo(hits=75141, misses=24859, maxsize=8000, currsize=8000) seconds=7.4209 cache_size=16000 CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000) seconds=4.2533

No comments yet.

Answer by OrbitalCommander879 1 month ago

0

TL;DR: page faults explain why the cache-based version is significantly slower than the one without a cache when num_iter is small. This is a side effect of creating many new Numpy arrays and deleted only at the end. When num_iter is big, the cache becomes more effective (as explained in the JonSG's answer). Using another system allocator like TCMalloc can strongly reduce this overhead.


When you create many new temporary arrays, Numpy requests some memory to the system allocator which request relatively large buffers to the operating system (OS). The first touch to memory pages causes a page fault enabling the OS to actually setup the pages (actual page fault): the virtual pages are mapped to a physical one and the target pages are filled with zeros for security reasons (information should not leak from one process to another). When all arrays are deleted, Numpy free the memory space and the underlying memory allocator has a good chance to give the memory back to the OS (so other processes can use it).

Page faults are very expensive because the CPU needs to switch from the user-land to kernel one (with elevated privileges). The kernel needs to setup many data structures and call a lot of functions to do that.


Profiling & Analysis

To prove page faults are the issue and how bad page faults are performance-wise, here is the low-level breakdown of the time when the cache is enabled (using perf):

BASH
13,75% _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_add_AVX2 7,47% [kernel] [k] __irqentry_text_end 6,94% _mt19937.cpython-311-x86_64-linux-gnu.so [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double 3,63% [kernel] [k] clear_page_erms 3,22% [kernel] [k] error_entry 2,98% [kernel] [k] native_irq_return_iret 2,88% libpython3.11.so.1.0 [.] _PyEval_EvalFrameDefault 2,35% [kernel] [k] sync_regs 2,28% [kernel] [k] __list_del_entry_valid_or_report 2,27% [kernel] [k] unmap_page_range 1,62% [kernel] [k] __handle_mm_fault 1,45% [kernel] [k] __mod_memcg_lruvec_state 1,43% mtrand.cpython-311-x86_64-linux-gnu.so [.] random_standard_uniform_fill 1,10% [kernel] [k] do_anonymous_page 1,10% _mt19937.cpython-311-x86_64-linux-gnu.so [.] mt19937_gen 0,98% [kernel] [k] mas_walk 0,93% libpython3.11.so.1.0 [.] PyObject_GenericGetAttr 0,91% [kernel] [k] get_mem_cgroup_from_mm 0,89% [kernel] [k] get_page_from_freelist 0,79% libpython3.11.so.1.0 [.] _PyObject_Malloc 0,77% [kernel] [k] lru_gen_add_folio 0,72% [nvidia] [k] _nv039919rm 0,65% [kernel] [k] lru_gen_del_folio.constprop.0 0,63% [kernel] [k] blk_cgroup_congested 0,62% [kernel] [k] handle_mm_fault 0,59% [kernel] [k] __alloc_pages_noprof 0,57% [kernel] [k] lru_add 0,57% [kernel] [k] folio_batch_move_lru 0,56% [kernel] [k] __rcu_read_lock 0,52% [kernel] [k] do_user_addr_fault [...] (many others functions taking <0.52% each)

As we can see, there are a lot of [kernel] functions called and most of them are due to page faults. For example, __irqentry_text_end and native_irq_return_iret (taking ~10% of the time) is caused by CPU interrupts triggered when the CPython process access to pages for the first time. clear_page_erms is the function clearing a memory page during a first touch. Several functions are related to the virtual to physical memory mapping (e.g. AFAIK ones related to the LRU cache). Note that DOUBLE_add_AVX2 is the internal native Numpy function actually summing the two arrays. In comparison, here is the breakdown with the cache disabled:

BASH
20,85% _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_add_AVX2 17,39% _mt19937.cpython-311-x86_64-linux-gnu.so [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double 5,69% libpython3.11.so.1.0 [.] _PyEval_EvalFrameDefault 3,35% mtrand.cpython-311-x86_64-linux-gnu.so [.] random_standard_uniform_fill 2,46% _mt19937.cpython-311-x86_64-linux-gnu.so [.] mt19937_gen 2,15% libpython3.11.so.1.0 [.] PyObject_GenericGetAttr 1,76% [kernel] [k] __irqentry_text_end 1,46% libpython3.11.so.1.0 [.] _PyObject_Malloc 1,07% libpython3.11.so.1.0 [.] PyUnicode_FromFormatV 1,03% libc.so.6 [.] printf_positional 0,93% libpython3.11.so.1.0 [.] _PyObject_Free 0,88% _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] NpyIter_AdvancedNew 0,79% _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] ufunc_generic_fastcall 0,77% [kernel] [k] error_entry 0,74% _bounded_integers.cpython-311-x86_64-linux-gnu.so [.] __pyx_f_5numpy_6random_17_bounded_integers__rand_int64 0,72% [nvidia] [k] _nv039919rm 0,69% [kernel] [k] native_irq_return_iret 0,66% [kernel] [k] clear_page_erms 0,55% libpython3.11.so.1.0 [.] PyType_IsSubtype 0,55% [kernel] [k] sync_regs 0,52% mtrand.cpython-311-x86_64-linux-gnu.so [.] __pyx_pw_5numpy_6random_6mtrand_11RandomState_31randint 0,48% [kernel] [k] unmap_page_range 0,46% libpython3.11.so.1.0 [.] PyDict_GetItemWithError 0,43% _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] PyArray_NewFromDescr_int 0,40% libpython3.11.so.1.0 [.] _PyFunction_Vectorcall 0,38% [kernel] [k] __mod_memcg_lruvec_state 0,38% _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] promote_and_get_ufuncimpl 0,37% libpython3.11.so.1.0 [.] PyDict_SetItem 0,36% libpython3.11.so.1.0 [.] PyObject_RichCompareBool 0,36% _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] PyUFunc_GenericReduction [...] (many others functions taking <0.35% each)

There are clearly less kernel function called in the top 30 most expensive functions. We can still see the interrupt related functions, but note that this low-level profiling is global to my whole machine and so these interrupt-related function likely comes from other processes (e.g. perf itself, the graphical desktop environment, and Firefox running as I write this answer).

There is another reason proving page faults are the main culprit: during my tests, the system allocator suddenly changed its behavior because I allocated many arrays before running the same commands, and this results in only few kernel calls (in perf) as well as very close timings between the two analyzed variants (with/without cache):

PYTHON
# First execution: In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter) ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter) 20.2 ms ± 63.7 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) 46.4 ms ± 81.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) # Second execution: In [4]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter) ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter) 19.8 ms ± 43.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 46.3 ms ± 155 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) # After creating many Numpy arrays (not deleted since) with no change of `numpy_comparison`: In [95]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter) ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter) 18.4 ms ± 15.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 19.9 ms ± 26.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-----

We can clearly see that the overhead of using a cache is now pretty small. This also means the performance could be very different in a real-world application (because of a different allocator state), or even on different platforms.


Solution to mitigate page faults

You can use alternative system allocators like TCMalloc which requests a big chunk of memory to the OS at startup time so not to often pay page faults. Here are results with it (using the command line LD_PRELOAD=libtcmalloc_minimal.so.4 ipython):

PYTHON
In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter) ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter) 16.9 ms ± 51.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) 19.5 ms ± 29.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The overhead related to the cache which was actually due the creation of many temporary arrays and more specifically page faults is now >10 times smaller!

I think the remaining overhead is due to the larger memory working set as pointed out by NickODell in comments (this cause more cache misses due to cash trashing and more data to be loaded from the slow DRAM). Put it shortly, the cache version is simply less cache friendly.

Here are results with num_item=100_000 with/without TCMalloc:

PYTHON
# Default system allocator (glibc) In [97]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter) ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter) 1.31 s ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 859 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # TCMalloc allocator In [3]: %timeit -n 1 numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter) ...: %timeit -n 1 numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter) 1.28 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 774 ms ± 83.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This behavior is expected: the cache starts to be useful with more hits. Note that TCMalloc makes the overall execution faster in most case!

No comments yet.

Discussion

No comments yet.