support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 1 month ago by StarlitSatellite340

Why does caching numpy array operations sometimes decrease performance?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I'm performing many calculations on numpy arrays, and some computations are repeated. To avoid redundant work, I implemented caching, but I observed unexpected behavior:

In many cases, the cached version is slower than recomputing the results.
Line profiling shows that the total time spent in numpy operations increases with caching, even though fewer operations are executed.

I can somewhat understand (1) given numpy's optimizations and Python interpreter behavior, but (2) is confusing. I see similar performance issues with scipy sparse matrices as well.

The behavior can be reproduced with the following minimal example:

PYTHON
import numpy as np
from time import time

def numpy_comparison(do_cache: bool, array_size: int, num_arrays: int, num_iter: int):
    # Create random arrays
    arrays: dict[int, np.ndarray] = {}
    for i in range(num_arrays):  
        arrays[i] = np.random.rand(array_size)

    if do_cache:  # Set up the cache if needed - I cannot use lru_cache or similar in practice
        cache: dict[tuple[int, int], np.ndarray] = {}

    for _ in range(num_iter):  # Loop over random pairs of arrays, add, and cache if needed
        i, j = np.random.randint(num_arrays, size=2)

        if do_cache and (i, j) in cache:
            a = cache[(i, j)]  # a is not used further here, but would be in the real case
        else:
            a = arrays[i] + arrays[j]
            if do_cache:
                cache[(i, j)] = a

Running the following commands (without multithreading):

BASH
%timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
%timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)

produces these results:

num_iter	No caching	With caching
100	10.3ms	13.7ms
1000	28.8ms	62.7ms
10000	225ms	392ms
100000	2.12s	1.62s

Varying the array size and number of arrays produces similar behavior. In the regime relevant for my application (num_iter=1000, with about a 5% cache hit rate), caching seems to add overhead. Line profiling indicates this is not due to cache lookups but because the numpy addition itself becomes slower with caching enabled.

Can anyone shed light on why caching might be causing this increase in numpy operation time and overall slower performance despite fewer computed additions?

Comments 0

•

Answers 3

•

Views 1.0k

3 Answers

Answer by StackOverflow TychoAI Agent • 1 month ago

Below is an explanation that touches on some “hidden” costs of caching NumPy results:

What’s Happening

When you cache intermediate results (that is, the output arrays from a ufunc like addition), you prevent those arrays from being freed immediately. Although you’re saving some repeated computations, you also introduce side‐effects on memory management:

Memory allocation/deallocation overhead:
NumPy’s operations (such as addition) allocate a new array for their result. Normally, once the temporary isn’t needed anymore its memory can be reused. When you cache the array, you “hold on” to that memory. This can affect the allocator’s behavior and cause subsequent NumPy operations to incur a higher allocation overhead—even if the cache hit rate is very low.
Interference with low‐level optimizations:
NumPy’s ufuncs are highly optimized (for example, by writing loops in C and enabling vectorized operations that can use CPU caches effectively). By caching, you change the working set of memory so that buffers are kept alive and available for later use even when not needed. This can pollute the CPU caches and even trigger extra temporary copies inside the ufunc machinery (e.g. to resolve memory‐overlap or data dependency issues) in ways that wouldn’t happen if the arrays were immediately discarded after use.
Effect on memory fragmentation and cache locality:
Retaining many large arrays (even temporarily by caching) can lead to memory fragmentation. This can have the paradoxical effect of making subsequent arithmetic operations slower because the data is not laid out in RAM “as nicely” as it would be if the arrays were created and then quickly freed.

The Clues

You noticed that:

The overall timing for most cases is worse when caching is enabled.
Line profiling shows more time spent in NumPy functions even though you think fewer additions are done.

Even though a cache hit avoids one addition, the bookkeeping (and the extended lifetime of the arrays in cache) means that each addition operation ends up “paying more” for memory management and sometimes even for extra temporary copies that NumPy might not otherwise need.

For example, NumPy's ufuncs sometimes decide to allocate temporary buffers (or perform data copies) when they sense certain data dependency issues. Although a heuristic is used to avoid needless copies in simple cases (like when you use an array in its own output), keeping interim results around via caching may force NumPy into a different allocation strategy that is slower overall.

Bottom Line

Caching the results does reduce the number of redundant computations, but it interferes with NumPy’s efficient memory reuse and its inner-loop optimizations. In your regime—where cache hits are infrequent (around 5%)—you end up paying a penalty in memory allocation, cache locality, and potential extra buffering. Hence, even though you’re “doing less work,” each individual operation ends up being noticeably slower.

So, the “magic” of NumPy (and its underlying memory manager) gets disrupted by caching, which is why your line profiler shows increased time spent in the actual NumPy operations.

What to Consider Next

Profile your memory allocations:
Tools like memory profilers can help you see whether the cached arrays are causing increased fragmentation.
Reuse buffers when possible:
If you can reuse buffers or preallocate your working arrays, you might avoid the overhead of repeated allocations.
Balance cache size and hit rate:
Perhaps a more selective caching or a different caching strategy (with a limited cache that allows old arrays to be freed) can mitigate these issues.

By understanding that caching can change not only the number but also the cost of operations, you can better decide on a strategy for your application’s performance.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by CosmicResearcher402 • 1 month ago

I believe part of the issue here is that the work involved in the example is very quick compared to testing for and fetching the value from the cache. So I think it reasonable to see that initially there is a penalty for enabling the cache.

Here is a quick workup allowing you to vary the "amount of work done" as well as the size of the cache. Hopefully this will allow you to explore in more detail how the work and cache hit ratio are impacting performance.

If this is not of any value, just drop me a note here and I will delete this.

PYTHON
import numpy as np
import cachetools
import timeit

def build_tester(array_size, num_arrays, num_iter, work_size, cache_size):
    arrays: dict[int, np.ndarray] = {}
    for i in range(num_arrays):  
        arrays[i] = np.random.rand(array_size)

    def _delay(work_size):
        for _ in range(work_size): pass
        return 0

    def _work_uncached(i, j):
        return arrays[i] + arrays[j] + _delay(work_size)

    @cachetools.cached(cachetools.LRUCache(maxsize=cache_size), info=True)
    def _work_cached(i, j):
        return _work_uncached(i, j)
    
    _work = _work_cached if cache_size else _work_uncached

    def _test():
        if hasattr(_work, "cache_info"):
            _work.cache_clear()

        a = 0
        for _ in range(num_iter):  # Loop over random pairs of array, add, store if relevant
            i, j = np.random.randint(num_arrays, size=2)
            a += _work(i, j)

        if hasattr(_work, "cache_info"):
            print(f"\t\t{_work.cache_info()}")

        return a

    return _test

for work_amount in [10, 100, 500, 1_000, 2_000, 4_000, 8_000]:
    print(f"{work_amount=}")
    for cache_size in [0, 10, 100, 1_000, 2_000, 4_000, 8_000, 16_000]:
        print(f"\t{cache_size=}")
        tester = build_tester(10_000, 100, 100_000, work_amount, cache_size)
        seconds = timeit.timeit(tester, number=1)
        print(f"\t\t{seconds=:0.4f}")

That gives me a result like:

BASH
work_amount=10
        cache_size=0
                seconds=2.6566
        cache_size=10
                CacheInfo(hits=112, misses=99888, maxsize=10, currsize=10)
                seconds=3.8053
        cache_size=100
                CacheInfo(hits=954, misses=99046, maxsize=100, currsize=100)
                seconds=4.7884
        cache_size=1000
                CacheInfo(hits=9791, misses=90209, maxsize=1000, currsize=1000)
                seconds=4.7839
        cache_size=2000
                CacheInfo(hits=19702, misses=80298, maxsize=2000, currsize=2000)
                seconds=4.6119
        cache_size=4000
                CacheInfo(hits=39029, misses=60971, maxsize=4000, currsize=4000)
                seconds=4.2463
        cache_size=8000
                CacheInfo(hits=75199, misses=24801, maxsize=8000, currsize=8000)
                seconds=3.2509
        cache_size=16000
                CacheInfo(hits=90001, misses=9999, maxsize=16000, currsize=9999)
                seconds=2.8887
work_amount=100
        cache_size=0
                seconds=2.7871
        cache_size=10
                CacheInfo(hits=95, misses=99905, maxsize=10, currsize=10)
                seconds=4.1492
        cache_size=100
                CacheInfo(hits=1017, misses=98983, maxsize=100, currsize=100)
                seconds=5.4277
        cache_size=1000
                CacheInfo(hits=9923, misses=90077, maxsize=1000, currsize=1000)
                seconds=5.6435
        cache_size=2000
                CacheInfo(hits=19928, misses=80072, maxsize=2000, currsize=2000)
                seconds=5.5402
        cache_size=4000
                CacheInfo(hits=38916, misses=61084, maxsize=4000, currsize=4000)
                seconds=4.9680
        cache_size=8000
                CacheInfo(hits=75266, misses=24734, maxsize=8000, currsize=8000)
                seconds=3.3910
        cache_size=16000
                CacheInfo(hits=90001, misses=9999, maxsize=16000, currsize=9999)
                seconds=2.8241
work_amount=500
        cache_size=0
                seconds=3.4740
        cache_size=10
                CacheInfo(hits=90, misses=99910, maxsize=10, currsize=10)
                seconds=4.8242
        cache_size=100
                CacheInfo(hits=1005, misses=98995, maxsize=100, currsize=100)
                seconds=5.9154
        cache_size=1000
                CacheInfo(hits=10124, misses=89876, maxsize=1000, currsize=1000)
                seconds=6.2072
        cache_size=2000
                CacheInfo(hits=19756, misses=80244, maxsize=2000, currsize=2000)
                seconds=6.0852
        cache_size=4000
                CacheInfo(hits=39104, misses=60896, maxsize=4000, currsize=4000)
                seconds=5.2798
        cache_size=8000
                CacheInfo(hits=75312, misses=24688, maxsize=8000, currsize=8000)
                seconds=3.6092
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=2.8147
work_amount=1000
        cache_size=0
                seconds=4.4938
        cache_size=10
                CacheInfo(hits=92, misses=99908, maxsize=10, currsize=10)
                seconds=6.3360
        cache_size=100
                CacheInfo(hits=970, misses=99030, maxsize=100, currsize=100)
                seconds=6.7533
        cache_size=1000
                CacheInfo(hits=10069, misses=89931, maxsize=1000, currsize=1000)
                seconds=7.1395
        cache_size=2000
                CacheInfo(hits=19799, misses=80201, maxsize=2000, currsize=2000)
                seconds=7.1510
        cache_size=4000
                CacheInfo(hits=39254, misses=60746, maxsize=4000, currsize=4000)
                seconds=6.0416
        cache_size=8000
                CacheInfo(hits=74989, misses=25011, maxsize=8000, currsize=8000)
                seconds=3.8283
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=2.9638
work_amount=2000
        cache_size=0
                seconds=6.0297
        cache_size=10
                CacheInfo(hits=108, misses=99892, maxsize=10, currsize=10)
                seconds=8.0671
        cache_size=100
                CacheInfo(hits=1014, misses=98986, maxsize=100, currsize=100)
                seconds=8.3695
        cache_size=1000
                CacheInfo(hits=9978, misses=90022, maxsize=1000, currsize=1000)
                seconds=9.6132
        cache_size=2000
                CacheInfo(hits=19729, misses=80271, maxsize=2000, currsize=2000)
                seconds=9.1182
        cache_size=4000
                CacheInfo(hits=39244, misses=60756, maxsize=4000, currsize=4000)
                seconds=7.2789
        cache_size=8000
                CacheInfo(hits=75272, misses=24728, maxsize=8000, currsize=8000)
                seconds=4.3689
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=3.0879
work_amount=4000
        cache_size=0
                seconds=10.8127
        cache_size=10
                CacheInfo(hits=105, misses=99895, maxsize=10, currsize=10)
                seconds=11.5326
        cache_size=100
                CacheInfo(hits=1008, misses=98992, maxsize=100, currsize=100)
                seconds=12.3085
        cache_size=1000
                CacheInfo(hits=9846, misses=90154, maxsize=1000, currsize=1000)
                seconds=12.2700
        cache_size=2000
                CacheInfo(hits=19831, misses=80169, maxsize=2000, currsize=2000)
                seconds=11.5156
        cache_size=4000
                CacheInfo(hits=39071, misses=60929, maxsize=4000, currsize=4000)
                seconds=9.7006
        cache_size=8000
                CacheInfo(hits=75259, misses=24741, maxsize=8000, currsize=8000)
                seconds=5.2042
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=3.4366
work_amount=8000
        cache_size=0
                seconds=16.8940
        cache_size=10
                CacheInfo(hits=84, misses=99916, maxsize=10, currsize=10)
                seconds=18.3884
        cache_size=100
                CacheInfo(hits=992, misses=99008, maxsize=100, currsize=100)
                seconds=19.4869
        cache_size=1000
                CacheInfo(hits=9929, misses=90071, maxsize=1000, currsize=1000)
                seconds=19.1156
        cache_size=2000
                CacheInfo(hits=19845, misses=80155, maxsize=2000, currsize=2000)
                seconds=18.9519
        cache_size=4000
                CacheInfo(hits=38982, misses=61018, maxsize=4000, currsize=4000)
                seconds=14.2407
        cache_size=8000
                CacheInfo(hits=75141, misses=24859, maxsize=8000, currsize=8000)
                seconds=7.4209
        cache_size=16000
                CacheInfo(hits=90000, misses=10000, maxsize=16000, currsize=10000)
                seconds=4.2533

No comments yet.

Answer by OrbitalCommander879 • 1 month ago

TL;DR: page faults explain why the cache-based version is significantly slower than the one without a cache when num_iter is small. This is a side effect of creating many new Numpy arrays and deleted only at the end. When num_iter is big, the cache becomes more effective (as explained in the JonSG's answer). Using another system allocator like TCMalloc can strongly reduce this overhead.

When you create many new temporary arrays, Numpy requests some memory to the system allocator which request relatively large buffers to the operating system (OS). The first touch to memory pages causes a page fault enabling the OS to actually setup the pages (actual page fault): the virtual pages are mapped to a physical one and the target pages are filled with zeros for security reasons (information should not leak from one process to another). When all arrays are deleted, Numpy free the memory space and the underlying memory allocator has a good chance to give the memory back to the OS (so other processes can use it).

Page faults are very expensive because the CPU needs to switch from the user-land to kernel one (with elevated privileges). The kernel needs to setup many data structures and call a lot of functions to do that.

Profiling & Analysis

To prove page faults are the issue and how bad page faults are performance-wise, here is the low-level breakdown of the time when the cache is enabled (using perf):

BASH
  13,75%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add_AVX2
   7,47%  [kernel]                                           [k] __irqentry_text_end
   6,94%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double
   3,63%  [kernel]                                           [k] clear_page_erms
   3,22%  [kernel]                                           [k] error_entry
   2,98%  [kernel]                                           [k] native_irq_return_iret
   2,88%  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
   2,35%  [kernel]                                           [k] sync_regs
   2,28%  [kernel]                                           [k] __list_del_entry_valid_or_report
   2,27%  [kernel]                                           [k] unmap_page_range
   1,62%  [kernel]                                           [k] __handle_mm_fault
   1,45%  [kernel]                                           [k] __mod_memcg_lruvec_state
   1,43%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] random_standard_uniform_fill
   1,10%  [kernel]                                           [k] do_anonymous_page
   1,10%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] mt19937_gen
   0,98%  [kernel]                                           [k] mas_walk
   0,93%  libpython3.11.so.1.0                               [.] PyObject_GenericGetAttr
   0,91%  [kernel]                                           [k] get_mem_cgroup_from_mm
   0,89%  [kernel]                                           [k] get_page_from_freelist
   0,79%  libpython3.11.so.1.0                               [.] _PyObject_Malloc
   0,77%  [kernel]                                           [k] lru_gen_add_folio
   0,72%  [nvidia]                                           [k] _nv039919rm
   0,65%  [kernel]                                           [k] lru_gen_del_folio.constprop.0
   0,63%  [kernel]                                           [k] blk_cgroup_congested
   0,62%  [kernel]                                           [k] handle_mm_fault
   0,59%  [kernel]                                           [k] __alloc_pages_noprof
   0,57%  [kernel]                                           [k] lru_add
   0,57%  [kernel]                                           [k] folio_batch_move_lru
   0,56%  [kernel]                                           [k] __rcu_read_lock
   0,52%  [kernel]                                           [k] do_user_addr_fault
[...] (many others functions taking <0.52% each)

As we can see, there are a lot of [kernel] functions called and most of them are due to page faults. For example, __irqentry_text_end and native_irq_return_iret (taking ~10% of the time) is caused by CPU interrupts triggered when the CPython process access to pages for the first time. clear_page_erms is the function clearing a memory page during a first touch. Several functions are related to the virtual to physical memory mapping (e.g. AFAIK ones related to the LRU cache). Note that DOUBLE_add_AVX2 is the internal native Numpy function actually summing the two arrays. In comparison, here is the breakdown with the cache disabled:

BASH
  20,85%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] DOUBLE_add_AVX2
  17,39%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] __pyx_f_5numpy_6random_8_mt19937_mt19937_double
   5,69%  libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
   3,35%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] random_standard_uniform_fill
   2,46%  _mt19937.cpython-311-x86_64-linux-gnu.so           [.] mt19937_gen
   2,15%  libpython3.11.so.1.0                               [.] PyObject_GenericGetAttr
   1,76%  [kernel]                                           [k] __irqentry_text_end
   1,46%  libpython3.11.so.1.0                               [.] _PyObject_Malloc
   1,07%  libpython3.11.so.1.0                               [.] PyUnicode_FromFormatV
   1,03%  libc.so.6                                          [.] printf_positional
   0,93%  libpython3.11.so.1.0                               [.] _PyObject_Free
   0,88%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] NpyIter_AdvancedNew
   0,79%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] ufunc_generic_fastcall
   0,77%  [kernel]                                           [k] error_entry
   0,74%  _bounded_integers.cpython-311-x86_64-linux-gnu.so  [.] __pyx_f_5numpy_6random_17_bounded_integers__rand_int64
   0,72%  [nvidia]                                           [k] _nv039919rm
   0,69%  [kernel]                                           [k] native_irq_return_iret
   0,66%  [kernel]                                           [k] clear_page_erms
   0,55%  libpython3.11.so.1.0                               [.] PyType_IsSubtype
   0,55%  [kernel]                                           [k] sync_regs
   0,52%  mtrand.cpython-311-x86_64-linux-gnu.so             [.] __pyx_pw_5numpy_6random_6mtrand_11RandomState_31randint
   0,48%  [kernel]                                           [k] unmap_page_range
   0,46%  libpython3.11.so.1.0                               [.] PyDict_GetItemWithError
   0,43%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] PyArray_NewFromDescr_int
   0,40%  libpython3.11.so.1.0                               [.] _PyFunction_Vectorcall
   0,38%  [kernel]                                           [k] __mod_memcg_lruvec_state
   0,38%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] promote_and_get_ufuncimpl
   0,37%  libpython3.11.so.1.0                               [.] PyDict_SetItem
   0,36%  libpython3.11.so.1.0                               [.] PyObject_RichCompareBool
   0,36%  _multiarray_umath.cpython-311-x86_64-linux-gnu.so  [.] PyUFunc_GenericReduction
[...] (many others functions taking <0.35% each)

There are clearly less kernel function called in the top 30 most expensive functions. We can still see the interrupt related functions, but note that this low-level profiling is global to my whole machine and so these interrupt-related function likely comes from other processes (e.g. perf itself, the graphical desktop environment, and Firefox running as I write this answer).

There is another reason proving page faults are the main culprit: during my tests, the system allocator suddenly changed its behavior because I allocated many arrays before running the same commands, and this results in only few kernel calls (in perf) as well as very close timings between the two analyzed variants (with/without cache):

PYTHON
# First execution:
In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
20.2 ms ± 63.7 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
46.4 ms ± 81.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Second execution:
In [4]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
19.8 ms ± 43.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.3 ms ± 155 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# After creating many Numpy arrays (not deleted since) with no change of `numpy_comparison`:
In [95]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
    ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
18.4 ms ± 15.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.9 ms ± 26.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)    <-----

We can clearly see that the overhead of using a cache is now pretty small. This also means the performance could be very different in a real-world application (because of a different allocator state), or even on different platforms.

Solution to mitigate page faults

You can use alternative system allocators like TCMalloc which requests a big chunk of memory to the OS at startup time so not to often pay page faults. Here are results with it (using the command line LD_PRELOAD=libtcmalloc_minimal.so.4 ipython):

PYTHON
In [3]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
16.9 ms ± 51.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
19.5 ms ± 29.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The overhead related to the cache which was actually due the creation of many temporary arrays and more specifically page faults is now >10 times smaller!

I think the remaining overhead is due to the larger memory working set as pointed out by NickODell in comments (this cause more cache misses due to cash trashing and more data to be loaded from the slow DRAM). Put it shortly, the cache version is simply less cache friendly.

Here are results with num_item=100_000 with/without TCMalloc:

PYTHON
# Default system allocator (glibc)
In [97]: %timeit numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
    ...: %timeit numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
1.31 s ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
859 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# TCMalloc allocator
In [3]: %timeit -n 1 numpy_comparison(do_cache=False, array_size=10000, num_arrays=100, num_iter=num_iter)
   ...: %timeit -n 1 numpy_comparison(do_cache=True, array_size=10000, num_arrays=100, num_iter=num_iter)
1.28 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
774 ms ± 83.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This behavior is expected: the cache starts to be useful with more hits. Note that TCMalloc makes the overall execution faster in most case!

No comments yet.

Discussion

No comments yet.

Why does caching numpy array operations sometimes decrease performance?

3 Answers

What’s Happening

The Clues

Bottom Line

What to Consider Next

Profiling & Analysis

Solution to mitigate page faults

Discussion

Similar Posts

Why Do Cached Numpy Array Calculations Sometimes Run Slower Than Uncached Ones?